Message ID | 20240223081346.2052267-1-thinker.li@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] selftests/net: force synchronized GC for a test. | expand |
On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
> Due to the slowness of the test environment
Would be interesting if it's slowness, because it failed 2 times
on the debug runner but 5 times on the non-debug one. We'll see.
On 2/23/24 7:21 PM, Jakub Kicinski wrote: > On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote: >> Due to the slowness of the test environment > > Would be interesting if it's slowness, because it failed 2 times > on the debug runner but 5 times on the non-debug one. We'll see. hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC and still failing suggests something else is at play
On 2/23/24 19:15, David Ahern wrote: > On 2/23/24 7:21 PM, Jakub Kicinski wrote: >> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote: >>> Due to the slowness of the test environment >> >> Would be interesting if it's slowness, because it failed 2 times >> on the debug runner but 5 times on the non-debug one. We'll see. > > hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC > and still failing suggests something else is at play Do you mean it is still happening even with this patch?
On 2/23/24 18:21, Jakub Kicinski wrote: > On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote: >> Due to the slowness of the test environment > > Would be interesting if it's slowness, because it failed 2 times > on the debug runner but 5 times on the non-debug one. We'll see. The code was landed on Feb 12, 2024. It is actually 1 time with debug runner but 3 times on the non-debug one. 1 time with debug runner and 2 times with the non-debug runner happened in a 12 hours period.
On Fri, 2024-02-23 at 20:40 -0800, Kui-Feng Lee wrote: > > On 2/23/24 18:21, Jakub Kicinski wrote: > > On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote: > > > Due to the slowness of the test environment > > > > Would be interesting if it's slowness, because it failed 2 times > > on the debug runner but 5 times on the non-debug one. We'll see. > > The code was landed on Feb 12, 2024. > It is actually 1 time with debug runner but 3 times on the non-debug > one. 1 time with debug runner and 2 times with the non-debug runner > happened in a 12 hours period. The only failure in a debug run was causes by an unrelated issue: https://netdev-3.bots.linux.dev/vmksft-net-dbg/results/467362/7-fib-tests-sh/stdout (grep for UBSAN) Basically the problems happens quite sporadically only in "fast" env. I think it's worthy a deeper investigation (it sounds like a race somewhere). I guess/hope then raising the number of temporary and permanent routes created by the relevant test-case could help making the issue more easily reproducible. Thanks, Paolo
On 2/23/24 19:15, David Ahern wrote: > On 2/23/24 7:21 PM, Jakub Kicinski wrote: >> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote: >>> Due to the slowness of the test environment >> >> Would be interesting if it's slowness, because it failed 2 times >> on the debug runner but 5 times on the non-debug one. We'll see. > > hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC > and still failing suggests something else is at play I did some tests, and found fib6_run_gc() do round_jiffies() for the gc timer. So, gc_interval can increase 0.75 seconds in some case. I am doing more investigation on this.
On 2/29/24 09:39, Kui-Feng Lee wrote: > > > On 2/23/24 19:15, David Ahern wrote: >> On 2/23/24 7:21 PM, Jakub Kicinski wrote: >>> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote: >>>> Due to the slowness of the test environment >>> >>> Would be interesting if it's slowness, because it failed 2 times >>> on the debug runner but 5 times on the non-debug one. We'll see. >> >> hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC >> and still failing suggests something else is at play > > I did some tests, and found fib6_run_gc() do round_jiffies() > for the gc timer. So, gc_interval can increase 0.75 seconds in > some case. I am doing more investigation on this. My conclusion is routes going to expire in N seconds can stay for 2*N + 1 seconds. Adding a new route going to expire in N seconds, it will starts gc timer for N seconds. In our case, we add several routes going to expire in N seconds consecutively. The later routes may expires in different ticks than the first route added. So, a route may wait nearly 2*N seconds. The test case waits for 2*N + 1 seconds, that should be enough. However, some extra waiting may be added to it. There are two possible extra waiting. The first one is calling round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second one is the granularity of waiting for 5 seconds (in our case) is 512ms for HZ 1000 according to the comment at the very begin of timer.c. In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may contribute up to 1144ms. Does that make sense? Debug build is slower. So, the test scripts will be slower than normal build. That means the script is actually waiting longer with a debug build.
On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote: > However, some extra waiting may be added to it. > There are two possible extra waiting. The first one is calling > round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second > one is the granularity of waiting for 5 seconds (in our case) is 512ms > for HZ 1000 according to the comment at the very begin of timer.c. > In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may > contribute up to 1144ms. > > Does that make sense? > > Debug build is slower. So, the test scripts will be slower than normal > build. That means the script is actually waiting longer with a debug build. Meaning bumping the wait to $((($EXPIRE + 1) * 2)) should be enough for the non-debug runner?
On 3/4/24 07:44, Jakub Kicinski wrote: > On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote: >> However, some extra waiting may be added to it. >> There are two possible extra waiting. The first one is calling >> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second >> one is the granularity of waiting for 5 seconds (in our case) is 512ms >> for HZ 1000 according to the comment at the very begin of timer.c. >> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may >> contribute up to 1144ms. >> >> Does that make sense? >> >> Debug build is slower. So, the test scripts will be slower than normal >> build. That means the script is actually waiting longer with a debug build. > > Meaning bumping the wait to $((($EXPIRE + 1) * 2)) > should be enough for the non-debug runner? Yes, it should be enough. I will send out another patch base on this.
On 3/4/24 8:44 AM, Jakub Kicinski wrote: > On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote: >> However, some extra waiting may be added to it. >> There are two possible extra waiting. The first one is calling >> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second >> one is the granularity of waiting for 5 seconds (in our case) is 512ms >> for HZ 1000 according to the comment at the very begin of timer.c. >> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may >> contribute up to 1144ms. >> >> Does that make sense? >> >> Debug build is slower. So, the test scripts will be slower than normal >> build. That means the script is actually waiting longer with a debug build. > > Meaning bumping the wait to $((($EXPIRE + 1) * 2)) > should be enough for the non-debug runner? I have not had time to do a deep a dive on the timing, but it seems odd to me that a 1 second timer can turn into 11 sec. That means for 10 seconds (10x the time the user requested) the route survived.
On 3/4/24 7:41 PM, David Ahern wrote: > On 3/4/24 8:44 AM, Jakub Kicinski wrote: >> On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote: >>> However, some extra waiting may be added to it. >>> There are two possible extra waiting. The first one is calling >>> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second >>> one is the granularity of waiting for 5 seconds (in our case) is 512ms >>> for HZ 1000 according to the comment at the very begin of timer.c. >>> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may >>> contribute up to 1144ms. >>> >>> Does that make sense? >>> >>> Debug build is slower. So, the test scripts will be slower than normal >>> build. That means the script is actually waiting longer with a debug build. >> >> Meaning bumping the wait to $((($EXPIRE + 1) * 2)) >> should be enough for the non-debug runner? > > I have not had time to do a deep a dive on the timing, but it seems odd > to me that a 1 second timer can turn into 11 sec. That means for 10 > seconds (10x the time the user requested) the route survived. Also, you added a direct call to ipv6_sysctl_rtcache_flush to force a flush which is going to try to kick off gc at that moment. Is the delay kicking in? delay = net->ipv6.sysctl.flush_delay;
On 3/4/24 18:49, David Ahern wrote: > On 3/4/24 7:41 PM, David Ahern wrote: >> On 3/4/24 8:44 AM, Jakub Kicinski wrote: >>> On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote: >>>> However, some extra waiting may be added to it. >>>> There are two possible extra waiting. The first one is calling >>>> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second >>>> one is the granularity of waiting for 5 seconds (in our case) is 512ms >>>> for HZ 1000 according to the comment at the very begin of timer.c. >>>> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may >>>> contribute up to 1144ms. >>>> >>>> Does that make sense? >>>> >>>> Debug build is slower. So, the test scripts will be slower than normal >>>> build. That means the script is actually waiting longer with a debug build. >>> >>> Meaning bumping the wait to $((($EXPIRE + 1) * 2)) >>> should be enough for the non-debug runner? >> >> I have not had time to do a deep a dive on the timing, but it seems odd >> to me that a 1 second timer can turn into 11 sec. That means for 10 >> seconds (10x the time the user requested) the route survived. > > Also, you added a direct call to ipv6_sysctl_rtcache_flush to force a > flush which is going to try to kick off gc at that moment. Is the delay > kicking in? > > delay = net->ipv6.sysctl.flush_delay; The delay doesn't cause any different here. The delay affects only exceptions. It doesn't 10x. It does 2 times. In fib_tests.sh sysctl -wq net.ipv6.route.gc_interval= $EXPIRE The test sets the gc interval to 5 seconds. The test install several routes in a round. So, some later routes may miss the first GC that emitted after the first 5 seconds. These routes have to wait for 5 more seconds for the next time GC. So, it is 10 seconds. However, due to calling round_jiffies() in fib6_run_gc(), the waiting time of the second GC period can be longer. It could increase 750ms at most. And, due to the granularity of the timer, it can adds another 512ms at the range 4s~32s.
diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh index 3ec1050e47a2..0a82c9bc07bb 100755 --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -823,7 +823,9 @@ fib6_gc_test() $IP -6 route add 2001:20::$i \ via 2001:10::2 dev dummy_10 expires $EXPIRE done + # Wait for GC sleep $(($EXPIRE * 2 + 1)) + $NS_EXEC sysctl -wq net.ipv6.route.flush=1 check_rt_num 0 $($IP -6 route list |grep expires|wc -l) log_test $ret 0 "ipv6 route garbage collection (with permanent routes)" @@ -864,7 +866,7 @@ fib6_gc_test() # Wait for GC sleep $(($EXPIRE * 2 + 1)) - + $NS_EXEC sysctl -wq net.ipv6.route.flush=1 check_rt_num 5 $($IP -6 route list |grep -v expires|grep 2001:20::|wc -l) log_test $ret 0 "ipv6 route garbage collection (replace with permanent)"
Due to the slowness of the test environment, always set off a synchronized GC after waiting for GC. This can fix the problem that Fib6 garbage collection test fails occasionally. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> --- tools/testing/selftests/net/fib_tests.sh | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)