[net-next] selftests/net: force synchronized GC for a test.

Message ID	20240223081346.2052267-1-thinker.li@gmail.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Received: from mail-yb1-f182.google.com (mail-yb1-f182.google.com [209.85.219.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 069C71756F for <netdev@vger.kernel.org>; Fri, 23 Feb 2024 08:14:00 +0000 (UTC) From: Kui-Feng Lee <thinker.li@gmail.com> To: netdev@vger.kernel.org, ast@kernel.org, martin.lau@linux.dev, kernel-team@meta.com, kuba@kernel.org, davem@davemloft.net, dsahern@kernel.org Cc: sinquersw@gmail.com, kuifeng@meta.com, Kui-Feng Lee <thinker.li@gmail.com> Subject: [PATCH net-next] selftests/net: force synchronized GC for a test. Date: Fri, 23 Feb 2024 00:13:46 -0800 Message-Id: <20240223081346.2052267-1-thinker.li@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[net-next] selftests/net: force synchronized GC for a test. \| expand [net-next] selftests/net: force synchronized GC for a test.

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for net-next
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 8 this patch: 8
netdev/build_tools	success	Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers	warning	4 maintainers not CCed: shuah@kernel.org pabeni@redhat.com linux-kselftest@vger.kernel.org edumazet@google.com
netdev/build_clang	success	Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	net selftest script(s) already in Makefile
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 8 this patch: 8
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 17 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
netdev/contest	success	net-next-2024-02-23--18-00 (tests: 1457)

Kui-Feng Lee Feb. 23, 2024, 8:13 a.m. UTC

Due to the slowness of the test environment, always set off a synchronized
GC after waiting for GC. This can fix the problem that Fib6 garbage
collection test fails occasionally.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 tools/testing/selftests/net/fib_tests.sh | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Jakub Kicinski Feb. 24, 2024, 2:21 a.m. UTC | #1

On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
> Due to the slowness of the test environment

Would be interesting if it's slowness, because it failed 2 times
on the debug runner but 5 times on the non-debug one. We'll see.

David Ahern Feb. 24, 2024, 3:15 a.m. UTC | #2

On 2/23/24 7:21 PM, Jakub Kicinski wrote:
> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
>> Due to the slowness of the test environment
> 
> Would be interesting if it's slowness, because it failed 2 times
> on the debug runner but 5 times on the non-debug one. We'll see.

hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC
and still failing suggests something else is at play

Kui-Feng Lee Feb. 24, 2024, 3:54 a.m. UTC | #3

On 2/23/24 19:15, David Ahern wrote:
> On 2/23/24 7:21 PM, Jakub Kicinski wrote:
>> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
>>> Due to the slowness of the test environment
>>
>> Would be interesting if it's slowness, because it failed 2 times
>> on the debug runner but 5 times on the non-debug one. We'll see.
> 
> hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC
> and still failing suggests something else is at play

Do you mean it is still happening even with this patch?

Kui-Feng Lee Feb. 24, 2024, 4:40 a.m. UTC | #4

On 2/23/24 18:21, Jakub Kicinski wrote:
> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
>> Due to the slowness of the test environment
> 
> Would be interesting if it's slowness, because it failed 2 times
> on the debug runner but 5 times on the non-debug one. We'll see.

The code was landed on Feb 12, 2024.
It is actually 1 time with debug runner but 3 times on the non-debug 
one. 1 time with debug runner and 2 times with the non-debug runner 
happened in a 12 hours period.

Paolo Abeni Feb. 27, 2024, 8:53 a.m. UTC | #5

On Fri, 2024-02-23 at 20:40 -0800, Kui-Feng Lee wrote:
> 
> On 2/23/24 18:21, Jakub Kicinski wrote:
> > On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
> > > Due to the slowness of the test environment
> > 
> > Would be interesting if it's slowness, because it failed 2 times
> > on the debug runner but 5 times on the non-debug one. We'll see.
> 
> The code was landed on Feb 12, 2024.
> It is actually 1 time with debug runner but 3 times on the non-debug 
> one. 1 time with debug runner and 2 times with the non-debug runner 
> happened in a 12 hours period.

The only failure in a debug run was causes by an unrelated issue:

https://netdev-3.bots.linux.dev/vmksft-net-dbg/results/467362/7-fib-tests-sh/stdout

(grep for UBSAN)

Basically the problems happens quite sporadically only in "fast" env. I
think it's worthy a deeper investigation (it sounds like a race
somewhere).

I guess/hope then raising the number of temporary and permanent routes
created by the relevant test-case could help making the issue more
easily reproducible.

Thanks,

Paolo

Kui-Feng Lee Feb. 29, 2024, 5:39 p.m. UTC | #6

On 2/23/24 19:15, David Ahern wrote:
> On 2/23/24 7:21 PM, Jakub Kicinski wrote:
>> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
>>> Due to the slowness of the test environment
>>
>> Would be interesting if it's slowness, because it failed 2 times
>> on the debug runner but 5 times on the non-debug one. We'll see.
> 
> hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC
> and still failing suggests something else is at play

I did some tests, and found fib6_run_gc() do round_jiffies()
for the gc timer. So, gc_interval can increase 0.75 seconds in
some case. I am doing more investigation on this.

Kui-Feng Lee March 2, 2024, 12:45 a.m. UTC | #7

On 2/29/24 09:39, Kui-Feng Lee wrote:
> 
> 
> On 2/23/24 19:15, David Ahern wrote:
>> On 2/23/24 7:21 PM, Jakub Kicinski wrote:
>>> On Fri, 23 Feb 2024 00:13:46 -0800 Kui-Feng Lee wrote:
>>>> Due to the slowness of the test environment
>>>
>>> Would be interesting if it's slowness, because it failed 2 times
>>> on the debug runner but 5 times on the non-debug one. We'll see.
>>
>> hmmm... that should be debugged. waiting 2*N + 1 and then requesting GC
>> and still failing suggests something else is at play
> 
> I did some tests, and found fib6_run_gc() do round_jiffies()
> for the gc timer. So, gc_interval can increase 0.75 seconds in
> some case. I am doing more investigation on this.

My conclusion is routes going to expire in N seconds can stay
for 2*N + 1 seconds.

Adding a new route going to expire in N seconds, it will starts
gc timer for N seconds. In our case, we add several routes going to
expire in N seconds consecutively. The later routes may expires in
different ticks than the first route added. So, a route may wait nearly
2*N seconds. The test case waits for 2*N + 1 seconds, that should be
enough.

However, some extra waiting may be added to it.
There are two possible extra waiting. The first one is calling
round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
one is the granularity of waiting for 5 seconds (in our case) is 512ms
for HZ 1000 according to the comment at the very begin of timer.c.
In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
contribute up to 1144ms.

Does that make sense?

Debug build is slower. So, the test scripts will be slower than normal
build. That means the script is actually waiting longer with a debug build.

Jakub Kicinski March 4, 2024, 3:44 p.m. UTC | #8

On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote:
> However, some extra waiting may be added to it.
> There are two possible extra waiting. The first one is calling
> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
> one is the granularity of waiting for 5 seconds (in our case) is 512ms
> for HZ 1000 according to the comment at the very begin of timer.c.
> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
> contribute up to 1144ms.
> 
> Does that make sense?
> 
> Debug build is slower. So, the test scripts will be slower than normal
> build. That means the script is actually waiting longer with a debug build.

Meaning bumping the wait to $((($EXPIRE + 1) * 2))
should be enough for the non-debug runner?

Kui-Feng Lee March 4, 2024, 5:27 p.m. UTC | #9

On 3/4/24 07:44, Jakub Kicinski wrote:
> On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote:
>> However, some extra waiting may be added to it.
>> There are two possible extra waiting. The first one is calling
>> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
>> one is the granularity of waiting for 5 seconds (in our case) is 512ms
>> for HZ 1000 according to the comment at the very begin of timer.c.
>> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
>> contribute up to 1144ms.
>>
>> Does that make sense?
>>
>> Debug build is slower. So, the test scripts will be slower than normal
>> build. That means the script is actually waiting longer with a debug build.
> 
> Meaning bumping the wait to $((($EXPIRE + 1) * 2))
> should be enough for the non-debug runner?

Yes, it should be enough. I will send out another patch base on this.

David Ahern March 5, 2024, 2:41 a.m. UTC | #10

On 3/4/24 8:44 AM, Jakub Kicinski wrote:
> On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote:
>> However, some extra waiting may be added to it.
>> There are two possible extra waiting. The first one is calling
>> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
>> one is the granularity of waiting for 5 seconds (in our case) is 512ms
>> for HZ 1000 according to the comment at the very begin of timer.c.
>> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
>> contribute up to 1144ms.
>>
>> Does that make sense?
>>
>> Debug build is slower. So, the test scripts will be slower than normal
>> build. That means the script is actually waiting longer with a debug build.
> 
> Meaning bumping the wait to $((($EXPIRE + 1) * 2))
> should be enough for the non-debug runner?

I have not had time to do a deep a dive on the timing, but it seems odd
to me that a 1 second timer can turn into 11 sec. That means for 10
seconds (10x the time the user requested) the route survived.

David Ahern March 5, 2024, 2:49 a.m. UTC | #11

On 3/4/24 7:41 PM, David Ahern wrote:
> On 3/4/24 8:44 AM, Jakub Kicinski wrote:
>> On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote:
>>> However, some extra waiting may be added to it.
>>> There are two possible extra waiting. The first one is calling
>>> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
>>> one is the granularity of waiting for 5 seconds (in our case) is 512ms
>>> for HZ 1000 according to the comment at the very begin of timer.c.
>>> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
>>> contribute up to 1144ms.
>>>
>>> Does that make sense?
>>>
>>> Debug build is slower. So, the test scripts will be slower than normal
>>> build. That means the script is actually waiting longer with a debug build.
>>
>> Meaning bumping the wait to $((($EXPIRE + 1) * 2))
>> should be enough for the non-debug runner?
> 
> I have not had time to do a deep a dive on the timing, but it seems odd
> to me that a 1 second timer can turn into 11 sec. That means for 10
> seconds (10x the time the user requested) the route survived.

Also, you added a direct call to ipv6_sysctl_rtcache_flush to force a
flush which is going to try to kick off gc at that moment. Is the delay
kicking in?

delay = net->ipv6.sysctl.flush_delay;

Kui-Feng Lee March 5, 2024, 3:20 a.m. UTC | #12

On 3/4/24 18:49, David Ahern wrote:
> On 3/4/24 7:41 PM, David Ahern wrote:
>> On 3/4/24 8:44 AM, Jakub Kicinski wrote:
>>> On Fri, 1 Mar 2024 16:45:58 -0800 Kui-Feng Lee wrote:
>>>> However, some extra waiting may be added to it.
>>>> There are two possible extra waiting. The first one is calling
>>>> round_jiffies() in fib6_run_gc(), that may add 750ms at most. The second
>>>> one is the granularity of waiting for 5 seconds (in our case) is 512ms
>>>> for HZ 1000 according to the comment at the very begin of timer.c.
>>>> In fact, it can add 392ms for 5750ms (5000ms + 750ms). Overall, they may
>>>> contribute up to 1144ms.
>>>>
>>>> Does that make sense?
>>>>
>>>> Debug build is slower. So, the test scripts will be slower than normal
>>>> build. That means the script is actually waiting longer with a debug build.
>>>
>>> Meaning bumping the wait to $((($EXPIRE + 1) * 2))
>>> should be enough for the non-debug runner?
>>
>> I have not had time to do a deep a dive on the timing, but it seems odd
>> to me that a 1 second timer can turn into 11 sec. That means for 10
>> seconds (10x the time the user requested) the route survived.
> 
> Also, you added a direct call to ipv6_sysctl_rtcache_flush to force a
> flush which is going to try to kick off gc at that moment. Is the delay
> kicking in?
> 
> delay = net->ipv6.sysctl.flush_delay;

The delay doesn't cause any different here. The delay affects only 
exceptions.

It doesn't 10x. It does 2 times.
In fib_tests.sh

   sysctl -wq net.ipv6.route.gc_interval= $EXPIRE

The test sets the gc interval to 5 seconds.

The test install several routes in a round. So, some later routes may
miss the first GC that emitted after the first 5 seconds. These routes
have to wait for 5 more seconds for the next time GC. So, it is 10
seconds. However, due to calling round_jiffies() in fib6_run_gc(), the
waiting time of the second GC period can be longer. It could increase
750ms at most.  And, due to the granularity of the timer, it can adds
another 512ms at the range 4s~32s.

[net-next] selftests/net: force synchronized GC for a test.

Checks

Commit Message

Comments

Patch