diff mbox series

selftests: net: ip_defrag: increase netdev_max_backlog

Message ID 20191204195321.406365-1-cascardo@canonical.com (mailing list archive)
State New
Headers show
Series selftests: net: ip_defrag: increase netdev_max_backlog | expand

Commit Message

Thadeu Lima de Souza Cascardo Dec. 4, 2019, 7:53 p.m. UTC
When using fragments with size 8 and payload larger than 8000, the backlog
might fill up and packets will be dropped, causing the test to fail. This
happens often enough when conntrack is on during the IPv6 test.

As the larger payload in the test is 10000, using a backlog of 1250 allow
the test to run repeatedly without failure. At least a 1000 runs were
possible with no failures, when usually less than 50 runs were good enough
for showing a failure.

As netdev_max_backlog is not a pernet setting, this sets the backlog to
1000 during exit to prevent disturbing following tests.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Fixes: 4c3510483d26 (selftests: net: ip_defrag: cover new IPv6 defrag behavior)
---
 tools/testing/selftests/net/ip_defrag.sh | 3 +++
 1 file changed, 3 insertions(+)

Comments

Eric Dumazet Dec. 4, 2019, 8:03 p.m. UTC | #1
On 12/4/19 11:53 AM, Thadeu Lima de Souza Cascardo wrote:
> When using fragments with size 8 and payload larger than 8000, the backlog
> might fill up and packets will be dropped, causing the test to fail. This
> happens often enough when conntrack is on during the IPv6 test.
> 
> As the larger payload in the test is 10000, using a backlog of 1250 allow
> the test to run repeatedly without failure. At least a 1000 runs were
> possible with no failures, when usually less than 50 runs were good enough
> for showing a failure.
> 
> As netdev_max_backlog is not a pernet setting, this sets the backlog to
> 1000 during exit to prevent disturbing following tests.
> 

Hmmm... I would prefer not changing a global setting like that.
This is going to be flaky since we often run tests in parallel (using different netns)

What about adding a small delay after each sent packet ?

diff --git a/tools/testing/selftests/net/ip_defrag.c b/tools/testing/selftests/net/ip_defrag.c
index c0c9ecb891e1d78585e0db95fd8783be31bc563a..24d0723d2e7e9b94c3e365ee2ee30e9445deafa8 100644
--- a/tools/testing/selftests/net/ip_defrag.c
+++ b/tools/testing/selftests/net/ip_defrag.c
@@ -198,6 +198,7 @@ static void send_fragment(int fd_raw, struct sockaddr *addr, socklen_t alen,
                error(1, 0, "send_fragment: %d vs %d", res, frag_len);
 
        frag_counter++;
+       usleep(1000);
 }
 
 static void send_udp_frags(int fd_raw, struct sockaddr *addr,
Thadeu Lima de Souza Cascardo Dec. 6, 2019, 12:17 p.m. UTC | #2
On Wed, Dec 04, 2019 at 12:03:57PM -0800, Eric Dumazet wrote:
> 
> 
> On 12/4/19 11:53 AM, Thadeu Lima de Souza Cascardo wrote:
> > When using fragments with size 8 and payload larger than 8000, the backlog
> > might fill up and packets will be dropped, causing the test to fail. This
> > happens often enough when conntrack is on during the IPv6 test.
> > 
> > As the larger payload in the test is 10000, using a backlog of 1250 allow
> > the test to run repeatedly without failure. At least a 1000 runs were
> > possible with no failures, when usually less than 50 runs were good enough
> > for showing a failure.
> > 
> > As netdev_max_backlog is not a pernet setting, this sets the backlog to
> > 1000 during exit to prevent disturbing following tests.
> > 
> 
> Hmmm... I would prefer not changing a global setting like that.
> This is going to be flaky since we often run tests in parallel (using different netns)
> 
> What about adding a small delay after each sent packet ?
> 
> diff --git a/tools/testing/selftests/net/ip_defrag.c b/tools/testing/selftests/net/ip_defrag.c
> index c0c9ecb891e1d78585e0db95fd8783be31bc563a..24d0723d2e7e9b94c3e365ee2ee30e9445deafa8 100644
> --- a/tools/testing/selftests/net/ip_defrag.c
> +++ b/tools/testing/selftests/net/ip_defrag.c
> @@ -198,6 +198,7 @@ static void send_fragment(int fd_raw, struct sockaddr *addr, socklen_t alen,
>                 error(1, 0, "send_fragment: %d vs %d", res, frag_len);
>  
>         frag_counter++;
> +       usleep(1000);
>  }
>  
>  static void send_udp_frags(int fd_raw, struct sockaddr *addr,
> 

That won't work because the issue only shows when we using conntrack, as the
packet will be reassembled on output, then fragmented again. When this happens,
the fragmentation code is transmitting the fragments in a tight loop, which
floods the backlog.

One other option is limit the number of fragments to a 1000, like the following
patch:

diff --git a/tools/testing/selftests/net/ip_defrag.c b/tools/testing/selftests/net/ip_defrag.c
index c0c9ecb891e1..f4086ba9d16c 100644
--- a/tools/testing/selftests/net/ip_defrag.c
+++ b/tools/testing/selftests/net/ip_defrag.c
@@ -16,6 +16,9 @@
 #include <time.h>
 #include <unistd.h>
 
+#define ALIGN(x, sz) ((x + (sz-1)) & ~(sz-1))
+#define MAX(a, b) ((a < b) ? b : a)
+
 static bool		cfg_do_ipv4;
 static bool		cfg_do_ipv6;
 static bool		cfg_verbose;
@@ -362,6 +365,7 @@ static void run_test(struct sockaddr *addr, socklen_t alen, bool ipv6)
 
 	for (payload_len = min_frag_len; payload_len < MSG_LEN_MAX;
 			payload_len += (rand() % 4096)) {
+		min_frag_len = MAX(8, ALIGN(payload_len / 1000, 8));
 		if (cfg_verbose)
 			printf("payload_len: %d\n", payload_len);
Eric Dumazet Dec. 6, 2019, 1:41 p.m. UTC | #3
On 12/6/19 4:17 AM, Thadeu Lima de Souza Cascardo wrote:
> On Wed, Dec 04, 2019 at 12:03:57PM -0800, Eric Dumazet wrote:
>>
>>
>> On 12/4/19 11:53 AM, Thadeu Lima de Souza Cascardo wrote:
>>> When using fragments with size 8 and payload larger than 8000, the backlog
>>> might fill up and packets will be dropped, causing the test to fail. This
>>> happens often enough when conntrack is on during the IPv6 test.
>>>
>>> As the larger payload in the test is 10000, using a backlog of 1250 allow
>>> the test to run repeatedly without failure. At least a 1000 runs were
>>> possible with no failures, when usually less than 50 runs were good enough
>>> for showing a failure.
>>>
>>> As netdev_max_backlog is not a pernet setting, this sets the backlog to
>>> 1000 during exit to prevent disturbing following tests.
>>>
>>
>> Hmmm... I would prefer not changing a global setting like that.
>> This is going to be flaky since we often run tests in parallel (using different netns)
>>
>> What about adding a small delay after each sent packet ?
>>
>> diff --git a/tools/testing/selftests/net/ip_defrag.c b/tools/testing/selftests/net/ip_defrag.c
>> index c0c9ecb891e1d78585e0db95fd8783be31bc563a..24d0723d2e7e9b94c3e365ee2ee30e9445deafa8 100644
>> --- a/tools/testing/selftests/net/ip_defrag.c
>> +++ b/tools/testing/selftests/net/ip_defrag.c
>> @@ -198,6 +198,7 @@ static void send_fragment(int fd_raw, struct sockaddr *addr, socklen_t alen,
>>                 error(1, 0, "send_fragment: %d vs %d", res, frag_len);
>>  
>>         frag_counter++;
>> +       usleep(1000);
>>  }
>>  
>>  static void send_udp_frags(int fd_raw, struct sockaddr *addr,
>>
> 
> That won't work because the issue only shows when we using conntrack, as the
> packet will be reassembled on output, then fragmented again. When this happens,
> the fragmentation code is transmitting the fragments in a tight loop, which
> floods the backlog.

Interesting !

So it looks like the test is correct, and exposed a long standing problem in this code.

We should not adjust the test to some kernel-of-the-day-constraints, and instead fix the kernel bug ;)

Where is this tight loop exactly ?

If this is feeding/bursting ~1000 skbs via netif_rx() in a BH context, maybe we need to call a variant
that allows immediate processing instead of (ab)using the softnet backlog.

Thanks !
Thadeu Lima de Souza Cascardo Dec. 6, 2019, 2:50 p.m. UTC | #4
On Fri, Dec 06, 2019 at 05:41:01AM -0800, Eric Dumazet wrote:
> 
> 
> On 12/6/19 4:17 AM, Thadeu Lima de Souza Cascardo wrote:
> > On Wed, Dec 04, 2019 at 12:03:57PM -0800, Eric Dumazet wrote:
> >>
> >>
> >> On 12/4/19 11:53 AM, Thadeu Lima de Souza Cascardo wrote:
> >>> When using fragments with size 8 and payload larger than 8000, the backlog
> >>> might fill up and packets will be dropped, causing the test to fail. This
> >>> happens often enough when conntrack is on during the IPv6 test.
> >>>
> >>> As the larger payload in the test is 10000, using a backlog of 1250 allow
> >>> the test to run repeatedly without failure. At least a 1000 runs were
> >>> possible with no failures, when usually less than 50 runs were good enough
> >>> for showing a failure.
> >>>
> >>> As netdev_max_backlog is not a pernet setting, this sets the backlog to
> >>> 1000 during exit to prevent disturbing following tests.
> >>>
> >>
> >> Hmmm... I would prefer not changing a global setting like that.
> >> This is going to be flaky since we often run tests in parallel (using different netns)
> >>
> >> What about adding a small delay after each sent packet ?
> >>
> >> diff --git a/tools/testing/selftests/net/ip_defrag.c b/tools/testing/selftests/net/ip_defrag.c
> >> index c0c9ecb891e1d78585e0db95fd8783be31bc563a..24d0723d2e7e9b94c3e365ee2ee30e9445deafa8 100644
> >> --- a/tools/testing/selftests/net/ip_defrag.c
> >> +++ b/tools/testing/selftests/net/ip_defrag.c
> >> @@ -198,6 +198,7 @@ static void send_fragment(int fd_raw, struct sockaddr *addr, socklen_t alen,
> >>                 error(1, 0, "send_fragment: %d vs %d", res, frag_len);
> >>  
> >>         frag_counter++;
> >> +       usleep(1000);
> >>  }
> >>  
> >>  static void send_udp_frags(int fd_raw, struct sockaddr *addr,
> >>
> > 
> > That won't work because the issue only shows when we using conntrack, as the
> > packet will be reassembled on output, then fragmented again. When this happens,
> > the fragmentation code is transmitting the fragments in a tight loop, which
> > floods the backlog.
> 
> Interesting !
> 
> So it looks like the test is correct, and exposed a long standing problem in this code.
> 
> We should not adjust the test to some kernel-of-the-day-constraints, and instead fix the kernel bug ;)
> 
> Where is this tight loop exactly ?
> 
> If this is feeding/bursting ~1000 skbs via netif_rx() in a BH context, maybe we need to call a variant
> that allows immediate processing instead of (ab)using the softnet backlog.
> 
> Thanks !

This is the loopback interface, so its xmit calls netif_rx. I suppose we would
have the same problem with veth, for example.

So net/ipv6/ip6_output.c:ip6_fragment has this:

		for (;;) {
			/* Prepare header of the next frame,
			 * before previous one went down. */
			if (iter.frag)
				ip6_fraglist_prepare(skb, &iter);

			skb->tstamp = tstamp;
			err = output(net, sk, skb);
			if (!err)
				IP6_INC_STATS(net, ip6_dst_idev(&rt->dst),
					      IPSTATS_MIB_FRAGCREATES);

			if (err || !iter.frag)
				break;

			skb = ip6_fraglist_next(&iter);
		}

output is ip6_finish_output2, which will call neigh_output, which ends up
calling dev_queue_xmit.

In this case, ip6_fragment is being called probably from rawv6_send_hdrinc ->
dst_output -> ip6_output -> ip6_finish_output -> __ip6_finish_output ->
ip6_fragment.

dst_output at rawv6_send_hdrinc is being called after netfilter
NF_INET_LOCAL_OUT hook. That one is gathering the fragments and only accepting
that last, reassembled skb, which causes ip6_fragment enter that loop.

So, basically, the easiest way to reproduce this is using this test with
loopback and netfilter doing the reassembly during conntrack. I see some BH
locks here and there, but I think this is just filling up the backlog too fast
to give any chance for softirq to kick in.

I will see if I can reproduce this using routed veths.

Cascardo.

Cascardo.
Thadeu Lima de Souza Cascardo Dec. 6, 2019, 3:50 p.m. UTC | #5
On Fri, Dec 06, 2019 at 11:50:15AM -0300, Thadeu Lima de Souza Cascardo wrote:
> On Fri, Dec 06, 2019 at 05:41:01AM -0800, Eric Dumazet wrote:
> > 
> > 
> > On 12/6/19 4:17 AM, Thadeu Lima de Souza Cascardo wrote:
> > > On Wed, Dec 04, 2019 at 12:03:57PM -0800, Eric Dumazet wrote:
> > >>
> > >>
> > >> On 12/4/19 11:53 AM, Thadeu Lima de Souza Cascardo wrote:
> > >>> When using fragments with size 8 and payload larger than 8000, the backlog
> > >>> might fill up and packets will be dropped, causing the test to fail. This
> > >>> happens often enough when conntrack is on during the IPv6 test.
> > >>>
> > >>> As the larger payload in the test is 10000, using a backlog of 1250 allow
> > >>> the test to run repeatedly without failure. At least a 1000 runs were
> > >>> possible with no failures, when usually less than 50 runs were good enough
> > >>> for showing a failure.
> > >>>
> > >>> As netdev_max_backlog is not a pernet setting, this sets the backlog to
> > >>> 1000 during exit to prevent disturbing following tests.
> > >>>
> > >>
> > >> Hmmm... I would prefer not changing a global setting like that.
> > >> This is going to be flaky since we often run tests in parallel (using different netns)
> > >>
> > >> What about adding a small delay after each sent packet ?
> > >>
> > >> diff --git a/tools/testing/selftests/net/ip_defrag.c b/tools/testing/selftests/net/ip_defrag.c
> > >> index c0c9ecb891e1d78585e0db95fd8783be31bc563a..24d0723d2e7e9b94c3e365ee2ee30e9445deafa8 100644
> > >> --- a/tools/testing/selftests/net/ip_defrag.c
> > >> +++ b/tools/testing/selftests/net/ip_defrag.c
> > >> @@ -198,6 +198,7 @@ static void send_fragment(int fd_raw, struct sockaddr *addr, socklen_t alen,
> > >>                 error(1, 0, "send_fragment: %d vs %d", res, frag_len);
> > >>  
> > >>         frag_counter++;
> > >> +       usleep(1000);
> > >>  }
> > >>  
> > >>  static void send_udp_frags(int fd_raw, struct sockaddr *addr,
> > >>
> > > 
> > > That won't work because the issue only shows when we using conntrack, as the
> > > packet will be reassembled on output, then fragmented again. When this happens,
> > > the fragmentation code is transmitting the fragments in a tight loop, which
> > > floods the backlog.
> > 
> > Interesting !
> > 
> > So it looks like the test is correct, and exposed a long standing problem in this code.
> > 
> > We should not adjust the test to some kernel-of-the-day-constraints, and instead fix the kernel bug ;)
> > 
> > Where is this tight loop exactly ?
> > 
> > If this is feeding/bursting ~1000 skbs via netif_rx() in a BH context, maybe we need to call a variant
> > that allows immediate processing instead of (ab)using the softnet backlog.
> > 
> > Thanks !
> 
> This is the loopback interface, so its xmit calls netif_rx. I suppose we would
> have the same problem with veth, for example.
> 
> So net/ipv6/ip6_output.c:ip6_fragment has this:
> 
> 		for (;;) {
> 			/* Prepare header of the next frame,
> 			 * before previous one went down. */
> 			if (iter.frag)
> 				ip6_fraglist_prepare(skb, &iter);
> 
> 			skb->tstamp = tstamp;
> 			err = output(net, sk, skb);
> 			if (!err)
> 				IP6_INC_STATS(net, ip6_dst_idev(&rt->dst),
> 					      IPSTATS_MIB_FRAGCREATES);
> 
> 			if (err || !iter.frag)
> 				break;
> 
> 			skb = ip6_fraglist_next(&iter);
> 		}
> 
> output is ip6_finish_output2, which will call neigh_output, which ends up
> calling dev_queue_xmit.
> 
> In this case, ip6_fragment is being called probably from rawv6_send_hdrinc ->
> dst_output -> ip6_output -> ip6_finish_output -> __ip6_finish_output ->
> ip6_fragment.
> 
> dst_output at rawv6_send_hdrinc is being called after netfilter
> NF_INET_LOCAL_OUT hook. That one is gathering the fragments and only accepting
> that last, reassembled skb, which causes ip6_fragment enter that loop.
> 
> So, basically, the easiest way to reproduce this is using this test with
> loopback and netfilter doing the reassembly during conntrack. I see some BH
> locks here and there, but I think this is just filling up the backlog too fast
> to give any chance for softirq to kick in.
> 
> I will see if I can reproduce this using routed veths.
> 

Confirmed that the same happens when using veth.

vethX (nsX) <-> veth1 (router) forwards through veth2 (router) <-> vethY (nsY)

With such a setup, when I send those fragments from nsX to nsY, they get
through, until I setup that same conntrack rule on the router. Then, increasing
netdev_max_backlog allows those fragments to go through again.

That at least seems to be a plausible scenario that we would like to fix, as
you said, instead of only making a test pass.

Next Monday, I may test anything you come up with.

Thanks.
Cascardo.
diff mbox series

Patch

diff --git a/tools/testing/selftests/net/ip_defrag.sh b/tools/testing/selftests/net/ip_defrag.sh
index 15d3489ecd9c..c91cfecfa245 100755
--- a/tools/testing/selftests/net/ip_defrag.sh
+++ b/tools/testing/selftests/net/ip_defrag.sh
@@ -12,6 +12,8 @@  setup() {
 	ip netns add "${NETNS}"
 	ip -netns "${NETNS}" link set lo up
 
+	sysctl -w net.core.netdev_max_backlog=1250 >/dev/null 2>&1
+
 	ip netns exec "${NETNS}" sysctl -w net.ipv4.ipfrag_high_thresh=9000000 >/dev/null 2>&1
 	ip netns exec "${NETNS}" sysctl -w net.ipv4.ipfrag_low_thresh=7000000 >/dev/null 2>&1
 	ip netns exec "${NETNS}" sysctl -w net.ipv4.ipfrag_time=1 >/dev/null 2>&1
@@ -30,6 +32,7 @@  setup() {
 
 cleanup() {
 	ip netns del "${NETNS}"
+	sysctl -w net.core.netdev_max_backlog=1000 >/dev/null 2>&1
 }
 
 trap cleanup EXIT