[RFC] fix xfrm MTU regression

Message ID	20210429170254.5grfgsz2hgy2qjhk@dwarf.suse.cz (mailing list archive)
State	RFC
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> Date: Thu, 29 Apr 2021 19:02:54 +0200 From: Jiri Bohac <jbohac@suse.cz> To: Mike Maloney <maloneykernel@gmail.com>, Eric Dumazet <eric.dumazet@gmail.com>, davem@davemloft.net Cc: netdev@vger.kernel.org, Steffen Klassert <steffen.klassert@secunet.com>, Herbert Xu <herbert@gondor.apana.org.au> Subject: [RFC PATCH] fix xfrm MTU regression Message-ID: <20210429170254.5grfgsz2hgy2qjhk@dwarf.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Precedence: bulk
Series	[RFC] fix xfrm MTU regression \| expand [RFC] fix xfrm MTU regression

Context	Check	Description
netdev/cover_letter	success	Link
netdev/fixes_present	success	Link
netdev/patch_count	success	Link
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/subject_prefix	warning	Target tree name not specified in the subject
netdev/cc_maintainers	warning	3 maintainers not CCed: yoshfuji@linux-ipv6.org dsahern@kernel.org kuba@kernel.org
netdev/source_inline	success	Was 0 now: 0
netdev/verify_signedoff	success	Link
netdev/module_param	success	Was 0 now: 0
netdev/build_32bit	success	Errors and warnings before: 37 this patch: 37
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/verify_fixes	success	Link
netdev/checkpatch	warning	WARNING: line length of 83 exceeds 80 columns
netdev/build_allmodconfig_warn	success	Errors and warnings before: 30 this patch: 30
netdev/header_inline	success	Link

Jiri Bohac April 29, 2021, 5:02 p.m. UTC

Hi,

Commit 749439bfac6e1a2932c582e2699f91d329658196 ("ipv6: fix udpv6
sendmsg crash caused by too small MTU") breaks PMTU for xfrm.

A Packet Too Big ICMPv6 message received in response to an ESP
packet will prevent all further communication through the tunnel
if the reported MTU minus the ESP overhead is smaller than 1280.

E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead
is 92 bytes. Receiving a PTB with MTU of 1371 or less will result
in all further packets in the tunnel dropped. A ping through the
tunnel fails with "ping: sendmsg: Invalid argument".

Apparently the MTU on the xfrm route is smaller than 1280 and
fails the check inside ip6_setup_cork() added by 749439bf.

We found this by debugging USGv6/ipv6ready failures. Failing
tests are: "Phase-2 Interoperability Test Scenario IPsec" /
5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation).

Below is my attempt to fix the situation by dropping the MTU
check and instead checking for the underflows described in the
749439bf commit message (without much understanding of the
details!). Does this make sense?:

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

Sabrina Dubroca April 29, 2021, 7:48 p.m. UTC | #1

2021-04-29, 19:02:54 +0200, Jiri Bohac wrote:
> Hi,
> 
> Commit 749439bfac6e1a2932c582e2699f91d329658196 ("ipv6: fix udpv6
> sendmsg crash caused by too small MTU") breaks PMTU for xfrm.
> 
> A Packet Too Big ICMPv6 message received in response to an ESP
> packet will prevent all further communication through the tunnel
> if the reported MTU minus the ESP overhead is smaller than 1280.
> 
> E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead
> is 92 bytes. Receiving a PTB with MTU of 1371 or less will result
> in all further packets in the tunnel dropped. A ping through the
> tunnel fails with "ping: sendmsg: Invalid argument".
> 
> Apparently the MTU on the xfrm route is smaller than 1280 and
> fails the check inside ip6_setup_cork() added by 749439bf.
> 
> We found this by debugging USGv6/ipv6ready failures. Failing
> tests are: "Phase-2 Interoperability Test Scenario IPsec" /
> 5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation).

That should be fixed with commit b515d2637276 ("xfrm: xfrm_state_mtu
should return at least 1280 for ipv6"), currently in Steffen's ipsec
tree:
https://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git/commit/?id=b515d2637276

Jiri Bohac April 29, 2021, 8:25 p.m. UTC | #2

On Thu, Apr 29, 2021 at 09:48:09PM +0200, Sabrina Dubroca wrote:
> That should be fixed with commit b515d2637276 ("xfrm: xfrm_state_mtu
> should return at least 1280 for ipv6"), currently in Steffen's ipsec
> tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git/commit/?id=b515d2637276

Thanks, that is interesting! The patch makes my large (-s 1400) pings inside
ESP pass through a 1280-MTU link on an intermediary router  but in a suboptimal
double-fragmented way. tcpdump on the router shows:

	22:09:44.556452 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: frag (0|1232) ESP(spi=0x00000001,seq=0xdd), length 1232                    
	22:09:44.566269 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: frag (1232|100)                                                            
	22:09:44.566553 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: ESP(spi=0x00000001,seq=0xde), length 276

I.e. the ping is fragmented into two ESP packets and the first ESP packet is then fragmented again.

The same pings with my patch come through in two fragments:

	22:13:22.072934 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: ESP(spi=0x00000001,seq=0x28), length 1236
	22:13:22.073039 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: ESP(spi=0x00000001,seq=0x29), length 356 

I can do more tests if needed.

Jiri Bohac April 30, 2021, 5:36 a.m. UTC | #3

On Thu, Apr 29, 2021 at 07:02:55PM +0200, Jiri Bohac wrote:
> Below is my attempt to fix the situation by dropping the MTU
> check and instead checking for the underflows described in the
> 749439bf commit message (without much understanding of the
> details!). Does this make sense?:

the first version left headersize uninitialized in the error
path; v2 below fixes this.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff4f9ebcf7f6..171eb4ec1e67 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1402,8 +1402,6 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork,
 		if (np->frag_size)
 			mtu = np->frag_size;
 	}
-	if (mtu < IPV6_MIN_MTU)
-		return -EINVAL;
 	cork->base.fragsize = mtu;
 	cork->base.gso_size = ipc6->gso_size;
 	cork->base.tx_flags = 0;
@@ -1465,8 +1463,6 @@ static int __ip6_append_data(struct sock *sk,
 
 	fragheaderlen = sizeof(struct ipv6hdr) + rt->rt6i_nfheader_len +
 			(opt ? opt->opt_nflen : 0);
-	maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen -
-		     sizeof(struct frag_hdr);
 
 	headersize = sizeof(struct ipv6hdr) +
 		     (opt ? opt->opt_flen + opt->opt_nflen : 0) +
@@ -1474,6 +1470,13 @@ static int __ip6_append_data(struct sock *sk,
 		      sizeof(struct frag_hdr) : 0) +
 		     rt->rt6i_nfheader_len;
 
+	if (mtu < fragheaderlen ||
+	    ((mtu - fragheaderlen) & ~7) + fragheaderlen < sizeof(struct frag_hdr))
+		goto emsgsize;
+
+	maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen -
+		     sizeof(struct frag_hdr);
+
 	/* as per RFC 7112 section 5, the entire IPv6 Header Chain must fit
 	 * the first fragment
 	 */

Sabrina Dubroca May 1, 2021, 10:23 a.m. UTC | #4

2021-04-29, 22:25:29 +0200, Jiri Bohac wrote:
> On Thu, Apr 29, 2021 at 09:48:09PM +0200, Sabrina Dubroca wrote:
> > That should be fixed with commit b515d2637276 ("xfrm: xfrm_state_mtu
> > should return at least 1280 for ipv6"), currently in Steffen's ipsec
> > tree:
> > https://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git/commit/?id=b515d2637276
> 
> Thanks, that is interesting! The patch makes my large (-s 1400) pings inside
> ESP pass through a 1280-MTU link on an intermediary router  but in a suboptimal
> double-fragmented way. tcpdump on the router shows:
> 
> 	22:09:44.556452 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: frag (0|1232) ESP(spi=0x00000001,seq=0xdd), length 1232                    
> 	22:09:44.566269 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: frag (1232|100)                                                            
> 	22:09:44.566553 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: ESP(spi=0x00000001,seq=0xde), length 276
> 
> I.e. the ping is fragmented into two ESP packets and the first ESP packet is then fragmented again.

It's a bit ugly, but I don't think we can do any better. We're going
through the stack twice in tunnel mode. The first pass (before xfrm)
we fragment according to the PMTU (adjusted to IPV6_MIN_MTU, because
MTUs lower than that are illegal in IPv6). The second time (after
xfrm), the first ESP packet is too big so we fragment it. This
behavior is consistent with a vti device running over a network with
MTU=1280 (which doesn't seem to work without my patch).

In transport mode, we're only going through the stack once, so we
don't see this double fragmentation.

I think my patch is correct, because without it we have IPv6 dsts
going around the kernel with an associated MTU smaller than
IPV6_MIN_MTU.

[RFC] fix xfrm MTU regression

Checks

Commit Message

Comments

Patch