diff mbox series

[RFC] fix xfrm MTU regression

Message ID 20210429170254.5grfgsz2hgy2qjhk@dwarf.suse.cz (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC] fix xfrm MTU regression | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Guessed tree name to be net-next
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cc_maintainers warning 3 maintainers not CCed: yoshfuji@linux-ipv6.org dsahern@kernel.org kuba@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 37 this patch: 37
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 83 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 30 this patch: 30
netdev/header_inline success Link

Commit Message

Jiri Bohac April 29, 2021, 5:02 p.m. UTC
Hi,

Commit 749439bfac6e1a2932c582e2699f91d329658196 ("ipv6: fix udpv6
sendmsg crash caused by too small MTU") breaks PMTU for xfrm.

A Packet Too Big ICMPv6 message received in response to an ESP
packet will prevent all further communication through the tunnel
if the reported MTU minus the ESP overhead is smaller than 1280.

E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead
is 92 bytes. Receiving a PTB with MTU of 1371 or less will result
in all further packets in the tunnel dropped. A ping through the
tunnel fails with "ping: sendmsg: Invalid argument".

Apparently the MTU on the xfrm route is smaller than 1280 and
fails the check inside ip6_setup_cork() added by 749439bf.

We found this by debugging USGv6/ipv6ready failures. Failing
tests are: "Phase-2 Interoperability Test Scenario IPsec" /
5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation).

Below is my attempt to fix the situation by dropping the MTU
check and instead checking for the underflows described in the
749439bf commit message (without much understanding of the
details!). Does this make sense?:

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

Comments

Sabrina Dubroca April 29, 2021, 7:48 p.m. UTC | #1
2021-04-29, 19:02:54 +0200, Jiri Bohac wrote:
> Hi,
> 
> Commit 749439bfac6e1a2932c582e2699f91d329658196 ("ipv6: fix udpv6
> sendmsg crash caused by too small MTU") breaks PMTU for xfrm.
> 
> A Packet Too Big ICMPv6 message received in response to an ESP
> packet will prevent all further communication through the tunnel
> if the reported MTU minus the ESP overhead is smaller than 1280.
> 
> E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead
> is 92 bytes. Receiving a PTB with MTU of 1371 or less will result
> in all further packets in the tunnel dropped. A ping through the
> tunnel fails with "ping: sendmsg: Invalid argument".
> 
> Apparently the MTU on the xfrm route is smaller than 1280 and
> fails the check inside ip6_setup_cork() added by 749439bf.
> 
> We found this by debugging USGv6/ipv6ready failures. Failing
> tests are: "Phase-2 Interoperability Test Scenario IPsec" /
> 5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation).

That should be fixed with commit b515d2637276 ("xfrm: xfrm_state_mtu
should return at least 1280 for ipv6"), currently in Steffen's ipsec
tree:
https://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git/commit/?id=b515d2637276
Jiri Bohac April 29, 2021, 8:25 p.m. UTC | #2
On Thu, Apr 29, 2021 at 09:48:09PM +0200, Sabrina Dubroca wrote:
> That should be fixed with commit b515d2637276 ("xfrm: xfrm_state_mtu
> should return at least 1280 for ipv6"), currently in Steffen's ipsec
> tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git/commit/?id=b515d2637276

Thanks, that is interesting! The patch makes my large (-s 1400) pings inside
ESP pass through a 1280-MTU link on an intermediary router  but in a suboptimal
double-fragmented way. tcpdump on the router shows:

	22:09:44.556452 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: frag (0|1232) ESP(spi=0x00000001,seq=0xdd), length 1232                    
	22:09:44.566269 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: frag (1232|100)                                                            
	22:09:44.566553 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: ESP(spi=0x00000001,seq=0xde), length 276

I.e. the ping is fragmented into two ESP packets and the first ESP packet is then fragmented again.

The same pings with my patch come through in two fragments:

	22:13:22.072934 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: ESP(spi=0x00000001,seq=0x28), length 1236
	22:13:22.073039 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: ESP(spi=0x00000001,seq=0x29), length 356 

I can do more tests if needed.
Jiri Bohac April 30, 2021, 5:36 a.m. UTC | #3
On Thu, Apr 29, 2021 at 07:02:55PM +0200, Jiri Bohac wrote:
> Below is my attempt to fix the situation by dropping the MTU
> check and instead checking for the underflows described in the
> 749439bf commit message (without much understanding of the
> details!). Does this make sense?:

the first version left headersize uninitialized in the error
path; v2 below fixes this.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff4f9ebcf7f6..171eb4ec1e67 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1402,8 +1402,6 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork,
 		if (np->frag_size)
 			mtu = np->frag_size;
 	}
-	if (mtu < IPV6_MIN_MTU)
-		return -EINVAL;
 	cork->base.fragsize = mtu;
 	cork->base.gso_size = ipc6->gso_size;
 	cork->base.tx_flags = 0;
@@ -1465,8 +1463,6 @@ static int __ip6_append_data(struct sock *sk,
 
 	fragheaderlen = sizeof(struct ipv6hdr) + rt->rt6i_nfheader_len +
 			(opt ? opt->opt_nflen : 0);
-	maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen -
-		     sizeof(struct frag_hdr);
 
 	headersize = sizeof(struct ipv6hdr) +
 		     (opt ? opt->opt_flen + opt->opt_nflen : 0) +
@@ -1474,6 +1470,13 @@ static int __ip6_append_data(struct sock *sk,
 		      sizeof(struct frag_hdr) : 0) +
 		     rt->rt6i_nfheader_len;
 
+	if (mtu < fragheaderlen ||
+	    ((mtu - fragheaderlen) & ~7) + fragheaderlen < sizeof(struct frag_hdr))
+		goto emsgsize;
+
+	maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen -
+		     sizeof(struct frag_hdr);
+
 	/* as per RFC 7112 section 5, the entire IPv6 Header Chain must fit
 	 * the first fragment
 	 */
Sabrina Dubroca May 1, 2021, 10:23 a.m. UTC | #4
2021-04-29, 22:25:29 +0200, Jiri Bohac wrote:
> On Thu, Apr 29, 2021 at 09:48:09PM +0200, Sabrina Dubroca wrote:
> > That should be fixed with commit b515d2637276 ("xfrm: xfrm_state_mtu
> > should return at least 1280 for ipv6"), currently in Steffen's ipsec
> > tree:
> > https://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git/commit/?id=b515d2637276
> 
> Thanks, that is interesting! The patch makes my large (-s 1400) pings inside
> ESP pass through a 1280-MTU link on an intermediary router  but in a suboptimal
> double-fragmented way. tcpdump on the router shows:
> 
> 	22:09:44.556452 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: frag (0|1232) ESP(spi=0x00000001,seq=0xdd), length 1232                    
> 	22:09:44.566269 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: frag (1232|100)                                                            
> 	22:09:44.566553 IP6 2001:db8:ffff::1 > 2001:db8:ffff:1::1: ESP(spi=0x00000001,seq=0xde), length 276
> 
> I.e. the ping is fragmented into two ESP packets and the first ESP packet is then fragmented again.

It's a bit ugly, but I don't think we can do any better. We're going
through the stack twice in tunnel mode. The first pass (before xfrm)
we fragment according to the PMTU (adjusted to IPV6_MIN_MTU, because
MTUs lower than that are illegal in IPv6). The second time (after
xfrm), the first ESP packet is too big so we fragment it. This
behavior is consistent with a vti device running over a network with
MTU=1280 (which doesn't seem to work without my patch).

In transport mode, we're only going through the stack once, so we
don't see this double fragmentation.

I think my patch is correct, because without it we have IPv6 dsts
going around the kernel with an associated MTU smaller than
IPV6_MIN_MTU.
diff mbox series

Patch

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff4f9ebcf7f6..8af6adb42c85 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1402,8 +1402,6 @@  static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork,
 		if (np->frag_size)
 			mtu = np->frag_size;
 	}
-	if (mtu < IPV6_MIN_MTU)
-		return -EINVAL;
 	cork->base.fragsize = mtu;
 	cork->base.gso_size = ipc6->gso_size;
 	cork->base.tx_flags = 0;
@@ -1465,6 +1463,11 @@  static int __ip6_append_data(struct sock *sk,
 
 	fragheaderlen = sizeof(struct ipv6hdr) + rt->rt6i_nfheader_len +
 			(opt ? opt->opt_nflen : 0);
+
+	if (mtu < fragheaderlen ||
+	    ((mtu - fragheaderlen) & ~7) + fragheaderlen < sizeof(struct frag_hdr))
+		goto emsgsize;
+
 	maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen -
 		     sizeof(struct frag_hdr);