Message ID | 1603272115-25351-1-git-send-email-wenxu@ucloud.cn (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] ip_tunnel: fix over-mtu packet send fail without TUNNEL_DONT_FRAGMENT flags | expand |
On Wed, 21 Oct 2020 17:21:55 +0800 wenxu@ucloud.cn wrote: > From: wenxu <wenxu@ucloud.cn> > > The TUNNEL_DONT_FRAGMENT flags specific the tunnel outer ip can do > fragment or not in the md mode. Without the TUNNEL_DONT_FRAGMENT > should always do fragment. So it should not care the frag_off in > inner ip. Can you describe the use case better? My understanding is that we should propagate DF in normally functioning networks, and let PMTU do its job. > Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel") > Signed-off-by: wenxu <wenxu@ucloud.cn> > --- > net/ipv4/ip_tunnel.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c > index 8b04d1d..ee65c92 100644 > --- a/net/ipv4/ip_tunnel.c > +++ b/net/ipv4/ip_tunnel.c > @@ -608,9 +608,6 @@ void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, > ttl = ip4_dst_hoplimit(&rt->dst); > } > > - if (!df && skb->protocol == htons(ETH_P_IP)) > - df = inner_iph->frag_off & htons(IP_DF); > - > headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len; > if (headroom > dev->needed_headroom) > dev->needed_headroom = headroom;
On 10/24/2020 5:12 AM, Jakub Kicinski wrote: > On Wed, 21 Oct 2020 17:21:55 +0800 wenxu@ucloud.cn wrote: >> From: wenxu <wenxu@ucloud.cn> >> >> The TUNNEL_DONT_FRAGMENT flags specific the tunnel outer ip can do >> fragment or not in the md mode. Without the TUNNEL_DONT_FRAGMENT >> should always do fragment. So it should not care the frag_off in >> inner ip. > Can you describe the use case better? My understanding is that we > should propagate DF in normally functioning networks, and let PMTU > do its job. Sorry for relying so late. ip_md_tunnel_xmit send packet in the collect_md mode. For OpenVswitch example, ovs set the gre port with flags df_default=false which will not set TUNNEL_DONT_FRAGMENT for tun_flags. And the mtu of virtual machine is 1500 with default. And the tunnel underlay device mtu is 1500 default too. So if the size of packet send from vm + underlay length > underlay device mtu. The packet always be dropped if the ip header of packet set flags with DF. In the collect_md the outer packet can fragment or not should depends on the tun_flags but not inner ip header like vxlan device did. >> Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel") >> Signed-off-by: wenxu <wenxu@ucloud.cn> >> --- >> net/ipv4/ip_tunnel.c | 3 --- >> 1 file changed, 3 deletions(-) >> >> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c >> index 8b04d1d..ee65c92 100644 >> --- a/net/ipv4/ip_tunnel.c >> +++ b/net/ipv4/ip_tunnel.c >> @@ -608,9 +608,6 @@ void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, >> ttl = ip4_dst_hoplimit(&rt->dst); >> } >> >> - if (!df && skb->protocol == htons(ETH_P_IP)) >> - df = inner_iph->frag_off & htons(IP_DF); >> - >> headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len; >> if (headroom > dev->needed_headroom) >> dev->needed_headroom = headroom; >
On Mon, 26 Oct 2020 16:23:29 +0800 wenxu wrote: > On 10/24/2020 5:12 AM, Jakub Kicinski wrote: > > On Wed, 21 Oct 2020 17:21:55 +0800 wenxu@ucloud.cn wrote: > >> From: wenxu <wenxu@ucloud.cn> > >> > >> The TUNNEL_DONT_FRAGMENT flags specific the tunnel outer ip can do > >> fragment or not in the md mode. Without the TUNNEL_DONT_FRAGMENT > >> should always do fragment. So it should not care the frag_off in > >> inner ip. > > Can you describe the use case better? My understanding is that we > > should propagate DF in normally functioning networks, and let PMTU > > do its job. > > Sorry for relying so late. ip_md_tunnel_xmit send packet in the collect_md mode. > > For OpenVswitch example, ovs set the gre port with flags df_default=false which will not > > set TUNNEL_DONT_FRAGMENT for tun_flags. > > And the mtu of virtual machine is 1500 with default. And the tunnel underlay device mtu > > is 1500 default too. So if the size of packet send from vm + underlay length > underlay device mtu. > > The packet always be dropped if the ip header of packet set flags with DF. > > In the collect_md the outer packet can fragment or not should depends on the tun_flags but not inner > > ip header like vxlan device did. Is this another incarnation of 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets")? Sounds like non-UDP tunnels need the same treatment to make PMTUD work. RFC2003 seems to clearly forbid ignoring the inner DF: Identification, Flags, Fragment Offset These three fields are set as specified in [10]. However, if the "Don't Fragment" bit is set in the inner IP header, it MUST be set in the outer IP header; if the "Don't Fragment" bit is not set in the inner IP header, it MAY be set in the outer IP header, as described in Section 5.1. and: [..] In particular, use of IP options is allowed, and use of fragmentation is allowed unless the "Don't Fragment" bit is set in the inner IP header. This restriction on fragmentation is required so that nodes employing Path MTU Discovery [7] can obtain the information they seek.
On 10/26/20 2:56 PM, Jakub Kicinski wrote: > On Mon, 26 Oct 2020 16:23:29 +0800 wenxu wrote: >> On 10/24/2020 5:12 AM, Jakub Kicinski wrote: >>> On Wed, 21 Oct 2020 17:21:55 +0800 wenxu@ucloud.cn wrote: >>>> From: wenxu <wenxu@ucloud.cn> >>>> >>>> The TUNNEL_DONT_FRAGMENT flags specific the tunnel outer ip can do >>>> fragment or not in the md mode. Without the TUNNEL_DONT_FRAGMENT >>>> should always do fragment. So it should not care the frag_off in >>>> inner ip. >>> Can you describe the use case better? My understanding is that we >>> should propagate DF in normally functioning networks, and let PMTU >>> do its job. >> >> Sorry for relying so late. ip_md_tunnel_xmit send packet in the collect_md mode. >> >> For OpenVswitch example, ovs set the gre port with flags df_default=false which will not >> >> set TUNNEL_DONT_FRAGMENT for tun_flags. >> >> And the mtu of virtual machine is 1500 with default. And the tunnel underlay device mtu >> >> is 1500 default too. So if the size of packet send from vm + underlay length > underlay device mtu. >> >> The packet always be dropped if the ip header of packet set flags with DF. >> >> In the collect_md the outer packet can fragment or not should depends on the tun_flags but not inner >> >> ip header like vxlan device did. > > Is this another incarnation of 4cb47a8644cc ("tunnels: PMTU discovery > support for directly bridged IP packets")? Sounds like non-UDP tunnels > need the same treatment to make PMTUD work. > > RFC2003 seems to clearly forbid ignoring the inner DF: I was looking at this patch Sunday night. To me it seems odd that packets flowing through the overlay affect decisions in the underlay which meant I agree with the proposed change. ip_md_tunnel_xmit is inconsistent right now. tnl_update_pmtu is called based on the TUNNEL_DONT_FRAGMENT flag, so why let it be changed later based on the inner header? Or, if you agree with RFC 2003 and the DF should be propagated outer to inner, then it seems like the df reset needs to be moved up before the call to tnl_update_pmtu
On Tue, 27 Oct 2020 08:51:07 -0600 David Ahern wrote: > > Is this another incarnation of 4cb47a8644cc ("tunnels: PMTU discovery > > support for directly bridged IP packets")? Sounds like non-UDP tunnels > > need the same treatment to make PMTUD work. > > > > RFC2003 seems to clearly forbid ignoring the inner DF: > > I was looking at this patch Sunday night. To me it seems odd that > packets flowing through the overlay affect decisions in the underlay > which meant I agree with the proposed change. The RFC was probably written before we invented terms like underlay and overlay, and still considered tunneling to be an inefficient hack ;) > ip_md_tunnel_xmit is inconsistent right now. tnl_update_pmtu is called > based on the TUNNEL_DONT_FRAGMENT flag, so why let it be changed later > based on the inner header? Or, if you agree with RFC 2003 and the DF > should be propagated outer to inner, then it seems like the df reset > needs to be moved up before the call to tnl_update_pmtu Looks like TUNNEL_DONT_FRAGMENT is intended to switch between using PMTU inside the tunnel or just the tunnel dev MTU. ICMP PTB is still generated based on the inner headers. We should be okay to add something like IFLA_GRE_IGNORE_DF to lwt, but IMHO the default should not be violating the RFC.
On 10/27/2020 11:55 PM, Jakub Kicinski wrote: > On Tue, 27 Oct 2020 08:51:07 -0600 David Ahern wrote: >>> Is this another incarnation of 4cb47a8644cc ("tunnels: PMTU discovery >>> support for directly bridged IP packets")? Sounds like non-UDP tunnels >>> need the same treatment to make PMTUD work. >>> >>> RFC2003 seems to clearly forbid ignoring the inner DF: >> I was looking at this patch Sunday night. To me it seems odd that >> packets flowing through the overlay affect decisions in the underlay >> which meant I agree with the proposed change. > The RFC was probably written before we invented terms like underlay > and overlay, and still considered tunneling to be an inefficient hack ;) > >> ip_md_tunnel_xmit is inconsistent right now. tnl_update_pmtu is called >> based on the TUNNEL_DONT_FRAGMENT flag, so why let it be changed later >> based on the inner header? Or, if you agree with RFC 2003 and the DF >> should be propagated outer to inner, then it seems like the df reset >> needs to be moved up before the call to tnl_update_pmtu > Looks like TUNNEL_DONT_FRAGMENT is intended to switch between using > PMTU inside the tunnel or just the tunnel dev MTU. ICMP PTB is still > generated based on the inner headers. > > We should be okay to add something like IFLA_GRE_IGNORE_DF to lwt, > but IMHO the default should not be violating the RFC. If we add TUNNEL_IGNORE_DF to lwt, the two IGNORE_DF and DONT_FRAGMENT flags should not coexist ? Or DONT_FRAGMENT is prior to the IGNORE_DF? Also there is inconsistent in the kernel for the tunnel device. For geneve and vxlan tunnel (don't send tunnel with ip_md_tunnel_xmit) in the lwt mode set the outer df only based TUNNEL_DONT_FRAGMENT . And this is also the some behavior for gre device before switching to use ip_md_tunnel_xmit as the following patch. 962924f ip_gre: Refactor collect metatdata mode tunnel xmit to ip_md_tunnel_xmit
On Thu, 29 Oct 2020 10:30:50 +0800 wenxu wrote: > On 10/27/2020 11:55 PM, Jakub Kicinski wrote: > > On Tue, 27 Oct 2020 08:51:07 -0600 David Ahern wrote: > >>> Is this another incarnation of 4cb47a8644cc ("tunnels: PMTU discovery > >>> support for directly bridged IP packets")? Sounds like non-UDP tunnels > >>> need the same treatment to make PMTUD work. > >>> > >>> RFC2003 seems to clearly forbid ignoring the inner DF: > >> I was looking at this patch Sunday night. To me it seems odd that > >> packets flowing through the overlay affect decisions in the underlay > >> which meant I agree with the proposed change. > > The RFC was probably written before we invented terms like underlay > > and overlay, and still considered tunneling to be an inefficient hack ;) > > > >> ip_md_tunnel_xmit is inconsistent right now. tnl_update_pmtu is called > >> based on the TUNNEL_DONT_FRAGMENT flag, so why let it be changed later > >> based on the inner header? Or, if you agree with RFC 2003 and the DF > >> should be propagated outer to inner, then it seems like the df reset > >> needs to be moved up before the call to tnl_update_pmtu > > Looks like TUNNEL_DONT_FRAGMENT is intended to switch between using > > PMTU inside the tunnel or just the tunnel dev MTU. ICMP PTB is still > > generated based on the inner headers. > > > > We should be okay to add something like IFLA_GRE_IGNORE_DF to lwt, > > but IMHO the default should not be violating the RFC. > > If we add TUNNEL_IGNORE_DF to lwt, the two IGNORE_DF and DONT_FRAGMENT > > flags should not coexist ? Or DONT_FRAGMENT is prior to the IGNORE_DF? > > > Also there is inconsistent in the kernel for the tunnel device. For geneve and > > vxlan tunnel (don't send tunnel with ip_md_tunnel_xmit) in the lwt mode set > > the outer df only based TUNNEL_DONT_FRAGMENT . > > And this is also the some behavior for gre device before switching to use > ip_md_tunnel_xmit as the following patch. > > 962924f ip_gre: Refactor collect metatdata mode tunnel xmit to ip_md_tunnel_xmit Ah, that's a lot more convincing, I was looking at the Fixes tag you provided, but it seems like Fixes should really point at the commit you mention here. Please mention the change in GRE behavior and the discrepancy between handling of DF by different tunnels in the commit message and repost.
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index 8b04d1d..ee65c92 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -608,9 +608,6 @@ void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, ttl = ip4_dst_hoplimit(&rt->dst); } - if (!df && skb->protocol == htons(ETH_P_IP)) - df = inner_iph->frag_off & htons(IP_DF); - headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len; if (headroom > dev->needed_headroom) dev->needed_headroom = headroom;