diff mbox series

[net] ip_tunnel: fix over-mtu packet send fail without TUNNEL_DONT_FRAGMENT flags

Message ID 1603272115-25351-1-git-send-email-wenxu@ucloud.cn (mailing list archive)
State Not Applicable
Delegated to: Netdev Maintainers
Headers show
Series [net] ip_tunnel: fix over-mtu packet send fail without TUNNEL_DONT_FRAGMENT flags | expand

Commit Message

wenxu Oct. 21, 2020, 9:21 a.m. UTC
From: wenxu <wenxu@ucloud.cn>

The TUNNEL_DONT_FRAGMENT flags specific the tunnel outer ip can do
fragment or not in the md mode. Without the TUNNEL_DONT_FRAGMENT
should always do fragment. So it should not care the frag_off in
inner ip.

Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel")
Signed-off-by: wenxu <wenxu@ucloud.cn>
---
 net/ipv4/ip_tunnel.c | 3 ---
 1 file changed, 3 deletions(-)

Comments

Jakub Kicinski Oct. 23, 2020, 9:12 p.m. UTC | #1
On Wed, 21 Oct 2020 17:21:55 +0800 wenxu@ucloud.cn wrote:
> From: wenxu <wenxu@ucloud.cn>
> 
> The TUNNEL_DONT_FRAGMENT flags specific the tunnel outer ip can do
> fragment or not in the md mode. Without the TUNNEL_DONT_FRAGMENT
> should always do fragment. So it should not care the frag_off in
> inner ip.

Can you describe the use case better? My understanding is that we
should propagate DF in normally functioning networks, and let PMTU 
do its job.

> Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel")
> Signed-off-by: wenxu <wenxu@ucloud.cn>
> ---
>  net/ipv4/ip_tunnel.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
> index 8b04d1d..ee65c92 100644
> --- a/net/ipv4/ip_tunnel.c
> +++ b/net/ipv4/ip_tunnel.c
> @@ -608,9 +608,6 @@ void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
>  			ttl = ip4_dst_hoplimit(&rt->dst);
>  	}
>  
> -	if (!df && skb->protocol == htons(ETH_P_IP))
> -		df = inner_iph->frag_off & htons(IP_DF);
> -
>  	headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len;
>  	if (headroom > dev->needed_headroom)
>  		dev->needed_headroom = headroom;
wenxu Oct. 26, 2020, 8:23 a.m. UTC | #2
On 10/24/2020 5:12 AM, Jakub Kicinski wrote:
> On Wed, 21 Oct 2020 17:21:55 +0800 wenxu@ucloud.cn wrote:
>> From: wenxu <wenxu@ucloud.cn>
>>
>> The TUNNEL_DONT_FRAGMENT flags specific the tunnel outer ip can do
>> fragment or not in the md mode. Without the TUNNEL_DONT_FRAGMENT
>> should always do fragment. So it should not care the frag_off in
>> inner ip.
> Can you describe the use case better? My understanding is that we
> should propagate DF in normally functioning networks, and let PMTU 
> do its job.

Sorry for relying so late.  ip_md_tunnel_xmit send packet in the collect_md mode.

For OpenVswitch example, ovs set the gre port with flags df_default=false which will not

set TUNNEL_DONT_FRAGMENT for tun_flags.

And the mtu of virtual machine is 1500 with default. And the tunnel underlay device mtu

is 1500 default too. So if the size of packet send from vm +  underlay length > underlay device mtu.

The packet always be dropped if the ip header of  packet set flags with DF.

In the collect_md the outer packet can fragment or not should depends on the tun_flags but not inner

ip header like vxlan device did.

>> Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel")
>> Signed-off-by: wenxu <wenxu@ucloud.cn>
>> ---
>>  net/ipv4/ip_tunnel.c | 3 ---
>>  1 file changed, 3 deletions(-)
>>
>> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
>> index 8b04d1d..ee65c92 100644
>> --- a/net/ipv4/ip_tunnel.c
>> +++ b/net/ipv4/ip_tunnel.c
>> @@ -608,9 +608,6 @@ void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
>>  			ttl = ip4_dst_hoplimit(&rt->dst);
>>  	}
>>  
>> -	if (!df && skb->protocol == htons(ETH_P_IP))
>> -		df = inner_iph->frag_off & htons(IP_DF);
>> -
>>  	headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len;
>>  	if (headroom > dev->needed_headroom)
>>  		dev->needed_headroom = headroom;
>
Jakub Kicinski Oct. 26, 2020, 8:56 p.m. UTC | #3
On Mon, 26 Oct 2020 16:23:29 +0800 wenxu wrote:
> On 10/24/2020 5:12 AM, Jakub Kicinski wrote:
> > On Wed, 21 Oct 2020 17:21:55 +0800 wenxu@ucloud.cn wrote:  
> >> From: wenxu <wenxu@ucloud.cn>
> >>
> >> The TUNNEL_DONT_FRAGMENT flags specific the tunnel outer ip can do
> >> fragment or not in the md mode. Without the TUNNEL_DONT_FRAGMENT
> >> should always do fragment. So it should not care the frag_off in
> >> inner ip.  
> > Can you describe the use case better? My understanding is that we
> > should propagate DF in normally functioning networks, and let PMTU 
> > do its job.  
> 
> Sorry for relying so late.  ip_md_tunnel_xmit send packet in the collect_md mode.
> 
> For OpenVswitch example, ovs set the gre port with flags df_default=false which will not
> 
> set TUNNEL_DONT_FRAGMENT for tun_flags.
> 
> And the mtu of virtual machine is 1500 with default. And the tunnel underlay device mtu
> 
> is 1500 default too. So if the size of packet send from vm +  underlay length > underlay device mtu.
> 
> The packet always be dropped if the ip header of  packet set flags with DF.
> 
> In the collect_md the outer packet can fragment or not should depends on the tun_flags but not inner
> 
> ip header like vxlan device did.

Is this another incarnation of 4cb47a8644cc ("tunnels: PMTU discovery
support for directly bridged IP packets")? Sounds like non-UDP tunnels
need the same treatment to make PMTUD work.

RFC2003 seems to clearly forbid ignoring the inner DF:

      Identification, Flags, Fragment Offset

         These three fields are set as specified in [10].  However, if
         the "Don't Fragment" bit is set in the inner IP header, it MUST
         be set in the outer IP header; if the "Don't Fragment" bit is
         not set in the inner IP header, it MAY be set in the outer IP
         header, as described in Section 5.1.

and:

   [..] In
   particular, use of IP options is allowed, and use of fragmentation is
   allowed unless the "Don't Fragment" bit is set in the inner IP
   header.  This restriction on fragmentation is required so that nodes
   employing Path MTU Discovery [7] can obtain the information they
   seek.
David Ahern Oct. 27, 2020, 2:51 p.m. UTC | #4
On 10/26/20 2:56 PM, Jakub Kicinski wrote:
> On Mon, 26 Oct 2020 16:23:29 +0800 wenxu wrote:
>> On 10/24/2020 5:12 AM, Jakub Kicinski wrote:
>>> On Wed, 21 Oct 2020 17:21:55 +0800 wenxu@ucloud.cn wrote:  
>>>> From: wenxu <wenxu@ucloud.cn>
>>>>
>>>> The TUNNEL_DONT_FRAGMENT flags specific the tunnel outer ip can do
>>>> fragment or not in the md mode. Without the TUNNEL_DONT_FRAGMENT
>>>> should always do fragment. So it should not care the frag_off in
>>>> inner ip.  
>>> Can you describe the use case better? My understanding is that we
>>> should propagate DF in normally functioning networks, and let PMTU 
>>> do its job.  
>>
>> Sorry for relying so late.  ip_md_tunnel_xmit send packet in the collect_md mode.
>>
>> For OpenVswitch example, ovs set the gre port with flags df_default=false which will not
>>
>> set TUNNEL_DONT_FRAGMENT for tun_flags.
>>
>> And the mtu of virtual machine is 1500 with default. And the tunnel underlay device mtu
>>
>> is 1500 default too. So if the size of packet send from vm +  underlay length > underlay device mtu.
>>
>> The packet always be dropped if the ip header of  packet set flags with DF.
>>
>> In the collect_md the outer packet can fragment or not should depends on the tun_flags but not inner
>>
>> ip header like vxlan device did.
> 
> Is this another incarnation of 4cb47a8644cc ("tunnels: PMTU discovery
> support for directly bridged IP packets")? Sounds like non-UDP tunnels
> need the same treatment to make PMTUD work.
> 
> RFC2003 seems to clearly forbid ignoring the inner DF:

I was looking at this patch Sunday night. To me it seems odd that
packets flowing through the overlay affect decisions in the underlay
which meant I agree with the proposed change.

ip_md_tunnel_xmit is inconsistent right now. tnl_update_pmtu is called
based on the TUNNEL_DONT_FRAGMENT flag, so why let it be changed later
based on the inner header? Or, if you agree with RFC 2003 and the DF
should be propagated outer to inner, then it seems like the df reset
needs to be moved up before the call to tnl_update_pmtu
Jakub Kicinski Oct. 27, 2020, 3:55 p.m. UTC | #5
On Tue, 27 Oct 2020 08:51:07 -0600 David Ahern wrote:
> > Is this another incarnation of 4cb47a8644cc ("tunnels: PMTU discovery
> > support for directly bridged IP packets")? Sounds like non-UDP tunnels
> > need the same treatment to make PMTUD work.
> > 
> > RFC2003 seems to clearly forbid ignoring the inner DF:  
> 
> I was looking at this patch Sunday night. To me it seems odd that
> packets flowing through the overlay affect decisions in the underlay
> which meant I agree with the proposed change.

The RFC was probably written before we invented terms like underlay 
and overlay, and still considered tunneling to be an inefficient hack ;)

> ip_md_tunnel_xmit is inconsistent right now. tnl_update_pmtu is called
> based on the TUNNEL_DONT_FRAGMENT flag, so why let it be changed later
> based on the inner header? Or, if you agree with RFC 2003 and the DF
> should be propagated outer to inner, then it seems like the df reset
> needs to be moved up before the call to tnl_update_pmtu

Looks like TUNNEL_DONT_FRAGMENT is intended to switch between using
PMTU inside the tunnel or just the tunnel dev MTU. ICMP PTB is still
generated based on the inner headers.

We should be okay to add something like IFLA_GRE_IGNORE_DF to lwt, 
but IMHO the default should not be violating the RFC.
wenxu Oct. 29, 2020, 2:30 a.m. UTC | #6
On 10/27/2020 11:55 PM, Jakub Kicinski wrote:
> On Tue, 27 Oct 2020 08:51:07 -0600 David Ahern wrote:
>>> Is this another incarnation of 4cb47a8644cc ("tunnels: PMTU discovery
>>> support for directly bridged IP packets")? Sounds like non-UDP tunnels
>>> need the same treatment to make PMTUD work.
>>>
>>> RFC2003 seems to clearly forbid ignoring the inner DF:  
>> I was looking at this patch Sunday night. To me it seems odd that
>> packets flowing through the overlay affect decisions in the underlay
>> which meant I agree with the proposed change.
> The RFC was probably written before we invented terms like underlay 
> and overlay, and still considered tunneling to be an inefficient hack ;)
>
>> ip_md_tunnel_xmit is inconsistent right now. tnl_update_pmtu is called
>> based on the TUNNEL_DONT_FRAGMENT flag, so why let it be changed later
>> based on the inner header? Or, if you agree with RFC 2003 and the DF
>> should be propagated outer to inner, then it seems like the df reset
>> needs to be moved up before the call to tnl_update_pmtu
> Looks like TUNNEL_DONT_FRAGMENT is intended to switch between using
> PMTU inside the tunnel or just the tunnel dev MTU. ICMP PTB is still
> generated based on the inner headers.
>
> We should be okay to add something like IFLA_GRE_IGNORE_DF to lwt, 
> but IMHO the default should not be violating the RFC.

If we add  TUNNEL_IGNORE_DF to lwt,  the two IGNORE_DF and DONT_FRAGMENT

flags should not coexist ?   Or DONT_FRAGMENT is prior to the IGNORE_DF?


Also there is inconsistent in the kernel for the tunnel device. For geneve and

vxlan tunnel (don't send tunnel with ip_md_tunnel_xmit) in the lwt mode set

the outer df only based  TUNNEL_DONT_FRAGMENT .

And this is also the some behavior for gre device before switching to use 
ip_md_tunnel_xmit as the following patch.

962924f ip_gre: Refactor collect metatdata mode tunnel xmit to ip_md_tunnel_xmit
Jakub Kicinski Oct. 30, 2020, 1:14 a.m. UTC | #7
On Thu, 29 Oct 2020 10:30:50 +0800 wenxu wrote:
> On 10/27/2020 11:55 PM, Jakub Kicinski wrote:
> > On Tue, 27 Oct 2020 08:51:07 -0600 David Ahern wrote:  
> >>> Is this another incarnation of 4cb47a8644cc ("tunnels: PMTU discovery
> >>> support for directly bridged IP packets")? Sounds like non-UDP tunnels
> >>> need the same treatment to make PMTUD work.
> >>>
> >>> RFC2003 seems to clearly forbid ignoring the inner DF:    
> >> I was looking at this patch Sunday night. To me it seems odd that
> >> packets flowing through the overlay affect decisions in the underlay
> >> which meant I agree with the proposed change.  
> > The RFC was probably written before we invented terms like underlay 
> > and overlay, and still considered tunneling to be an inefficient hack ;)
> >  
> >> ip_md_tunnel_xmit is inconsistent right now. tnl_update_pmtu is called
> >> based on the TUNNEL_DONT_FRAGMENT flag, so why let it be changed later
> >> based on the inner header? Or, if you agree with RFC 2003 and the DF
> >> should be propagated outer to inner, then it seems like the df reset
> >> needs to be moved up before the call to tnl_update_pmtu  
> > Looks like TUNNEL_DONT_FRAGMENT is intended to switch between using
> > PMTU inside the tunnel or just the tunnel dev MTU. ICMP PTB is still
> > generated based on the inner headers.
> >
> > We should be okay to add something like IFLA_GRE_IGNORE_DF to lwt, 
> > but IMHO the default should not be violating the RFC.  
> 
> If we add  TUNNEL_IGNORE_DF to lwt,  the two IGNORE_DF and DONT_FRAGMENT
> 
> flags should not coexist ?   Or DONT_FRAGMENT is prior to the IGNORE_DF?
> 
> 
> Also there is inconsistent in the kernel for the tunnel device. For geneve and
> 
> vxlan tunnel (don't send tunnel with ip_md_tunnel_xmit) in the lwt mode set
> 
> the outer df only based  TUNNEL_DONT_FRAGMENT .
> 
> And this is also the some behavior for gre device before switching to use 
> ip_md_tunnel_xmit as the following patch.
> 
> 962924f ip_gre: Refactor collect metatdata mode tunnel xmit to ip_md_tunnel_xmit

Ah, that's a lot more convincing, I was looking at the Fixes tag you
provided, but it seems like Fixes should really point at the commit you
mention here.

Please mention the change in GRE behavior and the discrepancy between
handling of DF by different tunnels in the commit message and repost.
diff mbox series

Patch

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 8b04d1d..ee65c92 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -608,9 +608,6 @@  void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 			ttl = ip4_dst_hoplimit(&rt->dst);
 	}
 
-	if (!df && skb->protocol == htons(ETH_P_IP))
-		df = inner_iph->frag_off & htons(IP_DF);
-
 	headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len;
 	if (headroom > dev->needed_headroom)
 		dev->needed_headroom = headroom;