Message ID | 20220303181607.1094358-9-eric.dumazet@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | tcp: BIG TCP implementation | expand |
On 3/3/22 11:16 AM, Eric Dumazet wrote: > From: Coco Li <lixiaoyan@google.com> > > Instead of simply forcing a 0 payload_len in IPv6 header, > implement RFC 2675 and insert a custom extension header. > > Note that only TCP stack is currently potentially generating > jumbograms, and that this extension header is purely local, > it wont be sent on a physical link. > > This is needed so that packet capture (tcpdump and friends) > can properly dissect these large packets. > I am fairly certain I know how you are going to respond, but I will ask this anyways :-) : The networking stack as it stands today does not care that skb->len > 64kB and nothing stops a driver from setting max gso size to be > 64kB. Sure, packet socket apps (tcpdump) get confused but if the h/w supports the larger packet size it just works. The jumbogram header is getting adding at the L3/IPv6 layer and then removed by the drivers before pushing to hardware. So, the only benefit of the push and pop of the jumbogram header is for packet sockets and tc/ebpf programs - assuming those programs understand the header (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes, it is a standard header so apps have a chance to understand the larger packet size, but what is the likelihood that random apps or even ebpf programs will understand it? Alternative solutions to the packet socket (ebpf programs have access to skb->len) problem would allow IPv4 to join the Big TCP party. I am wondering how feasible an alternative solution is to get large packet sizes across the board with less overhead and changes.
On Thu, 2022-03-03 at 21:33 -0700, David Ahern wrote: > On 3/3/22 11:16 AM, Eric Dumazet wrote: > > From: Coco Li <lixiaoyan@google.com> > > > > Instead of simply forcing a 0 payload_len in IPv6 header, > > implement RFC 2675 and insert a custom extension header. > > > > Note that only TCP stack is currently potentially generating > > jumbograms, and that this extension header is purely local, > > it wont be sent on a physical link. > > > > This is needed so that packet capture (tcpdump and friends) > > can properly dissect these large packets. > > > > > I am fairly certain I know how you are going to respond, but I will ask > this anyways :-) : > > The networking stack as it stands today does not care that skb->len > > 64kB and nothing stops a driver from setting max gso size to be > 64kB. > Sure, packet socket apps (tcpdump) get confused but if the h/w supports > the larger packet size it just works. > > The jumbogram header is getting adding at the L3/IPv6 layer and then > removed by the drivers before pushing to hardware. So, the only benefit > of the push and pop of the jumbogram header is for packet sockets and > tc/ebpf programs - assuming those programs understand the header > (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes, > it is a standard header so apps have a chance to understand the larger > packet size, but what is the likelihood that random apps or even ebpf > programs will understand it? > > Alternative solutions to the packet socket (ebpf programs have access to > skb->len) problem would allow IPv4 to join the Big TCP party. I am > wondering how feasible an alternative solution is to get large packet > sizes across the board with less overhead and changes. I agree that the header insertion and removal seems like a lot of extra overhead for the sake of correctness. In the Microsoft case I am pretty sure their LSOv2 supported both v4 and v6. I think we could do something similar, we would just need to make certain the device supports it and as such maybe it would make sense to implement it as a gso type flag? Could we handle the length field like we handle the checksum and place a value in there that we know is wrong, but could be used to provide additional data? Perhaps we could even use it to store the MSS in the form of the length of the first packet so if examined, the packet would look like the first frame of the flow with a set of trailing data.
On Fri, Mar 4, 2022 at 7:48 AM Alexander H Duyck <alexander.duyck@gmail.com> wrote: > > On Thu, 2022-03-03 at 21:33 -0700, David Ahern wrote: > > On 3/3/22 11:16 AM, Eric Dumazet wrote: > > > From: Coco Li <lixiaoyan@google.com> > > > > > > Instead of simply forcing a 0 payload_len in IPv6 header, > > > implement RFC 2675 and insert a custom extension header. > > > > > > Note that only TCP stack is currently potentially generating > > > jumbograms, and that this extension header is purely local, > > > it wont be sent on a physical link. > > > > > > This is needed so that packet capture (tcpdump and friends) > > > can properly dissect these large packets. > > > > > > > > > I am fairly certain I know how you are going to respond, but I will ask > > this anyways :-) : > > > > The networking stack as it stands today does not care that skb->len > > > 64kB and nothing stops a driver from setting max gso size to be > 64kB. > > Sure, packet socket apps (tcpdump) get confused but if the h/w supports > > the larger packet size it just works. > > > > The jumbogram header is getting adding at the L3/IPv6 layer and then > > removed by the drivers before pushing to hardware. So, the only benefit > > of the push and pop of the jumbogram header is for packet sockets and > > tc/ebpf programs - assuming those programs understand the header > > (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes, > > it is a standard header so apps have a chance to understand the larger > > packet size, but what is the likelihood that random apps or even ebpf > > programs will understand it? > > > > Alternative solutions to the packet socket (ebpf programs have access to > > skb->len) problem would allow IPv4 to join the Big TCP party. I am > > wondering how feasible an alternative solution is to get large packet > > sizes across the board with less overhead and changes. > > I agree that the header insertion and removal seems like a lot of extra > overhead for the sake of correctness. In the Microsoft case I am pretty > sure their LSOv2 supported both v4 and v6. I think we could do > something similar, we would just need to make certain the device > supports it and as such maybe it would make sense to implement it as a > gso type flag? > > Could we handle the length field like we handle the checksum and place > a value in there that we know is wrong, but could be used to provide > additional data? Perhaps we could even use it to store the MSS in the > form of the length of the first packet so if examined, the packet would > look like the first frame of the flow with a set of trailing data. > I am a bit sad you did not give all this feedback back in August when I presented BIG TCP. We did a lot of work in the last 6 months to implement, test all this, making sure this worked. I am not sure I want to spend another 6 months implementing what you suggest. For instance, input path will not like packets larger than 64KB. There is this thing trimming padding bytes, you probably do not want to mess with this.
On Thu, Mar 3, 2022 at 8:33 PM David Ahern <dsahern@kernel.org> wrote: > > On 3/3/22 11:16 AM, Eric Dumazet wrote: > > From: Coco Li <lixiaoyan@google.com> > > > > Instead of simply forcing a 0 payload_len in IPv6 header, > > implement RFC 2675 and insert a custom extension header. > > > > Note that only TCP stack is currently potentially generating > > jumbograms, and that this extension header is purely local, > > it wont be sent on a physical link. > > > > This is needed so that packet capture (tcpdump and friends) > > can properly dissect these large packets. > > > > > I am fairly certain I know how you are going to respond, but I will ask > this anyways :-) : > > The networking stack as it stands today does not care that skb->len > > 64kB and nothing stops a driver from setting max gso size to be > 64kB. > Sure, packet socket apps (tcpdump) get confused but if the h/w supports > the larger packet size it just works. Observability is key. "just works" is a bold claim. > > The jumbogram header is getting adding at the L3/IPv6 layer and then > removed by the drivers before pushing to hardware. So, the only benefit > of the push and pop of the jumbogram header is for packet sockets and > tc/ebpf programs - assuming those programs understand the header > (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes, > it is a standard header so apps have a chance to understand the larger > packet size, but what is the likelihood that random apps or even ebpf > programs will understand it? Can you explain to me what you are referring to by " random apps" exactly ? TCP does not expose to user space any individual packet length. > > Alternative solutions to the packet socket (ebpf programs have access to > skb->len) problem would allow IPv4 to join the Big TCP party. I am > wondering how feasible an alternative solution is to get large packet > sizes across the board with less overhead and changes. You know, I think I already answered this question 6 months ago. We need to carry an extra metadata to carry how much TCP payload is in a packet, both on RX and TX side. Adding an skb field for that was not an option for me. Adding a 8 bytes header is basically free, the headers need to be in cpu caches when the header is added/removed. This is zero cost on current cpus, compared to the gains. I think you focus on TSO side, which is only 25% of the possible gains that BIG TCP was seeking for. We covered both RX and TX with a common mechanism.
On Fri, 2022-03-04 at 09:09 -0800, Eric Dumazet wrote: > On Fri, Mar 4, 2022 at 7:48 AM Alexander H Duyck > <alexander.duyck@gmail.com> wrote: > > > > On Thu, 2022-03-03 at 21:33 -0700, David Ahern wrote: > > > On 3/3/22 11:16 AM, Eric Dumazet wrote: > > > > From: Coco Li <lixiaoyan@google.com> > > > > > > > > Instead of simply forcing a 0 payload_len in IPv6 header, > > > > implement RFC 2675 and insert a custom extension header. > > > > > > > > Note that only TCP stack is currently potentially generating > > > > jumbograms, and that this extension header is purely local, > > > > it wont be sent on a physical link. > > > > > > > > This is needed so that packet capture (tcpdump and friends) > > > > can properly dissect these large packets. > > > > > > > > > > > > > I am fairly certain I know how you are going to respond, but I will ask > > > this anyways :-) : > > > > > > The networking stack as it stands today does not care that skb->len > > > > 64kB and nothing stops a driver from setting max gso size to be > 64kB. > > > Sure, packet socket apps (tcpdump) get confused but if the h/w supports > > > the larger packet size it just works. > > > > > > The jumbogram header is getting adding at the L3/IPv6 layer and then > > > removed by the drivers before pushing to hardware. So, the only benefit > > > of the push and pop of the jumbogram header is for packet sockets and > > > tc/ebpf programs - assuming those programs understand the header > > > (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes, > > > it is a standard header so apps have a chance to understand the larger > > > packet size, but what is the likelihood that random apps or even ebpf > > > programs will understand it? > > > > > > Alternative solutions to the packet socket (ebpf programs have access to > > > skb->len) problem would allow IPv4 to join the Big TCP party. I am > > > wondering how feasible an alternative solution is to get large packet > > > sizes across the board with less overhead and changes. > > > > I agree that the header insertion and removal seems like a lot of extra > > overhead for the sake of correctness. In the Microsoft case I am pretty > > sure their LSOv2 supported both v4 and v6. I think we could do > > something similar, we would just need to make certain the device > > supports it and as such maybe it would make sense to implement it as a > > gso type flag? > > > > Could we handle the length field like we handle the checksum and place > > a value in there that we know is wrong, but could be used to provide > > additional data? Perhaps we could even use it to store the MSS in the > > form of the length of the first packet so if examined, the packet would > > look like the first frame of the flow with a set of trailing data. > > > > I am a bit sad you did not give all this feedback back in August when > I presented BIG TCP. > As I recall, I was thinking along the same lines as what you have done here, but Dave's question about including IPv4 does bring up an interesting point. And the Microsoft version supported both. > We did a lot of work in the last 6 months to implement, test all this, > making sure this worked. > > I am not sure I want to spend another 6 months implementing what you suggest. I am not saying we have to do this. I am simply stating a "what if" just to gauge this approach. You could think of it as thinking out loud, but in written form. > For instance, input path will not like packets larger than 64KB. > > There is this thing trimming padding bytes, you probably do not want > to mess with this. I had overlooked the fact that this is being used on the input path, the trimming would be an issue. I suppose the fact that the LSOv2 didn't have an Rx counterpart would be one reason for us to not consider the IPv4 approach.
On Fri, Mar 4, 2022 at 11:00 AM Alexander H Duyck <alexander.duyck@gmail.com> wrote: > > On Fri, 2022-03-04 at 09:09 -0800, Eric Dumazet wrote: > > On Fri, Mar 4, 2022 at 7:48 AM Alexander H Duyck > > <alexander.duyck@gmail.com> wrote: > > > > > > On Thu, 2022-03-03 at 21:33 -0700, David Ahern wrote: > > > > On 3/3/22 11:16 AM, Eric Dumazet wrote: > > > > > From: Coco Li <lixiaoyan@google.com> > > > > > > > > > > Instead of simply forcing a 0 payload_len in IPv6 header, > > > > > implement RFC 2675 and insert a custom extension header. > > > > > > > > > > Note that only TCP stack is currently potentially generating > > > > > jumbograms, and that this extension header is purely local, > > > > > it wont be sent on a physical link. > > > > > > > > > > This is needed so that packet capture (tcpdump and friends) > > > > > can properly dissect these large packets. > > > > > > > > > > > > > > > > > I am fairly certain I know how you are going to respond, but I will ask > > > > this anyways :-) : > > > > > > > > The networking stack as it stands today does not care that skb->len > > > > > 64kB and nothing stops a driver from setting max gso size to be > 64kB. > > > > Sure, packet socket apps (tcpdump) get confused but if the h/w supports > > > > the larger packet size it just works. > > > > > > > > The jumbogram header is getting adding at the L3/IPv6 layer and then > > > > removed by the drivers before pushing to hardware. So, the only benefit > > > > of the push and pop of the jumbogram header is for packet sockets and > > > > tc/ebpf programs - assuming those programs understand the header > > > > (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes, > > > > it is a standard header so apps have a chance to understand the larger > > > > packet size, but what is the likelihood that random apps or even ebpf > > > > programs will understand it? > > > > > > > > Alternative solutions to the packet socket (ebpf programs have access to > > > > skb->len) problem would allow IPv4 to join the Big TCP party. I am > > > > wondering how feasible an alternative solution is to get large packet > > > > sizes across the board with less overhead and changes. > > > > > > I agree that the header insertion and removal seems like a lot of extra > > > overhead for the sake of correctness. In the Microsoft case I am pretty > > > sure their LSOv2 supported both v4 and v6. I think we could do > > > something similar, we would just need to make certain the device > > > supports it and as such maybe it would make sense to implement it as a > > > gso type flag? > > > > > > Could we handle the length field like we handle the checksum and place > > > a value in there that we know is wrong, but could be used to provide > > > additional data? Perhaps we could even use it to store the MSS in the > > > form of the length of the first packet so if examined, the packet would > > > look like the first frame of the flow with a set of trailing data. > > > > > > > I am a bit sad you did not give all this feedback back in August when > > I presented BIG TCP. > > > > As I recall, I was thinking along the same lines as what you have done > here, but Dave's question about including IPv4 does bring up an > interesting point. And the Microsoft version supported both. Yes, maybe they added metadata for that, and decided to let packet capture in the dark, or changed tcpdump/wireshark to fetch/use this metadata ? This was the first thing I tried one year ago, and eventually gave up, because this was a no go for us. Then seeing HBH Jumbo support being added recently in tcpdump, I understood we could finally get visibility, and started BIG TCP using this. I guess someone might add extra logic to allow ipv4 BIG TCP, if they really need it, I will not object to it. > > > We did a lot of work in the last 6 months to implement, test all this, > > making sure this worked. > > > > I am not sure I want to spend another 6 months implementing what you suggest. > > I am not saying we have to do this. I am simply stating a "what if" > just to gauge this approach. You could think of it as thinking out > loud, but in written form. Understood. BTW I spent time adding a new gso_type flag, but also gave up because we have no more room in features_t type. Solving features_t exhaustion alone is a delicate topic. > > > For instance, input path will not like packets larger than 64KB. > > > > There is this thing trimming padding bytes, you probably do not want > > to mess with this. > > I had overlooked the fact that this is being used on the input path, > the trimming would be an issue. I suppose the fact that the LSOv2 > didn't have an Rx counterpart would be one reason for us to not > consider the IPv4 approach. >
On 3/4/22 10:47 AM, Eric Dumazet wrote: > On Thu, Mar 3, 2022 at 8:33 PM David Ahern <dsahern@kernel.org> wrote: >> >> On 3/3/22 11:16 AM, Eric Dumazet wrote: >>> From: Coco Li <lixiaoyan@google.com> >>> >>> Instead of simply forcing a 0 payload_len in IPv6 header, >>> implement RFC 2675 and insert a custom extension header. >>> >>> Note that only TCP stack is currently potentially generating >>> jumbograms, and that this extension header is purely local, >>> it wont be sent on a physical link. >>> >>> This is needed so that packet capture (tcpdump and friends) >>> can properly dissect these large packets. >>> >> >> >> I am fairly certain I know how you are going to respond, but I will ask >> this anyways :-) : >> >> The networking stack as it stands today does not care that skb->len > >> 64kB and nothing stops a driver from setting max gso size to be > 64kB. >> Sure, packet socket apps (tcpdump) get confused but if the h/w supports >> the larger packet size it just works. > > Observability is key. "just works" is a bold claim. > >> >> The jumbogram header is getting adding at the L3/IPv6 layer and then >> removed by the drivers before pushing to hardware. So, the only benefit >> of the push and pop of the jumbogram header is for packet sockets and >> tc/ebpf programs - assuming those programs understand the header >> (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes, >> it is a standard header so apps have a chance to understand the larger >> packet size, but what is the likelihood that random apps or even ebpf >> programs will understand it? > > Can you explain to me what you are referring to by " random apps" exactly ? > TCP does not expose to user space any individual packet length. TCP apps are not affected; they do not have direct access to L3 headers. This is about packet sockets and ebpf programs and their knowledge of the HBH header. This does not seem like a widely used feature and even tcpdump only recently gained support for it (e.g., Ubuntu 20.04 does not support it, 21.10 does). Given that what are the odds most packet programs are affected by the change and if they need to have support we could just as easily add that support in a way that gets both networking layers working. > > > >> >> Alternative solutions to the packet socket (ebpf programs have access to >> skb->len) problem would allow IPv4 to join the Big TCP party. I am >> wondering how feasible an alternative solution is to get large packet >> sizes across the board with less overhead and changes. > > You know, I think I already answered this question 6 months ago. > > We need to carry an extra metadata to carry how much TCP payload is in a packet, > both on RX and TX side. > > Adding an skb field for that was not an option for me. Why? skb->len is not limited to a u16. The only affect is when skb->len is used to fill in the ipv4/ipv6 header. > > Adding a 8 bytes header is basically free, the headers need to be in cpu caches > when the header is added/removed. > > This is zero cost on current cpus, compared to the gains. > > I think you focus on TSO side, which is only 25% of the possible gains > that BIG TCP was seeking for. > > We covered both RX and TX with a common mechanism.
On 3/4/22 12:13 PM, Eric Dumazet wrote: >> I am not saying we have to do this. I am simply stating a "what if" >> just to gauge this approach. You could think of it as thinking out >> loud, but in written form. my point as well. > > Understood. > > BTW I spent time adding a new gso_type flag, but also gave up because we have > no more room in features_t type. > > Solving features_t exhaustion alone is a delicate topic. > > >> >>> For instance, input path will not like packets larger than 64KB. >>> >>> There is this thing trimming padding bytes, you probably do not want >>> to mess with this. >> >> I had overlooked the fact that this is being used on the input path, >> the trimming would be an issue. I suppose the fact that the LSOv2 >> didn't have an Rx counterpart would be one reason for us to not >> consider the IPv4 approach. >> I'm aware of the trim on ingress; it can be properly handled. Drivers (LRO) and the S/W GRO stack would know when it is exceeding the 64kB length so the skb can be marked that is a large packet. I am not trying to derail this set from getting merged. v6 has a standard header for the large packet support, so certainly use it. That said, it is always best in the long run for IPv4 and IPv6 to have consistent feature support and implementation. Hence the asking about alternative solutions that work for both.
On 3/3/22 11:16 AM, Eric Dumazet wrote: > diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h > index 16870f86c74d3d1f5dfb7edac1e7db85f1ef6755..93b273db1c9926aba4199f486ce90778311916f5 100644 > --- a/include/linux/ipv6.h > +++ b/include/linux/ipv6.h > @@ -144,6 +144,7 @@ struct inet6_skb_parm { > #define IP6SKB_L3SLAVE 64 > #define IP6SKB_JUMBOGRAM 128 > #define IP6SKB_SEG6 256 > +#define IP6SKB_FAKEJUMBO 512 > }; > Why is this considered a FAKEJUMBO? The proper header is getting added correct?
On Sat, Mar 5, 2022 at 8:46 AM David Ahern <dsahern@kernel.org> wrote: > > On 3/4/22 10:47 AM, Eric Dumazet wrote: > > On Thu, Mar 3, 2022 at 8:33 PM David Ahern <dsahern@kernel.org> wrote: > >> > >> On 3/3/22 11:16 AM, Eric Dumazet wrote: > >>> From: Coco Li <lixiaoyan@google.com> > >>> > >>> Instead of simply forcing a 0 payload_len in IPv6 header, > >>> implement RFC 2675 and insert a custom extension header. > >>> > >>> Note that only TCP stack is currently potentially generating > >>> jumbograms, and that this extension header is purely local, > >>> it wont be sent on a physical link. > >>> > >>> This is needed so that packet capture (tcpdump and friends) > >>> can properly dissect these large packets. > >>> > >> > >> > >> I am fairly certain I know how you are going to respond, but I will ask > >> this anyways :-) : > >> > >> The networking stack as it stands today does not care that skb->len > > >> 64kB and nothing stops a driver from setting max gso size to be > 64kB. > >> Sure, packet socket apps (tcpdump) get confused but if the h/w supports > >> the larger packet size it just works. > > > > Observability is key. "just works" is a bold claim. > > > >> > >> The jumbogram header is getting adding at the L3/IPv6 layer and then > >> removed by the drivers before pushing to hardware. So, the only benefit > >> of the push and pop of the jumbogram header is for packet sockets and > >> tc/ebpf programs - assuming those programs understand the header > >> (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes, > >> it is a standard header so apps have a chance to understand the larger > >> packet size, but what is the likelihood that random apps or even ebpf > >> programs will understand it? > > > > Can you explain to me what you are referring to by " random apps" exactly ? > > TCP does not expose to user space any individual packet length. > > TCP apps are not affected; they do not have direct access to L3 headers. > This is about packet sockets and ebpf programs and their knowledge of > the HBH header. This does not seem like a widely used feature and even > tcpdump only recently gained support for it (e.g., Ubuntu 20.04 does > not support it, 21.10 does). Given that what are the odds most packet > programs are affected by the change and if they need to have support we > could just as easily add that support in a way that gets both networking > layers working. > > > > > > > > >> > >> Alternative solutions to the packet socket (ebpf programs have access to > >> skb->len) problem would allow IPv4 to join the Big TCP party. I am > >> wondering how feasible an alternative solution is to get large packet > >> sizes across the board with less overhead and changes. > > > > You know, I think I already answered this question 6 months ago. > > > > We need to carry an extra metadata to carry how much TCP payload is in a packet, > > both on RX and TX side. > > > > Adding an skb field for that was not an option for me. > > Why? skb->len is not limited to a u16. The only affect is when skb->len > is used to fill in the ipv4/ipv6 header. Seriously ? Have you looked recently at core networking stack, and have you read my netdev presentation ? Core networking stack will trim your packets skb->len based on what is found in the network header, which is a 16bit field, unless you use HBH. Look at ip6_rcv_core(). Do you want to modify it ? Let us know how exactly, and why it is not going to break things. pkt_len = ntohs(hdr->payload_len); /* pkt_len may be zero if Jumbo payload option is present */ if (pkt_len || hdr->nexthdr != NEXTHDR_HOP) { if (pkt_len + sizeof(struct ipv6hdr) > skb->len) { __IP6_INC_STATS(net, idev, IPSTATS_MIB_INTRUNCATEDPKTS); goto drop; } if (pskb_trim_rcsum(skb, pkt_len + sizeof(struct ipv6hdr))) { __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); goto drop; } hdr = ipv6_hdr(skb); } if (hdr->nexthdr == NEXTHDR_HOP) { if (ipv6_parse_hopopts(skb) < 0) { __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); rcu_read_unlock(); return NULL; } } > > > > > Adding a 8 bytes header is basically free, the headers need to be in cpu caches > > when the header is added/removed. > > > > This is zero cost on current cpus, compared to the gains. > > > > I think you focus on TSO side, which is only 25% of the possible gains > > that BIG TCP was seeking for. > > > > We covered both RX and TX with a common mechanism. >
On 3/5/22 11:08 AM, Eric Dumazet wrote: >> >> Why? skb->len is not limited to a u16. The only affect is when skb->len >> is used to fill in the ipv4/ipv6 header. > > Seriously ? > > Have you looked recently at core networking stack, and have you read > my netdev presentation ? yes I have. > > Core networking stack will trim your packets skb->len based on what is > found in the network header, Core networking stack is s/w under our control; much more complicated and intricate changes have been made.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 16870f86c74d3d1f5dfb7edac1e7db85f1ef6755..93b273db1c9926aba4199f486ce90778311916f5 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -144,6 +144,7 @@ struct inet6_skb_parm { #define IP6SKB_L3SLAVE 64 #define IP6SKB_JUMBOGRAM 128 #define IP6SKB_SEG6 256 +#define IP6SKB_FAKEJUMBO 512 }; #if defined(CONFIG_NET_L3_MASTER_DEV) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 50db9b20d746bc59c7ef7114492db8b9585c575b..38a8e1c9894cd99ecbec5968fcc97549ea0c7508 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -180,7 +180,9 @@ static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff #endif mtu = ip6_skb_dst_mtu(skb); - if (skb_is_gso(skb) && !skb_gso_validate_network_len(skb, mtu)) + if (skb_is_gso(skb) && + !(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) && + !skb_gso_validate_network_len(skb, mtu)) return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); if ((skb->len > mtu && !skb_is_gso(skb)) || @@ -251,6 +253,8 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6, struct dst_entry *dst = skb_dst(skb); struct net_device *dev = dst->dev; struct inet6_dev *idev = ip6_dst_idev(dst); + struct hop_jumbo_hdr *hop_jumbo; + int hoplen = sizeof(*hop_jumbo); unsigned int head_room; struct ipv6hdr *hdr; u8 proto = fl6->flowi6_proto; @@ -258,7 +262,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6, int hlimit = -1; u32 mtu; - head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dev); + head_room = sizeof(struct ipv6hdr) + hoplen + LL_RESERVED_SPACE(dev); if (opt) head_room += opt->opt_nflen + opt->opt_flen; @@ -281,6 +285,20 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6, &fl6->saddr); } + if (unlikely(seg_len > IPV6_MAXPLEN)) { + hop_jumbo = skb_push(skb, hoplen); + + hop_jumbo->nexthdr = proto; + hop_jumbo->hdrlen = 0; + hop_jumbo->tlv_type = IPV6_TLV_JUMBO; + hop_jumbo->tlv_len = 4; + hop_jumbo->jumbo_payload_len = htonl(seg_len + hoplen); + + proto = IPPROTO_HOPOPTS; + seg_len = 0; + IP6CB(skb)->flags |= IP6SKB_FAKEJUMBO; + } + skb_push(skb, sizeof(struct ipv6hdr)); skb_reset_network_header(skb); hdr = ipv6_hdr(skb);