diff mbox series

[v2,net-next,08/14] ipv6: Add hop-by-hop header to jumbograms in ip6_output

Message ID 20220303181607.1094358-9-eric.dumazet@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series tcp: BIG TCP implementation | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2109 this patch: 2109
netdev/cc_maintainers warning 1 maintainers not CCed: yoshfuji@linux-ipv6.org
netdev/build_clang success Errors and warnings before: 309 this patch: 309
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2237 this patch: 2237
netdev/checkpatch warning WARNING: 'wont' may be misspelled - perhaps 'won't'?
netdev/kdoc success Errors and warnings before: 37 this patch: 37
netdev/source_inline success Was 0 now: 0

Commit Message

Eric Dumazet March 3, 2022, 6:16 p.m. UTC
From: Coco Li <lixiaoyan@google.com>

Instead of simply forcing a 0 payload_len in IPv6 header,
implement RFC 2675 and insert a custom extension header.

Note that only TCP stack is currently potentially generating
jumbograms, and that this extension header is purely local,
it wont be sent on a physical link.

This is needed so that packet capture (tcpdump and friends)
can properly dissect these large packets.

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/ipv6.h  |  1 +
 net/ipv6/ip6_output.c | 22 ++++++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

Comments

David Ahern March 4, 2022, 4:33 a.m. UTC | #1
On 3/3/22 11:16 AM, Eric Dumazet wrote:
> From: Coco Li <lixiaoyan@google.com>
> 
> Instead of simply forcing a 0 payload_len in IPv6 header,
> implement RFC 2675 and insert a custom extension header.
> 
> Note that only TCP stack is currently potentially generating
> jumbograms, and that this extension header is purely local,
> it wont be sent on a physical link.
> 
> This is needed so that packet capture (tcpdump and friends)
> can properly dissect these large packets.
> 


I am fairly certain I know how you are going to respond, but I will ask
this anyways :-) :

The networking stack as it stands today does not care that skb->len >
64kB and nothing stops a driver from setting max gso size to be > 64kB.
Sure, packet socket apps (tcpdump) get confused but if the h/w supports
the larger packet size it just works.

The jumbogram header is getting adding at the L3/IPv6 layer and then
removed by the drivers before pushing to hardware. So, the only benefit
of the push and pop of the jumbogram header is for packet sockets and
tc/ebpf programs - assuming those programs understand the header
(tcpdump (libpcap?) yes, random packet socket program maybe not). Yes,
it is a standard header so apps have a chance to understand the larger
packet size, but what is the likelihood that random apps or even ebpf
programs will understand it?

Alternative solutions to the packet socket (ebpf programs have access to
skb->len) problem would allow IPv4 to join the Big TCP party. I am
wondering how feasible an alternative solution is to get large packet
sizes across the board with less overhead and changes.
Alexander Duyck March 4, 2022, 3:48 p.m. UTC | #2
On Thu, 2022-03-03 at 21:33 -0700, David Ahern wrote:
> On 3/3/22 11:16 AM, Eric Dumazet wrote:
> > From: Coco Li <lixiaoyan@google.com>
> > 
> > Instead of simply forcing a 0 payload_len in IPv6 header,
> > implement RFC 2675 and insert a custom extension header.
> > 
> > Note that only TCP stack is currently potentially generating
> > jumbograms, and that this extension header is purely local,
> > it wont be sent on a physical link.
> > 
> > This is needed so that packet capture (tcpdump and friends)
> > can properly dissect these large packets.
> > 
> 
> 
> I am fairly certain I know how you are going to respond, but I will ask
> this anyways :-) :
> 
> The networking stack as it stands today does not care that skb->len >
> 64kB and nothing stops a driver from setting max gso size to be > 64kB.
> Sure, packet socket apps (tcpdump) get confused but if the h/w supports
> the larger packet size it just works.
> 
> The jumbogram header is getting adding at the L3/IPv6 layer and then
> removed by the drivers before pushing to hardware. So, the only benefit
> of the push and pop of the jumbogram header is for packet sockets and
> tc/ebpf programs - assuming those programs understand the header
> (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes,
> it is a standard header so apps have a chance to understand the larger
> packet size, but what is the likelihood that random apps or even ebpf
> programs will understand it?
> 
> Alternative solutions to the packet socket (ebpf programs have access to
> skb->len) problem would allow IPv4 to join the Big TCP party. I am
> wondering how feasible an alternative solution is to get large packet
> sizes across the board with less overhead and changes.

I agree that the header insertion and removal seems like a lot of extra
overhead for the sake of correctness. In the Microsoft case I am pretty
sure their LSOv2 supported both v4 and v6. I think we could do
something similar, we would just need to make certain the device
supports it and as such maybe it would make sense to implement it as a
gso type flag?

Could we handle the length field like we handle the checksum and place
a value in there that we know is wrong, but could be used to provide
additional data? Perhaps we could even use it to store the MSS in the
form of the length of the first packet so if examined, the packet would
look like the first frame of the flow with a set of trailing data.
Eric Dumazet March 4, 2022, 5:09 p.m. UTC | #3
On Fri, Mar 4, 2022 at 7:48 AM Alexander H Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Thu, 2022-03-03 at 21:33 -0700, David Ahern wrote:
> > On 3/3/22 11:16 AM, Eric Dumazet wrote:
> > > From: Coco Li <lixiaoyan@google.com>
> > >
> > > Instead of simply forcing a 0 payload_len in IPv6 header,
> > > implement RFC 2675 and insert a custom extension header.
> > >
> > > Note that only TCP stack is currently potentially generating
> > > jumbograms, and that this extension header is purely local,
> > > it wont be sent on a physical link.
> > >
> > > This is needed so that packet capture (tcpdump and friends)
> > > can properly dissect these large packets.
> > >
> >
> >
> > I am fairly certain I know how you are going to respond, but I will ask
> > this anyways :-) :
> >
> > The networking stack as it stands today does not care that skb->len >
> > 64kB and nothing stops a driver from setting max gso size to be > 64kB.
> > Sure, packet socket apps (tcpdump) get confused but if the h/w supports
> > the larger packet size it just works.
> >
> > The jumbogram header is getting adding at the L3/IPv6 layer and then
> > removed by the drivers before pushing to hardware. So, the only benefit
> > of the push and pop of the jumbogram header is for packet sockets and
> > tc/ebpf programs - assuming those programs understand the header
> > (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes,
> > it is a standard header so apps have a chance to understand the larger
> > packet size, but what is the likelihood that random apps or even ebpf
> > programs will understand it?
> >
> > Alternative solutions to the packet socket (ebpf programs have access to
> > skb->len) problem would allow IPv4 to join the Big TCP party. I am
> > wondering how feasible an alternative solution is to get large packet
> > sizes across the board with less overhead and changes.
>
> I agree that the header insertion and removal seems like a lot of extra
> overhead for the sake of correctness. In the Microsoft case I am pretty
> sure their LSOv2 supported both v4 and v6. I think we could do
> something similar, we would just need to make certain the device
> supports it and as such maybe it would make sense to implement it as a
> gso type flag?
>
> Could we handle the length field like we handle the checksum and place
> a value in there that we know is wrong, but could be used to provide
> additional data? Perhaps we could even use it to store the MSS in the
> form of the length of the first packet so if examined, the packet would
> look like the first frame of the flow with a set of trailing data.
>

I am a bit sad you did not give all this feedback back in August when
I presented BIG TCP.

We did a lot of work in the last 6 months to implement, test all this,
making sure this worked.

I am not sure I want to spend another 6 months implementing what you suggest.

For instance, input path will not like packets larger than 64KB.

There is this thing trimming padding bytes, you probably do not want
to mess with this.
Eric Dumazet March 4, 2022, 5:47 p.m. UTC | #4
On Thu, Mar 3, 2022 at 8:33 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 3/3/22 11:16 AM, Eric Dumazet wrote:
> > From: Coco Li <lixiaoyan@google.com>
> >
> > Instead of simply forcing a 0 payload_len in IPv6 header,
> > implement RFC 2675 and insert a custom extension header.
> >
> > Note that only TCP stack is currently potentially generating
> > jumbograms, and that this extension header is purely local,
> > it wont be sent on a physical link.
> >
> > This is needed so that packet capture (tcpdump and friends)
> > can properly dissect these large packets.
> >
>
>
> I am fairly certain I know how you are going to respond, but I will ask
> this anyways :-) :
>
> The networking stack as it stands today does not care that skb->len >
> 64kB and nothing stops a driver from setting max gso size to be > 64kB.
> Sure, packet socket apps (tcpdump) get confused but if the h/w supports
> the larger packet size it just works.

Observability is key. "just works" is a bold claim.

>
> The jumbogram header is getting adding at the L3/IPv6 layer and then
> removed by the drivers before pushing to hardware. So, the only benefit
> of the push and pop of the jumbogram header is for packet sockets and
> tc/ebpf programs - assuming those programs understand the header
> (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes,
> it is a standard header so apps have a chance to understand the larger
> packet size, but what is the likelihood that random apps or even ebpf
> programs will understand it?

Can you explain to me what you are referring to by " random apps" exactly ?
TCP does not expose to user space any individual packet length.



>
> Alternative solutions to the packet socket (ebpf programs have access to
> skb->len) problem would allow IPv4 to join the Big TCP party. I am
> wondering how feasible an alternative solution is to get large packet
> sizes across the board with less overhead and changes.

You know, I think I already answered this question 6 months ago.

We need to carry an extra metadata to carry how much TCP payload is in a packet,
both on RX and TX side.

Adding an skb field for that was not an option for me.

Adding a 8 bytes header is basically free, the headers need to be in cpu caches
when the header is added/removed.

This is zero cost on current cpus, compared to the gains.

I think you focus on TSO side, which is only 25% of the possible gains
that BIG TCP was seeking for.

We covered both RX and TX with a common mechanism.
Alexander Duyck March 4, 2022, 7 p.m. UTC | #5
On Fri, 2022-03-04 at 09:09 -0800, Eric Dumazet wrote:
> On Fri, Mar 4, 2022 at 7:48 AM Alexander H Duyck
> <alexander.duyck@gmail.com> wrote:
> > 
> > On Thu, 2022-03-03 at 21:33 -0700, David Ahern wrote:
> > > On 3/3/22 11:16 AM, Eric Dumazet wrote:
> > > > From: Coco Li <lixiaoyan@google.com>
> > > > 
> > > > Instead of simply forcing a 0 payload_len in IPv6 header,
> > > > implement RFC 2675 and insert a custom extension header.
> > > > 
> > > > Note that only TCP stack is currently potentially generating
> > > > jumbograms, and that this extension header is purely local,
> > > > it wont be sent on a physical link.
> > > > 
> > > > This is needed so that packet capture (tcpdump and friends)
> > > > can properly dissect these large packets.
> > > > 
> > > 
> > > 
> > > I am fairly certain I know how you are going to respond, but I will ask
> > > this anyways :-) :
> > > 
> > > The networking stack as it stands today does not care that skb->len >
> > > 64kB and nothing stops a driver from setting max gso size to be > 64kB.
> > > Sure, packet socket apps (tcpdump) get confused but if the h/w supports
> > > the larger packet size it just works.
> > > 
> > > The jumbogram header is getting adding at the L3/IPv6 layer and then
> > > removed by the drivers before pushing to hardware. So, the only benefit
> > > of the push and pop of the jumbogram header is for packet sockets and
> > > tc/ebpf programs - assuming those programs understand the header
> > > (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes,
> > > it is a standard header so apps have a chance to understand the larger
> > > packet size, but what is the likelihood that random apps or even ebpf
> > > programs will understand it?
> > > 
> > > Alternative solutions to the packet socket (ebpf programs have access to
> > > skb->len) problem would allow IPv4 to join the Big TCP party. I am
> > > wondering how feasible an alternative solution is to get large packet
> > > sizes across the board with less overhead and changes.
> > 
> > I agree that the header insertion and removal seems like a lot of extra
> > overhead for the sake of correctness. In the Microsoft case I am pretty
> > sure their LSOv2 supported both v4 and v6. I think we could do
> > something similar, we would just need to make certain the device
> > supports it and as such maybe it would make sense to implement it as a
> > gso type flag?
> > 
> > Could we handle the length field like we handle the checksum and place
> > a value in there that we know is wrong, but could be used to provide
> > additional data? Perhaps we could even use it to store the MSS in the
> > form of the length of the first packet so if examined, the packet would
> > look like the first frame of the flow with a set of trailing data.
> > 
> 
> I am a bit sad you did not give all this feedback back in August when
> I presented BIG TCP.
> 

As I recall, I was thinking along the same lines as what you have done
here, but Dave's question about including IPv4 does bring up an
interesting point. And the Microsoft version supported both.

> We did a lot of work in the last 6 months to implement, test all this,
> making sure this worked.
> 
> I am not sure I want to spend another 6 months implementing what you suggest.

I am not saying we have to do this. I am simply stating a "what if"
just to gauge this approach. You could think of it as thinking out
loud, but in written form.

> For instance, input path will not like packets larger than 64KB.
> 
> There is this thing trimming padding bytes, you probably do not want
> to mess with this.

I had overlooked the fact that this is being used on the input path,
the trimming would be an issue. I suppose the fact that the LSOv2
didn't have an Rx counterpart would be one reason for us to not
consider the IPv4 approach.
Eric Dumazet March 4, 2022, 7:13 p.m. UTC | #6
On Fri, Mar 4, 2022 at 11:00 AM Alexander H Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, 2022-03-04 at 09:09 -0800, Eric Dumazet wrote:
> > On Fri, Mar 4, 2022 at 7:48 AM Alexander H Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Thu, 2022-03-03 at 21:33 -0700, David Ahern wrote:
> > > > On 3/3/22 11:16 AM, Eric Dumazet wrote:
> > > > > From: Coco Li <lixiaoyan@google.com>
> > > > >
> > > > > Instead of simply forcing a 0 payload_len in IPv6 header,
> > > > > implement RFC 2675 and insert a custom extension header.
> > > > >
> > > > > Note that only TCP stack is currently potentially generating
> > > > > jumbograms, and that this extension header is purely local,
> > > > > it wont be sent on a physical link.
> > > > >
> > > > > This is needed so that packet capture (tcpdump and friends)
> > > > > can properly dissect these large packets.
> > > > >
> > > >
> > > >
> > > > I am fairly certain I know how you are going to respond, but I will ask
> > > > this anyways :-) :
> > > >
> > > > The networking stack as it stands today does not care that skb->len >
> > > > 64kB and nothing stops a driver from setting max gso size to be > 64kB.
> > > > Sure, packet socket apps (tcpdump) get confused but if the h/w supports
> > > > the larger packet size it just works.
> > > >
> > > > The jumbogram header is getting adding at the L3/IPv6 layer and then
> > > > removed by the drivers before pushing to hardware. So, the only benefit
> > > > of the push and pop of the jumbogram header is for packet sockets and
> > > > tc/ebpf programs - assuming those programs understand the header
> > > > (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes,
> > > > it is a standard header so apps have a chance to understand the larger
> > > > packet size, but what is the likelihood that random apps or even ebpf
> > > > programs will understand it?
> > > >
> > > > Alternative solutions to the packet socket (ebpf programs have access to
> > > > skb->len) problem would allow IPv4 to join the Big TCP party. I am
> > > > wondering how feasible an alternative solution is to get large packet
> > > > sizes across the board with less overhead and changes.
> > >
> > > I agree that the header insertion and removal seems like a lot of extra
> > > overhead for the sake of correctness. In the Microsoft case I am pretty
> > > sure their LSOv2 supported both v4 and v6. I think we could do
> > > something similar, we would just need to make certain the device
> > > supports it and as such maybe it would make sense to implement it as a
> > > gso type flag?
> > >
> > > Could we handle the length field like we handle the checksum and place
> > > a value in there that we know is wrong, but could be used to provide
> > > additional data? Perhaps we could even use it to store the MSS in the
> > > form of the length of the first packet so if examined, the packet would
> > > look like the first frame of the flow with a set of trailing data.
> > >
> >
> > I am a bit sad you did not give all this feedback back in August when
> > I presented BIG TCP.
> >
>
> As I recall, I was thinking along the same lines as what you have done
> here, but Dave's question about including IPv4 does bring up an
> interesting point. And the Microsoft version supported both.

Yes, maybe they added metadata for that, and decided to let packet capture
in the dark, or changed tcpdump/wireshark to fetch/use this metadata ?

This was the first thing I tried one year ago, and eventually gave up,
because this was a no go for us.

Then seeing HBH Jumbo support being added recently in tcpdump,
I understood we could finally get visibility, and started BIG TCP using this.

I guess someone might add extra logic to allow ipv4 BIG TCP, if they
really need it,
I will not object to it.

>
> > We did a lot of work in the last 6 months to implement, test all this,
> > making sure this worked.
> >
> > I am not sure I want to spend another 6 months implementing what you suggest.
>
> I am not saying we have to do this. I am simply stating a "what if"
> just to gauge this approach. You could think of it as thinking out
> loud, but in written form.

Understood.

BTW I spent time adding a new gso_type flag, but also gave up because we have
no more room in features_t type.

Solving features_t exhaustion alone is a delicate topic.


>
> > For instance, input path will not like packets larger than 64KB.
> >
> > There is this thing trimming padding bytes, you probably do not want
> > to mess with this.
>
> I had overlooked the fact that this is being used on the input path,
> the trimming would be an issue. I suppose the fact that the LSOv2
> didn't have an Rx counterpart would be one reason for us to not
> consider the IPv4 approach.
>
David Ahern March 5, 2022, 4:46 p.m. UTC | #7
On 3/4/22 10:47 AM, Eric Dumazet wrote:
> On Thu, Mar 3, 2022 at 8:33 PM David Ahern <dsahern@kernel.org> wrote:
>>
>> On 3/3/22 11:16 AM, Eric Dumazet wrote:
>>> From: Coco Li <lixiaoyan@google.com>
>>>
>>> Instead of simply forcing a 0 payload_len in IPv6 header,
>>> implement RFC 2675 and insert a custom extension header.
>>>
>>> Note that only TCP stack is currently potentially generating
>>> jumbograms, and that this extension header is purely local,
>>> it wont be sent on a physical link.
>>>
>>> This is needed so that packet capture (tcpdump and friends)
>>> can properly dissect these large packets.
>>>
>>
>>
>> I am fairly certain I know how you are going to respond, but I will ask
>> this anyways :-) :
>>
>> The networking stack as it stands today does not care that skb->len >
>> 64kB and nothing stops a driver from setting max gso size to be > 64kB.
>> Sure, packet socket apps (tcpdump) get confused but if the h/w supports
>> the larger packet size it just works.
> 
> Observability is key. "just works" is a bold claim.
> 
>>
>> The jumbogram header is getting adding at the L3/IPv6 layer and then
>> removed by the drivers before pushing to hardware. So, the only benefit
>> of the push and pop of the jumbogram header is for packet sockets and
>> tc/ebpf programs - assuming those programs understand the header
>> (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes,
>> it is a standard header so apps have a chance to understand the larger
>> packet size, but what is the likelihood that random apps or even ebpf
>> programs will understand it?
> 
> Can you explain to me what you are referring to by " random apps" exactly ?
> TCP does not expose to user space any individual packet length.

TCP apps are not affected; they do not have direct access to L3 headers.
This is about packet sockets and ebpf programs and their knowledge of
the HBH header. This does not seem like a widely used feature and even
tcpdump only recently gained support for it (e.g.,  Ubuntu 20.04 does
not support it, 21.10 does). Given that what are the odds most packet
programs are affected by the change and if they need to have support we
could just as easily add that support in a way that gets both networking
layers working.

> 
> 
> 
>>
>> Alternative solutions to the packet socket (ebpf programs have access to
>> skb->len) problem would allow IPv4 to join the Big TCP party. I am
>> wondering how feasible an alternative solution is to get large packet
>> sizes across the board with less overhead and changes.
> 
> You know, I think I already answered this question 6 months ago.
> 
> We need to carry an extra metadata to carry how much TCP payload is in a packet,
> both on RX and TX side.
> 
> Adding an skb field for that was not an option for me.

Why? skb->len is not limited to a u16. The only affect is when skb->len
is used to fill in the ipv4/ipv6 header.

> 
> Adding a 8 bytes header is basically free, the headers need to be in cpu caches
> when the header is added/removed.
> 
> This is zero cost on current cpus, compared to the gains.
> 
> I think you focus on TSO side, which is only 25% of the possible gains
> that BIG TCP was seeking for.
> 
> We covered both RX and TX with a common mechanism.
David Ahern March 5, 2022, 4:53 p.m. UTC | #8
On 3/4/22 12:13 PM, Eric Dumazet wrote:
>> I am not saying we have to do this. I am simply stating a "what if"
>> just to gauge this approach. You could think of it as thinking out
>> loud, but in written form.

my point as well.

> 
> Understood.
> 
> BTW I spent time adding a new gso_type flag, but also gave up because we have
> no more room in features_t type.
> 
> Solving features_t exhaustion alone is a delicate topic.
> 
> 
>>
>>> For instance, input path will not like packets larger than 64KB.
>>>
>>> There is this thing trimming padding bytes, you probably do not want
>>> to mess with this.
>>
>> I had overlooked the fact that this is being used on the input path,
>> the trimming would be an issue. I suppose the fact that the LSOv2
>> didn't have an Rx counterpart would be one reason for us to not
>> consider the IPv4 approach.
>>

I'm aware of the trim on ingress; it can be properly handled. Drivers
(LRO) and the S/W GRO stack would know when it is exceeding the 64kB
length so the skb can be marked that is a large packet.

I am not trying to derail this set from getting merged. v6 has a
standard header for the large packet support, so certainly use it. That
said, it is always best in the long run for IPv4 and IPv6 to have
consistent feature support and implementation. Hence the asking about
alternative solutions that work for both.
David Ahern March 5, 2022, 4:55 p.m. UTC | #9
On 3/3/22 11:16 AM, Eric Dumazet wrote:
> diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> index 16870f86c74d3d1f5dfb7edac1e7db85f1ef6755..93b273db1c9926aba4199f486ce90778311916f5 100644
> --- a/include/linux/ipv6.h
> +++ b/include/linux/ipv6.h
> @@ -144,6 +144,7 @@ struct inet6_skb_parm {
>  #define IP6SKB_L3SLAVE         64
>  #define IP6SKB_JUMBOGRAM      128
>  #define IP6SKB_SEG6	      256
> +#define IP6SKB_FAKEJUMBO      512
>  };
>  

Why is this considered a FAKEJUMBO? The proper header is getting added
correct?
Eric Dumazet March 5, 2022, 6:08 p.m. UTC | #10
On Sat, Mar 5, 2022 at 8:46 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 3/4/22 10:47 AM, Eric Dumazet wrote:
> > On Thu, Mar 3, 2022 at 8:33 PM David Ahern <dsahern@kernel.org> wrote:
> >>
> >> On 3/3/22 11:16 AM, Eric Dumazet wrote:
> >>> From: Coco Li <lixiaoyan@google.com>
> >>>
> >>> Instead of simply forcing a 0 payload_len in IPv6 header,
> >>> implement RFC 2675 and insert a custom extension header.
> >>>
> >>> Note that only TCP stack is currently potentially generating
> >>> jumbograms, and that this extension header is purely local,
> >>> it wont be sent on a physical link.
> >>>
> >>> This is needed so that packet capture (tcpdump and friends)
> >>> can properly dissect these large packets.
> >>>
> >>
> >>
> >> I am fairly certain I know how you are going to respond, but I will ask
> >> this anyways :-) :
> >>
> >> The networking stack as it stands today does not care that skb->len >
> >> 64kB and nothing stops a driver from setting max gso size to be > 64kB.
> >> Sure, packet socket apps (tcpdump) get confused but if the h/w supports
> >> the larger packet size it just works.
> >
> > Observability is key. "just works" is a bold claim.
> >
> >>
> >> The jumbogram header is getting adding at the L3/IPv6 layer and then
> >> removed by the drivers before pushing to hardware. So, the only benefit
> >> of the push and pop of the jumbogram header is for packet sockets and
> >> tc/ebpf programs - assuming those programs understand the header
> >> (tcpdump (libpcap?) yes, random packet socket program maybe not). Yes,
> >> it is a standard header so apps have a chance to understand the larger
> >> packet size, but what is the likelihood that random apps or even ebpf
> >> programs will understand it?
> >
> > Can you explain to me what you are referring to by " random apps" exactly ?
> > TCP does not expose to user space any individual packet length.
>
> TCP apps are not affected; they do not have direct access to L3 headers.
> This is about packet sockets and ebpf programs and their knowledge of
> the HBH header. This does not seem like a widely used feature and even
> tcpdump only recently gained support for it (e.g.,  Ubuntu 20.04 does
> not support it, 21.10 does). Given that what are the odds most packet
> programs are affected by the change and if they need to have support we
> could just as easily add that support in a way that gets both networking
> layers working.
>
> >
> >
> >
> >>
> >> Alternative solutions to the packet socket (ebpf programs have access to
> >> skb->len) problem would allow IPv4 to join the Big TCP party. I am
> >> wondering how feasible an alternative solution is to get large packet
> >> sizes across the board with less overhead and changes.
> >
> > You know, I think I already answered this question 6 months ago.
> >
> > We need to carry an extra metadata to carry how much TCP payload is in a packet,
> > both on RX and TX side.
> >
> > Adding an skb field for that was not an option for me.
>
> Why? skb->len is not limited to a u16. The only affect is when skb->len
> is used to fill in the ipv4/ipv6 header.

Seriously ?

Have you looked recently at core networking stack, and have you read
my netdev presentation ?

Core networking stack will trim your packets skb->len based on what is
found in the network header,
which is a 16bit field, unless you use HBH.

Look at ip6_rcv_core().
Do you want to modify it ?
Let us know how exactly, and why it is not going to break things.

pkt_len = ntohs(hdr->payload_len);

/* pkt_len may be zero if Jumbo payload option is present */
if (pkt_len || hdr->nexthdr != NEXTHDR_HOP) {
    if (pkt_len + sizeof(struct ipv6hdr) > skb->len) {
         __IP6_INC_STATS(net,
                                     idev, IPSTATS_MIB_INTRUNCATEDPKTS);
        goto drop;
    }
    if (pskb_trim_rcsum(skb, pkt_len + sizeof(struct ipv6hdr))) {
        __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
            goto drop;
      }
hdr = ipv6_hdr(skb);
}

if (hdr->nexthdr == NEXTHDR_HOP) {
    if (ipv6_parse_hopopts(skb) < 0) {
        __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
        rcu_read_unlock();
        return NULL;
    }
}







>
> >
> > Adding a 8 bytes header is basically free, the headers need to be in cpu caches
> > when the header is added/removed.
> >
> > This is zero cost on current cpus, compared to the gains.
> >
> > I think you focus on TSO side, which is only 25% of the possible gains
> > that BIG TCP was seeking for.
> >
> > We covered both RX and TX with a common mechanism.
>
David Ahern March 5, 2022, 7:06 p.m. UTC | #11
On 3/5/22 11:08 AM, Eric Dumazet wrote:
>>
>> Why? skb->len is not limited to a u16. The only affect is when skb->len
>> is used to fill in the ipv4/ipv6 header.
> 
> Seriously ?
> 
> Have you looked recently at core networking stack, and have you read
> my netdev presentation ?

yes I have.

> 
> Core networking stack will trim your packets skb->len based on what is
> found in the network header,

Core networking stack is s/w under our control; much more complicated
and intricate changes have been made.
diff mbox series

Patch

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 16870f86c74d3d1f5dfb7edac1e7db85f1ef6755..93b273db1c9926aba4199f486ce90778311916f5 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -144,6 +144,7 @@  struct inet6_skb_parm {
 #define IP6SKB_L3SLAVE         64
 #define IP6SKB_JUMBOGRAM      128
 #define IP6SKB_SEG6	      256
+#define IP6SKB_FAKEJUMBO      512
 };
 
 #if defined(CONFIG_NET_L3_MASTER_DEV)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 50db9b20d746bc59c7ef7114492db8b9585c575b..38a8e1c9894cd99ecbec5968fcc97549ea0c7508 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -180,7 +180,9 @@  static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff
 #endif
 
 	mtu = ip6_skb_dst_mtu(skb);
-	if (skb_is_gso(skb) && !skb_gso_validate_network_len(skb, mtu))
+	if (skb_is_gso(skb) &&
+	    !(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) &&
+	    !skb_gso_validate_network_len(skb, mtu))
 		return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu);
 
 	if ((skb->len > mtu && !skb_is_gso(skb)) ||
@@ -251,6 +253,8 @@  int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
 	struct inet6_dev *idev = ip6_dst_idev(dst);
+	struct hop_jumbo_hdr *hop_jumbo;
+	int hoplen = sizeof(*hop_jumbo);
 	unsigned int head_room;
 	struct ipv6hdr *hdr;
 	u8  proto = fl6->flowi6_proto;
@@ -258,7 +262,7 @@  int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	int hlimit = -1;
 	u32 mtu;
 
-	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dev);
+	head_room = sizeof(struct ipv6hdr) + hoplen + LL_RESERVED_SPACE(dev);
 	if (opt)
 		head_room += opt->opt_nflen + opt->opt_flen;
 
@@ -281,6 +285,20 @@  int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 					     &fl6->saddr);
 	}
 
+	if (unlikely(seg_len > IPV6_MAXPLEN)) {
+		hop_jumbo = skb_push(skb, hoplen);
+
+		hop_jumbo->nexthdr = proto;
+		hop_jumbo->hdrlen = 0;
+		hop_jumbo->tlv_type = IPV6_TLV_JUMBO;
+		hop_jumbo->tlv_len = 4;
+		hop_jumbo->jumbo_payload_len = htonl(seg_len + hoplen);
+
+		proto = IPPROTO_HOPOPTS;
+		seg_len = 0;
+		IP6CB(skb)->flags |= IP6SKB_FAKEJUMBO;
+	}
+
 	skb_push(skb, sizeof(struct ipv6hdr));
 	skb_reset_network_header(skb);
 	hdr = ipv6_hdr(skb);