diff mbox series

[RFC] net: add TCP fraglist GRO support

Message ID 20240423094117.93206-1-nbd@nbd.name (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC] net: add TCP fraglist GRO support | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1847 this patch: 1847
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 957 this patch: 957
netdev/verify_signedoff fail author Signed-off-by missing
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1883 this patch: 1883
netdev/checkpatch warning WARNING: Non-standard signature: 'Signe-off-by:' - perhaps 'Signed-off-by:'? WARNING: line length of 85 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 2 this patch: 2
netdev/source_inline success Was 0 now: 0

Commit Message

Felix Fietkau April 23, 2024, 9:41 a.m. UTC
When forwarding TCP after GRO, software segmentation is very expensive,
especially when the checksum needs to be recalculated.
One case where that's currently unavoidable is when routing packets over
PPPoE. Performance improves significantly when using fraglist GRO
implemented in the same way as for UDP.

Here's a measurement of running 2 TCP streams through a MediaTek MT7622
device (2-core Cortex-A53), which runs NAT with flow offload enabled from
one ethernet port to PPPoE on another ethernet port + cake qdisc set to
1Gbps.

rx-gro-list off: 630 Mbit/s, CPU 35% idle
rx-gro-list on:  770 Mbit/s, CPU 40% idle

Signe-off-by: Felix Fietkau <nbd@nbd.name>
---
 include/net/gro.h        |   1 +
 include/net/tcp.h        |   3 +-
 net/core/gro.c           |  27 ++++++++
 net/ipv4/tcp_offload.c   | 143 ++++++++++++++++++++++++++++++++++++++-
 net/ipv4/udp_offload.c   |  27 --------
 net/ipv6/tcpv6_offload.c |  58 +++++++++++++++-
 6 files changed, 228 insertions(+), 31 deletions(-)

Comments

Eric Dumazet April 23, 2024, 10:15 a.m. UTC | #1
On Tue, Apr 23, 2024 at 11:41 AM Felix Fietkau <nbd@nbd.name> wrote:
>
> When forwarding TCP after GRO, software segmentation is very expensive,
> especially when the checksum needs to be recalculated.
> One case where that's currently unavoidable is when routing packets over
> PPPoE. Performance improves significantly when using fraglist GRO
> implemented in the same way as for UDP.
>
> Here's a measurement of running 2 TCP streams through a MediaTek MT7622
> device (2-core Cortex-A53), which runs NAT with flow offload enabled from
> one ethernet port to PPPoE on another ethernet port + cake qdisc set to
> 1Gbps.
>
> rx-gro-list off: 630 Mbit/s, CPU 35% idle
> rx-gro-list on:  770 Mbit/s, CPU 40% idle

Hi Felix

changelog is a bit terse, and patch complex.

Could you elaborate why this issue
seems to be related to a specific driver ?

I think we should push hard to not use frag_list in drivers :/

And GRO itself could avoid building frag_list skbs
in hosts where forwarding is enabled.

(Note that we also can increase MAX_SKB_FRAGS to 45 these days)
Felix Fietkau April 23, 2024, 10:25 a.m. UTC | #2
On 23.04.24 12:15, Eric Dumazet wrote:
> On Tue, Apr 23, 2024 at 11:41 AM Felix Fietkau <nbd@nbd.name> wrote:
>>
>> When forwarding TCP after GRO, software segmentation is very expensive,
>> especially when the checksum needs to be recalculated.
>> One case where that's currently unavoidable is when routing packets over
>> PPPoE. Performance improves significantly when using fraglist GRO
>> implemented in the same way as for UDP.
>>
>> Here's a measurement of running 2 TCP streams through a MediaTek MT7622
>> device (2-core Cortex-A53), which runs NAT with flow offload enabled from
>> one ethernet port to PPPoE on another ethernet port + cake qdisc set to
>> 1Gbps.
>>
>> rx-gro-list off: 630 Mbit/s, CPU 35% idle
>> rx-gro-list on:  770 Mbit/s, CPU 40% idle
> 
> Hi Felix
> 
> changelog is a bit terse, and patch complex.
> 
> Could you elaborate why this issue
> seems to be related to a specific driver ?
> 
> I think we should push hard to not use frag_list in drivers :/
> 
> And GRO itself could avoid building frag_list skbs
> in hosts where forwarding is enabled.
> 
> (Note that we also can increase MAX_SKB_FRAGS to 45 these days)

The issue is not related to a specific driver at all. Here's how traffic 
flows: TCP packets are received on the SoC ethernet driver, the network 
stack performs regular GRO. The packet gets forwarded by flow offloading 
until it reaches the PPPoE device. PPPoE does not support GSO packets, 
so the packets need to be segmented again.
This is *very* expensive, since data needs to be copied and checksummed.

So in my patch, I changed the code to build fraglist GRO instead of 
regular GRO packets, whenever there is no local socket to receive the 
packets. This makes segmenting very cheap, since the original skbs are 
preserved on the trip through the stack. The only cost is an extra 
socket lookup whenever NETIF_F_FRAGLIST_GRO is enabled.

PPPoE in this case is only an example. The same issue appears when 
forwarding to any netdev which does not support TSO, which in my case 
affects most wifi drivers as well.

- Felix
Eric Dumazet April 23, 2024, 11:17 a.m. UTC | #3
On Tue, Apr 23, 2024 at 12:25 PM Felix Fietkau <nbd@nbd.name> wrote:
>
> On 23.04.24 12:15, Eric Dumazet wrote:
> > On Tue, Apr 23, 2024 at 11:41 AM Felix Fietkau <nbd@nbd.name> wrote:
> >>
> >> When forwarding TCP after GRO, software segmentation is very expensive,
> >> especially when the checksum needs to be recalculated.
> >> One case where that's currently unavoidable is when routing packets over
> >> PPPoE. Performance improves significantly when using fraglist GRO
> >> implemented in the same way as for UDP.
> >>
> >> Here's a measurement of running 2 TCP streams through a MediaTek MT7622
> >> device (2-core Cortex-A53), which runs NAT with flow offload enabled from
> >> one ethernet port to PPPoE on another ethernet port + cake qdisc set to
> >> 1Gbps.
> >>
> >> rx-gro-list off: 630 Mbit/s, CPU 35% idle
> >> rx-gro-list on:  770 Mbit/s, CPU 40% idle
> >
> > Hi Felix
> >
> > changelog is a bit terse, and patch complex.
> >
> > Could you elaborate why this issue
> > seems to be related to a specific driver ?
> >
> > I think we should push hard to not use frag_list in drivers :/
> >
> > And GRO itself could avoid building frag_list skbs
> > in hosts where forwarding is enabled.
> >
> > (Note that we also can increase MAX_SKB_FRAGS to 45 these days)
>
> The issue is not related to a specific driver at all. Here's how traffic
> flows: TCP packets are received on the SoC ethernet driver, the network
> stack performs regular GRO. The packet gets forwarded by flow offloading
> until it reaches the PPPoE device. PPPoE does not support GSO packets,
> so the packets need to be segmented again.
> This is *very* expensive, since data needs to be copied and checksummed.

gso segmentation does not copy the payload, unless the device has no
SG capability.

I guess something should be done about that, regardless of your GRO work,
since most ethernet devices support SG these days.

Some drivers use header split for RX, so forwarding to  PPPoE
would require a linearization anyway, if SG is not properly handled.

>
> So in my patch, I changed the code to build fraglist GRO instead of
> regular GRO packets, whenever there is no local socket to receive the
> packets. This makes segmenting very cheap, since the original skbs are
> preserved on the trip through the stack. The only cost is an extra
> socket lookup whenever NETIF_F_FRAGLIST_GRO is enabled.

A socket lookup in multi-net-namespace world is not going to work generically,
but I get the idea now.
Felix Fietkau April 23, 2024, 11:55 a.m. UTC | #4
On 23.04.24 13:17, Eric Dumazet wrote:
> On Tue, Apr 23, 2024 at 12:25 PM Felix Fietkau <nbd@nbd.name> wrote:
>>
>> On 23.04.24 12:15, Eric Dumazet wrote:
>> > On Tue, Apr 23, 2024 at 11:41 AM Felix Fietkau <nbd@nbd.name> wrote:
>> >>
>> >> When forwarding TCP after GRO, software segmentation is very expensive,
>> >> especially when the checksum needs to be recalculated.
>> >> One case where that's currently unavoidable is when routing packets over
>> >> PPPoE. Performance improves significantly when using fraglist GRO
>> >> implemented in the same way as for UDP.
>> >>
>> >> Here's a measurement of running 2 TCP streams through a MediaTek MT7622
>> >> device (2-core Cortex-A53), which runs NAT with flow offload enabled from
>> >> one ethernet port to PPPoE on another ethernet port + cake qdisc set to
>> >> 1Gbps.
>> >>
>> >> rx-gro-list off: 630 Mbit/s, CPU 35% idle
>> >> rx-gro-list on:  770 Mbit/s, CPU 40% idle
>> >
>> > Hi Felix
>> >
>> > changelog is a bit terse, and patch complex.
>> >
>> > Could you elaborate why this issue
>> > seems to be related to a specific driver ?
>> >
>> > I think we should push hard to not use frag_list in drivers :/
>> >
>> > And GRO itself could avoid building frag_list skbs
>> > in hosts where forwarding is enabled.
>> >
>> > (Note that we also can increase MAX_SKB_FRAGS to 45 these days)
>>
>> The issue is not related to a specific driver at all. Here's how traffic
>> flows: TCP packets are received on the SoC ethernet driver, the network
>> stack performs regular GRO. The packet gets forwarded by flow offloading
>> until it reaches the PPPoE device. PPPoE does not support GSO packets,
>> so the packets need to be segmented again.
>> This is *very* expensive, since data needs to be copied and checksummed.
> 
> gso segmentation does not copy the payload, unless the device has no
> SG capability.
> 
> I guess something should be done about that, regardless of your GRO work,
> since most ethernet devices support SG these days.
> 
> Some drivers use header split for RX, so forwarding to  PPPoE
> would require a linearization anyway, if SG is not properly handled.

In the world of consumer-grade WiFi devices, there are a lot of chipsets 
with limited or nonexistent SG support, and very limited checksum 
offload capabilities on Ethernet. The WiFi side of these devices is 
often even worse. I think fraglist GRO is a decent fallback for the 
inevitable corner cases.

>> So in my patch, I changed the code to build fraglist GRO instead of
>> regular GRO packets, whenever there is no local socket to receive the
>> packets. This makes segmenting very cheap, since the original skbs are
>> preserved on the trip through the stack. The only cost is an extra
>> socket lookup whenever NETIF_F_FRAGLIST_GRO is enabled.
> 
> A socket lookup in multi-net-namespace world is not going to work generically,
> but I get the idea now.

Right, I can't think of a proper solution to this at the moment. 
Considering that NETIF_F_FRAGLIST_GRO is opt-in and only meant for 
rather specific configurations anyway, this should not be too much of a 
problem, right?

- Felix
Eric Dumazet April 23, 2024, 12:11 p.m. UTC | #5
On Tue, Apr 23, 2024 at 1:55 PM Felix Fietkau <nbd@nbd.name> wrote:
>
> In the world of consumer-grade WiFi devices, there are a lot of chipsets
> with limited or nonexistent SG support, and very limited checksum
> offload capabilities on Ethernet. The WiFi side of these devices is
> often even worse. I think fraglist GRO is a decent fallback for the
> inevitable corner cases.

What about netfilter and NAT ? Are they okay with NETIF_F_FRAGLIST_GRO already ?

Many of these devices are probably using NAT.
Felix Fietkau April 23, 2024, 12:23 p.m. UTC | #6
On 23.04.24 14:11, Eric Dumazet wrote:
> On Tue, Apr 23, 2024 at 1:55 PM Felix Fietkau <nbd@nbd.name> wrote:
>>
>> In the world of consumer-grade WiFi devices, there are a lot of chipsets
>> with limited or nonexistent SG support, and very limited checksum
>> offload capabilities on Ethernet. The WiFi side of these devices is
>> often even worse. I think fraglist GRO is a decent fallback for the
>> inevitable corner cases.
> 
> What about netfilter and NAT ? Are they okay with NETIF_F_FRAGLIST_GRO already ?
> 
> Many of these devices are probably using NAT.

In my tests, nftables NAT works just fine, both with and without 
flowtable offloading. I didn't see anything in netfilter that would have 
a problem with this.

- Felix
Eric Dumazet April 23, 2024, 1:07 p.m. UTC | #7
On Tue, Apr 23, 2024 at 2:23 PM Felix Fietkau <nbd@nbd.name> wrote:
>
> On 23.04.24 14:11, Eric Dumazet wrote:
> > On Tue, Apr 23, 2024 at 1:55 PM Felix Fietkau <nbd@nbd.name> wrote:
> >>
> >> In the world of consumer-grade WiFi devices, there are a lot of chipsets
> >> with limited or nonexistent SG support, and very limited checksum
> >> offload capabilities on Ethernet. The WiFi side of these devices is
> >> often even worse. I think fraglist GRO is a decent fallback for the
> >> inevitable corner cases.
> >
> > What about netfilter and NAT ? Are they okay with NETIF_F_FRAGLIST_GRO already ?
> >
> > Many of these devices are probably using NAT.
>
> In my tests, nftables NAT works just fine, both with and without
> flowtable offloading. I didn't see anything in netfilter that would have
> a problem with this.

This is great !
Paolo Abeni April 23, 2024, 2:34 p.m. UTC | #8
On Tue, 2024-04-23 at 14:23 +0200, Felix Fietkau wrote:
> On 23.04.24 14:11, Eric Dumazet wrote:
> > On Tue, Apr 23, 2024 at 1:55 PM Felix Fietkau <nbd@nbd.name> wrote:
> > > 
> > > In the world of consumer-grade WiFi devices, there are a lot of chipsets
> > > with limited or nonexistent SG support, and very limited checksum
> > > offload capabilities on Ethernet. The WiFi side of these devices is
> > > often even worse. I think fraglist GRO is a decent fallback for the
> > > inevitable corner cases.
> > 
> > What about netfilter and NAT ? Are they okay with NETIF_F_FRAGLIST_GRO already ?
> > 
> > Many of these devices are probably using NAT.
> 
> In my tests, nftables NAT works just fine, both with and without 
> flowtable offloading. I didn't see anything in netfilter that would have 
> a problem with this.

I see you handle explicitly NAT changes in __tcpv4_gso_segment_csum(),
like the current UDP code. 

The TCP header has many other fields that could be updated affecting
the TCP csum.
Handling every possible mutation looks cumbersome and will likely
reduce the performance benefits.

What is your plan WRT other TCP header fields update?

Strictly WRT the patch, I guess it deserves to be split in series,
moving UDP helpers in common code and possibly factoring out more
helpers with separate patches.

e.g. in __tcpv4_gso_segment_csum() is quite similar 
__udpv4_gso_segment_csum() - even too much, as the tcp csum should be
always be updated when the ports or addresses change ;)

Cheers,

Paolo
David Ahern April 23, 2024, 3:03 p.m. UTC | #9
On 4/23/24 4:15 AM, Eric Dumazet wrote:
> I think we should push hard to not use frag_list in drivers :/

why is that? I noticed significant gains for local delivery after adding
frag_list support for H/W GRO. Fewer skbs going up the stack is
essential for high throughput and reducing CPU load.

> 
> And GRO itself could avoid building frag_list skbs
> in hosts where forwarding is enabled.

But if the egress device supports SG and the driver understands
frag_list, walking the frag_list should be cheaper than multiple skbs
traversing the forwarding path.

> 
> (Note that we also can increase MAX_SKB_FRAGS to 45 these days)

Using 45 frags has other side effects and not something that can be done
universally (hence why it is a config option).

45 frags is for Big TCP at 4kB and that is ~ 3 skbs at the default
setting of 17 which means an skb chain 2 deep. 1 skb going up the stack
vs 3 skbs - that is a big difference.

Was there a conference talk or a discussion around tests performed
comparing use of frag_list with MAX_SKB_FRAGS at 17 vs expanding
MAX_SKB_FRAGS to 45?
Eric Dumazet April 23, 2024, 3:18 p.m. UTC | #10
On Tue, Apr 23, 2024 at 5:03 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 4/23/24 4:15 AM, Eric Dumazet wrote:
> > I think we should push hard to not use frag_list in drivers :/
>
> why is that? I noticed significant gains for local delivery after adding
> frag_list support for H/W GRO. Fewer skbs going up the stack is
> essential for high throughput and reducing CPU load.

Felix case is about forwarding, not local delivery (which is fine)

>
> >
> > And GRO itself could avoid building frag_list skbs
> > in hosts where forwarding is enabled.
>
> But if the egress device supports SG and the driver understands
> frag_list, walking the frag_list should be cheaper than multiple skbs
> traversing the forwarding path.

I do not count any relevant (modern) driver supporting NETIF_F_FRAGLIST

>
> >
> > (Note that we also can increase MAX_SKB_FRAGS to 45 these days)
>
> Using 45 frags has other side effects and not something that can be done
> universally (hence why it is a config option).
>
> 45 frags is for Big TCP at 4kB and that is ~ 3 skbs at the default
> setting of 17 which means an skb chain 2 deep. 1 skb going up the stack
> vs 3 skbs - that is a big difference.

45 frags can also be for non BIG TCP, allowing ~64KB GRO packets
without frag_list.

45*1448 = 65160

Max number of frags per skb is orthogonal to (BIG) TCP.
Felix Fietkau April 23, 2024, 4:55 p.m. UTC | #11
On 23.04.24 16:34, Paolo Abeni wrote:
> On Tue, 2024-04-23 at 14:23 +0200, Felix Fietkau wrote:
>> On 23.04.24 14:11, Eric Dumazet wrote:
>> > On Tue, Apr 23, 2024 at 1:55 PM Felix Fietkau <nbd@nbd.name> wrote:
>> > > 
>> > > In the world of consumer-grade WiFi devices, there are a lot of chipsets
>> > > with limited or nonexistent SG support, and very limited checksum
>> > > offload capabilities on Ethernet. The WiFi side of these devices is
>> > > often even worse. I think fraglist GRO is a decent fallback for the
>> > > inevitable corner cases.
>> > 
>> > What about netfilter and NAT ? Are they okay with NETIF_F_FRAGLIST_GRO already ?
>> > 
>> > Many of these devices are probably using NAT.
>> 
>> In my tests, nftables NAT works just fine, both with and without 
>> flowtable offloading. I didn't see anything in netfilter that would have 
>> a problem with this.
> 
> I see you handle explicitly NAT changes in __tcpv4_gso_segment_csum(),
> like the current UDP code.
> 
> The TCP header has many other fields that could be updated affecting
> the TCP csum.
> Handling every possible mutation looks cumbersome and will likely
> reduce the performance benefits.
> 
> What is your plan WRT other TCP header fields update?

I think that should be easy enough to handle. My patch already only 
combines packets where tcp_flag_word(th) is identical. So when 
segmenting, I could handle all flags changes with a single 
inet_proto_csum_replace4 call.

> Strictly WRT the patch, I guess it deserves to be split in series,
> moving UDP helpers in common code and possibly factoring out more
> helpers with separate patches.
Will do.

> e.g. in __tcpv4_gso_segment_csum() is quite similar
> __udpv4_gso_segment_csum() - even too much, as the tcp csum should be
> always be updated when the ports or addresses change ;)

Will fix that.

Thanks,

- Felix
Willem de Bruijn April 24, 2024, 1:24 a.m. UTC | #12
Felix Fietkau wrote:
> On 23.04.24 16:34, Paolo Abeni wrote:
> > On Tue, 2024-04-23 at 14:23 +0200, Felix Fietkau wrote:
> >> On 23.04.24 14:11, Eric Dumazet wrote:
> >> > On Tue, Apr 23, 2024 at 1:55 PM Felix Fietkau <nbd@nbd.name> wrote:
> >> > > 
> >> > > In the world of consumer-grade WiFi devices, there are a lot of chipsets
> >> > > with limited or nonexistent SG support, and very limited checksum
> >> > > offload capabilities on Ethernet. The WiFi side of these devices is
> >> > > often even worse. I think fraglist GRO is a decent fallback for the
> >> > > inevitable corner cases.
> >> > 
> >> > What about netfilter and NAT ? Are they okay with NETIF_F_FRAGLIST_GRO already ?
> >> > 
> >> > Many of these devices are probably using NAT.
> >> 
> >> In my tests, nftables NAT works just fine, both with and without 
> >> flowtable offloading. I didn't see anything in netfilter that would have 
> >> a problem with this.
> > 
> > I see you handle explicitly NAT changes in __tcpv4_gso_segment_csum(),
> > like the current UDP code.
> > 
> > The TCP header has many other fields that could be updated affecting
> > the TCP csum.
> > Handling every possible mutation looks cumbersome and will likely
> > reduce the performance benefits.
> > 
> > What is your plan WRT other TCP header fields update?
> 
> I think that should be easy enough to handle. My patch already only 
> combines packets where tcp_flag_word(th) is identical. So when 
> segmenting, I could handle all flags changes with a single 
> inet_proto_csum_replace4 call.
> 
> > Strictly WRT the patch, I guess it deserves to be split in series,
> > moving UDP helpers in common code and possibly factoring out more
> > helpers with separate patches.
> Will do.

A significant chunk of the complexity is in the
tcp[46]_check_fraglist_gro sk match. Is this heuristic worth the
complexity?

It seems that the platforms that will enable NETIF_F_FRAGLIST will
be mainly forwarding planes.

If keeping, this refinement can probably a separate follow-on patch in
the series too:

- refactor existing udp code
- add segmentation support to handle such packets on tx
- add coalescing support that starts building such packets on rx
- refine coalescing choice

> > e.g. in __tcpv4_gso_segment_csum() is quite similar
> > __udpv4_gso_segment_csum() - even too much, as the tcp csum should be
> > always be updated when the ports or addresses change ;)
> 
> Will fix that.
> 
> Thanks,
> 
> - Felix
Felix Fietkau April 24, 2024, 1:50 p.m. UTC | #13
On 24.04.24 03:24, Willem de Bruijn wrote:
> Felix Fietkau wrote:
>> On 23.04.24 16:34, Paolo Abeni wrote:
>> > On Tue, 2024-04-23 at 14:23 +0200, Felix Fietkau wrote:
>> >> On 23.04.24 14:11, Eric Dumazet wrote:
>> >> > On Tue, Apr 23, 2024 at 1:55 PM Felix Fietkau <nbd@nbd.name> wrote:
>> >> > > 
>> >> > > In the world of consumer-grade WiFi devices, there are a lot of chipsets
>> >> > > with limited or nonexistent SG support, and very limited checksum
>> >> > > offload capabilities on Ethernet. The WiFi side of these devices is
>> >> > > often even worse. I think fraglist GRO is a decent fallback for the
>> >> > > inevitable corner cases.
>> >> > 
>> >> > What about netfilter and NAT ? Are they okay with NETIF_F_FRAGLIST_GRO already ?
>> >> > 
>> >> > Many of these devices are probably using NAT.
>> >> 
>> >> In my tests, nftables NAT works just fine, both with and without 
>> >> flowtable offloading. I didn't see anything in netfilter that would have 
>> >> a problem with this.
>> > 
>> > I see you handle explicitly NAT changes in __tcpv4_gso_segment_csum(),
>> > like the current UDP code.
>> > 
>> > The TCP header has many other fields that could be updated affecting
>> > the TCP csum.
>> > Handling every possible mutation looks cumbersome and will likely
>> > reduce the performance benefits.
>> > 
>> > What is your plan WRT other TCP header fields update?
>> 
>> I think that should be easy enough to handle. My patch already only 
>> combines packets where tcp_flag_word(th) is identical. So when 
>> segmenting, I could handle all flags changes with a single 
>> inet_proto_csum_replace4 call.
>> 
>> > Strictly WRT the patch, I guess it deserves to be split in series,
>> > moving UDP helpers in common code and possibly factoring out more
>> > helpers with separate patches.
>> Will do.
> 
> A significant chunk of the complexity is in the
> tcp[46]_check_fraglist_gro sk match. Is this heuristic worth the
> complexity?
> 
> It seems that the platforms that will enable NETIF_F_FRAGLIST will
> be mainly forwarding planes.

There are people using their devices as file servers and routers at the 
same time. The heuristic helps for those cases.

> If keeping, this refinement can probably a separate follow-on patch in
> the series too:
> 
> - refactor existing udp code
> - add segmentation support to handle such packets on tx
> - add coalescing support that starts building such packets on rx
> - refine coalescing choice
I don't really understand what you're suggesting. With my patch, the GRO 
code handles coalescing of packets. Segmentation on output is also 
supported. The next version of my patch will fix the cases that were too 
similar to the UDP code, so I guess refactoring to share code doesn't 
really make sense there.
Am I missing something?

Thanks,

- Felix
Willem de Bruijn April 24, 2024, 2:30 p.m. UTC | #14
Felix Fietkau wrote:
> On 24.04.24 03:24, Willem de Bruijn wrote:
> > Felix Fietkau wrote:
> >> On 23.04.24 16:34, Paolo Abeni wrote:
> >> > On Tue, 2024-04-23 at 14:23 +0200, Felix Fietkau wrote:
> >> >> On 23.04.24 14:11, Eric Dumazet wrote:
> >> >> > On Tue, Apr 23, 2024 at 1:55 PM Felix Fietkau <nbd@nbd.name> wrote:
> >> >> > > 
> >> >> > > In the world of consumer-grade WiFi devices, there are a lot of chipsets
> >> >> > > with limited or nonexistent SG support, and very limited checksum
> >> >> > > offload capabilities on Ethernet. The WiFi side of these devices is
> >> >> > > often even worse. I think fraglist GRO is a decent fallback for the
> >> >> > > inevitable corner cases.
> >> >> > 
> >> >> > What about netfilter and NAT ? Are they okay with NETIF_F_FRAGLIST_GRO already ?
> >> >> > 
> >> >> > Many of these devices are probably using NAT.
> >> >> 
> >> >> In my tests, nftables NAT works just fine, both with and without 
> >> >> flowtable offloading. I didn't see anything in netfilter that would have 
> >> >> a problem with this.
> >> > 
> >> > I see you handle explicitly NAT changes in __tcpv4_gso_segment_csum(),
> >> > like the current UDP code.
> >> > 
> >> > The TCP header has many other fields that could be updated affecting
> >> > the TCP csum.
> >> > Handling every possible mutation looks cumbersome and will likely
> >> > reduce the performance benefits.
> >> > 
> >> > What is your plan WRT other TCP header fields update?
> >> 
> >> I think that should be easy enough to handle. My patch already only 
> >> combines packets where tcp_flag_word(th) is identical. So when 
> >> segmenting, I could handle all flags changes with a single 
> >> inet_proto_csum_replace4 call.
> >> 
> >> > Strictly WRT the patch, I guess it deserves to be split in series,
> >> > moving UDP helpers in common code and possibly factoring out more
> >> > helpers with separate patches.
> >> Will do.
> > 
> > A significant chunk of the complexity is in the
> > tcp[46]_check_fraglist_gro sk match. Is this heuristic worth the
> > complexity?
> > 
> > It seems that the platforms that will enable NETIF_F_FRAGLIST will
> > be mainly forwarding planes.
> 
> There are people using their devices as file servers and routers at the 
> same time. The heuristic helps for those cases.

Ok.

> > If keeping, this refinement can probably a separate follow-on patch in
> > the series too:
> > 
> > - refactor existing udp code
> > - add segmentation support to handle such packets on tx
> > - add coalescing support that starts building such packets on rx
> > - refine coalescing choice
> I don't really understand what you're suggesting. With my patch, the GRO 
> code handles coalescing of packets. Segmentation on output is also 
> supported. The next version of my patch will fix the cases that were too 
> similar to the UDP code, so I guess refactoring to share code doesn't 
> really make sense there.
> Am I missing something?

I mean if breaking up into a series. First have the refactoring patch
which should be easy to review to be a noop. Then add the segmentation
code, which needs to exist before packets may arrive that depend on
it. Then add the code that produces such packets. To break up into
manageable chunks.
Felix Fietkau April 24, 2024, 4:26 p.m. UTC | #15
On 24.04.24 16:30, Willem de Bruijn wrote:
>> > If keeping, this refinement can probably a separate follow-on patch in
>> > the series too:
>> > 
>> > - refactor existing udp code
>> > - add segmentation support to handle such packets on tx
>> > - add coalescing support that starts building such packets on rx
>> > - refine coalescing choice
>> I don't really understand what you're suggesting. With my patch, the GRO 
>> code handles coalescing of packets. Segmentation on output is also 
>> supported. The next version of my patch will fix the cases that were too 
>> similar to the UDP code, so I guess refactoring to share code doesn't 
>> really make sense there.
>> Am I missing something?
> 
> I mean if breaking up into a series. First have the refactoring patch
> which should be easy to review to be a noop. Then add the segmentation
> code, which needs to exist before packets may arrive that depend on
> it. Then add the code that produces such packets. To break up into
> manageable chunks.

Right, that makes sense.

Thanks,

- Felix
diff mbox series

Patch

diff --git a/include/net/gro.h b/include/net/gro.h
index 50f1e403dbbb..ca8e4b3de044 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -429,6 +429,7 @@  static inline __wsum ip6_gro_compute_pseudo(const struct sk_buff *skb,
 }
 
 int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb);
+int skb_gro_receive_list(struct sk_buff *p, struct sk_buff *skb);
 
 /* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
 static inline void gro_normal_list(struct napi_struct *napi)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b935e1ae4caf..875cda53a7c9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2194,7 +2194,8 @@  void tcp_v4_destroy_sock(struct sock *sk);
 
 struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 				netdev_features_t features);
-struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb);
+struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
+				bool fraglist);
 INDIRECT_CALLABLE_DECLARE(int tcp4_gro_complete(struct sk_buff *skb, int thoff));
 INDIRECT_CALLABLE_DECLARE(struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb));
 INDIRECT_CALLABLE_DECLARE(int tcp6_gro_complete(struct sk_buff *skb, int thoff));
diff --git a/net/core/gro.c b/net/core/gro.c
index 2459ab697f7f..268c6c826d09 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -231,6 +231,33 @@  int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 	return 0;
 }
 
+int skb_gro_receive_list(struct sk_buff *p, struct sk_buff *skb)
+{
+	if (unlikely(p->len + skb->len >= 65536))
+		return -E2BIG;
+
+	if (NAPI_GRO_CB(p)->last == p)
+		skb_shinfo(p)->frag_list = skb;
+	else
+		NAPI_GRO_CB(p)->last->next = skb;
+
+	skb_pull(skb, skb_gro_offset(skb));
+
+	NAPI_GRO_CB(p)->last = skb;
+	NAPI_GRO_CB(p)->count++;
+	p->data_len += skb->len;
+
+	/* sk ownership - if any - completely transferred to the aggregated packet */
+	skb->destructor = NULL;
+	skb->sk = NULL;
+	p->truesize += skb->truesize;
+	p->len += skb->len;
+
+	NAPI_GRO_CB(skb)->same_flow = 1;
+
+	return 0;
+}
+
 
 static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb)
 {
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index fab0973f995b..4f6e40a30b0c 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -28,6 +28,74 @@  static void tcp_gso_tstamp(struct sk_buff *skb, unsigned int ts_seq,
 	}
 }
 
+static void __tcpv4_gso_segment_csum(struct sk_buff *seg,
+				     __be32 *oldip, __be32 *newip,
+				     __be16 *oldport, __be16 *newport)
+{
+	struct tcphdr *th;
+	struct iphdr *iph;
+
+	if (*oldip == *newip && *oldport == *newport)
+		return;
+
+	th = tcp_hdr(seg);
+	iph = ip_hdr(seg);
+
+	if (th->check) {
+		inet_proto_csum_replace4(&th->check, seg, *oldip, *newip,
+					 true);
+		inet_proto_csum_replace2(&th->check, seg, *oldport, *newport,
+					 false);
+		if (!th->check)
+			th->check = CSUM_MANGLED_0;
+	}
+	*oldport = *newport;
+
+	csum_replace4(&iph->check, *oldip, *newip);
+	*oldip = *newip;
+}
+
+static struct sk_buff *__tcpv4_gso_segment_list_csum(struct sk_buff *segs)
+{
+	struct sk_buff *seg;
+	struct tcphdr *uh, *uh2;
+	struct iphdr *iph, *iph2;
+
+	seg = segs;
+	uh = tcp_hdr(seg);
+	iph = ip_hdr(seg);
+
+	if ((tcp_hdr(seg)->dest == tcp_hdr(seg->next)->dest) &&
+	    (tcp_hdr(seg)->source == tcp_hdr(seg->next)->source) &&
+	    (ip_hdr(seg)->daddr == ip_hdr(seg->next)->daddr) &&
+	    (ip_hdr(seg)->saddr == ip_hdr(seg->next)->saddr))
+		return segs;
+
+	while ((seg = seg->next)) {
+		uh2 = tcp_hdr(seg);
+		iph2 = ip_hdr(seg);
+
+		__tcpv4_gso_segment_csum(seg,
+					 &iph2->saddr, &iph->saddr,
+					 &uh2->source, &uh->source);
+		__tcpv4_gso_segment_csum(seg,
+					 &iph2->daddr, &iph->daddr,
+					 &uh2->dest, &uh->dest);
+	}
+
+	return segs;
+}
+
+static struct sk_buff *__tcp_gso_segment_list(struct sk_buff *skb,
+					      netdev_features_t features)
+{
+	skb = skb_segment_list(skb, features, skb_mac_header_len(skb));
+	if (IS_ERR(skb))
+		return skb;
+
+	return __tcpv4_gso_segment_list_csum(skb);
+}
+
 static struct sk_buff *tcp4_gso_segment(struct sk_buff *skb,
 					netdev_features_t features)
 {
@@ -37,6 +105,9 @@  static struct sk_buff *tcp4_gso_segment(struct sk_buff *skb,
 	if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
 		return ERR_PTR(-EINVAL);
 
+	if (skb_shinfo(skb)->gso_type & SKB_GSO_FRAGLIST)
+		return __tcp_gso_segment_list(skb, features);
+
 	if (unlikely(skb->ip_summed != CHECKSUM_PARTIAL)) {
 		const struct iphdr *iph = ip_hdr(skb);
 		struct tcphdr *th = tcp_hdr(skb);
@@ -178,7 +249,8 @@  struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	return segs;
 }
 
-struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
+struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
+				bool fraglist)
 {
 	struct sk_buff *pp = NULL;
 	struct sk_buff *p;
@@ -215,6 +287,7 @@  struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 	len = skb_gro_len(skb);
 	flags = tcp_flag_word(th);
 
+	NAPI_GRO_CB(skb)->is_flist = fraglist;
 	list_for_each_entry(p, head, list) {
 		if (!NAPI_GRO_CB(p)->same_flow)
 			continue;
@@ -234,6 +307,7 @@  struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 found:
 	/* Include the IP ID check below from the inner most IP hdr */
 	flush = NAPI_GRO_CB(p)->flush;
+	flush |= fraglist != NAPI_GRO_CB(p)->is_flist;
 	flush |= (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
 		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
@@ -267,6 +341,19 @@  struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 	flush |= (ntohl(th2->seq) + skb_gro_len(p)) ^ ntohl(th->seq);
 	flush |= skb_cmp_decrypted(p, skb);
 
+	if (fraglist) {
+		flush |= (__force int)(flags ^ tcp_flag_word(th2));
+		flush |= skb->ip_summed != p->ip_summed;
+		flush |= skb->csum_level != p->csum_level;
+		flush |= !pskb_may_pull(skb, skb_gro_offset(skb));
+		flush |= NAPI_GRO_CB(p)->count >= 64;
+
+		if (flush || skb_gro_receive_list(p, skb))
+			mss = 1;
+
+		goto out_check_final;
+	}
+
 	if (flush || skb_gro_receive(p, skb)) {
 		mss = 1;
 		goto out_check_final;
@@ -314,6 +401,49 @@  void tcp_gro_complete(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(tcp_gro_complete);
 
+static bool tcp4_check_fraglist_gro(struct sk_buff *skb)
+{
+	const struct iphdr *iph = skb_gro_network_header(skb);
+	struct net *net = dev_net(skb->dev);
+	unsigned int off, hlen, thlen;
+	struct tcphdr *th;
+	struct sock *sk;
+	int iif, sdif;
+
+	if (!(skb->dev->features & NETIF_F_GRO_FRAGLIST))
+		return false;
+
+	inet_get_iif_sdif(skb, &iif, &sdif);
+
+	off = skb_gro_offset(skb);
+	hlen = off + sizeof(*th);
+	th = skb_gro_header(skb, hlen, off);
+	if (unlikely(!th))
+		return false;
+
+	thlen = th->doff * 4;
+	if (thlen < sizeof(*th))
+		return false;
+
+	hlen = off + thlen;
+	if (!skb_gro_may_pull(skb, hlen)) {
+		th = skb_gro_header_slow(skb, hlen, off);
+		if (unlikely(!th))
+			return false;
+	}
+
+	sk = __inet_lookup_established(net, net->ipv4.tcp_death_row.hashinfo,
+				       iph->saddr, th->source,
+				       iph->daddr, ntohs(th->dest),
+				       iif, sdif);
+	if (!sk)
+		return true;
+
+	sock_put(sk);
+
+	return false;
+}
+
 INDIRECT_CALLABLE_SCOPE
 struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb)
 {
@@ -325,7 +455,7 @@  struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb)
 		return NULL;
 	}
 
-	return tcp_gro_receive(head, skb);
+	return tcp_gro_receive(head, skb, tcp4_check_fraglist_gro(skb));
 }
 
 INDIRECT_CALLABLE_SCOPE int tcp4_gro_complete(struct sk_buff *skb, int thoff)
@@ -333,6 +463,15 @@  INDIRECT_CALLABLE_SCOPE int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 	const struct iphdr *iph = ip_hdr(skb);
 	struct tcphdr *th = tcp_hdr(skb);
 
+	if (NAPI_GRO_CB(skb)->is_flist) {
+		skb_shinfo(skb)->gso_type |= SKB_GSO_FRAGLIST | SKB_GSO_TCPV4;
+		skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count;
+
+		__skb_incr_checksum_unnecessary(skb);
+
+		return 0;
+	}
+
 	th->check = ~tcp_v4_check(skb->len - thoff, iph->saddr,
 				  iph->daddr, 0);
 
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 3498dd1d0694..a3cd546a1aea 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -433,33 +433,6 @@  static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
 	return segs;
 }
 
-static int skb_gro_receive_list(struct sk_buff *p, struct sk_buff *skb)
-{
-	if (unlikely(p->len + skb->len >= 65536))
-		return -E2BIG;
-
-	if (NAPI_GRO_CB(p)->last == p)
-		skb_shinfo(p)->frag_list = skb;
-	else
-		NAPI_GRO_CB(p)->last->next = skb;
-
-	skb_pull(skb, skb_gro_offset(skb));
-
-	NAPI_GRO_CB(p)->last = skb;
-	NAPI_GRO_CB(p)->count++;
-	p->data_len += skb->len;
-
-	/* sk ownership - if any - completely transferred to the aggregated packet */
-	skb->destructor = NULL;
-	skb->sk = NULL;
-	p->truesize += skb->truesize;
-	p->len += skb->len;
-
-	NAPI_GRO_CB(skb)->same_flow = 1;
-
-	return 0;
-}
-
 
 #define UDP_GRO_CNT_MAX 64
 static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
diff --git a/net/ipv6/tcpv6_offload.c b/net/ipv6/tcpv6_offload.c
index 4b07d1e6c952..7c82532d8aa7 100644
--- a/net/ipv6/tcpv6_offload.c
+++ b/net/ipv6/tcpv6_offload.c
@@ -7,12 +7,56 @@ 
  */
 #include <linux/indirect_call_wrapper.h>
 #include <linux/skbuff.h>
+#include <net/inet6_hashtables.h>
 #include <net/gro.h>
 #include <net/protocol.h>
 #include <net/tcp.h>
 #include <net/ip6_checksum.h>
 #include "ip6_offload.h"
 
+static bool tcp6_check_fraglist_gro(struct sk_buff *skb)
+{
+	const struct ipv6hdr *hdr = skb_gro_network_header(skb);
+	struct net *net = dev_net(skb->dev);
+	unsigned int off, hlen, thlen;
+	struct tcphdr *th;
+	struct sock *sk;
+	int iif, sdif;
+
+	if (!(skb->dev->features & NETIF_F_GRO_FRAGLIST))
+		return false;
+
+	inet6_get_iif_sdif(skb, &iif, &sdif);
+
+	off = skb_gro_offset(skb);
+	hlen = off + sizeof(*th);
+	th = skb_gro_header(skb, hlen, off);
+	if (unlikely(!th))
+		return false;
+
+	thlen = th->doff * 4;
+	if (thlen < sizeof(*th))
+		return false;
+
+	hlen = off + thlen;
+	if (!skb_gro_may_pull(skb, hlen)) {
+		th = skb_gro_header_slow(skb, hlen, off);
+		if (unlikely(!th))
+			return false;
+	}
+
+	sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row.hashinfo,
+					&hdr->saddr, th->source,
+					&hdr->daddr, ntohs(th->dest),
+					iif, sdif);
+	if (!sk)
+		return true;
+
+	sock_put(sk);
+
+	return false;
+}
+
 INDIRECT_CALLABLE_SCOPE
 struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)
 {
@@ -24,7 +68,7 @@  struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)
 		return NULL;
 	}
 
-	return tcp_gro_receive(head, skb);
+	return tcp_gro_receive(head, skb, tcp6_check_fraglist_gro(skb));
 }
 
 INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff)
@@ -32,6 +76,15 @@  INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff)
 	const struct ipv6hdr *iph = ipv6_hdr(skb);
 	struct tcphdr *th = tcp_hdr(skb);
 
+	if (NAPI_GRO_CB(skb)->is_flist) {
+		skb_shinfo(skb)->gso_type |= SKB_GSO_FRAGLIST | SKB_GSO_TCPV6;
+		skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count;
+
+		__skb_incr_checksum_unnecessary(skb);
+
+		return 0;
+	}
+
 	th->check = ~tcp_v6_check(skb->len - thoff, &iph->saddr,
 				  &iph->daddr, 0);
 	skb_shinfo(skb)->gso_type |= SKB_GSO_TCPV6;
@@ -51,6 +104,9 @@  static struct sk_buff *tcp6_gso_segment(struct sk_buff *skb,
 	if (!pskb_may_pull(skb, sizeof(*th)))
 		return ERR_PTR(-EINVAL);
 
+	if (skb_shinfo(skb)->gso_type & SKB_GSO_FRAGLIST)
+		return skb_segment_list(skb, features, skb_mac_header_len(skb));
+
 	if (unlikely(skb->ip_summed != CHECKSUM_PARTIAL)) {
 		const struct ipv6hdr *ipv6h = ipv6_hdr(skb);
 		struct tcphdr *th = tcp_hdr(skb);