diff mbox series

[v4,net-next,09/14] gro: prevent ACE field corruption & better AccECN handling

Message ID 20241021215910.59767-10-chia-yu.chang@nokia-bell-labs.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series AccECN protocol preparation patch series | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5 this patch: 5
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 16 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-10-23--12-00 (tests: 777)

Commit Message

Chia-Yu Chang (Nokia) Oct. 21, 2024, 9:59 p.m. UTC
From: Ilpo Järvinen <ij@kernel.org>

There are important differences in how the CWR field behaves
in RFC3168 and AccECN. With AccECN, CWR flag is part of the
ACE counter and its changes are important so adjust the flags
changed mask accordingly.

Also, if CWR is there, set the Accurate ECN GSO flag to avoid
corrupting CWR flag somewhere.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_offload.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Paolo Abeni Oct. 29, 2024, 12:03 p.m. UTC | #1
On 10/21/24 23:59, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Ilpo Järvinen <ij@kernel.org>
> 
> There are important differences in how the CWR field behaves
> in RFC3168 and AccECN. With AccECN, CWR flag is part of the
> ACE counter and its changes are important so adjust the flags
> changed mask accordingly.
> 
> Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> corrupting CWR flag somewhere.
> 
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
>  net/ipv4/tcp_offload.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> index 0b05f30e9e5f..f59762d88c38 100644
> --- a/net/ipv4/tcp_offload.c
> +++ b/net/ipv4/tcp_offload.c
> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
>  	th2 = tcp_hdr(p);
>  	flush = (__force int)(flags & TCP_FLAG_CWR);
>  	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> -		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> +		  ~(TCP_FLAG_FIN | TCP_FLAG_PSH));

If I read correctly, if the peer is using RFC3168 and TSO_ECN, GRO will
now pump into the stack twice the number of packets it was doing prior
to this patch, am I correct?

That is likely causing measurable performance regressions.

>  	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
>  	for (i = sizeof(*th); i < thlen; i += 4)
>  		flush |= *(u32 *)((u8 *)th + i) ^
> @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
>  	shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
>  
>  	if (th->cwr)
> -		shinfo->gso_type |= SKB_GSO_TCP_ECN;
> +		shinfo->gso_type |= SKB_GSO_TCP_ACCECN;

If this packet is forwarded, it will not leverage TSO anymore - with
current H/W.

I think we need a way to enable this feature conditionally, but I fear
another sysctl will be ugly and the additional conditionals will not be
good for GRO.

Smarter suggestions welcome ;)

Cheers,

Paolo

>  }
>  EXPORT_SYMBOL(tcp_gro_complete);
>
Ilpo Järvinen Oct. 29, 2024, 9:17 p.m. UTC | #2
On Tue, 29 Oct 2024, Paolo Abeni wrote:
> On 10/21/24 23:59, chia-yu.chang@nokia-bell-labs.com wrote:
> > From: Ilpo Järvinen <ij@kernel.org>
> > 
> > There are important differences in how the CWR field behaves
> > in RFC3168 and AccECN. With AccECN, CWR flag is part of the
> > ACE counter and its changes are important so adjust the flags
> > changed mask accordingly.
> > 
> > Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> > corrupting CWR flag somewhere.
> > 
> > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> >  net/ipv4/tcp_offload.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> > index 0b05f30e9e5f..f59762d88c38 100644
> > --- a/net/ipv4/tcp_offload.c
> > +++ b/net/ipv4/tcp_offload.c
> > @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> >  	th2 = tcp_hdr(p);
> >  	flush = (__force int)(flags & TCP_FLAG_CWR);
> >  	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> > -		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> > +		  ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> 
> If I read correctly, if the peer is using RFC3168 and TSO_ECN, GRO will
> now pump into the stack twice the number of packets it was doing prior
> to this patch, am I correct?
> 
> That is likely causing measurable performance regressions.

Hi Paolo,

Thanks for taking a look!

While it's true on surface that this might cause some more packets with 
RFC3168 (by design, as network cannot know if the sender is using RFC3168 
or not), the important question is the scale how many of extra packets 
will occur in practice.

First of all, RFC3168 requires CWR flag to be sent no more frequently 
than once per window of data, or in other words, once per RTT. And that
means just one packet, not e.g. all packets of a super-skb (the RFC3168
signalling will lose its integrity if this is violated by the sender).

Secondly, the TCP sender uses CWR flag to indicate it just halved its 
congestion window which mean it is sending half the amount of packets in 
this window than in the previous window (analoguous to halving sending 
rate). 2 RTTs with CWR each means two window reductions (this behavior 
is spec'ed in RFC3168).

So lets say the sender was using 100 packets congestion window, this 
change will add one packet to 50 packets on this next RTT. Note those are
raw numbers of packets on wire and do not tell how many packets GRO 
combined into each super-skb which will wary case-by-case basis. 
Regardless, I suspect the extra packet added to the half of the packets 
will be hard/impossible to measure to cause a performance regression.

This change would double the number of packets only if the congestion 
window is 1 or 2 packets and in that case TSO/GSO/GRO benefits will be 
pretty small to begin with (or even counterproductive). Also, the 
traditional TCP congestion control (RFC3168 included) has many issues 
anyway with that small windows because it doesn't deal with fractional 
congestion windows well.

> >  	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> >  	for (i = sizeof(*th); i < thlen; i += 4)
> >  		flush |= *(u32 *)((u8 *)th + i) ^
> > @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
> >  	shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> >  
> >  	if (th->cwr)
> > -		shinfo->gso_type |= SKB_GSO_TCP_ECN;
> > +		shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> 
> If this packet is forwarded, it will not leverage TSO anymore - with
> current H/W.
> 
> I think we need a way to enable this feature conditionally, but I fear
> another sysctl will be ugly and the additional conditionals will not be
> good for GRO.
>
> Smarter suggestions welcome ;)

Well, it is already very selectively _conditional_, SKB_GSO_TCP_ACCECN is 
only set for the skb when CWR is set. That is, once per RTT (data window) 
when it comes to RFC3168. 

I don't have any source for this (other than reading many many tcpdumps 
in the past) but I believe the percentage of packets with CWR set (due to 
RFC3168 signalling) is going to be very small overall.

Do you think that is not good enough?

To answer more generally to your suggestion on making it conditional based 
on some other logic, it would mean you accept network middleboxes are 
allowed to corrupt AccECN ACE field when forwarding. If RFC3168 TSO/GSO 
trickery remains in use (without a middlebox explicitly tracking the 
connection had negotiated RFC3168), a forwarder won't be able to reproduce 
the exactly same stream of TCP packets headers thus corrupting non-RFC3168 
use of CWR flag. It's not something any middlebox should be doing (I hope 
we agree on this as a general principle)!
diff mbox series

Patch

diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 0b05f30e9e5f..f59762d88c38 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -329,7 +329,7 @@  struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
 	th2 = tcp_hdr(p);
 	flush = (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
-		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
+		  ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
 	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
 	for (i = sizeof(*th); i < thlen; i += 4)
 		flush |= *(u32 *)((u8 *)th + i) ^
@@ -405,7 +405,7 @@  void tcp_gro_complete(struct sk_buff *skb)
 	shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
 
 	if (th->cwr)
-		shinfo->gso_type |= SKB_GSO_TCP_ECN;
+		shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
 }
 EXPORT_SYMBOL(tcp_gro_complete);