Message ID | 20241021215910.59767-10-chia-yu.chang@nokia-bell-labs.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | AccECN protocol preparation patch series | expand |
On 10/21/24 23:59, chia-yu.chang@nokia-bell-labs.com wrote: > From: Ilpo Järvinen <ij@kernel.org> > > There are important differences in how the CWR field behaves > in RFC3168 and AccECN. With AccECN, CWR flag is part of the > ACE counter and its changes are important so adjust the flags > changed mask accordingly. > > Also, if CWR is there, set the Accurate ECN GSO flag to avoid > corrupting CWR flag somewhere. > > Signed-off-by: Ilpo Järvinen <ij@kernel.org> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > --- > net/ipv4/tcp_offload.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c > index 0b05f30e9e5f..f59762d88c38 100644 > --- a/net/ipv4/tcp_offload.c > +++ b/net/ipv4/tcp_offload.c > @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb, > th2 = tcp_hdr(p); > flush = (__force int)(flags & TCP_FLAG_CWR); > flush |= (__force int)((flags ^ tcp_flag_word(th2)) & > - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH)); > + ~(TCP_FLAG_FIN | TCP_FLAG_PSH)); If I read correctly, if the peer is using RFC3168 and TSO_ECN, GRO will now pump into the stack twice the number of packets it was doing prior to this patch, am I correct? That is likely causing measurable performance regressions. > flush |= (__force int)(th->ack_seq ^ th2->ack_seq); > for (i = sizeof(*th); i < thlen; i += 4) > flush |= *(u32 *)((u8 *)th + i) ^ > @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb) > shinfo->gso_segs = NAPI_GRO_CB(skb)->count; > > if (th->cwr) > - shinfo->gso_type |= SKB_GSO_TCP_ECN; > + shinfo->gso_type |= SKB_GSO_TCP_ACCECN; If this packet is forwarded, it will not leverage TSO anymore - with current H/W. I think we need a way to enable this feature conditionally, but I fear another sysctl will be ugly and the additional conditionals will not be good for GRO. Smarter suggestions welcome ;) Cheers, Paolo > } > EXPORT_SYMBOL(tcp_gro_complete); >
On Tue, 29 Oct 2024, Paolo Abeni wrote: > On 10/21/24 23:59, chia-yu.chang@nokia-bell-labs.com wrote: > > From: Ilpo Järvinen <ij@kernel.org> > > > > There are important differences in how the CWR field behaves > > in RFC3168 and AccECN. With AccECN, CWR flag is part of the > > ACE counter and its changes are important so adjust the flags > > changed mask accordingly. > > > > Also, if CWR is there, set the Accurate ECN GSO flag to avoid > > corrupting CWR flag somewhere. > > > > Signed-off-by: Ilpo Järvinen <ij@kernel.org> > > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > > --- > > net/ipv4/tcp_offload.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c > > index 0b05f30e9e5f..f59762d88c38 100644 > > --- a/net/ipv4/tcp_offload.c > > +++ b/net/ipv4/tcp_offload.c > > @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb, > > th2 = tcp_hdr(p); > > flush = (__force int)(flags & TCP_FLAG_CWR); > > flush |= (__force int)((flags ^ tcp_flag_word(th2)) & > > - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH)); > > + ~(TCP_FLAG_FIN | TCP_FLAG_PSH)); > > If I read correctly, if the peer is using RFC3168 and TSO_ECN, GRO will > now pump into the stack twice the number of packets it was doing prior > to this patch, am I correct? > > That is likely causing measurable performance regressions. Hi Paolo, Thanks for taking a look! While it's true on surface that this might cause some more packets with RFC3168 (by design, as network cannot know if the sender is using RFC3168 or not), the important question is the scale how many of extra packets will occur in practice. First of all, RFC3168 requires CWR flag to be sent no more frequently than once per window of data, or in other words, once per RTT. And that means just one packet, not e.g. all packets of a super-skb (the RFC3168 signalling will lose its integrity if this is violated by the sender). Secondly, the TCP sender uses CWR flag to indicate it just halved its congestion window which mean it is sending half the amount of packets in this window than in the previous window (analoguous to halving sending rate). 2 RTTs with CWR each means two window reductions (this behavior is spec'ed in RFC3168). So lets say the sender was using 100 packets congestion window, this change will add one packet to 50 packets on this next RTT. Note those are raw numbers of packets on wire and do not tell how many packets GRO combined into each super-skb which will wary case-by-case basis. Regardless, I suspect the extra packet added to the half of the packets will be hard/impossible to measure to cause a performance regression. This change would double the number of packets only if the congestion window is 1 or 2 packets and in that case TSO/GSO/GRO benefits will be pretty small to begin with (or even counterproductive). Also, the traditional TCP congestion control (RFC3168 included) has many issues anyway with that small windows because it doesn't deal with fractional congestion windows well. > > flush |= (__force int)(th->ack_seq ^ th2->ack_seq); > > for (i = sizeof(*th); i < thlen; i += 4) > > flush |= *(u32 *)((u8 *)th + i) ^ > > @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb) > > shinfo->gso_segs = NAPI_GRO_CB(skb)->count; > > > > if (th->cwr) > > - shinfo->gso_type |= SKB_GSO_TCP_ECN; > > + shinfo->gso_type |= SKB_GSO_TCP_ACCECN; > > If this packet is forwarded, it will not leverage TSO anymore - with > current H/W. > > I think we need a way to enable this feature conditionally, but I fear > another sysctl will be ugly and the additional conditionals will not be > good for GRO. > > Smarter suggestions welcome ;) Well, it is already very selectively _conditional_, SKB_GSO_TCP_ACCECN is only set for the skb when CWR is set. That is, once per RTT (data window) when it comes to RFC3168. I don't have any source for this (other than reading many many tcpdumps in the past) but I believe the percentage of packets with CWR set (due to RFC3168 signalling) is going to be very small overall. Do you think that is not good enough? To answer more generally to your suggestion on making it conditional based on some other logic, it would mean you accept network middleboxes are allowed to corrupt AccECN ACE field when forwarding. If RFC3168 TSO/GSO trickery remains in use (without a middlebox explicitly tracking the connection had negotiated RFC3168), a forwarder won't be able to reproduce the exactly same stream of TCP packets headers thus corrupting non-RFC3168 use of CWR flag. It's not something any middlebox should be doing (I hope we agree on this as a general principle)!
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index 0b05f30e9e5f..f59762d88c38 100644 --- a/net/ipv4/tcp_offload.c +++ b/net/ipv4/tcp_offload.c @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb, th2 = tcp_hdr(p); flush = (__force int)(flags & TCP_FLAG_CWR); flush |= (__force int)((flags ^ tcp_flag_word(th2)) & - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH)); + ~(TCP_FLAG_FIN | TCP_FLAG_PSH)); flush |= (__force int)(th->ack_seq ^ th2->ack_seq); for (i = sizeof(*th); i < thlen; i += 4) flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb) shinfo->gso_segs = NAPI_GRO_CB(skb)->count; if (th->cwr) - shinfo->gso_type |= SKB_GSO_TCP_ECN; + shinfo->gso_type |= SKB_GSO_TCP_ACCECN; } EXPORT_SYMBOL(tcp_gro_complete);