Message ID | 20240423125620.3309458-1-edumazet@google.com (mailing list archive) |
---|---|
State | Accepted |
Commit | ec00ed472bdb7d0af840da68c8c11bff9f4d9caa |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] tcp: avoid premature drops in tcp_add_backlog() | expand |
On Tue, 23 Apr 2024 12:56:20 +0000 Eric Dumazet wrote:
> Subject: [PATCH net-next] tcp: avoid premature drops in tcp_add_backlog()
This is intentionally for net-next?
On Thu, Apr 25, 2024 at 8:46 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 23 Apr 2024 12:56:20 +0000 Eric Dumazet wrote: > > Subject: [PATCH net-next] tcp: avoid premature drops in tcp_add_backlog() > > This is intentionally for net-next? Yes, this has been broken for a long time. We can soak this a bit, then it will reach stable trees when this reaches Linus tree ?
Hello: This patch was applied to netdev/net-next.git (main) by Jakub Kicinski <kuba@kernel.org>: On Tue, 23 Apr 2024 12:56:20 +0000 you wrote: > While testing TCP performance with latest trees, > I saw suspect SOCKET_BACKLOG drops. > > tcp_add_backlog() computes its limit with : > > limit = (u32)READ_ONCE(sk->sk_rcvbuf) + > (u32)(READ_ONCE(sk->sk_sndbuf) >> 1); > limit += 64 * 1024; > > [...] Here is the summary with links: - [net-next] tcp: avoid premature drops in tcp_add_backlog() https://git.kernel.org/netdev/net-next/c/ec00ed472bdb You are awesome, thank you!
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 88c83ac4212957f19efad0f967952d2502bdbc7f..e06f0cd04f7eee2b00fcaebe17cbd23c26f1d28f 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1995,7 +1995,7 @@ int tcp_v4_early_demux(struct sk_buff *skb) bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb, enum skb_drop_reason *reason) { - u32 limit, tail_gso_size, tail_gso_segs; + u32 tail_gso_size, tail_gso_segs; struct skb_shared_info *shinfo; const struct tcphdr *th; struct tcphdr *thtail; @@ -2004,6 +2004,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb, bool fragstolen; u32 gso_segs; u32 gso_size; + u64 limit; int delta; /* In case all data was pulled from skb frags (in __pskb_pull_tail()), @@ -2099,7 +2100,13 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb, __skb_push(skb, hdrlen); no_coalesce: - limit = (u32)READ_ONCE(sk->sk_rcvbuf) + (u32)(READ_ONCE(sk->sk_sndbuf) >> 1); + /* sk->sk_backlog.len is reset only at the end of __release_sock(). + * Both sk->sk_backlog.len and sk->sk_rmem_alloc could reach + * sk_rcvbuf in normal conditions. + */ + limit = ((u64)READ_ONCE(sk->sk_rcvbuf)) << 1; + + limit += ((u32)READ_ONCE(sk->sk_sndbuf)) >> 1; /* Only socket owner can try to collapse/prune rx queues * to reduce memory overhead, so add a little headroom here. @@ -2107,6 +2114,8 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb, */ limit += 64 * 1024; + limit = min_t(u64, limit, UINT_MAX); + if (unlikely(sk_add_backlog(sk, skb, limit))) { bh_unlock_sock(sk); *reason = SKB_DROP_REASON_SOCKET_BACKLOG;
While testing TCP performance with latest trees, I saw suspect SOCKET_BACKLOG drops. tcp_add_backlog() computes its limit with : limit = (u32)READ_ONCE(sk->sk_rcvbuf) + (u32)(READ_ONCE(sk->sk_sndbuf) >> 1); limit += 64 * 1024; This does not take into account that sk->sk_backlog.len is reset only at the very end of __release_sock(). Both sk->sk_backlog.len and sk->sk_rmem_alloc could reach sk_rcvbuf in normal conditions. We should double sk->sk_rcvbuf contribution in the formula to absorb bubbles in the backlog, which happen more often for very fast flows. This change maintains decent protection against abuses. Fixes: c377411f2494 ("net: sk_add_backlog() take rmem_alloc into account") Signed-off-by: Eric Dumazet <edumazet@google.com> --- net/ipv4/tcp_ipv4.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)