Message ID | 20240307220016.3147666-1-edumazet@google.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 08842c43d0165b0ed78907fd8cc92ce17d857913 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] udp: no longer touch sk->sk_refcnt in early demux | expand |
On Thu, 2024-03-07 at 22:00 +0000, Eric Dumazet wrote: > After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU") > and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()") > UDP early demux no longer need to grab a refcount on the UDP socket. > > This save two atomic operations per incoming packet for connected > sockets. This reminds me of a old series: https://lore.kernel.org/netdev/cover.1506114055.git.pabeni@redhat.com/ and I'm wondering if we could reconsider such option. > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Martin KaFai Lau <kafai@fb.com> > Cc: Joe Stringer <joe@wand.net.nz> > Cc: Alexei Starovoitov <ast@kernel.org> > Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > Cc: Kuniyuki Iwashima <kuniyu@amazon.com> > --- > net/ipv4/udp.c | 5 +++-- > net/ipv6/udp.c | 5 +++-- > 2 files changed, 6 insertions(+), 4 deletions(-) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb) > uh->source, iph->saddr, dif, sdif); > } > > - if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt)) > + if (!sk) > return 0; > > skb->sk = sk; > - skb->destructor = sock_efree; > + DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk)); > + skb->destructor = sock_pfree; I *think* that the skb may escape the current rcu section if e.g. if matches a nf dup target in the input tables. Back then I tried to implement some debug infra to track such accesses: https://lore.kernel.org/lkml/cover.1507294365.git.pabeni@redhat.com/ which was buggy (prone to false negative). I think it can be improved to something more reliable, perhaps I should revamp it? I'm also wondering if the DEBUG_NET_WARN_ON_ONCE is worthy?!? the sk is an hashed UDP socket so is a full sock and has the bit SOCK_RCU_FREE set. Perhaps we could use a simple 'noop' destructor as in: https://lore.kernel.org/netdev/b16163e3a4fa4d772edeabd8743acb4a07206bb9.1506114055.git.pabeni@redhat.com/ Thanks! Paolo
On Fri, 2024-03-08 at 09:37 +0100, Paolo Abeni wrote: > On Thu, 2024-03-07 at 22:00 +0000, Eric Dumazet wrote: > > After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU") > > and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()") > > UDP early demux no longer need to grab a refcount on the UDP socket. > > > > This save two atomic operations per incoming packet for connected > > sockets. > > This reminds me of a old series: > > https://lore.kernel.org/netdev/cover.1506114055.git.pabeni@redhat.com/ > > and I'm wondering if we could reconsider such option. > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > Cc: Martin KaFai Lau <kafai@fb.com> > > Cc: Joe Stringer <joe@wand.net.nz> > > Cc: Alexei Starovoitov <ast@kernel.org> > > Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > > Cc: Kuniyuki Iwashima <kuniyu@amazon.com> > > --- > > net/ipv4/udp.c | 5 +++-- > > net/ipv6/udp.c | 5 +++-- > > 2 files changed, 6 insertions(+), 4 deletions(-) > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644 > > --- a/net/ipv4/udp.c > > +++ b/net/ipv4/udp.c > > @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb) > > uh->source, iph->saddr, dif, sdif); > > } > > > > - if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt)) > > + if (!sk) > > return 0; > > > > skb->sk = sk; > > - skb->destructor = sock_efree; > > + DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk)); > > + skb->destructor = sock_pfree; > > I *think* that the skb may escape the current rcu section if e.g. if > matches a nf dup target in the input tables. > > Back then I tried to implement some debug infra to track such accesses: > > https://lore.kernel.org/lkml/cover.1507294365.git.pabeni@redhat.com/ > > which was buggy (prone to false negative). I think it can be improved > to something more reliable, perhaps I should revamp it? > > I'm also wondering if the DEBUG_NET_WARN_ON_ONCE is worthy?!? the sk is > an hashed UDP socket so is a full sock and has the bit SOCK_RCU_FREE > set. > > Perhaps we could use a simple 'noop' destructor as in: > > https://lore.kernel.org/netdev/b16163e3a4fa4d772edeabd8743acb4a07206bb9.1506114055.git.pabeni@redhat.com/ Please ignore this last part, too late I noticed we need 'sock_pfree' to let inet_steal_sock() work as expected. Paolo
On Fri, Mar 8, 2024 at 9:37 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On Thu, 2024-03-07 at 22:00 +0000, Eric Dumazet wrote: > > After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU") > > and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()") > > UDP early demux no longer need to grab a refcount on the UDP socket. > > > > This save two atomic operations per incoming packet for connected > > sockets. > > This reminds me of a old series: > > https://lore.kernel.org/netdev/cover.1506114055.git.pabeni@redhat.com/ > > and I'm wondering if we could reconsider such option. > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > Cc: Martin KaFai Lau <kafai@fb.com> > > Cc: Joe Stringer <joe@wand.net.nz> > > Cc: Alexei Starovoitov <ast@kernel.org> > > Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > > Cc: Kuniyuki Iwashima <kuniyu@amazon.com> > > --- > > net/ipv4/udp.c | 5 +++-- > > net/ipv6/udp.c | 5 +++-- > > 2 files changed, 6 insertions(+), 4 deletions(-) > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644 > > --- a/net/ipv4/udp.c > > +++ b/net/ipv4/udp.c > > @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb) > > uh->source, iph->saddr, dif, sdif); > > } > > > > - if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt)) > > + if (!sk) > > return 0; > > > > skb->sk = sk; > > - skb->destructor = sock_efree; > > + DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk)); > > + skb->destructor = sock_pfree; > > I *think* that the skb may escape the current rcu section if e.g. if > matches a nf dup target in the input tables. You mean the netfilter queueing stuff perhaps ? This is already safe, it uses a refcount_inc_not_zero(&sk->sk_refcnt): if (skb_sk_is_prefetched(skb)) { struct sock *sk = skb->sk; if (!sk_is_refcounted(sk)) { if (!refcount_inc_not_zero(&sk->sk_refcnt)) return -ENOTCONN; /* drop refcount on skb_orphan */ skb->destructor = sock_edemux; } } I would think a duplicate can not duplicate skb->sk in general, or must also attempt an refcount_inc_not_zero(&sk->sk_refcnt) and use a related destructor. > > Back then I tried to implement some debug infra to track such accesses: > > https://lore.kernel.org/lkml/cover.1507294365.git.pabeni@redhat.com/ > > which was buggy (prone to false negative). I think it can be improved > to something more reliable, perhaps I should revamp it? > > I'm also wondering if the DEBUG_NET_WARN_ON_ONCE is worthy?!? the sk is > an hashed UDP socket so is a full sock and has the bit SOCK_RCU_FREE > set. This was mostly to catch any future issues and related to my use of sock_pfree() DEBUG_NET_WARN_ON_ONCE() is a nop, unless you compile a DEV kernel. > > Perhaps we could use a simple 'noop' destructor as in: > > https://lore.kernel.org/netdev/b16163e3a4fa4d772edeabd8743acb4a07206bb9.1506114055.git.pabeni@redhat.com/ > I think we need sock_pfree() for inet_steal_sock(), I might be wrong.
On Fri, Mar 8, 2024 at 10:20 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > Please ignore this last part, too late I noticed we need 'sock_pfree' > to let inet_steal_sock() work as expected. Yes, this is skb_sk_is_prefetched() part.
On Fri, 2024-03-08 at 10:21 +0100, Eric Dumazet wrote: > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > > index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644 > > > --- a/net/ipv4/udp.c > > > +++ b/net/ipv4/udp.c > > > @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb) > > > uh->source, iph->saddr, dif, sdif); > > > } > > > > > > - if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt)) > > > + if (!sk) > > > return 0; > > > > > > skb->sk = sk; > > > - skb->destructor = sock_efree; > > > + DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk)); > > > + skb->destructor = sock_pfree; > > > > I *think* that the skb may escape the current rcu section if e.g. if > > matches a nf dup target in the input tables. > > You mean the netfilter queueing stuff perhaps ? > > This is already safe, it uses a refcount_inc_not_zero(&sk->sk_refcnt): > > if (skb_sk_is_prefetched(skb)) { > struct sock *sk = skb->sk; > > if (!sk_is_refcounted(sk)) { > if (!refcount_inc_not_zero(&sk->sk_refcnt)) > return -ENOTCONN; > > /* drop refcount on skb_orphan */ > skb->destructor = sock_edemux; > } > } > > I would think a duplicate can not duplicate skb->sk in general, or must also > attempt an refcount_inc_not_zero(&sk->sk_refcnt) and use a related destructor. Right, looks safe. Acked-by: Paolo Abeni <pabeni@redhat.com>
On Fri, Mar 8, 2024 at 12:11 PM Paolo Abeni <pabeni@redhat.com> wrote: > Right, looks safe. > > Acked-by: Paolo Abeni <pabeni@redhat.com> Thanks ! BTW, I was thinking (for next cycle) to let users (or TCP stack with ad-hoc heuristics ) opt-in TCP sockets to SOCK_RCU_FREE. This would avoid the refcnt dance for their incoming packets.
Hello: This patch was applied to netdev/net-next.git (main) by David S. Miller <davem@davemloft.net>: On Thu, 7 Mar 2024 22:00:16 +0000 you wrote: > After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU") > and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()") > UDP early demux no longer need to grab a refcount on the UDP socket. > > This save two atomic operations per incoming packet for connected > sockets. > > [...] Here is the summary with links: - [net-next] udp: no longer touch sk->sk_refcnt in early demux https://git.kernel.org/netdev/net-next/c/08842c43d016 You are awesome, thank you!
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb) uh->source, iph->saddr, dif, sdif); } - if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt)) + if (!sk) return 0; skb->sk = sk; - skb->destructor = sock_efree; + DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk)); + skb->destructor = sock_pfree; dst = rcu_dereference(sk->sk_rx_dst); if (dst) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 3f2249b4cd5f6a594dd9768e29f20f0d9a57faed..fad6667fad6644db8c679ae9b723ccda15edaede 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1101,11 +1101,12 @@ void udp_v6_early_demux(struct sk_buff *skb) else return; - if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt)) + if (!sk) return; skb->sk = sk; - skb->destructor = sock_efree; + DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk)); + skb->destructor = sock_pfree; dst = rcu_dereference(sk->sk_rx_dst); if (dst)
After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU") and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()") UDP early demux no longer need to grab a refcount on the UDP socket. This save two atomic operations per incoming packet for connected sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Joe Stringer <joe@wand.net.nz> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> --- net/ipv4/udp.c | 5 +++-- net/ipv6/udp.c | 5 +++-- 2 files changed, 6 insertions(+), 4 deletions(-)