diff mbox series

[net-next] udp: no longer touch sk->sk_refcnt in early demux

Message ID 20240307220016.3147666-1-edumazet@google.com (mailing list archive)
State Accepted
Commit 08842c43d0165b0ed78907fd8cc92ce17d857913
Delegated to: Netdev Maintainers
Headers show
Series [net-next] udp: no longer touch sk->sk_refcnt in early demux | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 948 this patch: 948
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 2 maintainers not CCed: bpf@vger.kernel.org dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 957 this patch: 957
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 964 this patch: 964
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 28 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-08--09-00 (tests: 890)

Commit Message

Eric Dumazet March 7, 2024, 10 p.m. UTC
After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()")
UDP early demux no longer need to grab a refcount on the UDP socket.

This save two atomic operations per incoming packet for connected
sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Joe Stringer <joe@wand.net.nz>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/udp.c | 5 +++--
 net/ipv6/udp.c | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

Comments

Paolo Abeni March 8, 2024, 8:37 a.m. UTC | #1
On Thu, 2024-03-07 at 22:00 +0000, Eric Dumazet wrote:
> After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
> and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()")
> UDP early demux no longer need to grab a refcount on the UDP socket.
> 
> This save two atomic operations per incoming packet for connected
> sockets.

This reminds me of a old series:

https://lore.kernel.org/netdev/cover.1506114055.git.pabeni@redhat.com/

and I'm wondering if we could reconsider such option.

> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: Joe Stringer <joe@wand.net.nz>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  net/ipv4/udp.c | 5 +++--
>  net/ipv6/udp.c | 5 +++--
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb)
>  					     uh->source, iph->saddr, dif, sdif);
>  	}
>  
> -	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
> +	if (!sk)
>  		return 0;
>  
>  	skb->sk = sk;
> -	skb->destructor = sock_efree;
> +	DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
> +	skb->destructor = sock_pfree;

I *think* that the skb may escape the current rcu section if e.g. if
matches a nf dup target in the input tables.

Back then I tried to implement some debug infra to track such accesses:

https://lore.kernel.org/lkml/cover.1507294365.git.pabeni@redhat.com/

which was buggy (prone to false negative). I think it can be improved
to something more reliable, perhaps I should revamp it?

I'm also wondering if the DEBUG_NET_WARN_ON_ONCE is worthy?!? the sk is
an hashed UDP socket so is a full sock and has the bit SOCK_RCU_FREE
set.

Perhaps we could use a simple 'noop' destructor as in:

https://lore.kernel.org/netdev/b16163e3a4fa4d772edeabd8743acb4a07206bb9.1506114055.git.pabeni@redhat.com/


Thanks!

Paolo
Paolo Abeni March 8, 2024, 9:19 a.m. UTC | #2
On Fri, 2024-03-08 at 09:37 +0100, Paolo Abeni wrote:
> On Thu, 2024-03-07 at 22:00 +0000, Eric Dumazet wrote:
> > After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
> > and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()")
> > UDP early demux no longer need to grab a refcount on the UDP socket.
> > 
> > This save two atomic operations per incoming packet for connected
> > sockets.
> 
> This reminds me of a old series:
> 
> https://lore.kernel.org/netdev/cover.1506114055.git.pabeni@redhat.com/
> 
> and I'm wondering if we could reconsider such option.
> 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Martin KaFai Lau <kafai@fb.com>
> > Cc: Joe Stringer <joe@wand.net.nz>
> > Cc: Alexei Starovoitov <ast@kernel.org>
> > Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  net/ipv4/udp.c | 5 +++--
> >  net/ipv6/udp.c | 5 +++--
> >  2 files changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb)
> >  					     uh->source, iph->saddr, dif, sdif);
> >  	}
> >  
> > -	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
> > +	if (!sk)
> >  		return 0;
> >  
> >  	skb->sk = sk;
> > -	skb->destructor = sock_efree;
> > +	DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
> > +	skb->destructor = sock_pfree;
> 
> I *think* that the skb may escape the current rcu section if e.g. if
> matches a nf dup target in the input tables.
> 
> Back then I tried to implement some debug infra to track such accesses:
> 
> https://lore.kernel.org/lkml/cover.1507294365.git.pabeni@redhat.com/
> 
> which was buggy (prone to false negative). I think it can be improved
> to something more reliable, perhaps I should revamp it?
> 
> I'm also wondering if the DEBUG_NET_WARN_ON_ONCE is worthy?!? the sk is
> an hashed UDP socket so is a full sock and has the bit SOCK_RCU_FREE
> set.
> 
> Perhaps we could use a simple 'noop' destructor as in:
> 
> https://lore.kernel.org/netdev/b16163e3a4fa4d772edeabd8743acb4a07206bb9.1506114055.git.pabeni@redhat.com/

Please ignore this last part, too late I noticed we need 'sock_pfree'
to let inet_steal_sock() work as expected.

Paolo
Eric Dumazet March 8, 2024, 9:21 a.m. UTC | #3
On Fri, Mar 8, 2024 at 9:37 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2024-03-07 at 22:00 +0000, Eric Dumazet wrote:
> > After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
> > and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()")
> > UDP early demux no longer need to grab a refcount on the UDP socket.
> >
> > This save two atomic operations per incoming packet for connected
> > sockets.
>
> This reminds me of a old series:
>
> https://lore.kernel.org/netdev/cover.1506114055.git.pabeni@redhat.com/
>
> and I'm wondering if we could reconsider such option.
>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Martin KaFai Lau <kafai@fb.com>
> > Cc: Joe Stringer <joe@wand.net.nz>
> > Cc: Alexei Starovoitov <ast@kernel.org>
> > Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  net/ipv4/udp.c | 5 +++--
> >  net/ipv6/udp.c | 5 +++--
> >  2 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb)
> >                                            uh->source, iph->saddr, dif, sdif);
> >       }
> >
> > -     if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
> > +     if (!sk)
> >               return 0;
> >
> >       skb->sk = sk;
> > -     skb->destructor = sock_efree;
> > +     DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
> > +     skb->destructor = sock_pfree;
>
> I *think* that the skb may escape the current rcu section if e.g. if
> matches a nf dup target in the input tables.

You mean the netfilter queueing stuff perhaps ?

This is already safe, it uses a refcount_inc_not_zero(&sk->sk_refcnt):

if (skb_sk_is_prefetched(skb)) {
    struct sock *sk = skb->sk;

    if (!sk_is_refcounted(sk)) {
             if (!refcount_inc_not_zero(&sk->sk_refcnt))
                   return -ENOTCONN;

        /* drop refcount on skb_orphan */
        skb->destructor = sock_edemux;
    }
}

I would think a duplicate can not duplicate skb->sk in general, or must also
attempt an refcount_inc_not_zero(&sk->sk_refcnt) and use a related destructor.

>
> Back then I tried to implement some debug infra to track such accesses:
>
> https://lore.kernel.org/lkml/cover.1507294365.git.pabeni@redhat.com/
>
> which was buggy (prone to false negative). I think it can be improved
> to something more reliable, perhaps I should revamp it?
>
> I'm also wondering if the DEBUG_NET_WARN_ON_ONCE is worthy?!? the sk is
> an hashed UDP socket so is a full sock and has the bit SOCK_RCU_FREE
> set.

This was mostly to catch any future issues and related to my use of sock_pfree()

DEBUG_NET_WARN_ON_ONCE() is a nop, unless you compile a DEV kernel.

>
> Perhaps we could use a simple 'noop' destructor as in:
>
> https://lore.kernel.org/netdev/b16163e3a4fa4d772edeabd8743acb4a07206bb9.1506114055.git.pabeni@redhat.com/
>

I think we need sock_pfree() for inet_steal_sock(), I might be wrong.
Eric Dumazet March 8, 2024, 9:23 a.m. UTC | #4
On Fri, Mar 8, 2024 at 10:20 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
>
> Please ignore this last part, too late I noticed we need 'sock_pfree'
> to let inet_steal_sock() work as expected.

Yes, this is skb_sk_is_prefetched() part.
Paolo Abeni March 8, 2024, 11:10 a.m. UTC | #5
On Fri, 2024-03-08 at 10:21 +0100, Eric Dumazet wrote:
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb)
> > >                                            uh->source, iph->saddr, dif, sdif);
> > >       }
> > > 
> > > -     if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
> > > +     if (!sk)
> > >               return 0;
> > > 
> > >       skb->sk = sk;
> > > -     skb->destructor = sock_efree;
> > > +     DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
> > > +     skb->destructor = sock_pfree;
> > 
> > I *think* that the skb may escape the current rcu section if e.g. if
> > matches a nf dup target in the input tables.
> 
> You mean the netfilter queueing stuff perhaps ?
> 
> This is already safe, it uses a refcount_inc_not_zero(&sk->sk_refcnt):
> 
> if (skb_sk_is_prefetched(skb)) {
>     struct sock *sk = skb->sk;
> 
>     if (!sk_is_refcounted(sk)) {
>              if (!refcount_inc_not_zero(&sk->sk_refcnt))
>                    return -ENOTCONN;
> 
>         /* drop refcount on skb_orphan */
>         skb->destructor = sock_edemux;
>     }
> }
> 
> I would think a duplicate can not duplicate skb->sk in general, or must also
> attempt an refcount_inc_not_zero(&sk->sk_refcnt) and use a related destructor.

Right, looks safe.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Eric Dumazet March 8, 2024, 12:40 p.m. UTC | #6
On Fri, Mar 8, 2024 at 12:11 PM Paolo Abeni <pabeni@redhat.com> wrote:
> Right, looks safe.
>
> Acked-by: Paolo Abeni <pabeni@redhat.com>

Thanks !

BTW, I was thinking (for next cycle) to let users (or TCP stack with
ad-hoc heuristics )
opt-in TCP sockets to SOCK_RCU_FREE.

This would avoid the refcnt dance for their incoming packets.
patchwork-bot+netdevbpf@kernel.org March 11, 2024, 7:39 p.m. UTC | #7
Hello:

This patch was applied to netdev/net-next.git (main)
by David S. Miller <davem@davemloft.net>:

On Thu,  7 Mar 2024 22:00:16 +0000 you wrote:
> After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
> and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()")
> UDP early demux no longer need to grab a refcount on the UDP socket.
> 
> This save two atomic operations per incoming packet for connected
> sockets.
> 
> [...]

Here is the summary with links:
  - [net-next] udp: no longer touch sk->sk_refcnt in early demux
    https://git.kernel.org/netdev/net-next/c/08842c43d016

You are awesome, thank you!
diff mbox series

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2570,11 +2570,12 @@  int udp_v4_early_demux(struct sk_buff *skb)
 					     uh->source, iph->saddr, dif, sdif);
 	}
 
-	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
+	if (!sk)
 		return 0;
 
 	skb->sk = sk;
-	skb->destructor = sock_efree;
+	DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
+	skb->destructor = sock_pfree;
 	dst = rcu_dereference(sk->sk_rx_dst);
 
 	if (dst)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3f2249b4cd5f6a594dd9768e29f20f0d9a57faed..fad6667fad6644db8c679ae9b723ccda15edaede 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1101,11 +1101,12 @@  void udp_v6_early_demux(struct sk_buff *skb)
 	else
 		return;
 
-	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
+	if (!sk)
 		return;
 
 	skb->sk = sk;
-	skb->destructor = sock_efree;
+	DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
+	skb->destructor = sock_pfree;
 	dst = rcu_dereference(sk->sk_rx_dst);
 
 	if (dst)