diff mbox series

[net] net: set SOCK_RCU_FREE before inserting socket into hashtable

Message ID 20231108202819.1932920-1-sdf@google.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] net: set SOCK_RCU_FREE before inserting socket into hashtable | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1314 this patch: 1314
netdev/cc_maintainers warning 2 maintainers not CCed: dsahern@kernel.org bpf@vger.kernel.org
netdev/build_clang success Errors and warnings before: 1341 this patch: 1341
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1342 this patch: 1342
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 13 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Stanislav Fomichev Nov. 8, 2023, 8:28 p.m. UTC
We've started to see the following kernel traces:

 WARNING: CPU: 83 PID: 0 at net/core/filter.c:6641 sk_lookup+0x1bd/0x1d0

 Call Trace:
  <IRQ>
  __bpf_skc_lookup+0x10d/0x120
  bpf_sk_lookup+0x48/0xd0
  bpf_sk_lookup_tcp+0x19/0x20
  bpf_prog_<redacted>+0x37c/0x16a3
  cls_bpf_classify+0x205/0x2e0
  tcf_classify+0x92/0x160
  __netif_receive_skb_core+0xe52/0xf10
  __netif_receive_skb_list_core+0x96/0x2b0
  napi_complete_done+0x7b5/0xb70
  <redacted>_poll+0x94/0xb0
  net_rx_action+0x163/0x1d70
  __do_softirq+0xdc/0x32e
  asm_call_irq_on_stack+0x12/0x20
  </IRQ>
  do_softirq_own_stack+0x36/0x50
  do_softirq+0x44/0x70

I'm not 100% what is causing them. It might be some kernel change or
new code path in the bpf program. But looking at the code,
I'm assuming the issue has been there for a while.

__inet_hash can race with lockless (rcu) readers on the other cpus:

  __inet_hash
    __sk_nulls_add_node_rcu
    <- (bpf triggers here)
    sock_set_flag(SOCK_RCU_FREE)

Let's move the SOCK_RCU_FREE part up a bit, before we are inserting
the socket into hashtables. Note, that the race is really harmless;
the bpf callers are handling this situation (where listener socket
doesn't have SOCK_RCU_FREE set) correctly, so the only
annoyance is a WARN_ONCE (so not 100% sure whether it should
wait until net-next instead).

For the fixes tag, I'm using the original commit which added the flag.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 net/ipv4/inet_hashtables.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet Nov. 8, 2023, 8:58 p.m. UTC | #1
On Wed, Nov 8, 2023 at 9:28 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> We've started to see the following kernel traces:
>
>  WARNING: CPU: 83 PID: 0 at net/core/filter.c:6641 sk_lookup+0x1bd/0x1d0
>
>  Call Trace:
>   <IRQ>
>   __bpf_skc_lookup+0x10d/0x120
>   bpf_sk_lookup+0x48/0xd0
>   bpf_sk_lookup_tcp+0x19/0x20
>   bpf_prog_<redacted>+0x37c/0x16a3
>   cls_bpf_classify+0x205/0x2e0
>   tcf_classify+0x92/0x160
>   __netif_receive_skb_core+0xe52/0xf10
>   __netif_receive_skb_list_core+0x96/0x2b0
>   napi_complete_done+0x7b5/0xb70
>   <redacted>_poll+0x94/0xb0
>   net_rx_action+0x163/0x1d70
>   __do_softirq+0xdc/0x32e
>   asm_call_irq_on_stack+0x12/0x20
>   </IRQ>
>   do_softirq_own_stack+0x36/0x50
>   do_softirq+0x44/0x70
>
> I'm not 100% what is causing them. It might be some kernel change or
> new code path in the bpf program. But looking at the code,
> I'm assuming the issue has been there for a while.
>
> __inet_hash can race with lockless (rcu) readers on the other cpus:
>
>   __inet_hash
>     __sk_nulls_add_node_rcu
>     <- (bpf triggers here)
>     sock_set_flag(SOCK_RCU_FREE)
>
> Let's move the SOCK_RCU_FREE part up a bit, before we are inserting
> the socket into hashtables. Note, that the race is really harmless;
> the bpf callers are handling this situation (where listener socket
> doesn't have SOCK_RCU_FREE set) correctly, so the only
> annoyance is a WARN_ONCE (so not 100% sure whether it should
> wait until net-next instead).
>
> For the fixes tag, I'm using the original commit which added the flag.

When this commit added the flag, precise location of the
sock_set_flag(sk, SOCK_RCU_FREE)
did not matter, because the thread calling __inet_hash() owns a reference on sk.

SOCK_RCU_FREE was tested only at dismantle time.

Back then BPF was not able yet to perform lookups, and double check if
SOCK_RCU_FREE
was set or not.

Checking SOCK_RCU_FREE _after_ the lookup to infer if a refcount has
been taken came
with commit 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")

I think we can be more precise and help future debugging, in case more problems
need investigations.

Can you augment the changelog and use a different Fixes: tag ?

With that,

Reviewed-by: Eric Dumazet <edumazet@google.com>

>
> Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  net/ipv4/inet_hashtables.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 598c1b114d2c..a532f749e477 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -751,12 +751,12 @@ int __inet_hash(struct sock *sk, struct sock *osk)
>                 if (err)
>                         goto unlock;
>         }
> +       sock_set_flag(sk, SOCK_RCU_FREE);
>         if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
>                 sk->sk_family == AF_INET6)
>                 __sk_nulls_add_node_tail_rcu(sk, &ilb2->nulls_head);
>         else
>                 __sk_nulls_add_node_rcu(sk, &ilb2->nulls_head);
> -       sock_set_flag(sk, SOCK_RCU_FREE);
>         sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
>  unlock:
>         spin_unlock(&ilb2->lock);
> --
> 2.42.0.869.gea05f2083d-goog
>
Stanislav Fomichev Nov. 8, 2023, 9:01 p.m. UTC | #2
On Wed, Nov 8, 2023 at 12:58 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Nov 8, 2023 at 9:28 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > We've started to see the following kernel traces:
> >
> >  WARNING: CPU: 83 PID: 0 at net/core/filter.c:6641 sk_lookup+0x1bd/0x1d0
> >
> >  Call Trace:
> >   <IRQ>
> >   __bpf_skc_lookup+0x10d/0x120
> >   bpf_sk_lookup+0x48/0xd0
> >   bpf_sk_lookup_tcp+0x19/0x20
> >   bpf_prog_<redacted>+0x37c/0x16a3
> >   cls_bpf_classify+0x205/0x2e0
> >   tcf_classify+0x92/0x160
> >   __netif_receive_skb_core+0xe52/0xf10
> >   __netif_receive_skb_list_core+0x96/0x2b0
> >   napi_complete_done+0x7b5/0xb70
> >   <redacted>_poll+0x94/0xb0
> >   net_rx_action+0x163/0x1d70
> >   __do_softirq+0xdc/0x32e
> >   asm_call_irq_on_stack+0x12/0x20
> >   </IRQ>
> >   do_softirq_own_stack+0x36/0x50
> >   do_softirq+0x44/0x70
> >
> > I'm not 100% what is causing them. It might be some kernel change or
> > new code path in the bpf program. But looking at the code,
> > I'm assuming the issue has been there for a while.
> >
> > __inet_hash can race with lockless (rcu) readers on the other cpus:
> >
> >   __inet_hash
> >     __sk_nulls_add_node_rcu
> >     <- (bpf triggers here)
> >     sock_set_flag(SOCK_RCU_FREE)
> >
> > Let's move the SOCK_RCU_FREE part up a bit, before we are inserting
> > the socket into hashtables. Note, that the race is really harmless;
> > the bpf callers are handling this situation (where listener socket
> > doesn't have SOCK_RCU_FREE set) correctly, so the only
> > annoyance is a WARN_ONCE (so not 100% sure whether it should
> > wait until net-next instead).
> >
> > For the fixes tag, I'm using the original commit which added the flag.
>
> When this commit added the flag, precise location of the
> sock_set_flag(sk, SOCK_RCU_FREE)
> did not matter, because the thread calling __inet_hash() owns a reference on sk.
>
> SOCK_RCU_FREE was tested only at dismantle time.
>
> Back then BPF was not able yet to perform lookups, and double check if
> SOCK_RCU_FREE
> was set or not.
>
> Checking SOCK_RCU_FREE _after_ the lookup to infer if a refcount has
> been taken came
> with commit 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
>
> I think we can be more precise and help future debugging, in case more problems
> need investigations.
>
> Can you augment the changelog and use a different Fixes: tag ?
>
> With that,
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Sure, thank you for the timeline! Will resend shortly with the updated
changelog.

> >
> > Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  net/ipv4/inet_hashtables.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index 598c1b114d2c..a532f749e477 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -751,12 +751,12 @@ int __inet_hash(struct sock *sk, struct sock *osk)
> >                 if (err)
> >                         goto unlock;
> >         }
> > +       sock_set_flag(sk, SOCK_RCU_FREE);
> >         if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
> >                 sk->sk_family == AF_INET6)
> >                 __sk_nulls_add_node_tail_rcu(sk, &ilb2->nulls_head);
> >         else
> >                 __sk_nulls_add_node_rcu(sk, &ilb2->nulls_head);
> > -       sock_set_flag(sk, SOCK_RCU_FREE);
> >         sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
> >  unlock:
> >         spin_unlock(&ilb2->lock);
> > --
> > 2.42.0.869.gea05f2083d-goog
> >
diff mbox series

Patch

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 598c1b114d2c..a532f749e477 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -751,12 +751,12 @@  int __inet_hash(struct sock *sk, struct sock *osk)
 		if (err)
 			goto unlock;
 	}
+	sock_set_flag(sk, SOCK_RCU_FREE);
 	if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
 		sk->sk_family == AF_INET6)
 		__sk_nulls_add_node_tail_rcu(sk, &ilb2->nulls_head);
 	else
 		__sk_nulls_add_node_rcu(sk, &ilb2->nulls_head);
-	sock_set_flag(sk, SOCK_RCU_FREE);
 	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
 unlock:
 	spin_unlock(&ilb2->lock);