diff mbox series

[net] tcp: prevent concurrent execution of tcp_sk_exit_batch

Message ID 20240812222857.29837-1-fw@strlen.de (mailing list archive)
State Accepted
Commit 565d121b69980637f040eb4d84289869cdaabedf
Delegated to: Netdev Maintainers
Headers show
Series [net] tcp: prevent concurrent execution of tcp_sk_exit_batch | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 29 this patch: 29
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 29 this patch: 29
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 46 this patch: 46
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 32 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-08-16--21-00 (tests: 710)

Commit Message

Florian Westphal Aug. 12, 2024, 10:28 p.m. UTC
Its possible that two threads call tcp_sk_exit_batch() concurrently,
once from the cleanup_net workqueue, once from a task that failed to clone
a new netns.  In the latter case, error unwinding calls the exit handlers
in reverse order for the 'failed' netns.

tcp_sk_exit_batch() calls tcp_twsk_purge().
Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
this function picks up twsk in any dying netns, not just the one passed
in via exit_batch list.

This means that the error unwind of setup_net() can "steal" and destroy
timewait sockets belonging to the exiting netns.

This allows the netns exit worker to proceed to call

WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

without the expected 1 -> 0 transition, which then splats.

At same time, error unwind path that is also running inet_twsk_purge()
will splat as well:

WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
...
 refcount_dec include/linux/refcount.h:351 [inline]
 inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
 inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
 inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
 tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
 setup_net+0x714/0xb40 net/core/net_namespace.c:375
 copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
 create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110

... because refcount_dec() of tw_refcount unexpectedly dropped to 0.

This doesn't seem like an actual bug (no tw sockets got lost and I don't
see a use-after-free) but as erroneous trigger of debug check.

Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
blocks other task from doing final _dec_and_test before mutex-owner has
removed all tw sockets of dying netns.

Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Jason Xing <kerneljasonxing@gmail.com>
Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/tcp_ipv4.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Kuniyuki Iwashima Aug. 12, 2024, 11:28 p.m. UTC | #1
From: Florian Westphal <fw@strlen.de>
Date: Tue, 13 Aug 2024 00:28:25 +0200
> Its possible that two threads call tcp_sk_exit_batch() concurrently,
> once from the cleanup_net workqueue, once from a task that failed to clone
> a new netns.  In the latter case, error unwinding calls the exit handlers
> in reverse order for the 'failed' netns.
> 
> tcp_sk_exit_batch() calls tcp_twsk_purge().
> Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
> this function picks up twsk in any dying netns, not just the one passed
> in via exit_batch list.
> 
> This means that the error unwind of setup_net() can "steal" and destroy
> timewait sockets belonging to the exiting netns.
> 
> This allows the netns exit worker to proceed to call
> 
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> 
> without the expected 1 -> 0 transition, which then splats.
> 
> At same time, error unwind path that is also running inet_twsk_purge()
> will splat as well:
> 
> WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
> ...
>  refcount_dec include/linux/refcount.h:351 [inline]
>  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
>  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
>  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
>  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
>  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
>  setup_net+0x714/0xb40 net/core/net_namespace.c:375
>  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
>  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> 
> ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
> 
> This doesn't seem like an actual bug (no tw sockets got lost and I don't
> see a use-after-free) but as erroneous trigger of debug check.

I guess the reason you don't move the check back to tcp_sk_exit() is
to catch a potential issue explained in solution 4 in the link, right ?

Then,

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

> 
> Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
> blocks other task from doing final _dec_and_test before mutex-owner has
> removed all tw sockets of dying netns.
> 
> Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
> Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> Cc: Jason Xing <kerneljasonxing@gmail.com>
> Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
> Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
> Signed-off-by: Florian Westphal <fw@strlen.de>
Florian Westphal Aug. 12, 2024, 11:52 p.m. UTC | #2
Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
> > 
> > This doesn't seem like an actual bug (no tw sockets got lost and I don't
> > see a use-after-free) but as erroneous trigger of debug check.
> 
> I guess the reason you don't move the check back to tcp_sk_exit() is

No, it would not work. .exit runs before .exit_batch, we'd splat.

Before e9bd0cca09d1 ordering doesn't matter because
refcount_dec_and_test is used consistently, so it was not relevant
if the 0-transition occured from .exit or later via inet_twsk_kill.

> to catch a potential issue explained in solution 4 in the link, right ?

Yes, it would help to catch such issue wrt. twsk lifetime.
Having the WARN ensures no twsk can escape .exit_batch completion.

E.g. if twsk destructors in the future refer to some other
object that has to be released before netns pointers become invalid
or something like that.

Does that make sense?  Otherwise I'm open to alternative approaches.
Or we can wait until syzbot finds a reproducer, I don't think its
a real/urgent bug.
Kuniyuki Iwashima Aug. 13, 2024, 12:01 a.m. UTC | #3
From: Florian Westphal <fw@strlen.de>
Date: Tue, 13 Aug 2024 01:52:59 +0200
> Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
> > > 
> > > This doesn't seem like an actual bug (no tw sockets got lost and I don't
> > > see a use-after-free) but as erroneous trigger of debug check.
> > 
> > I guess the reason you don't move the check back to tcp_sk_exit() is
> 
> No, it would not work. .exit runs before .exit_batch, we'd splat.

ahh, sorry, somehow I checked the order of exit_batch_rtnl() and
ops_exit_list() ... :/


> 
> Before e9bd0cca09d1 ordering doesn't matter because
> refcount_dec_and_test is used consistently, so it was not relevant
> if the 0-transition occured from .exit or later via inet_twsk_kill.
> 
> > to catch a potential issue explained in solution 4 in the link, right ?
> 
> Yes, it would help to catch such issue wrt. twsk lifetime.
> Having the WARN ensures no twsk can escape .exit_batch completion.
> 
> E.g. if twsk destructors in the future refer to some other
> object that has to be released before netns pointers become invalid
> or something like that.
> 
> Does that make sense?

Yes.


> Otherwise I'm open to alternative approaches.
> Or we can wait until syzbot finds a reproducer, I don't think its
> a real/urgent bug.

Agree, also I guess syzbot will not find a repro as it's been
3 months since the first report.

Thanks!
Jason Xing Aug. 13, 2024, 2:48 a.m. UTC | #4
On Tue, Aug 13, 2024 at 7:04 AM Florian Westphal <fw@strlen.de> wrote:
>
> Its possible that two threads call tcp_sk_exit_batch() concurrently,
> once from the cleanup_net workqueue, once from a task that failed to clone
> a new netns.  In the latter case, error unwinding calls the exit handlers
> in reverse order for the 'failed' netns.
>
> tcp_sk_exit_batch() calls tcp_twsk_purge().
> Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
> this function picks up twsk in any dying netns, not just the one passed
> in via exit_batch list.
>
> This means that the error unwind of setup_net() can "steal" and destroy
> timewait sockets belonging to the exiting netns.
>
> This allows the netns exit worker to proceed to call
>
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
>
> without the expected 1 -> 0 transition, which then splats.
>
> At same time, error unwind path that is also running inet_twsk_purge()
> will splat as well:
>
> WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
> ...
>  refcount_dec include/linux/refcount.h:351 [inline]
>  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
>  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
>  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
>  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
>  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
>  setup_net+0x714/0xb40 net/core/net_namespace.c:375
>  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
>  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
>
> ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
>
> This doesn't seem like an actual bug (no tw sockets got lost and I don't
> see a use-after-free) but as erroneous trigger of debug check.
>
> Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
> blocks other task from doing final _dec_and_test before mutex-owner has
> removed all tw sockets of dying netns.
>
> Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
> Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> Cc: Jason Xing <kerneljasonxing@gmail.com>
> Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
> Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
> Signed-off-by: Florian Westphal <fw@strlen.de>

Really thanks for your great effort :) It's not easy to analyze at all.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks,
Jason
Paolo Abeni Aug. 15, 2024, 10:47 a.m. UTC | #5
On 8/13/24 00:28, Florian Westphal wrote:
> Its possible that two threads call tcp_sk_exit_batch() concurrently,
> once from the cleanup_net workqueue, once from a task that failed to clone
> a new netns.  In the latter case, error unwinding calls the exit handlers
> in reverse order for the 'failed' netns.
> 
> tcp_sk_exit_batch() calls tcp_twsk_purge().
> Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
> this function picks up twsk in any dying netns, not just the one passed
> in via exit_batch list.
> 
> This means that the error unwind of setup_net() can "steal" and destroy
> timewait sockets belonging to the exiting netns.
> 
> This allows the netns exit worker to proceed to call
> 
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> 
> without the expected 1 -> 0 transition, which then splats.
> 
> At same time, error unwind path that is also running inet_twsk_purge()
> will splat as well:
> 
> WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
> ...
>   refcount_dec include/linux/refcount.h:351 [inline]
>   inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
>   inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
>   inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
>   tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
>   ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
>   setup_net+0x714/0xb40 net/core/net_namespace.c:375
>   copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
>   create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> 
> ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
> 
> This doesn't seem like an actual bug (no tw sockets got lost and I don't
> see a use-after-free) but as erroneous trigger of debug check.
> 
> Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
> blocks other task from doing final _dec_and_test before mutex-owner has
> removed all tw sockets of dying netns.
> 
> Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
> Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> Cc: Jason Xing <kerneljasonxing@gmail.com>
> Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
> Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
> Signed-off-by: Florian Westphal <fw@strlen.de>

The fixes LGTM, but I prefer to keep the patch in PW a little longer to 
allow Eric having a look here.

Cheers,

Paolo
Eric Dumazet Aug. 19, 2024, 3:36 p.m. UTC | #6
On Thu, Aug 15, 2024 at 12:47 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/13/24 00:28, Florian Westphal wrote:
> > Its possible that two threads call tcp_sk_exit_batch() concurrently,
> > once from the cleanup_net workqueue, once from a task that failed to clone
> > a new netns.  In the latter case, error unwinding calls the exit handlers
> > in reverse order for the 'failed' netns.
> >
> > tcp_sk_exit_batch() calls tcp_twsk_purge().
> > Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
> > this function picks up twsk in any dying netns, not just the one passed
> > in via exit_batch list.
> >
> > This means that the error unwind of setup_net() can "steal" and destroy
> > timewait sockets belonging to the exiting netns.
> >
> > This allows the netns exit worker to proceed to call
> >
> > WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> >
> > without the expected 1 -> 0 transition, which then splats.
> >
> > At same time, error unwind path that is also running inet_twsk_purge()
> > will splat as well:
> >
> > WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
> > ...
> >   refcount_dec include/linux/refcount.h:351 [inline]
> >   inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> >   inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
> >   inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> >   tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> >   ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> >   setup_net+0x714/0xb40 net/core/net_namespace.c:375
> >   copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> >   create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> >
> > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
> >
> > This doesn't seem like an actual bug (no tw sockets got lost and I don't
> > see a use-after-free) but as erroneous trigger of debug check.
> >
> > Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
> > blocks other task from doing final _dec_and_test before mutex-owner has
> > removed all tw sockets of dying netns.
> >
> > Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
> > Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> > Cc: Jason Xing <kerneljasonxing@gmail.com>
> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> > Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
> > Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
> > Signed-off-by: Florian Westphal <fw@strlen.de>
>
> The fixes LGTM, but I prefer to keep the patch in PW a little longer to
> allow Eric having a look here.

Excellent patch, thanks a lot Florian !

I had this syzbot report in my queue for a while, but I presumed this
was caused by RDS,
thus waiting for more signal.

Reviewed-by: Eric Dumazet <edumazet@google.com>
patchwork-bot+netdevbpf@kernel.org Aug. 19, 2024, 3:50 p.m. UTC | #7
Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 13 Aug 2024 00:28:25 +0200 you wrote:
> Its possible that two threads call tcp_sk_exit_batch() concurrently,
> once from the cleanup_net workqueue, once from a task that failed to clone
> a new netns.  In the latter case, error unwinding calls the exit handlers
> in reverse order for the 'failed' netns.
> 
> tcp_sk_exit_batch() calls tcp_twsk_purge().
> Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
> this function picks up twsk in any dying netns, not just the one passed
> in via exit_batch list.
> 
> [...]

Here is the summary with links:
  - [net] tcp: prevent concurrent execution of tcp_sk_exit_batch
    https://git.kernel.org/netdev/net/c/565d121b6998

You are awesome, thank you!
diff mbox series

Patch

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fd17f25ff288..a4e510846905 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -97,6 +97,8 @@  static DEFINE_PER_CPU(struct sock_bh_locked, ipv4_tcp_sk) = {
 	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
 };
 
+static DEFINE_MUTEX(tcp_exit_batch_mutex);
+
 static u32 tcp_v4_init_seq(const struct sk_buff *skb)
 {
 	return secure_tcp_seq(ip_hdr(skb)->daddr,
@@ -3514,6 +3516,16 @@  static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
 {
 	struct net *net;
 
+	/* make sure concurrent calls to tcp_sk_exit_batch from net_cleanup_work
+	 * and failed setup_net error unwinding path are serialized.
+	 *
+	 * tcp_twsk_purge() handles twsk in any dead netns, not just those in
+	 * net_exit_list, the thread that dismantles a particular twsk must
+	 * do so without other thread progressing to refcount_dec_and_test() of
+	 * tcp_death_row.tw_refcount.
+	 */
+	mutex_lock(&tcp_exit_batch_mutex);
+
 	tcp_twsk_purge(net_exit_list);
 
 	list_for_each_entry(net, net_exit_list, exit_list) {
@@ -3521,6 +3533,8 @@  static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
 		WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
 		tcp_fastopen_ctx_destroy(net);
 	}
+
+	mutex_unlock(&tcp_exit_batch_mutex);
 }
 
 static struct pernet_operations __net_initdata tcp_sk_ops = {