Message ID | 20240812222857.29837-1-fw@strlen.de (mailing list archive) |
---|---|
State | Accepted |
Commit | 565d121b69980637f040eb4d84289869cdaabedf |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] tcp: prevent concurrent execution of tcp_sk_exit_batch | expand |
From: Florian Westphal <fw@strlen.de> Date: Tue, 13 Aug 2024 00:28:25 +0200 > Its possible that two threads call tcp_sk_exit_batch() concurrently, > once from the cleanup_net workqueue, once from a task that failed to clone > a new netns. In the latter case, error unwinding calls the exit handlers > in reverse order for the 'failed' netns. > > tcp_sk_exit_batch() calls tcp_twsk_purge(). > Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"), > this function picks up twsk in any dying netns, not just the one passed > in via exit_batch list. > > This means that the error unwind of setup_net() can "steal" and destroy > timewait sockets belonging to the exiting netns. > > This allows the netns exit worker to proceed to call > > WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount)); > > without the expected 1 -> 0 transition, which then splats. > > At same time, error unwind path that is also running inet_twsk_purge() > will splat as well: > > WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 > ... > refcount_dec include/linux/refcount.h:351 [inline] > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0. > > This doesn't seem like an actual bug (no tw sockets got lost and I don't > see a use-after-free) but as erroneous trigger of debug check. I guess the reason you don't move the check back to tcp_sk_exit() is to catch a potential issue explained in solution 4 in the link, right ? Then, Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> > > Add a mutex to force strict ordering: the task that calls tcp_twsk_purge() > blocks other task from doing final _dec_and_test before mutex-owner has > removed all tw sockets of dying netns. > > Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.") > Cc: Kuniyuki Iwashima <kuniyu@amazon.com> > Cc: Jason Xing <kerneljasonxing@gmail.com> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/ > Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/ > Signed-off-by: Florian Westphal <fw@strlen.de>
Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0. > > > > This doesn't seem like an actual bug (no tw sockets got lost and I don't > > see a use-after-free) but as erroneous trigger of debug check. > > I guess the reason you don't move the check back to tcp_sk_exit() is No, it would not work. .exit runs before .exit_batch, we'd splat. Before e9bd0cca09d1 ordering doesn't matter because refcount_dec_and_test is used consistently, so it was not relevant if the 0-transition occured from .exit or later via inet_twsk_kill. > to catch a potential issue explained in solution 4 in the link, right ? Yes, it would help to catch such issue wrt. twsk lifetime. Having the WARN ensures no twsk can escape .exit_batch completion. E.g. if twsk destructors in the future refer to some other object that has to be released before netns pointers become invalid or something like that. Does that make sense? Otherwise I'm open to alternative approaches. Or we can wait until syzbot finds a reproducer, I don't think its a real/urgent bug.
From: Florian Westphal <fw@strlen.de> Date: Tue, 13 Aug 2024 01:52:59 +0200 > Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0. > > > > > > This doesn't seem like an actual bug (no tw sockets got lost and I don't > > > see a use-after-free) but as erroneous trigger of debug check. > > > > I guess the reason you don't move the check back to tcp_sk_exit() is > > No, it would not work. .exit runs before .exit_batch, we'd splat. ahh, sorry, somehow I checked the order of exit_batch_rtnl() and ops_exit_list() ... :/ > > Before e9bd0cca09d1 ordering doesn't matter because > refcount_dec_and_test is used consistently, so it was not relevant > if the 0-transition occured from .exit or later via inet_twsk_kill. > > > to catch a potential issue explained in solution 4 in the link, right ? > > Yes, it would help to catch such issue wrt. twsk lifetime. > Having the WARN ensures no twsk can escape .exit_batch completion. > > E.g. if twsk destructors in the future refer to some other > object that has to be released before netns pointers become invalid > or something like that. > > Does that make sense? Yes. > Otherwise I'm open to alternative approaches. > Or we can wait until syzbot finds a reproducer, I don't think its > a real/urgent bug. Agree, also I guess syzbot will not find a repro as it's been 3 months since the first report. Thanks!
On Tue, Aug 13, 2024 at 7:04 AM Florian Westphal <fw@strlen.de> wrote: > > Its possible that two threads call tcp_sk_exit_batch() concurrently, > once from the cleanup_net workqueue, once from a task that failed to clone > a new netns. In the latter case, error unwinding calls the exit handlers > in reverse order for the 'failed' netns. > > tcp_sk_exit_batch() calls tcp_twsk_purge(). > Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"), > this function picks up twsk in any dying netns, not just the one passed > in via exit_batch list. > > This means that the error unwind of setup_net() can "steal" and destroy > timewait sockets belonging to the exiting netns. > > This allows the netns exit worker to proceed to call > > WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount)); > > without the expected 1 -> 0 transition, which then splats. > > At same time, error unwind path that is also running inet_twsk_purge() > will splat as well: > > WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 > ... > refcount_dec include/linux/refcount.h:351 [inline] > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0. > > This doesn't seem like an actual bug (no tw sockets got lost and I don't > see a use-after-free) but as erroneous trigger of debug check. > > Add a mutex to force strict ordering: the task that calls tcp_twsk_purge() > blocks other task from doing final _dec_and_test before mutex-owner has > removed all tw sockets of dying netns. > > Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.") > Cc: Kuniyuki Iwashima <kuniyu@amazon.com> > Cc: Jason Xing <kerneljasonxing@gmail.com> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/ > Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/ > Signed-off-by: Florian Westphal <fw@strlen.de> Really thanks for your great effort :) It's not easy to analyze at all. Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Thanks, Jason
On 8/13/24 00:28, Florian Westphal wrote: > Its possible that two threads call tcp_sk_exit_batch() concurrently, > once from the cleanup_net workqueue, once from a task that failed to clone > a new netns. In the latter case, error unwinding calls the exit handlers > in reverse order for the 'failed' netns. > > tcp_sk_exit_batch() calls tcp_twsk_purge(). > Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"), > this function picks up twsk in any dying netns, not just the one passed > in via exit_batch list. > > This means that the error unwind of setup_net() can "steal" and destroy > timewait sockets belonging to the exiting netns. > > This allows the netns exit worker to proceed to call > > WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount)); > > without the expected 1 -> 0 transition, which then splats. > > At same time, error unwind path that is also running inet_twsk_purge() > will splat as well: > > WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 > ... > refcount_dec include/linux/refcount.h:351 [inline] > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0. > > This doesn't seem like an actual bug (no tw sockets got lost and I don't > see a use-after-free) but as erroneous trigger of debug check. > > Add a mutex to force strict ordering: the task that calls tcp_twsk_purge() > blocks other task from doing final _dec_and_test before mutex-owner has > removed all tw sockets of dying netns. > > Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.") > Cc: Kuniyuki Iwashima <kuniyu@amazon.com> > Cc: Jason Xing <kerneljasonxing@gmail.com> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/ > Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/ > Signed-off-by: Florian Westphal <fw@strlen.de> The fixes LGTM, but I prefer to keep the patch in PW a little longer to allow Eric having a look here. Cheers, Paolo
On Thu, Aug 15, 2024 at 12:47 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On 8/13/24 00:28, Florian Westphal wrote: > > Its possible that two threads call tcp_sk_exit_batch() concurrently, > > once from the cleanup_net workqueue, once from a task that failed to clone > > a new netns. In the latter case, error unwinding calls the exit handlers > > in reverse order for the 'failed' netns. > > > > tcp_sk_exit_batch() calls tcp_twsk_purge(). > > Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"), > > this function picks up twsk in any dying netns, not just the one passed > > in via exit_batch list. > > > > This means that the error unwind of setup_net() can "steal" and destroy > > timewait sockets belonging to the exiting netns. > > > > This allows the netns exit worker to proceed to call > > > > WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount)); > > > > without the expected 1 -> 0 transition, which then splats. > > > > At same time, error unwind path that is also running inet_twsk_purge() > > will splat as well: > > > > WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 > > ... > > refcount_dec include/linux/refcount.h:351 [inline] > > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 > > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > > > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0. > > > > This doesn't seem like an actual bug (no tw sockets got lost and I don't > > see a use-after-free) but as erroneous trigger of debug check. > > > > Add a mutex to force strict ordering: the task that calls tcp_twsk_purge() > > blocks other task from doing final _dec_and_test before mutex-owner has > > removed all tw sockets of dying netns. > > > > Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.") > > Cc: Kuniyuki Iwashima <kuniyu@amazon.com> > > Cc: Jason Xing <kerneljasonxing@gmail.com> > > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com > > Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/ > > Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/ > > Signed-off-by: Florian Westphal <fw@strlen.de> > > The fixes LGTM, but I prefer to keep the patch in PW a little longer to > allow Eric having a look here. Excellent patch, thanks a lot Florian ! I had this syzbot report in my queue for a while, but I presumed this was caused by RDS, thus waiting for more signal. Reviewed-by: Eric Dumazet <edumazet@google.com>
Hello: This patch was applied to netdev/net.git (main) by Jakub Kicinski <kuba@kernel.org>: On Tue, 13 Aug 2024 00:28:25 +0200 you wrote: > Its possible that two threads call tcp_sk_exit_batch() concurrently, > once from the cleanup_net workqueue, once from a task that failed to clone > a new netns. In the latter case, error unwinding calls the exit handlers > in reverse order for the 'failed' netns. > > tcp_sk_exit_batch() calls tcp_twsk_purge(). > Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"), > this function picks up twsk in any dying netns, not just the one passed > in via exit_batch list. > > [...] Here is the summary with links: - [net] tcp: prevent concurrent execution of tcp_sk_exit_batch https://git.kernel.org/netdev/net/c/565d121b6998 You are awesome, thank you!
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index fd17f25ff288..a4e510846905 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -97,6 +97,8 @@ static DEFINE_PER_CPU(struct sock_bh_locked, ipv4_tcp_sk) = { .bh_lock = INIT_LOCAL_LOCK(bh_lock), }; +static DEFINE_MUTEX(tcp_exit_batch_mutex); + static u32 tcp_v4_init_seq(const struct sk_buff *skb) { return secure_tcp_seq(ip_hdr(skb)->daddr, @@ -3514,6 +3516,16 @@ static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list) { struct net *net; + /* make sure concurrent calls to tcp_sk_exit_batch from net_cleanup_work + * and failed setup_net error unwinding path are serialized. + * + * tcp_twsk_purge() handles twsk in any dead netns, not just those in + * net_exit_list, the thread that dismantles a particular twsk must + * do so without other thread progressing to refcount_dec_and_test() of + * tcp_death_row.tw_refcount. + */ + mutex_lock(&tcp_exit_batch_mutex); + tcp_twsk_purge(net_exit_list); list_for_each_entry(net, net_exit_list, exit_list) { @@ -3521,6 +3533,8 @@ static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list) WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount)); tcp_fastopen_ctx_destroy(net); } + + mutex_unlock(&tcp_exit_batch_mutex); } static struct pernet_operations __net_initdata tcp_sk_ops = {
Its possible that two threads call tcp_sk_exit_batch() concurrently, once from the cleanup_net workqueue, once from a task that failed to clone a new netns. In the latter case, error unwinding calls the exit handlers in reverse order for the 'failed' netns. tcp_sk_exit_batch() calls tcp_twsk_purge(). Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"), this function picks up twsk in any dying netns, not just the one passed in via exit_batch list. This means that the error unwind of setup_net() can "steal" and destroy timewait sockets belonging to the exiting netns. This allows the netns exit worker to proceed to call WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount)); without the expected 1 -> 0 transition, which then splats. At same time, error unwind path that is also running inet_twsk_purge() will splat as well: WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 ... refcount_dec include/linux/refcount.h:351 [inline] inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 setup_net+0x714/0xb40 net/core/net_namespace.c:375 copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 ... because refcount_dec() of tw_refcount unexpectedly dropped to 0. This doesn't seem like an actual bug (no tw sockets got lost and I don't see a use-after-free) but as erroneous trigger of debug check. Add a mutex to force strict ordering: the task that calls tcp_twsk_purge() blocks other task from doing final _dec_and_test before mutex-owner has removed all tw sockets of dying netns. Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.") Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Jason Xing <kerneljasonxing@gmail.com> Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/ Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/ Signed-off-by: Florian Westphal <fw@strlen.de> --- net/ipv4/tcp_ipv4.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)