Message ID | 20221119130317.39158-1-jakub@cloudflare.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] l2tp: Don't sleep and disable BH under writer-side sk_callback_lock | expand |
On 2022/11/19 22:03, Jakub Sitnicki wrote: > When holding a reader-writer spin lock we cannot sleep. Calling > setup_udp_tunnel_sock() with write lock held violates this rule, because we > end up calling percpu_down_read(), which might sleep, as syzbot reports > [1]: > > __might_resched.cold+0x222/0x26b kernel/sched/core.c:9890 > percpu_down_read include/linux/percpu-rwsem.h:49 [inline] > cpus_read_lock+0x1b/0x140 kernel/cpu.c:310 > static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158 > udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline] > setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81 > l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509 > pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723 > > Trim the writer-side critical section for sk_callback_lock down to the > minimum, so that it covers only operations on sk_user_data. This patch does not look correct. Since l2tp_validate_socket() checks that sk->sk_user_data == NULL with sk->sk_callback_lock held, you need to call rcu_assign_sk_user_data(sk, tunnel) before releasing sk->sk_callback_lock.
On 2022/11/19 22:52, Tetsuo Handa wrote: > On 2022/11/19 22:03, Jakub Sitnicki wrote: >> When holding a reader-writer spin lock we cannot sleep. Calling >> setup_udp_tunnel_sock() with write lock held violates this rule, because we >> end up calling percpu_down_read(), which might sleep, as syzbot reports >> [1]: >> >> __might_resched.cold+0x222/0x26b kernel/sched/core.c:9890 >> percpu_down_read include/linux/percpu-rwsem.h:49 [inline] >> cpus_read_lock+0x1b/0x140 kernel/cpu.c:310 >> static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158 >> udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline] >> setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81 >> l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509 >> pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723 >> >> Trim the writer-side critical section for sk_callback_lock down to the >> minimum, so that it covers only operations on sk_user_data. > > This patch does not look correct. > > Since l2tp_validate_socket() checks that sk->sk_user_data == NULL with > sk->sk_callback_lock held, you need to call rcu_assign_sk_user_data(sk, tunnel) > before releasing sk->sk_callback_lock. > Is it safe to temporarily set a dummy pointer like below? If it is not safe, what makes assignments done by setup_udp_tunnel_sock() safe? diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c index 754fdda8a5f5..198d38d8fceb 100644 --- a/net/l2tp/l2tp_core.c +++ b/net/l2tp/l2tp_core.c @@ -1474,11 +1474,12 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, } sk = sock->sk; - write_lock(&sk->sk_callback_lock); - + write_lock_bh(&sk->sk_callback_lock); ret = l2tp_validate_socket(sk, net, tunnel->encap); if (ret < 0) goto err_sock; + rcu_assign_sk_user_data(sk, (void *) 1); + write_unlock_bh(&sk->sk_callback_lock); tunnel->l2tp_net = net; pn = l2tp_pernet(net); @@ -1492,6 +1493,8 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, spin_unlock_bh(&pn->l2tp_tunnel_list_lock); sock_put(sk); ret = -EEXIST; + write_lock_bh(&sk->sk_callback_lock); + rcu_assign_sk_user_data(sk, NULL); goto err_sock; } } @@ -1522,16 +1525,15 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, if (tunnel->fd >= 0) sockfd_put(sock); - write_unlock(&sk->sk_callback_lock); return 0; err_sock: + write_unlock_bh(&sk->sk_callback_lock); + if (tunnel->fd < 0) sock_release(sock); else sockfd_put(sock); - - write_unlock(&sk->sk_callback_lock); err: return ret; }
On Sat, Nov 19, 2022 at 10:52 PM +09, Tetsuo Handa wrote: > On 2022/11/19 22:03, Jakub Sitnicki wrote: >> When holding a reader-writer spin lock we cannot sleep. Calling >> setup_udp_tunnel_sock() with write lock held violates this rule, because we >> end up calling percpu_down_read(), which might sleep, as syzbot reports >> [1]: >> >> __might_resched.cold+0x222/0x26b kernel/sched/core.c:9890 >> percpu_down_read include/linux/percpu-rwsem.h:49 [inline] >> cpus_read_lock+0x1b/0x140 kernel/cpu.c:310 >> static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158 >> udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline] >> setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81 >> l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509 >> pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723 >> >> Trim the writer-side critical section for sk_callback_lock down to the >> minimum, so that it covers only operations on sk_user_data. > > This patch does not look correct. > > Since l2tp_validate_socket() checks that sk->sk_user_data == NULL with > sk->sk_callback_lock held, you need to call rcu_assign_sk_user_data(sk, tunnel) > before releasing sk->sk_callback_lock. You're right. v2 posted: https://lore.kernel.org/netdev/20221121085426.21315-1-jakub@cloudflare.com/
On Sat, Nov 19, 2022 at 11:27 PM +09, Tetsuo Handa wrote: > On 2022/11/19 22:52, Tetsuo Handa wrote: >> On 2022/11/19 22:03, Jakub Sitnicki wrote: >>> When holding a reader-writer spin lock we cannot sleep. Calling >>> setup_udp_tunnel_sock() with write lock held violates this rule, because we >>> end up calling percpu_down_read(), which might sleep, as syzbot reports >>> [1]: >>> >>> __might_resched.cold+0x222/0x26b kernel/sched/core.c:9890 >>> percpu_down_read include/linux/percpu-rwsem.h:49 [inline] >>> cpus_read_lock+0x1b/0x140 kernel/cpu.c:310 >>> static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158 >>> udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline] >>> setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81 >>> l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509 >>> pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723 >>> >>> Trim the writer-side critical section for sk_callback_lock down to the >>> minimum, so that it covers only operations on sk_user_data. >> >> This patch does not look correct. >> >> Since l2tp_validate_socket() checks that sk->sk_user_data == NULL with >> sk->sk_callback_lock held, you need to call rcu_assign_sk_user_data(sk, tunnel) >> before releasing sk->sk_callback_lock. >> > > Is it safe to temporarily set a dummy pointer like below? > If it is not safe, what makes assignments done by > setup_udp_tunnel_sock() safe? Yes, I think so. Great idea. I've used it in v2. We can check & assign sk_user_data under sk_callback_lock, and then just let setup_udp_tunnel_sock overwrite it with the same value, without holding the lock. I still think that it's best to keep the critical section as short as possible, though. [...]
On 2022/11/21 18:00, Jakub Sitnicki wrote: >> Is it safe to temporarily set a dummy pointer like below? >> If it is not safe, what makes assignments done by >> setup_udp_tunnel_sock() safe? > > Yes, I think so. Great idea. I've used it in v2. So, you are sure that e.g. udp_sk(sk)->gro_receive = cfg->gro_receive; in setup_udp_tunnel_sock() (where the caller is passing cfg->gro_receive == NULL) never races with e.g. below check (because the socket might be sending/receiving in progress since the socket is retrieved via user-specified file descriptor) ? Then, v2 patch would be OK for fixing this regression. (But I think we still should avoid retrieving a socket from user-specified file descriptor in order to avoid lockdep race window.) struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb, struct udphdr *uh, struct sock *sk) { (...snipped...) if (!sk || !udp_sk(sk)->gro_receive) { (...snipped...) /* no GRO, be sure flush the current packet */ goto out; } (...snipped...) pp = call_gro_receive_sk(udp_sk(sk)->gro_receive, sk, head, skb); out: skb_gro_flush_final(skb, pp, flush); return pp; } > > We can check & assign sk_user_data under sk_callback_lock, and then just > let setup_udp_tunnel_sock overwrite it with the same value, without > holding the lock. Given that sk_user_data is RCU-protected on reader-side, don't we need to wait for RCU grace period after resetting to NULL? > > I still think that it's best to keep the critical section as short as > possible, though.
On Mon, Nov 21, 2022 at 07:03 PM +09, Tetsuo Handa wrote: > On 2022/11/21 18:00, Jakub Sitnicki wrote: >>> Is it safe to temporarily set a dummy pointer like below? >>> If it is not safe, what makes assignments done by >>> setup_udp_tunnel_sock() safe? >> >> Yes, I think so. Great idea. I've used it in v2. > > So, you are sure that e.g. > > udp_sk(sk)->gro_receive = cfg->gro_receive; > > in setup_udp_tunnel_sock() (where the caller is passing cfg->gro_receive == NULL) > never races with e.g. below check (because the socket might be sending/receiving > in progress since the socket is retrieved via user-specified file descriptor) ? > > Then, v2 patch would be OK for fixing this regression. (But I think we still should > avoid retrieving a socket from user-specified file descriptor in order to avoid > lockdep race window.) > > > struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb, > struct udphdr *uh, struct sock *sk) > { > (...snipped...) > if (!sk || !udp_sk(sk)->gro_receive) { > (...snipped...) > /* no GRO, be sure flush the current packet */ > goto out; > } > (...snipped...) > pp = call_gro_receive_sk(udp_sk(sk)->gro_receive, sk, head, skb); > out: > skb_gro_flush_final(skb, pp, flush); > return pp; > } > First, let me say, that I get the impression that setup_udp_tunnel_sock was not really meant to be used on pre-existing sockets created by user-space. Even though l2tp and gtp seem to be doing that. That is because, I don't see how it could be used properly. Given that we need to check-and-set sk_user_data under sk_callback_lock, which setup_udp_tunnel_sock doesn't grab itself. At the same time it might sleep. There is no way to apply it without resorting to tricks, like we did here. So - yeah - there may be other problems. But if there are, they are not related to the faulty commit b68777d54fac ("l2tp: Serialize access to sk_user_data with sk_callback_lock"), which we're trying to fix. There was no locking present in l2tp_tunnel_register before that point. >> >> We can check & assign sk_user_data under sk_callback_lock, and then just >> let setup_udp_tunnel_sock overwrite it with the same value, without >> holding the lock. > > Given that sk_user_data is RCU-protected on reader-side, don't we need to > wait for RCU grace period after resetting to NULL? Who would be the reader? If you have l2tp_udp_encap_recv in mind - the encap_rcv callback has not been set yet, if we are on the error handling path in l2tp_tunnel_register. In general, though, yes - I think you are right. We must synchronize_rcu before we kfree the tunnel. However, that is also not related to the race to check-and-set sk_user_data, which commit b68777d54fac is trying to fix.
On 2022/11/22 6:55, Jakub Sitnicki wrote: > First, let me say, that I get the impression that setup_udp_tunnel_sock > was not really meant to be used on pre-existing sockets created by > user-space. Even though l2tp and gtp seem to be doing that. > > That is because, I don't see how it could be used properly. Given that > we need to check-and-set sk_user_data under sk_callback_lock, which > setup_udp_tunnel_sock doesn't grab itself. At the same time it might > sleep. There is no way to apply it without resorting to tricks, like we > did here. > > So - yeah - there may be other problems. But if there are, they are not > related to the faulty commit b68777d54fac ("l2tp: Serialize access to > sk_user_data with sk_callback_lock"), which we're trying to fix. There > was no locking present in l2tp_tunnel_register before that point. https://syzkaller.appspot.com/bug?extid=94cc2a66fc228b23f360 is the one where changing lockdep class is concurrently done on pre-existing sockets. I think we need to always create a new socket inside l2tp_tunnel_register(), rather than trying to serialize setting of sk_user_data under sk_callback_lock. > However, that is also not related to the race to check-and-set > sk_user_data, which commit b68777d54fac is trying to fix. Therefore, I feel that reverting commit b68777d54fac "l2tp: Serialize access to sk_user_data with sk_callback_lock" might be the better choice.
On Tue, Nov 22, 2022 at 06:48 PM +09, Tetsuo Handa wrote: > On 2022/11/22 6:55, Jakub Sitnicki wrote: >> First, let me say, that I get the impression that setup_udp_tunnel_sock >> was not really meant to be used on pre-existing sockets created by >> user-space. Even though l2tp and gtp seem to be doing that. >> >> That is because, I don't see how it could be used properly. Given that >> we need to check-and-set sk_user_data under sk_callback_lock, which >> setup_udp_tunnel_sock doesn't grab itself. At the same time it might >> sleep. There is no way to apply it without resorting to tricks, like we >> did here. >> >> So - yeah - there may be other problems. But if there are, they are not >> related to the faulty commit b68777d54fac ("l2tp: Serialize access to >> sk_user_data with sk_callback_lock"), which we're trying to fix. There >> was no locking present in l2tp_tunnel_register before that point. > > https://syzkaller.appspot.com/bug?extid=94cc2a66fc228b23f360 is the one > where changing lockdep class is concurrently done on pre-existing sockets. > > I think we need to always create a new socket inside l2tp_tunnel_register(), > rather than trying to serialize setting of sk_user_data under sk_callback_lock. While that would be easier to handle, I don't see how it can be done in a backward-compatible way. User-space is allowed to pass a socket to l2tp today [1]. > >> However, that is also not related to the race to check-and-set >> sk_user_data, which commit b68777d54fac is trying to fix. > > Therefore, I feel that reverting commit b68777d54fac "l2tp: Serialize access > to sk_user_data with sk_callback_lock" might be the better choice. I'm okay with that. Providing we can come up with have an alternative fix to the race between l2tp and other sk_user_data users. [1] https://elixir.bootlin.com/linux/v6.1-rc6/source/net/l2tp/l2tp_netlink.c#L220
On 2022/11/22 19:46, Jakub Sitnicki wrote: >> https://syzkaller.appspot.com/bug?extid=94cc2a66fc228b23f360 is the one >> where changing lockdep class is concurrently done on pre-existing sockets. >> >> I think we need to always create a new socket inside l2tp_tunnel_register(), >> rather than trying to serialize setting of sk_user_data under sk_callback_lock. > > While that would be easier to handle, I don't see how it can be done in > a backward-compatible way. User-space is allowed to pass a socket to > l2tp today [1]. What is the expected usage of the socket which was passed to l2tp_tunnel_register() ? Is the userspace supposed to just close() that socket? Or, is the userspace allowed to continue using the socket? If the userspace might continue using the socket, we would create a new socket, copy required attributes (the source and destination addresses?) from the socket fetched via sockfd_lookup(), and call replace_fd() like e.g. umh_pipe_setup() does inside l2tp_tunnel_register(). i-node number of the socket would change, but I assume that the process which called l2tp_tunnel_register() is not using that i-node number. Since the socket is a datagram socket, I think we can copy required attributes. But since I'm not familiar with networking code, I don't know what attributes need to be copied. Thus, I leave implementing it to netdev people.
On Tue, Nov 22, 2022 at 08:14:33PM +0900, Tetsuo Handa wrote: > On 2022/11/22 19:46, Jakub Sitnicki wrote: > >> https://syzkaller.appspot.com/bug?extid=94cc2a66fc228b23f360 is the one > >> where changing lockdep class is concurrently done on pre-existing sockets. > >> > >> I think we need to always create a new socket inside l2tp_tunnel_register(), > >> rather than trying to serialize setting of sk_user_data under sk_callback_lock. > > > > While that would be easier to handle, I don't see how it can be done in > > a backward-compatible way. User-space is allowed to pass a socket to > > l2tp today [1]. > > What is the expected usage of the socket which was passed to l2tp_tunnel_register() ? It receives L2TP packets. Those can be either control or data ones. Data packets are processed by the kernel. Control packets are queued to user space. > Is the userspace supposed to just close() that socket? Or, is the userspace allowed to > continue using the socket? User space uses this socket to send and receive L2TP control packets (tunnel and session configuration, keep alive and tear down). Therefore it absolutely needs to continue using this socket after the registration phase. > If the userspace might continue using the socket, we would > > create a new socket, copy required attributes (the source and destination addresses?) from > the socket fetched via sockfd_lookup(), and call replace_fd() like e.g. umh_pipe_setup() does > > inside l2tp_tunnel_register(). i-node number of the socket would change, but I assume that > the process which called l2tp_tunnel_register() is not using that i-node number. > > Since the socket is a datagram socket, I think we can copy required attributes. But since > I'm not familiar with networking code, I don't know what attributes need to be copied. Thus, > I leave implementing it to netdev people. That looks fragile to me. If the problem is that setup_udp_tunnel_sock() can sleep, we can just drop the udp_tunnel_encap_enable() call from setup_udp_tunnel_sock(), rename it __udp_tunnel_encap_enable() and make make udp_tunnel_encap_enable() a wrapper around it that'd also call udp_tunnel_encap_enable().
On 2022/11/22 23:10, Guillaume Nault wrote: > User space uses this socket to send and receive L2TP control packets > (tunnel and session configuration, keep alive and tear down). Therefore > it absolutely needs to continue using this socket after the > registration phase. Thank you for explanation. >> If the userspace might continue using the socket, we would >> >> create a new socket, copy required attributes (the source and destination addresses?) from >> the socket fetched via sockfd_lookup(), and call replace_fd() like e.g. umh_pipe_setup() does >> >> inside l2tp_tunnel_register(). i-node number of the socket would change, but I assume that >> the process which called l2tp_tunnel_register() is not using that i-node number. >> >> Since the socket is a datagram socket, I think we can copy required attributes. But since >> I'm not familiar with networking code, I don't know what attributes need to be copied. Thus, >> I leave implementing it to netdev people. > > That looks fragile to me. If the problem is that setup_udp_tunnel_sock() > can sleep, we can just drop the udp_tunnel_encap_enable() call from > setup_udp_tunnel_sock(), rename it __udp_tunnel_encap_enable() and make > make udp_tunnel_encap_enable() a wrapper around it that'd also call > udp_tunnel_encap_enable(). > That's what I thought at https://lkml.kernel.org/r/c64284f4-2c2a-ecb9-a08e-9e49d49c720b@I-love.SAKURA.ne.jp . But the problem is not that setup_udp_tunnel_sock() can sleep. The problem is that lockdep gets confused due to changing lockdep class after the socket is already published. We need to avoid calling lockdep_set_class_and_name() on a socket retrieved via sockfd_lookup().
On Tue, Nov 22, 2022 at 11:28:45PM +0900, Tetsuo Handa wrote: > On 2022/11/22 23:10, Guillaume Nault wrote: > > User space uses this socket to send and receive L2TP control packets > > (tunnel and session configuration, keep alive and tear down). Therefore > > it absolutely needs to continue using this socket after the > > registration phase. > > Thank you for explanation. > > >> If the userspace might continue using the socket, we would > >> > >> create a new socket, copy required attributes (the source and destination addresses?) from > >> the socket fetched via sockfd_lookup(), and call replace_fd() like e.g. umh_pipe_setup() does > >> > >> inside l2tp_tunnel_register(). i-node number of the socket would change, but I assume that > >> the process which called l2tp_tunnel_register() is not using that i-node number. > >> > >> Since the socket is a datagram socket, I think we can copy required attributes. But since > >> I'm not familiar with networking code, I don't know what attributes need to be copied. Thus, > >> I leave implementing it to netdev people. > > > > That looks fragile to me. If the problem is that setup_udp_tunnel_sock() > > can sleep, we can just drop the udp_tunnel_encap_enable() call from > > setup_udp_tunnel_sock(), rename it __udp_tunnel_encap_enable() and make > > make udp_tunnel_encap_enable() a wrapper around it that'd also call > > udp_tunnel_encap_enable(). > > > > That's what I thought at https://lkml.kernel.org/r/c64284f4-2c2a-ecb9-a08e-9e49d49c720b@I-love.SAKURA.ne.jp . > > But the problem is not that setup_udp_tunnel_sock() can sleep. The problem is that lockdep > gets confused due to changing lockdep class after the socket is already published. We need > to avoid calling lockdep_set_class_and_name() on a socket retrieved via sockfd_lookup(). This is a second problem. The problem of setting sk_user_data under sk_callback_lock write protection (while still calling udp_tunnel_encap_enable() from sleepable context) still remains. For lockdep_set_class_and_name(), maybe we could store the necessary socket information (addresses, ports and checksum configuration) in the l2tp_tunnel structure, thus avoiding the need to read them from the socket. This way, we could stop locking the user space socket in l2tp_xmit_core() and drop the lockdep_set_class_and_name() call. I think either you or Jakub proposed something like this in another thread.
On Wed, Nov 23, 2022 at 16:24:00 +0100, Guillaume Nault wrote: > On Tue, Nov 22, 2022 at 11:28:45PM +0900, Tetsuo Handa wrote: > > That's what I thought at https://lkml.kernel.org/r/c64284f4-2c2a-ecb9-a08e-9e49d49c720b@I-love.SAKURA.ne.jp . > > > > But the problem is not that setup_udp_tunnel_sock() can sleep. The problem is that lockdep > > gets confused due to changing lockdep class after the socket is already published. We need > > to avoid calling lockdep_set_class_and_name() on a socket retrieved via sockfd_lookup(). > > This is a second problem. The problem of setting sk_user_data under > sk_callback_lock write protection (while still calling > udp_tunnel_encap_enable() from sleepable context) still remains. > > For lockdep_set_class_and_name(), maybe we could store the necessary > socket information (addresses, ports and checksum configuration) in the > l2tp_tunnel structure, thus avoiding the need to read them from the > socket. This way, we could stop locking the user space socket in > l2tp_xmit_core() and drop the lockdep_set_class_and_name() call. > I think either you or Jakub proposed something like this in another > thread. I note that l2tp_xmit_core calls ip_queue_xmit which expects a socket atomic context*. It also accesses struct inet_sock corking data which may also need locks to safely access. Possibly we could somehow work around that, but on the face of it we'd need to do a bit more work to avoid the socket lock in the tx path. * davem fixed locking in the l2tp xmit path in: 6af88da14ee2 ("l2tp: Fix locking in l2tp_core.c")
On Thu, Nov 24, 2022 at 10:07:51AM +0000, Tom Parkin wrote: > On Wed, Nov 23, 2022 at 16:24:00 +0100, Guillaume Nault wrote: > > On Tue, Nov 22, 2022 at 11:28:45PM +0900, Tetsuo Handa wrote: > > > That's what I thought at https://lkml.kernel.org/r/c64284f4-2c2a-ecb9-a08e-9e49d49c720b@I-love.SAKURA.ne.jp . > > > > > > But the problem is not that setup_udp_tunnel_sock() can sleep. The problem is that lockdep > > > gets confused due to changing lockdep class after the socket is already published. We need > > > to avoid calling lockdep_set_class_and_name() on a socket retrieved via sockfd_lookup(). > > > > This is a second problem. The problem of setting sk_user_data under > > sk_callback_lock write protection (while still calling > > udp_tunnel_encap_enable() from sleepable context) still remains. > > > > For lockdep_set_class_and_name(), maybe we could store the necessary > > socket information (addresses, ports and checksum configuration) in the > > l2tp_tunnel structure, thus avoiding the need to read them from the > > socket. This way, we could stop locking the user space socket in > > l2tp_xmit_core() and drop the lockdep_set_class_and_name() call. > > I think either you or Jakub proposed something like this in another > > thread. > > I note that l2tp_xmit_core calls ip_queue_xmit which expects a socket > atomic context*. > > It also accesses struct inet_sock corking data which may also need locks > to safely access. > > Possibly we could somehow work around that, but on the face of it we'd > need to do a bit more work to avoid the socket lock in the tx path. I was thinking of avoiding using the socket entirely, which indeed means replacing ip_queue_xmit(). We should probably use the different variants of udp_tunnel_xmit_skb() instead. > * davem fixed locking in the l2tp xmit path in: > > 6af88da14ee2 ("l2tp: Fix locking in l2tp_core.c") > -- > Tom Parkin > Katalix Systems Ltd > https://katalix.com > Catalysts for your Embedded Linux software development
diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c index 754fdda8a5f5..100d17908196 100644 --- a/net/l2tp/l2tp_core.c +++ b/net/l2tp/l2tp_core.c @@ -1474,11 +1474,20 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, } sk = sock->sk; - write_lock(&sk->sk_callback_lock); - + write_lock_bh(&sk->sk_callback_lock); ret = l2tp_validate_socket(sk, net, tunnel->encap); if (ret < 0) - goto err_sock; + goto err_inval_sock; + + switch (tunnel->encap) { + case L2TP_ENCAPTYPE_IP: + rcu_assign_sk_user_data(sk, tunnel); + break; + case L2TP_ENCAPTYPE_UDP: + /* nothing to do */ + break; + } + write_unlock_bh(&sk->sk_callback_lock); tunnel->l2tp_net = net; pn = l2tp_pernet(net); @@ -1507,8 +1516,6 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, }; setup_udp_tunnel_sock(net, sock, &udp_cfg); - } else { - rcu_assign_sk_user_data(sk, tunnel); } tunnel->old_sk_destruct = sk->sk_destruct; @@ -1522,16 +1529,18 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, if (tunnel->fd >= 0) sockfd_put(sock); - write_unlock(&sk->sk_callback_lock); return 0; err_sock: + write_lock_bh(&sk->sk_callback_lock); + rcu_assign_sk_user_data(sk, NULL); +err_inval_sock: + write_unlock_bh(&sk->sk_callback_lock); + if (tunnel->fd < 0) sock_release(sock); else sockfd_put(sock); - - write_unlock(&sk->sk_callback_lock); err: return ret; }
When holding a reader-writer spin lock we cannot sleep. Calling setup_udp_tunnel_sock() with write lock held violates this rule, because we end up calling percpu_down_read(), which might sleep, as syzbot reports [1]: __might_resched.cold+0x222/0x26b kernel/sched/core.c:9890 percpu_down_read include/linux/percpu-rwsem.h:49 [inline] cpus_read_lock+0x1b/0x140 kernel/cpu.c:310 static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158 udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline] setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81 l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509 pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723 Trim the writer-side critical section for sk_callback_lock down to the minimum, so that it covers only operations on sk_user_data. Also, when grabbing the sk_callback_lock, we always need to disable BH, as Eric points out. Failing to do so leads to deadlocks because we acquire sk_callback_lock in softirq context, which can get stuck waiting on us if: 1) it runs on the same CPU, or CPU0 ---- lock(clock-AF_INET6); <Interrupt> lock(clock-AF_INET6); 2) lock ordering leads to priority inversion CPU0 CPU1 ---- ---- lock(clock-AF_INET6); local_irq_disable(); lock(&tcp_hashinfo.bhash[i].lock); lock(clock-AF_INET6); <Interrupt> lock(&tcp_hashinfo.bhash[i].lock); ... as syzbot reports [2,3]. Use the _bh variants for write_(un)lock. [1] https://lore.kernel.org/netdev/0000000000004e78ec05eda79749@google.com/ [2] https://lore.kernel.org/netdev/000000000000e38b6605eda76f98@google.com/ [3] https://lore.kernel.org/netdev/000000000000dfa31e05eda76f75@google.com/ Cc: Tom Parkin <tparkin@katalix.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Fixes: b68777d54fac ("l2tp: Serialize access to sk_user_data with sk_callback_lock") Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+703d9e154b3b58277261@syzkaller.appspotmail.com Reported-by: syzbot+50680ced9e98a61f7698@syzkaller.appspotmail.com Reported-by: syzbot+de987172bb74a381879b@syzkaller.appspotmail.com Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> --- net/l2tp/l2tp_core.c | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-)