Message ID | 20240507170018.83385-1-kuniyu@amazon.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v1,net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. | expand |
On Tue, 2024-05-07 at 10:00 -0700, Kuniyuki Iwashima wrote: > Billy Jheng Bing-Jhong reported a race between __unix_gc() and > queue_oob(). > > __unix_gc() tries to garbage-collect close()d inflight sockets, > and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC > will drop the reference and set NULL to it locklessly. > > However, the peer socket still can send MSG_OOB message to the > GC candidate and queue_oob() can update unix_sk(sk)->oob_skb > concurrently, resulting in NULL pointer dereference. [0] > > To avoid the race, let's update unix_sk(sk)->oob_skb under the > sk_receive_queue's lock. I'm sorry to delay this fix but... AFAICS every time AF_UNIX touches the ooo_skb, it's under the receiver unix_state_lock. The only exception is __unix_gc. What about just acquiring such lock there? Otherwise there are other chunk touching the ooo_skb is touched where this patch does not add the receive queue spin lock protection e.g. in unix_stream_recv_urg(), making the code a bit inconsistent. Thanks Paolo
From: Paolo Abeni <pabeni@redhat.com> Date: Thu, 09 May 2024 11:12:38 +0200 > On Tue, 2024-05-07 at 10:00 -0700, Kuniyuki Iwashima wrote: > > Billy Jheng Bing-Jhong reported a race between __unix_gc() and > > queue_oob(). > > > > __unix_gc() tries to garbage-collect close()d inflight sockets, > > and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC > > will drop the reference and set NULL to it locklessly. > > > > However, the peer socket still can send MSG_OOB message to the > > GC candidate and queue_oob() can update unix_sk(sk)->oob_skb > > concurrently, resulting in NULL pointer dereference. [0] > > > > To avoid the race, let's update unix_sk(sk)->oob_skb under the > > sk_receive_queue's lock. > > I'm sorry to delay this fix but... > > AFAICS every time AF_UNIX touches the ooo_skb, it's under the receiver > unix_state_lock. The only exception is __unix_gc. What about just > acquiring such lock there? In the new GC, there is unix_state_lock -> gc_lock ordering, and we need another fix then. That's why I chose locking recvq for old GC too. https://lore.kernel.org/netdev/20240507172606.85532-1-kuniyu@amazon.com/ Also, Linus says: I really get the feeling that 'sb->oob_skb' should actually be forced to always be in sync with the receive queue by always doing the accesses under the receive_queue lock. ( That's in the security@ thread I added you, but I just noticed Linus replied to the previous mail. I'll forward the mails to you. ) > Otherwise there are other chunk touching the ooo_skb is touched where > this patch does not add the receive queue spin lock protection e.g. in > unix_stream_recv_urg(), making the code a bit inconsistent. Yes, now the receive path is protected by unix_state_lock() and the send path is by unix_state_lock() and recvq lock. Ideally, as Linus suggested, we should acquire recvq lock everywhere touching oob_skb and remove the additional refcount by skb_get(), but I thought it's too much as a fix and I would do that refactoring in the next cycle. What do you think ?
On Fri, 2024-05-10 at 14:03 +0900, Kuniyuki Iwashima wrote: > From: Paolo Abeni <pabeni@redhat.com> > Date: Thu, 09 May 2024 11:12:38 +0200 > > On Tue, 2024-05-07 at 10:00 -0700, Kuniyuki Iwashima wrote: > > > Billy Jheng Bing-Jhong reported a race between __unix_gc() and > > > queue_oob(). > > > > > > __unix_gc() tries to garbage-collect close()d inflight sockets, > > > and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC > > > will drop the reference and set NULL to it locklessly. > > > > > > However, the peer socket still can send MSG_OOB message to the > > > GC candidate and queue_oob() can update unix_sk(sk)->oob_skb > > > concurrently, resulting in NULL pointer dereference. [0] > > > > > > To avoid the race, let's update unix_sk(sk)->oob_skb under the > > > sk_receive_queue's lock. > > > > I'm sorry to delay this fix but... > > > > AFAICS every time AF_UNIX touches the ooo_skb, it's under the receiver > > unix_state_lock. The only exception is __unix_gc. What about just > > acquiring such lock there? > > In the new GC, there is unix_state_lock -> gc_lock ordering, and > we need another fix then. > > That's why I chose locking recvq for old GC too. > https://lore.kernel.org/netdev/20240507172606.85532-1-kuniyu@amazon.com/ > > Also, Linus says: > > I really get the feeling that 'sb->oob_skb' should actually be forced > to always be in sync with the receive queue by always doing the > accesses under the receive_queue lock. > > ( That's in the security@ thread I added you, but I just noticed > Linus replied to the previous mail. I'll forward the mails to you. ) > > > > Otherwise there are other chunk touching the ooo_skb is touched where > > this patch does not add the receive queue spin lock protection e.g. in > > unix_stream_recv_urg(), making the code a bit inconsistent. > > Yes, now the receive path is protected by unix_state_lock() and the > send path is by unix_state_lock() and recvq lock. > > Ideally, as Linus suggested, we should acquire recvq lock everywhere > touching oob_skb and remove the additional refcount by skb_get(), but > I thought it's too much as a fix and I would do that refactoring in > the next cycle. > > What do you think ? I missed/forgot the unix_state_lock -> gc_lock ordering on net-next. What about using the receive queue lock, and acquiring that everywhere oob_skb is touched, without the additional refcount refactor? Would be more consistent and reasonably small. It should work on the new CG, too. The refcount refactor could later come on net-next, and will be less complex with the lock already in place. Incremental patch on top of yours, completely untested: --- diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 9a6ad5974dff..a489f2aef29d 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2614,8 +2614,10 @@ static int unix_stream_recv_urg(struct unix_stream_read_state *state) mutex_lock(&u->iolock); unix_state_lock(sk); + spin_lock(&sk->sk_receive_queue.lock); if (sock_flag(sk, SOCK_URGINLINE) || !u->oob_skb) { + spin_unlock(&sk->sk_receive_queue.lock); unix_state_unlock(sk); mutex_unlock(&u->iolock); return -EINVAL; @@ -2627,6 +2629,7 @@ static int unix_stream_recv_urg(struct unix_stream_read_state *state) WRITE_ONCE(u->oob_skb, NULL); else skb_get(oob_skb); + spin_unlock(&sk->sk_receive_queue.lock); unix_state_unlock(sk); chunk = state->recv_actor(oob_skb, 0, chunk, state); @@ -2655,6 +2658,7 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk, consume_skb(skb); skb = NULL; } else { + spin_lock(&sk->sk_receive_queue.lock); if (skb == u->oob_skb) { if (copied) { skb = NULL; @@ -2673,6 +2677,7 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk, skb = skb_peek(&sk->sk_receive_queue); } } + spin_unlock(&sk->sk_receive_queue.lock); } return skb; }
From: Paolo Abeni <pabeni@redhat.com> Date: Fri, 10 May 2024 09:53:25 +0200 > On Fri, 2024-05-10 at 14:03 +0900, Kuniyuki Iwashima wrote: > > From: Paolo Abeni <pabeni@redhat.com> > > Date: Thu, 09 May 2024 11:12:38 +0200 > > > On Tue, 2024-05-07 at 10:00 -0700, Kuniyuki Iwashima wrote: > > > > Billy Jheng Bing-Jhong reported a race between __unix_gc() and > > > > queue_oob(). > > > > > > > > __unix_gc() tries to garbage-collect close()d inflight sockets, > > > > and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC > > > > will drop the reference and set NULL to it locklessly. > > > > > > > > However, the peer socket still can send MSG_OOB message to the > > > > GC candidate and queue_oob() can update unix_sk(sk)->oob_skb > > > > concurrently, resulting in NULL pointer dereference. [0] > > > > > > > > To avoid the race, let's update unix_sk(sk)->oob_skb under the > > > > sk_receive_queue's lock. > > > > > > I'm sorry to delay this fix but... > > > > > > AFAICS every time AF_UNIX touches the ooo_skb, it's under the receiver > > > unix_state_lock. The only exception is __unix_gc. What about just > > > acquiring such lock there? > > > > In the new GC, there is unix_state_lock -> gc_lock ordering, and > > we need another fix then. > > > > That's why I chose locking recvq for old GC too. > > https://lore.kernel.org/netdev/20240507172606.85532-1-kuniyu@amazon.com/ > > > > Also, Linus says: > > > > I really get the feeling that 'sb->oob_skb' should actually be forced > > to always be in sync with the receive queue by always doing the > > accesses under the receive_queue lock. > > > > ( That's in the security@ thread I added you, but I just noticed > > Linus replied to the previous mail. I'll forward the mails to you. ) > > > > > > > Otherwise there are other chunk touching the ooo_skb is touched where > > > this patch does not add the receive queue spin lock protection e.g. in > > > unix_stream_recv_urg(), making the code a bit inconsistent. > > > > Yes, now the receive path is protected by unix_state_lock() and the > > send path is by unix_state_lock() and recvq lock. > > > > Ideally, as Linus suggested, we should acquire recvq lock everywhere > > touching oob_skb and remove the additional refcount by skb_get(), but > > I thought it's too much as a fix and I would do that refactoring in > > the next cycle. > > > > What do you think ? > > I missed/forgot the unix_state_lock -> gc_lock ordering on net-next. > > What about using the receive queue lock, and acquiring that everywhere > oob_skb is touched, without the additional refcount refactor? > > Would be more consistent and reasonably small. It should work on the > new CG, too. > > The refcount refactor could later come on net-next, and will be less > complex with the lock already in place. yeah, sounds good. will post v2 with additional recvq locks. Thanks! > > Incremental patch on top of yours, completely untested: > --- > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index 9a6ad5974dff..a489f2aef29d 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -2614,8 +2614,10 @@ static int unix_stream_recv_urg(struct unix_stream_read_state *state) > > mutex_lock(&u->iolock); > unix_state_lock(sk); > + spin_lock(&sk->sk_receive_queue.lock); > > if (sock_flag(sk, SOCK_URGINLINE) || !u->oob_skb) { > + spin_unlock(&sk->sk_receive_queue.lock); > unix_state_unlock(sk); > mutex_unlock(&u->iolock); > return -EINVAL; > @@ -2627,6 +2629,7 @@ static int unix_stream_recv_urg(struct unix_stream_read_state *state) > WRITE_ONCE(u->oob_skb, NULL); > else > skb_get(oob_skb); > + spin_unlock(&sk->sk_receive_queue.lock); > unix_state_unlock(sk); > > chunk = state->recv_actor(oob_skb, 0, chunk, state); > @@ -2655,6 +2658,7 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk, > consume_skb(skb); > skb = NULL; > } else { > + spin_lock(&sk->sk_receive_queue.lock); > if (skb == u->oob_skb) { > if (copied) { > skb = NULL; > @@ -2673,6 +2677,7 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk, > skb = skb_peek(&sk->sk_receive_queue); > } > } > + spin_unlock(&sk->sk_receive_queue.lock); > } > return skb; > } >
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 9a6ad5974dff..6ae0370f038f 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2217,13 +2217,20 @@ static int queue_oob(struct socket *sock, struct msghdr *msg, struct sock *other maybe_add_creds(skb, sock, other); skb_get(skb); + scm_stat_add(other, skb); + + /* oob_skb must be changed under sk_recv_queue's + * lock to avoid the race with GC. + */ + spin_lock(&other->sk_receive_queue.lock); if (ousk->oob_skb) consume_skb(ousk->oob_skb); WRITE_ONCE(ousk->oob_skb, skb); - scm_stat_add(other, skb); - skb_queue_tail(&other->sk_receive_queue, skb); + __skb_queue_tail(&other->sk_receive_queue, skb); + spin_unlock(&other->sk_receive_queue.lock); + sk_send_sigurg(other); unix_state_unlock(other); other->sk_data_ready(other); diff --git a/net/unix/garbage.c b/net/unix/garbage.c index 0104be9d4704..b87e48e2b51b 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -342,10 +342,12 @@ static void __unix_gc(struct work_struct *work) scan_children(&u->sk, inc_inflight, &hitlist); #if IS_ENABLED(CONFIG_AF_UNIX_OOB) + spin_lock(&u->sk.sk_receive_queue.lock); if (u->oob_skb) { - kfree_skb(u->oob_skb); + WARN_ON_ONCE(skb_unref(u->oob_skb)); u->oob_skb = NULL; } + spin_unlock(&u->sk.sk_receive_queue.lock); #endif }
Billy Jheng Bing-Jhong reported a race between __unix_gc() and queue_oob(). __unix_gc() tries to garbage-collect close()d inflight sockets, and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC will drop the reference and set NULL to it locklessly. However, the peer socket still can send MSG_OOB message to the GC candidate and queue_oob() can update unix_sk(sk)->oob_skb concurrently, resulting in NULL pointer dereference. [0] To avoid the race, let's update unix_sk(sk)->oob_skb under the sk_receive_queue's lock. Note that the same issue exists in the new GC, and the change in queue_oob() can be applied as is. [0]: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 8000000009f5e067 P4D 8000000009f5e067 PUD 9f5d067 PMD 0 Oops: 0002 [#1] PREEMPT SMP PTI CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc5-00191-gd091e579b864 #110 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: events delayed_fput RIP: 0010:skb_dequeue (./include/linux/skbuff.h:2386 ./include/linux/skbuff.h:2402 net/core/skbuff.c:3847) Code: 39 e3 74 3e 8b 43 10 48 89 ef 83 e8 01 89 43 10 49 8b 44 24 08 49 c7 44 24 08 00 00 00 00 49 8b 14 24 49 c7 04 24 00 00 00 00 <48> 89 42 08 48 89 10 e8 e7 c5 42 00 4c 89 e0 5b 5d 41 5c c3 cc cc RSP: 0018:ffffc900001bfd48 EFLAGS: 00000002 RAX: 0000000000000000 RBX: ffff8880088f5ae8 RCX: 00000000361289f9 RDX: 0000000000000000 RSI: 0000000000000206 RDI: ffff8880088f5b00 RBP: ffff8880088f5b00 R08: 0000000000080000 R09: 0000000000000001 R10: 0000000000000003 R11: 0000000000000001 R12: ffff8880056b6a00 R13: ffff8880088f5280 R14: 0000000000000001 R15: ffff8880088f5a80 FS: 0000000000000000(0000) GS:ffff88807dd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 0000000006314000 CR4: 00000000007506f0 PKRU: 55555554 Call Trace: <TASK> unix_release_sock (net/unix/af_unix.c:654) unix_release (net/unix/af_unix.c:1050) __sock_release (net/socket.c:660) sock_close (net/socket.c:1423) __fput (fs/file_table.c:423) delayed_fput (fs/file_table.c:444 (discriminator 3)) process_one_work (kernel/workqueue.c:3259) worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416) kthread (kernel/kthread.c:388) ret_from_fork (arch/x86/kernel/process.c:153) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) </TASK> Modules linked in: CR2: 0000000000000008 Fixes: 1279f9d9dec2 ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.") Reported-by: Billy Jheng Bing-Jhong <billy@starlabs.sg> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> --- net/unix/af_unix.c | 11 +++++++++-- net/unix/garbage.c | 4 +++- 2 files changed, 12 insertions(+), 3 deletions(-)