diff mbox series

[v1,net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.

Message ID 20240507170018.83385-1-kuniyu@amazon.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [v1,net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 932 this patch: 932
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: dhowells@redhat.com
netdev/build_clang success Errors and warnings before: 938 this patch: 938
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 943 this patch: 943
netdev/checkpatch warning WARNING: Commit log lines starting with '#' are dropped by git as comments
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Kuniyuki Iwashima May 7, 2024, 5 p.m. UTC
Billy Jheng Bing-Jhong reported a race between __unix_gc() and
queue_oob().

__unix_gc() tries to garbage-collect close()d inflight sockets,
and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC
will drop the reference and set NULL to it locklessly.

However, the peer socket still can send MSG_OOB message to the
GC candidate and queue_oob() can update unix_sk(sk)->oob_skb
concurrently, resulting in NULL pointer dereference. [0]

To avoid the race, let's update unix_sk(sk)->oob_skb under the
sk_receive_queue's lock.

Note that the same issue exists in the new GC, and the change
in queue_oob() can be applied as is.

[0]:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 8000000009f5e067 P4D 8000000009f5e067 PUD 9f5d067 PMD 0
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc5-00191-gd091e579b864 #110
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: events delayed_fput
RIP: 0010:skb_dequeue (./include/linux/skbuff.h:2386 ./include/linux/skbuff.h:2402 net/core/skbuff.c:3847)
Code: 39 e3 74 3e 8b 43 10 48 89 ef 83 e8 01 89 43 10 49 8b 44 24 08 49 c7 44 24 08 00 00 00 00 49 8b 14 24 49 c7 04 24 00 00 00 00 <48> 89 42 08 48 89 10 e8 e7 c5 42 00 4c 89 e0 5b 5d 41 5c c3 cc cc
RSP: 0018:ffffc900001bfd48 EFLAGS: 00000002
RAX: 0000000000000000 RBX: ffff8880088f5ae8 RCX: 00000000361289f9
RDX: 0000000000000000 RSI: 0000000000000206 RDI: ffff8880088f5b00
RBP: ffff8880088f5b00 R08: 0000000000080000 R09: 0000000000000001
R10: 0000000000000003 R11: 0000000000000001 R12: ffff8880056b6a00
R13: ffff8880088f5280 R14: 0000000000000001 R15: ffff8880088f5a80
FS:  0000000000000000(0000) GS:ffff88807dd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000006314000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
 <TASK>
 unix_release_sock (net/unix/af_unix.c:654)
 unix_release (net/unix/af_unix.c:1050)
 __sock_release (net/socket.c:660)
 sock_close (net/socket.c:1423)
 __fput (fs/file_table.c:423)
 delayed_fput (fs/file_table.c:444 (discriminator 3))
 process_one_work (kernel/workqueue.c:3259)
 worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
 kthread (kernel/kthread.c:388)
 ret_from_fork (arch/x86/kernel/process.c:153)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
 </TASK>
Modules linked in:
CR2: 0000000000000008

Fixes: 1279f9d9dec2 ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.")
Reported-by: Billy Jheng Bing-Jhong <billy@starlabs.sg>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 11 +++++++++--
 net/unix/garbage.c |  4 +++-
 2 files changed, 12 insertions(+), 3 deletions(-)

Comments

Paolo Abeni May 9, 2024, 9:12 a.m. UTC | #1
On Tue, 2024-05-07 at 10:00 -0700, Kuniyuki Iwashima wrote:
> Billy Jheng Bing-Jhong reported a race between __unix_gc() and
> queue_oob().
> 
> __unix_gc() tries to garbage-collect close()d inflight sockets,
> and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC
> will drop the reference and set NULL to it locklessly.
> 
> However, the peer socket still can send MSG_OOB message to the
> GC candidate and queue_oob() can update unix_sk(sk)->oob_skb
> concurrently, resulting in NULL pointer dereference. [0]
> 
> To avoid the race, let's update unix_sk(sk)->oob_skb under the
> sk_receive_queue's lock.

I'm sorry to delay this fix but...

AFAICS every time AF_UNIX touches the ooo_skb, it's under the receiver
unix_state_lock. The only exception is __unix_gc. What about just
acquiring such lock there? 

Otherwise there are other chunk touching the ooo_skb is touched where
this patch does not add the receive queue spin lock protection e.g. in
unix_stream_recv_urg(), making the code a bit inconsistent.

Thanks

Paolo
Kuniyuki Iwashima May 10, 2024, 5:03 a.m. UTC | #2
From: Paolo Abeni <pabeni@redhat.com>
Date: Thu, 09 May 2024 11:12:38 +0200
> On Tue, 2024-05-07 at 10:00 -0700, Kuniyuki Iwashima wrote:
> > Billy Jheng Bing-Jhong reported a race between __unix_gc() and
> > queue_oob().
> > 
> > __unix_gc() tries to garbage-collect close()d inflight sockets,
> > and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC
> > will drop the reference and set NULL to it locklessly.
> > 
> > However, the peer socket still can send MSG_OOB message to the
> > GC candidate and queue_oob() can update unix_sk(sk)->oob_skb
> > concurrently, resulting in NULL pointer dereference. [0]
> > 
> > To avoid the race, let's update unix_sk(sk)->oob_skb under the
> > sk_receive_queue's lock.
> 
> I'm sorry to delay this fix but...
> 
> AFAICS every time AF_UNIX touches the ooo_skb, it's under the receiver
> unix_state_lock. The only exception is __unix_gc. What about just
> acquiring such lock there?

In the new GC, there is unix_state_lock -> gc_lock ordering, and
we need another fix then.

That's why I chose locking recvq for old GC too.
https://lore.kernel.org/netdev/20240507172606.85532-1-kuniyu@amazon.com/

Also, Linus says:

    I really get the feeling that 'sb->oob_skb' should actually be forced
    to always be in sync with the receive queue by always doing the
    accesses under the receive_queue lock.

( That's in the security@ thread I added you, but I just noticed
  Linus replied to the previous mail.  I'll forward the mails to you. )


> Otherwise there are other chunk touching the ooo_skb is touched where
> this patch does not add the receive queue spin lock protection e.g. in
> unix_stream_recv_urg(), making the code a bit inconsistent.

Yes, now the receive path is protected by unix_state_lock() and the
send path is by unix_state_lock() and recvq lock.

Ideally, as Linus suggested, we should acquire recvq lock everywhere
touching oob_skb and remove the additional refcount by skb_get(), but
I thought it's too much as a fix and I would do that refactoring in
the next cycle.

What do you think ?
Paolo Abeni May 10, 2024, 7:53 a.m. UTC | #3
On Fri, 2024-05-10 at 14:03 +0900, Kuniyuki Iwashima wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Thu, 09 May 2024 11:12:38 +0200
> > On Tue, 2024-05-07 at 10:00 -0700, Kuniyuki Iwashima wrote:
> > > Billy Jheng Bing-Jhong reported a race between __unix_gc() and
> > > queue_oob().
> > > 
> > > __unix_gc() tries to garbage-collect close()d inflight sockets,
> > > and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC
> > > will drop the reference and set NULL to it locklessly.
> > > 
> > > However, the peer socket still can send MSG_OOB message to the
> > > GC candidate and queue_oob() can update unix_sk(sk)->oob_skb
> > > concurrently, resulting in NULL pointer dereference. [0]
> > > 
> > > To avoid the race, let's update unix_sk(sk)->oob_skb under the
> > > sk_receive_queue's lock.
> > 
> > I'm sorry to delay this fix but...
> > 
> > AFAICS every time AF_UNIX touches the ooo_skb, it's under the receiver
> > unix_state_lock. The only exception is __unix_gc. What about just
> > acquiring such lock there?
> 
> In the new GC, there is unix_state_lock -> gc_lock ordering, and
> we need another fix then.
> 
> That's why I chose locking recvq for old GC too.
> https://lore.kernel.org/netdev/20240507172606.85532-1-kuniyu@amazon.com/
> 
> Also, Linus says:
> 
>     I really get the feeling that 'sb->oob_skb' should actually be forced
>     to always be in sync with the receive queue by always doing the
>     accesses under the receive_queue lock.
> 
> ( That's in the security@ thread I added you, but I just noticed
>   Linus replied to the previous mail.  I'll forward the mails to you. )
> 
> 
> > Otherwise there are other chunk touching the ooo_skb is touched where
> > this patch does not add the receive queue spin lock protection e.g. in
> > unix_stream_recv_urg(), making the code a bit inconsistent.
> 
> Yes, now the receive path is protected by unix_state_lock() and the
> send path is by unix_state_lock() and recvq lock.
> 
> Ideally, as Linus suggested, we should acquire recvq lock everywhere
> touching oob_skb and remove the additional refcount by skb_get(), but
> I thought it's too much as a fix and I would do that refactoring in
> the next cycle.
> 
> What do you think ?

I missed/forgot the unix_state_lock -> gc_lock ordering on net-next.

What about using the receive queue lock, and acquiring that everywhere
oob_skb is touched, without the additional refcount refactor?

Would be more consistent and reasonably small. It should work on the
new CG, too.

The refcount refactor could later come on net-next, and will be less
complex with the lock already in place.

Incremental patch on top of yours, completely untested:
---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 9a6ad5974dff..a489f2aef29d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2614,8 +2614,10 @@ static int unix_stream_recv_urg(struct unix_stream_read_state *state)
 
 	mutex_lock(&u->iolock);
 	unix_state_lock(sk);
+	spin_lock(&sk->sk_receive_queue.lock);
 
 	if (sock_flag(sk, SOCK_URGINLINE) || !u->oob_skb) {
+		spin_unlock(&sk->sk_receive_queue.lock);
 		unix_state_unlock(sk);
 		mutex_unlock(&u->iolock);
 		return -EINVAL;
@@ -2627,6 +2629,7 @@ static int unix_stream_recv_urg(struct unix_stream_read_state *state)
 		WRITE_ONCE(u->oob_skb, NULL);
 	else
 		skb_get(oob_skb);
+	spin_unlock(&sk->sk_receive_queue.lock);
 	unix_state_unlock(sk);
 
 	chunk = state->recv_actor(oob_skb, 0, chunk, state);
@@ -2655,6 +2658,7 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
 		consume_skb(skb);
 		skb = NULL;
 	} else {
+		spin_lock(&sk->sk_receive_queue.lock);
 		if (skb == u->oob_skb) {
 			if (copied) {
 				skb = NULL;
@@ -2673,6 +2677,7 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
 				skb = skb_peek(&sk->sk_receive_queue);
 			}
 		}
+		spin_unlock(&sk->sk_receive_queue.lock);
 	}
 	return skb;
 }
Kuniyuki Iwashima May 10, 2024, 9:11 a.m. UTC | #4
From: Paolo Abeni <pabeni@redhat.com>
Date: Fri, 10 May 2024 09:53:25 +0200
> On Fri, 2024-05-10 at 14:03 +0900, Kuniyuki Iwashima wrote:
> > From: Paolo Abeni <pabeni@redhat.com>
> > Date: Thu, 09 May 2024 11:12:38 +0200
> > > On Tue, 2024-05-07 at 10:00 -0700, Kuniyuki Iwashima wrote:
> > > > Billy Jheng Bing-Jhong reported a race between __unix_gc() and
> > > > queue_oob().
> > > > 
> > > > __unix_gc() tries to garbage-collect close()d inflight sockets,
> > > > and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC
> > > > will drop the reference and set NULL to it locklessly.
> > > > 
> > > > However, the peer socket still can send MSG_OOB message to the
> > > > GC candidate and queue_oob() can update unix_sk(sk)->oob_skb
> > > > concurrently, resulting in NULL pointer dereference. [0]
> > > > 
> > > > To avoid the race, let's update unix_sk(sk)->oob_skb under the
> > > > sk_receive_queue's lock.
> > > 
> > > I'm sorry to delay this fix but...
> > > 
> > > AFAICS every time AF_UNIX touches the ooo_skb, it's under the receiver
> > > unix_state_lock. The only exception is __unix_gc. What about just
> > > acquiring such lock there?
> > 
> > In the new GC, there is unix_state_lock -> gc_lock ordering, and
> > we need another fix then.
> > 
> > That's why I chose locking recvq for old GC too.
> > https://lore.kernel.org/netdev/20240507172606.85532-1-kuniyu@amazon.com/
> > 
> > Also, Linus says:
> > 
> >     I really get the feeling that 'sb->oob_skb' should actually be forced
> >     to always be in sync with the receive queue by always doing the
> >     accesses under the receive_queue lock.
> > 
> > ( That's in the security@ thread I added you, but I just noticed
> >   Linus replied to the previous mail.  I'll forward the mails to you. )
> > 
> > 
> > > Otherwise there are other chunk touching the ooo_skb is touched where
> > > this patch does not add the receive queue spin lock protection e.g. in
> > > unix_stream_recv_urg(), making the code a bit inconsistent.
> > 
> > Yes, now the receive path is protected by unix_state_lock() and the
> > send path is by unix_state_lock() and recvq lock.
> > 
> > Ideally, as Linus suggested, we should acquire recvq lock everywhere
> > touching oob_skb and remove the additional refcount by skb_get(), but
> > I thought it's too much as a fix and I would do that refactoring in
> > the next cycle.
> > 
> > What do you think ?
> 
> I missed/forgot the unix_state_lock -> gc_lock ordering on net-next.
> 
> What about using the receive queue lock, and acquiring that everywhere
> oob_skb is touched, without the additional refcount refactor?
> 
> Would be more consistent and reasonably small. It should work on the
> new CG, too.
> 
> The refcount refactor could later come on net-next, and will be less
> complex with the lock already in place.

yeah, sounds good.

will post v2 with additional recvq locks.

Thanks!


> 
> Incremental patch on top of yours, completely untested:
> ---
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 9a6ad5974dff..a489f2aef29d 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2614,8 +2614,10 @@ static int unix_stream_recv_urg(struct unix_stream_read_state *state)
>  
>  	mutex_lock(&u->iolock);
>  	unix_state_lock(sk);
> +	spin_lock(&sk->sk_receive_queue.lock);
>  
>  	if (sock_flag(sk, SOCK_URGINLINE) || !u->oob_skb) {
> +		spin_unlock(&sk->sk_receive_queue.lock);
>  		unix_state_unlock(sk);
>  		mutex_unlock(&u->iolock);
>  		return -EINVAL;
> @@ -2627,6 +2629,7 @@ static int unix_stream_recv_urg(struct unix_stream_read_state *state)
>  		WRITE_ONCE(u->oob_skb, NULL);
>  	else
>  		skb_get(oob_skb);
> +	spin_unlock(&sk->sk_receive_queue.lock);
>  	unix_state_unlock(sk);
>  
>  	chunk = state->recv_actor(oob_skb, 0, chunk, state);
> @@ -2655,6 +2658,7 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
>  		consume_skb(skb);
>  		skb = NULL;
>  	} else {
> +		spin_lock(&sk->sk_receive_queue.lock);
>  		if (skb == u->oob_skb) {
>  			if (copied) {
>  				skb = NULL;
> @@ -2673,6 +2677,7 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
>  				skb = skb_peek(&sk->sk_receive_queue);
>  			}
>  		}
> +		spin_unlock(&sk->sk_receive_queue.lock);
>  	}
>  	return skb;
>  }
>
diff mbox series

Patch

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 9a6ad5974dff..6ae0370f038f 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2217,13 +2217,20 @@  static int queue_oob(struct socket *sock, struct msghdr *msg, struct sock *other
 	maybe_add_creds(skb, sock, other);
 	skb_get(skb);
 
+	scm_stat_add(other, skb);
+
+	/* oob_skb must be changed under sk_recv_queue's
+	 * lock to avoid the race with GC.
+	 */
+	spin_lock(&other->sk_receive_queue.lock);
 	if (ousk->oob_skb)
 		consume_skb(ousk->oob_skb);
 
 	WRITE_ONCE(ousk->oob_skb, skb);
 
-	scm_stat_add(other, skb);
-	skb_queue_tail(&other->sk_receive_queue, skb);
+	__skb_queue_tail(&other->sk_receive_queue, skb);
+	spin_unlock(&other->sk_receive_queue.lock);
+
 	sk_send_sigurg(other);
 	unix_state_unlock(other);
 	other->sk_data_ready(other);
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 0104be9d4704..b87e48e2b51b 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -342,10 +342,12 @@  static void __unix_gc(struct work_struct *work)
 		scan_children(&u->sk, inc_inflight, &hitlist);
 
 #if IS_ENABLED(CONFIG_AF_UNIX_OOB)
+		spin_lock(&u->sk.sk_receive_queue.lock);
 		if (u->oob_skb) {
-			kfree_skb(u->oob_skb);
+			WARN_ON_ONCE(skb_unref(u->oob_skb));
 			u->oob_skb = NULL;
 		}
+		spin_unlock(&u->sk.sk_receive_queue.lock);
 #endif
 	}