diff mbox series

[bpf,v2,1/2] bpf: sockmap, af_unix stream sockets need to hold ref for pair sock

Message ID 20231122192452.335312-2-john.fastabend@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series sockmap fix for KASAN_VMALLOC and af_unix | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf
netdev/apply fail Patch does not apply to bpf
bpf/vmtest-bpf-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-VM_Test-3 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-4 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-5 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-PR success PR summary
bpf/vmtest-bpf-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-6 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-8 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-VM_Test-11 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-7 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-9 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-VM_Test-2 success Logs for Validate matrix.py
bpf/vmtest-bpf-VM_Test-12 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-10 success Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
bpf/vmtest-bpf-VM_Test-14 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-VM_Test-13 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-16 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-19 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-27 success Logs for x86_64-llvm-16 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-15 success Logs for set-matrix
bpf/vmtest-bpf-VM_Test-17 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-20 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-18 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-22 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-23 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-26 success Logs for x86_64-llvm-16 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-24 success Logs for x86_64-llvm-16 / build / build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-21 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-28 success Logs for x86_64-llvm-16 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-29 success Logs for x86_64-llvm-16 / veristat
bpf/vmtest-bpf-VM_Test-25 success Logs for x86_64-llvm-16 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-16

Commit Message

John Fastabend Nov. 22, 2023, 7:24 p.m. UTC
AF_UNIX stream sockets are a paired socket. So sending on one of the pairs
will lookup the paired socket as part of the send operation. It is possible
however to put just one of the pairs in a BPF map. This currently
increments the refcnt on the sock in the sockmap to ensure it is not
free'd by the stack before sockmap cleans up its state and stops any
skbs being sent/recv'd to that socket.

But we missed a case. If the peer socket is closed it will be
free'd by the stack. However, the paired socket can still be
referenced from BPF sockmap side because we hold a reference
there. Then if we are sending traffic through BPF sockmap to
that socket it will try to dereference the free'd pair in its
send logic creating a use after free.  And following splat,

   [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
   [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
   [...]
   [59.905468] Call Trace:
   [59.905787]  <TASK>
   [59.906066]  dump_stack_lvl+0x130/0x1d0
   [59.908877]  print_report+0x16f/0x740
   [59.910629]  kasan_report+0x118/0x160
   [59.912576]  sk_wake_async+0x31/0x1b0
   [59.913554]  sock_def_readable+0x156/0x2a0
   [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
   [59.916398]  sock_sendmsg+0x20e/0x250
   [59.916854]  skb_send_sock+0x236/0xac0
   [59.920527]  sk_psock_backlog+0x287/0xaa0

To fix let BPF sockmap hold a refcnt on both the socket in the
sockmap and its paired socket.  It wasn't obvious how to contain
the fix to bpf_unix logic. The primarily problem with keeping this
logic in bpf_unix was: In the sock close() we could handle the
deref by having a close handler. But, when we are destroying the
psock through a map delete operation we wouldn't have gotten any
signal thorugh the proto struct other than it being replaced.
If we do the deref from the proto replace its too early because
we need to deref the skpair after the backlog worker has been
stopped.

Given all this it seems best to just cache it at the end of the
psock and eat 8B for the af_unix and vsock users.

Notice dgram sockets are OK because they handle locking already.

Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h | 1 +
 include/net/af_unix.h | 1 +
 net/core/skmsg.c      | 2 ++
 net/unix/af_unix.c    | 2 --
 net/unix/unix_bpf.c   | 5 +++++
 5 files changed, 9 insertions(+), 2 deletions(-)

Comments

Yonghong Song Nov. 23, 2023, 5:55 p.m. UTC | #1
On 11/22/23 2:24 PM, John Fastabend wrote:
> AF_UNIX stream sockets are a paired socket. So sending on one of the pairs
> will lookup the paired socket as part of the send operation. It is possible
> however to put just one of the pairs in a BPF map. This currently
> increments the refcnt on the sock in the sockmap to ensure it is not
> free'd by the stack before sockmap cleans up its state and stops any
> skbs being sent/recv'd to that socket.
>
> But we missed a case. If the peer socket is closed it will be
> free'd by the stack. However, the paired socket can still be
> referenced from BPF sockmap side because we hold a reference
> there. Then if we are sending traffic through BPF sockmap to
> that socket it will try to dereference the free'd pair in its
> send logic creating a use after free.  And following splat,
>
>     [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
>     [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
>     [...]
>     [59.905468] Call Trace:
>     [59.905787]  <TASK>
>     [59.906066]  dump_stack_lvl+0x130/0x1d0
>     [59.908877]  print_report+0x16f/0x740
>     [59.910629]  kasan_report+0x118/0x160
>     [59.912576]  sk_wake_async+0x31/0x1b0
>     [59.913554]  sock_def_readable+0x156/0x2a0
>     [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
>     [59.916398]  sock_sendmsg+0x20e/0x250
>     [59.916854]  skb_send_sock+0x236/0xac0
>     [59.920527]  sk_psock_backlog+0x287/0xaa0
>
> To fix let BPF sockmap hold a refcnt on both the socket in the
> sockmap and its paired socket.  It wasn't obvious how to contain
> the fix to bpf_unix logic. The primarily problem with keeping this
> logic in bpf_unix was: In the sock close() we could handle the
> deref by having a close handler. But, when we are destroying the
> psock through a map delete operation we wouldn't have gotten any
> signal thorugh the proto struct other than it being replaced.
> If we do the deref from the proto replace its too early because
> we need to deref the skpair after the backlog worker has been
> stopped.
>
> Given all this it seems best to just cache it at the end of the
> psock and eat 8B for the af_unix and vsock users.
>
> Notice dgram sockets are OK because they handle locking already.
>
> Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>   include/linux/skmsg.h | 1 +
>   include/net/af_unix.h | 1 +
>   net/core/skmsg.c      | 2 ++
>   net/unix/af_unix.c    | 2 --
>   net/unix/unix_bpf.c   | 5 +++++
>   5 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index c1637515a8a4..fbe628961cf8 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -106,6 +106,7 @@ struct sk_psock {
>   	struct mutex			work_mutex;
>   	struct sk_psock_work_state	work_state;
>   	struct delayed_work		work;
> +	struct sock			*skpair;
>   	struct rcu_work			rwork;
>   };
>   
> diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> index 824c258143a3..49c4640027d8 100644
> --- a/include/net/af_unix.h
> +++ b/include/net/af_unix.h
> @@ -75,6 +75,7 @@ struct unix_sock {
>   };
>   
>   #define unix_sk(ptr) container_of_const(ptr, struct unix_sock, sk)
> +#define unix_peer(sk) (unix_sk(sk)->peer)
>   
>   #define peer_wait peer_wq.wait
>   
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 6c31eefbd777..6236164b9bce 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -826,6 +826,8 @@ static void sk_psock_destroy(struct work_struct *work)
>   
>   	if (psock->sk_redir)
>   		sock_put(psock->sk_redir);
> +	if (psock->skpair)
> +		sock_put(psock->skpair);
>   	sock_put(psock->sk);
>   	kfree(psock);
>   }
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 3e8a04a13668..87dd723aacf9 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -212,8 +212,6 @@ static inline bool unix_secdata_eq(struct scm_cookie *scm, struct sk_buff *skb)
>   }
>   #endif /* CONFIG_SECURITY_NETWORK */
>   
> -#define unix_peer(sk) (unix_sk(sk)->peer)
> -
>   static inline int unix_our_peer(struct sock *sk, struct sock *osk)
>   {
>   	return unix_peer(osk) == sk;
> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> index 2f9d8271c6ec..7e7791029198 100644
> --- a/net/unix/unix_bpf.c
> +++ b/net/unix/unix_bpf.c
> @@ -159,12 +159,17 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
>   
>   int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>   {
> +	struct sock *skpair = unix_peer(sk);

The above assignment is redundant.

> +
>   	if (restore) {
>   		sk->sk_write_space = psock->saved_write_space;
>   		sock_replace_proto(sk, psock->sk_proto);
>   		return 0;
>   	}
>   
> +	skpair = unix_peer(sk);
> +	sock_hold(skpair);
> +	psock->skpair = skpair;
>   	unix_stream_bpf_check_needs_rebuild(psock->sk_proto);
>   	sock_replace_proto(sk, &unix_stream_bpf_prot);
>   	return 0;
Cong Wang Nov. 23, 2023, 8:24 p.m. UTC | #2
On Wed, Nov 22, 2023 at 11:24:51AM -0800, John Fastabend wrote:
> AF_UNIX stream sockets are a paired socket. So sending on one of the pairs
> will lookup the paired socket as part of the send operation. It is possible
> however to put just one of the pairs in a BPF map. This currently
> increments the refcnt on the sock in the sockmap to ensure it is not
> free'd by the stack before sockmap cleans up its state and stops any
> skbs being sent/recv'd to that socket.
> 
> But we missed a case. If the peer socket is closed it will be
> free'd by the stack. However, the paired socket can still be
> referenced from BPF sockmap side because we hold a reference
> there. Then if we are sending traffic through BPF sockmap to
> that socket it will try to dereference the free'd pair in its
> send logic creating a use after free.  And following splat,

Hmm, how could it pass the SOCK_DEAD test in unix_stream_sendmsg()?

2285                 unix_state_lock(other);
2286
2287                 if (sock_flag(other, SOCK_DEAD) ||
2288                     (other->sk_shutdown & RCV_SHUTDOWN))
2289                         goto pipe_err_free;


Thanks.
Jakub Sitnicki Nov. 24, 2023, 1:43 p.m. UTC | #3
On Thu, Nov 23, 2023 at 12:24 PM -08, Cong Wang wrote:
> On Wed, Nov 22, 2023 at 11:24:51AM -0800, John Fastabend wrote:
>> AF_UNIX stream sockets are a paired socket. So sending on one of the pairs
>> will lookup the paired socket as part of the send operation. It is possible
>> however to put just one of the pairs in a BPF map. This currently
>> increments the refcnt on the sock in the sockmap to ensure it is not
>> free'd by the stack before sockmap cleans up its state and stops any
>> skbs being sent/recv'd to that socket.
>> 
>> But we missed a case. If the peer socket is closed it will be
>> free'd by the stack. However, the paired socket can still be
>> referenced from BPF sockmap side because we hold a reference
>> there. Then if we are sending traffic through BPF sockmap to
>> that socket it will try to dereference the free'd pair in its
>> send logic creating a use after free.  And following splat,
>
> Hmm, how could it pass the SOCK_DEAD test in unix_stream_sendmsg()?
>
> 2285                 unix_state_lock(other);
> 2286
> 2287                 if (sock_flag(other, SOCK_DEAD) ||
> 2288                     (other->sk_shutdown & RCV_SHUTDOWN))
> 2289                         goto pipe_err_free;

The quoted UAF happens after unix_state_unlock(other):

  2285                  unix_state_lock(other);
  2286
  2287                  if (sock_flag(other, SOCK_DEAD) ||
  2288                      (other->sk_shutdown & RCV_SHUTDOWN))
  2289                          goto pipe_err_free;
  2290
  2291                  maybe_add_creds(skb, sock, other);
  2292                  scm_stat_add(other, skb);
  2293                  skb_queue_tail(&other->sk_receive_queue, skb);
  2294                  unix_state_unlock(other);
  2295                  other->sk_data_ready(other); <-- UAF

Although, I think I saw it happen at unix_state_lock(other) as well.

We don't hold a ref on other, so we're racing with __sock_release /
unix_release_sock.
diff mbox series

Patch

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index c1637515a8a4..fbe628961cf8 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -106,6 +106,7 @@  struct sk_psock {
 	struct mutex			work_mutex;
 	struct sk_psock_work_state	work_state;
 	struct delayed_work		work;
+	struct sock			*skpair;
 	struct rcu_work			rwork;
 };
 
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 824c258143a3..49c4640027d8 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -75,6 +75,7 @@  struct unix_sock {
 };
 
 #define unix_sk(ptr) container_of_const(ptr, struct unix_sock, sk)
+#define unix_peer(sk) (unix_sk(sk)->peer)
 
 #define peer_wait peer_wq.wait
 
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 6c31eefbd777..6236164b9bce 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -826,6 +826,8 @@  static void sk_psock_destroy(struct work_struct *work)
 
 	if (psock->sk_redir)
 		sock_put(psock->sk_redir);
+	if (psock->skpair)
+		sock_put(psock->skpair);
 	sock_put(psock->sk);
 	kfree(psock);
 }
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3e8a04a13668..87dd723aacf9 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -212,8 +212,6 @@  static inline bool unix_secdata_eq(struct scm_cookie *scm, struct sk_buff *skb)
 }
 #endif /* CONFIG_SECURITY_NETWORK */
 
-#define unix_peer(sk) (unix_sk(sk)->peer)
-
 static inline int unix_our_peer(struct sock *sk, struct sock *osk)
 {
 	return unix_peer(osk) == sk;
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index 2f9d8271c6ec..7e7791029198 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -159,12 +159,17 @@  int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
 
 int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 {
+	struct sock *skpair = unix_peer(sk);
+
 	if (restore) {
 		sk->sk_write_space = psock->saved_write_space;
 		sock_replace_proto(sk, psock->sk_proto);
 		return 0;
 	}
 
+	skpair = unix_peer(sk);
+	sock_hold(skpair);
+	psock->skpair = skpair;
 	unix_stream_bpf_check_needs_rebuild(psock->sk_proto);
 	sock_replace_proto(sk, &unix_stream_bpf_prot);
 	return 0;