Message ID | 20250317092257.68760-2-jiayuan.chen@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | bpf: Fix use-after-free of sockmap | expand |
On Mon, Mar 17, 2025 at 05:22:54PM +0800, Jiayuan Chen wrote: > The sk->sk_socket is not locked or referenced, and during the call to Hm? We should have a reference in socket map, whether directly or indirectly, right? When we add a socket to a socket map, we do call sock_map_psock_get_checked() to obtain a reference. > skb_send_sock(), there is a race condition with the release of sk_socket. > All types of sockets(tcp/udp/unix/vsock) will be affected. > > Race conditions: > ''' > CPU0 CPU1 > skb_send_sock > sendmsg_unlocked > sock_sendmsg > sock_sendmsg_nosec > close(fd): > ... > ops->release() IIRC, ->release() is only called when the refcnt of fd becomes zero, so I wonder how we reach here despite we have a reference of psock->refcnt? > sock_map_close() > sk_socket->ops = NULL > free(socket) > sock->ops->sendmsg > ^ > panic here Thanks.
2025/3/20 07:02, "Cong Wang" <xiyou.wangcong@gmail.com> wrote: > > On Mon, Mar 17, 2025 at 05:22:54PM +0800, Jiayuan Chen wrote: > > > > > The sk->sk_socket is not locked or referenced, and during the call to > > > > Hm? We should have a reference in socket map, whether directly or > > indirectly, right? When we add a socket to a socket map, we do call > > sock_map_psock_get_checked() to obtain a reference. > Yes, but we remove psock from sockmap when sock_map_close() was called ''' sock_map_close lock_sock(sk); rcu_read_lock(); psock = sk_psock(sk); // here we remove psock and the reference of psock become 0 sock_map_remove_links(sk, psock) psock = sk_psock_get(sk); if (unlikely(!psock)) goto no_psock; <=== jmp to no_psock rcu_read_unlock(); release_sock(sk); cancel_delayed_work_sync(&psock->work); <== no chance to run cancel ''' So I think we should hold the psock when backlog running Thanks,
On Wed, Mar 19, 2025 at 11:36:13PM +0000, Jiayuan Chen wrote: > 2025/3/20 07:02, "Cong Wang" <xiyou.wangcong@gmail.com> wrote: > > > > > On Mon, Mar 17, 2025 at 05:22:54PM +0800, Jiayuan Chen wrote: > > > > > > > > The sk->sk_socket is not locked or referenced, and during the call to > > > > > > > Hm? We should have a reference in socket map, whether directly or > > > > indirectly, right? When we add a socket to a socket map, we do call > > > > sock_map_psock_get_checked() to obtain a reference. > > > > Yes, but we remove psock from sockmap when sock_map_close() was called > ''' > sock_map_close > lock_sock(sk); > rcu_read_lock(); > psock = sk_psock(sk); > // here we remove psock and the reference of psock become 0 > sock_map_remove_links(sk, psock) sk_psock_drop() also calls cancel_delayed_work_sync(&psock->work), althrough in yet another work. Is this also a contribution to this bug? > psock = sk_psock_get(sk); > if (unlikely(!psock)) > goto no_psock; <=== jmp to no_psock > rcu_read_unlock(); > release_sock(sk); > cancel_delayed_work_sync(&psock->work); <== no chance to run cancel > ''' > I have to say sock_map_close() becomes harder and harder to understand now. And I am feeling we may have more bugs since we have two flying work's here: psock->rwork and psock->work. Thanks.
March 20, 2025 at 08:06, "Cong Wang" <xiyou.wangcong@gmail.com> wrote: > > On Wed, Mar 19, 2025 at 11:36:13PM +0000, Jiayuan Chen wrote: > > > > > 2025/3/20 07:02, "Cong Wang" <xiyou.wangcong@gmail.com> wrote: > > > > > > > > > > > > On Mon, Mar 17, 2025 at 05:22:54PM +0800, Jiayuan Chen wrote: > > > > > > > > > > > > > > The sk->sk_socket is not locked or referenced, and during the call to > > > > > > > > > > > > > Hm? We should have a reference in socket map, whether directly or > > > > > > > > indirectly, right? When we add a socket to a socket map, we do call > > > > > > > > sock_map_psock_get_checked() to obtain a reference. > > > > > > > > > > > > Yes, but we remove psock from sockmap when sock_map_close() was called > > > > ''' > > > > sock_map_close > > > > lock_sock(sk); > > > > rcu_read_lock(); > > > > psock = sk_psock(sk); > > > > // here we remove psock and the reference of psock become 0 > > > > sock_map_remove_links(sk, psock) > > > > sk_psock_drop() also calls cancel_delayed_work_sync(&psock->work), > > althrough in yet another work. Is this also a contribution to this bug? > Maybe it's related. Calling cancel_delayed_work_sync() in sk_psock_drop() is too late for our scenario. To be more precise, the core goal of this patch is to prevent sock_map_close() from executing until the backlog work completes. This is because sock_map_close() resides in the close(fd) path, once it finishes, subsequent steps will release the sk_socket. Therefore, performing cancellation in sk_psock_drop() is too late. Upon reviewing historical commits, I found that the backlog work originally held lock_sk, which naturally synchronized with lock_sk in sock_map_close. However, when the backlog work later removed lock_sk, an alternative synchronization mechanism(just hold psock reference like this patch) became necessary. > > > > psock = sk_psock_get(sk); > > > > if (unlikely(!psock)) > > > > goto no_psock; <=== jmp to no_psock > > > > rcu_read_unlock(); > > > > release_sock(sk); > > > > cancel_delayed_work_sync(&psock->work); <== no chance to run cancel > > > > ''' > > > > I have to say sock_map_close() becomes harder and harder to understand > > now. And I am feeling we may have more bugs since we have two flying > > work's here: psock->rwork and psock->work. > > Thanks. Yes, this patch prevent sock_map_close() from executing until the backlog work completes. This likely makes the cancel_delayed_work in sk_psock_destroy redundant. The code has undergone too many iterations. While sk_psock_destroy certainly contains redundant operations, we should retain it for now. There may be hidden dependencies we haven't fully untangled. Thanks.
On 3/17/25 10:22, Jiayuan Chen wrote: > The sk->sk_socket is not locked or referenced, and during the call to > skb_send_sock(), there is a race condition with the release of sk_socket. > All types of sockets(tcp/udp/unix/vsock) will be affected. > ... > Some approach I tried > ... > 2. Increased the reference of sk_socket->file: > - If the user calls close(fd), we will do nothing because the reference > count is not set to 0. It's unexpected. Have you considered bumping file's refcnt only for the time of send/callback? Along the lines of: static struct file *sock_get_file(struct sock *sk) { struct file *file = NULL; struct socket *sock; rcu_read_lock(); sock = sk->sk_socket; if (sock) file = get_file_active(&sock->file); rcu_read_unlock(); return file; } static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb, u32 off, u32 len, bool ingress) { int err; if (!ingress) { struct sock *sk = psock->sk; struct file *file; ... file = sock_get_file(sk); if (!file) return -EIO; err = skb_send_sock(sk, skb, off, len); fput(file); return err; } ... } static void sk_psock_verdict_data_ready(struct sock *sk) { struct file *file; ... file = sock_get_file(sk); if (!file) return; copied = sk->sk_socket->ops->read_skb(sk, sk_psock_verdict_recv); fput(file); ... }
March 20, 2025 at 20:32, "Michal Luczaj" <mhal@rbox.co> wrote: > > On 3/17/25 10:22, Jiayuan Chen wrote: > > > > > The sk->sk_socket is not locked or referenced, and during the call to > > > > skb_send_sock(), there is a race condition with the release of sk_socket. > > > > All types of sockets(tcp/udp/unix/vsock) will be affected. > > > > ... > > > > Some approach I tried > > > > ... > > > > 2. Increased the reference of sk_socket->file: > > > > - If the user calls close(fd), we will do nothing because the reference > > > > count is not set to 0. It's unexpected. > > > > Have you considered bumping file's refcnt only for the time of > > send/callback? Along the lines of: > > static struct file *sock_get_file(struct sock *sk) > > { > > struct file *file = NULL; > > struct socket *sock; > > rcu_read_lock(); > > sock = sk->sk_socket; > > if (sock) > > file = get_file_active(&sock->file); > > rcu_read_unlock(); > > return file; > > } > > static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb, > > u32 off, u32 len, bool ingress) > > { > > int err; > > if (!ingress) { > > struct sock *sk = psock->sk; > > struct file *file; > > ... > > file = sock_get_file(sk); > > if (!file) > > return -EIO; > > err = skb_send_sock(sk, skb, off, len); > > fput(file); > > return err; > > } > > ... > > } > > static void sk_psock_verdict_data_ready(struct sock *sk) > > { > > struct file *file; > > ... > > file = sock_get_file(sk); > > if (!file) > > return; > > copied = sk->sk_socket->ops->read_skb(sk, sk_psock_verdict_recv); > > fput(file); > > ... > > } > Thank you for your suggestions. I previously attempted a similar approach in another version, but felt that manipulating the file layer within sockmap was quite odd. Moreover, the actual process flow is more complex than we initially imagined. The current overall close process looks roughly like this: ''' close(fd): file_ref_put(file): __sock_release(sock) sock_map_close() saved_close() sk->sk_socket = NULL sock->ops = NULL sock->file = NULL ''' We can observe that while sk->sk_socket might not be NULL, it doesn’t guarantee that sock->file is not NULL. This means the following code is problematic: ''' sock = sk->sk_socket; if (sock) sock->file->xx <== sock->file may be set to NULL ''' Of course, we might try something like: ''' try_hold_sock() { rcu_read_lock(); sock = sk->sk_socket; if (!sock) unlock_return; file = sock->file; if (!file) unlock_return; file = get_file_active(&file); rcu_read_unlock(); return file; } ''' Acquiring the socket's reference count requires significant effort, and we need to pay special attention to the socket's own release process to ensure correctness. Ultimately, this led me to decide to operate on psock instead of directly manipulating the file layer to avoid this issue.
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 0ddc4c718833..6101c1bb279a 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -655,6 +655,14 @@ static void sk_psock_backlog(struct work_struct *work) bool ingress; int ret; + /* Increment the psock refcnt to synchronize with close(fd) path in + * sock_map_close(), ensuring we wait for backlog thread completion + * before sk_socket freed. If refcnt increment fails, it indicates + * sock_map_close() completed with sk_socket potentially already freed. + */ + if (!sk_psock_get(psock->sk)) + return; + mutex_lock(&psock->work_mutex); if (unlikely(state->len)) { len = state->len; @@ -702,6 +710,7 @@ static void sk_psock_backlog(struct work_struct *work) } end: mutex_unlock(&psock->work_mutex); + sk_psock_put(psock->sk, psock); } struct sk_psock *sk_psock_init(struct sock *sk, int node)
The sk->sk_socket is not locked or referenced, and during the call to skb_send_sock(), there is a race condition with the release of sk_socket. All types of sockets(tcp/udp/unix/vsock) will be affected. Race conditions: ''' CPU0 CPU1 skb_send_sock sendmsg_unlocked sock_sendmsg sock_sendmsg_nosec close(fd): ... ops->release() sock_map_close() sk_socket->ops = NULL free(socket) sock->ops->sendmsg ^ panic here ''' Based on the fact that we already wait for the workqueue to finish in sock_map_close() if psock is held, we simply increase the psock reference count to avoid race conditions. ''' void sock_map_close() { ... if (likely(psock)) { ... psock = sk_psock_get(sk); if (unlikely(!psock)) goto no_psock; <=== Control usually jumps here via goto ... cancel_delayed_work_sync(&psock->work); <=== not executed sk_psock_put(sk, psock); ... } ''' The panic I catched: ''' Workqueue: events sk_psock_backlog RIP: 0010:sock_sendmsg+0x21d/0x440 RAX: 0000000000000000 RBX: ffffc9000521fad8 RCX: 0000000000000001 ... Call Trace: <TASK> ? die_addr+0x40/0xa0 ? exc_general_protection+0x14c/0x230 ? asm_exc_general_protection+0x26/0x30 ? sock_sendmsg+0x21d/0x440 ? sock_sendmsg+0x3e0/0x440 ? __pfx_sock_sendmsg+0x10/0x10 __skb_send_sock+0x543/0xb70 sk_psock_backlog+0x247/0xb80 ... ''' Reported-by: Michal Luczaj <mhal@rbox.co> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> --- Some approach I tried 1. add rcu: - RCU conflicts with mutex_lock in Unix socket send path. - Race conditions still exist when reading sk->sk_socket->ops for in current sock_sendmsg implementation. 2. Increased the reference of sk_socket->file: - If the user calls close(fd), we will do nothing because the reference count is not set to 0. It's unexpected. 3. Use sock_lock when calling skb_send_sock: - skb_send_sock itself already do the locking. - If we call skb_send_sock_locked instead, we have to implement sendmsg_locked for each protocol, which is not easy for UDP or Unix, as the sending process involves frequent locking and unlocking, which makes it challenging to isolate the locking logic. --- net/core/skmsg.c | 9 +++++++++ 1 file changed, 9 insertions(+)