diff mbox series

[net,v2] net: sockmap: avoid race between sock_map_destroy() and sk_psock_put()

Message ID 20240910114354.14283-1-dmantipov@yandex.ru (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] net: sockmap: avoid race between sock_map_destroy() and sk_psock_put() | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 16 this patch: 16
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 1 blamed authors not CCed: ast@kernel.org; 3 maintainers not CCed: bpf@vger.kernel.org ast@kernel.org edumazet@google.com
netdev/build_clang success Errors and warnings before: 16 this patch: 16
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 16 this patch: 16
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 19 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-09-26--21-00 (tests: 768)

Commit Message

Dmitry Antipov Sept. 10, 2024, 11:43 a.m. UTC
Syzbot has triggered the following race condition:

On CPU0, 'sk_psock_drop()' (most likely scheduled from 'sock_map_unref()'
called by 'sock_map_update_common()') is running at [1]:

void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
{
        write_lock_bh(&sk->sk_callback_lock);
        sk_psock_restore_proto(sk, psock);                              [1]
        rcu_assign_sk_user_data(sk, NULL);                              [2]
        ...
}

If 'sock_map_destroy()' is scheduled on CPU1 at the same time, psock is
always NULL at [3]. But, since [1] may be is in progress during [4], the
value of 'saved_destroy' at this point is undefined:

void sock_map_destroy(struct sock *sk)
{
        void (*saved_destroy)(struct sock *sk);
        struct sk_psock *psock;

        rcu_read_lock();
        psock = sk_psock_get(sk);                                       [3]
        if (unlikely(!psock)) {
                rcu_read_unlock();
                saved_destroy = READ_ONCE(sk->sk_prot)->destroy;        [4]
        } else {
                saved_destroy = psock->saved_destroy;                   [5]
                sock_map_remove_links(sk, psock);
                rcu_read_unlock();
                sk_psock_stop(psock);
                sk_psock_put(sk, psock);
        }
        if (WARN_ON_ONCE(saved_destroy == sock_map_destroy))
                return;
        if (saved_destroy)
                saved_destroy(sk);
}

Fix this issue in 3 steps:

1. Prefer 'sk_psock()' over 'sk_psock_get()' at [3]. Since zero
   refcount is ignored, 'psock' is non-NULL until [2] is completed.

2. Add read lock around [5], to make sure that [1] is not in progress
   when the former is executed.

3. Since 'sk_psock()' does not adjust reference counting, drop
   'sk_psock_put()' and redundant 'sk_psock_stop()' (which is
   executed by 'sk_psock_drop()' anyway).

Fixes: 5b4a79ba65a1 ("bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself")
Reported-by: syzbot+f363afac6b0ace576f45@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f363afac6b0ace576f45
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
---
 net/core/sock_map.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Paolo Abeni Sept. 19, 2024, 8:51 a.m. UTC | #1
On 9/10/24 13:43, Dmitry Antipov wrote:
> Syzbot has triggered the following race condition:
> 
> On CPU0, 'sk_psock_drop()' (most likely scheduled from 'sock_map_unref()'
> called by 'sock_map_update_common()') is running at [1]:
> 
> void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
> {
>          write_lock_bh(&sk->sk_callback_lock);
>          sk_psock_restore_proto(sk, psock);                              [1]
>          rcu_assign_sk_user_data(sk, NULL);                              [2]
>          ...
> }
> 
> If 'sock_map_destroy()' is scheduled on CPU1 at the same time, psock is
> always NULL at [3]. But, since [1] may be is in progress during [4], the
> value of 'saved_destroy' at this point is undefined:
> 
> void sock_map_destroy(struct sock *sk)
> {
>          void (*saved_destroy)(struct sock *sk);
>          struct sk_psock *psock;
> 
>          rcu_read_lock();
>          psock = sk_psock_get(sk);                                       [3]
>          if (unlikely(!psock)) {
>                  rcu_read_unlock();
>                  saved_destroy = READ_ONCE(sk->sk_prot)->destroy;        [4]
>          } else {
>                  saved_destroy = psock->saved_destroy;                   [5]
>                  sock_map_remove_links(sk, psock);
>                  rcu_read_unlock();
>                  sk_psock_stop(psock);
>                  sk_psock_put(sk, psock);
>          }
>          if (WARN_ON_ONCE(saved_destroy == sock_map_destroy))
>                  return;
>          if (saved_destroy)
>                  saved_destroy(sk);
> }
> 
> Fix this issue in 3 steps:
> 
> 1. Prefer 'sk_psock()' over 'sk_psock_get()' at [3]. Since zero
>     refcount is ignored, 'psock' is non-NULL until [2] is completed.
> 
> 2. Add read lock around [5], to make sure that [1] is not in progress
>     when the former is executed.
> 
> 3. Since 'sk_psock()' does not adjust reference counting, drop
>     'sk_psock_put()' and redundant 'sk_psock_stop()' (which is
>     executed by 'sk_psock_drop()' anyway).
> 
> Fixes: 5b4a79ba65a1 ("bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself")
> Reported-by: syzbot+f363afac6b0ace576f45@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=f363afac6b0ace576f45
> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>

I think there is still Cong question pending: why this could not/ should 
not be addressed instead in the RDS code.

Thanks,

Paolo
Dmitry Antipov Sept. 19, 2024, 9:26 a.m. UTC | #2
On 9/19/24 11:51 AM, Paolo Abeni wrote:

> I think there is still Cong question pending: why this could not/ should not be addressed instead in the RDS code.

Hm. Except 'rds_tcp_accept_xxx()' functions, the rest of the backtrace
is contributed by generic TCP and IP things. I've tried to debug this
issue starting from the closest innards and seems found an issue within
sockmap code. Anyway, I'm strongly disagree with "unless otherwise proven,
there are a lot of bugs everywhere except the subsystem I'm responsible
to" bias, assuming that a bit more friendly and cooperative efforts
should give us the better results.

Dmitry
Paolo Abeni Sept. 24, 2024, 8:23 a.m. UTC | #3
On 9/19/24 11:26, Dmitry Antipov wrote:
> On 9/19/24 11:51 AM, Paolo Abeni wrote:
> 
>> I think there is still Cong question pending: why this could not/ should not be addressed instead in the RDS code.
> 
> Hm. Except 'rds_tcp_accept_xxx()' functions, the rest of the backtrace
> is contributed by generic TCP and IP things.

AFAICS most of RDS is build on top of TCP, most RDS-specific bugs will 
show a lot tcp/ip in their backtrace.

i.e. with mptcp we had quite a few bugs with a 'tcp mostily' or 'tcp 
only' backtrace and root caused in the mptcp code.

> I've tried to debug this
> issue starting from the closest innards and seems found an issue within
> sockmap code.
> Anyway, I'm strongly disagree with "unless otherwise proven,
> there are a lot of bugs everywhere except the subsystem I'm responsible
> to" bias, assuming that a bit more friendly and cooperative efforts
> should give us the better results.

I guess that the main point in Cong's feedback is that a sockmap update 
is not supposed to race with sock_map_destroy() (???) @Cong, @John, 
@JakubS: any comments on that?

Cheers,

Paolo
Michal Luczaj Sept. 25, 2024, 12:22 a.m. UTC | #4
On 9/24/24 10:23, Paolo Abeni wrote:
> ...
> I guess that the main point in Cong's feedback is that a sockmap update 
> is not supposed to race with sock_map_destroy() (???) @Cong, @John, 
> @JakubS: any comments on that?

In somewhat related news: sock_map_unhash() races with the update, hitting
WARN_ON_ONCE(saved_unhash == sock_map_unhash).

CPU0					CPU1
====					====

BPF_MAP_DELETE_ELEM
  sk_psock_drop()
    sk_psock_restore_proto
    rcu_assign_sk_user_data(NULL)
    					shutdown()
					  sock_map_unhash()
					    psock = sk_psock(sk)
					    if (unlikely(!psock)) {
BPF_MAP_UPDATE_ELEM
  sock_map_init_proto()
    sock_replace_proto
					      saved_unhash = READ_ONCE(sk->sk_prot)->unhash;
					    }
					    if (WARN_ON_ONCE(saved_unhash == sock_map_unhash))
					      return;

[   20.860668] WARNING: CPU: 1 PID: 1238 at net/core/sock_map.c:1641 sock_map_unhash+0xa1/0x220
[   20.860686] CPU: 1 UID: 0 PID: 1238 Comm: a.out Not tainted 6.11.0+ #59
[   20.860688] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   20.860705] Call Trace:
[   20.860706]  <TASK>
[   20.860725]  unix_shutdown+0xb0/0x470
[   20.860728]  __sys_shutdown+0x7a/0xa0
[   20.860731]  __x64_sys_shutdown+0x10/0x20
[   20.860733]  do_syscall_64+0x93/0x180
[   20.860788]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Under VM it takes about 10 minutes to trigger the splat:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/un.h>
#include <sys/syscall.h>
#include <sys/socket.h>
#include <linux/bpf.h>

int s[2], sockmap;

static void die(char *msg)
{
	perror(msg);
	exit(-1);
}

static int create_sockmap(int key_size, int value_size, int max_entries)
{
	union bpf_attr attr = {
		.map_type = BPF_MAP_TYPE_SOCKMAP,
		.key_size = key_size,
		.value_size = value_size,
		.max_entries = max_entries
	};

	int map = syscall(SYS_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
	if (map < 0)
		die("bpf_create_map");

	return map;
}

static void map_update_elem(int map_fd, int key, void *value, uint64_t flags)
{
	union bpf_attr attr = {
		.map_fd = map_fd,
		.key = (uint64_t)&key,
		.value = (uint64_t)value,
		.flags = flags
	};

	syscall(SYS_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}

static void map_del_elem(int map_fd, int key)
{
	union bpf_attr attr = {
		.map_fd = map_fd,
		.key = (uint64_t)&key
	};

	syscall(SYS_bpf, BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
}

static void *racer_del(void *unused)
{
	for (;;)
		map_del_elem(sockmap, 0);

	return NULL;
}
static void *racer_update(void *unused)
{
	for (;;)
		map_update_elem(sockmap, 0, &s[0], BPF_ANY);

	return NULL;
}

int main(void)
{
	pthread_t t0, t1;

	if (pthread_create(&t0, NULL, racer_del, NULL))
		die("pthread_create");

	if (pthread_create(&t1, NULL, racer_update, NULL))
		die("pthread_create");

	sockmap = create_sockmap(sizeof(int), sizeof(int), 1);

	for (;;) {
		if (socketpair(AF_UNIX, SOCK_STREAM, 0, s) < 0)
			die("socketpair");

		map_update_elem(sockmap, 0, &s[0], BPF_ANY);
		shutdown(s[1], 0);

		close(s[0]);
		close(s[1]);
	}
}

With mdelay(1) it's a matter of seconds:

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 724b6856fcc3..98a964399813 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1631,6 +1631,7 @@ void sock_map_unhash(struct sock *sk)
 	psock = sk_psock(sk);
 	if (unlikely(!psock)) {
 		rcu_read_unlock();
+		mdelay(1);
 		saved_unhash = READ_ONCE(sk->sk_prot)->unhash;
 	} else {
 		saved_unhash = psock->saved_unhash;

I've tried the patch below and it seems to do the trick

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 724b6856fcc3..a384771a66e8 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1627,6 +1627,7 @@ void sock_map_unhash(struct sock *sk)
 	void (*saved_unhash)(struct sock *sk);
 	struct sk_psock *psock;
 
+	lock_sock(sk);
 	rcu_read_lock();
 	psock = sk_psock(sk);
 	if (unlikely(!psock)) {
@@ -1637,6 +1638,7 @@ void sock_map_unhash(struct sock *sk)
 		sock_map_remove_links(sk, psock);
 		rcu_read_unlock();
 	}
+	release_sock(sk);
 	if (WARN_ON_ONCE(saved_unhash == sock_map_unhash))
 		return;
 	if (saved_unhash)

but perhaps what needs to be fixed instead is af_unix shutdown()?
CCing Kuniyuki.

thanks,
Michal
Jakub Sitnicki Sept. 27, 2024, 1:04 p.m. UTC | #5
On Tue, Sep 24, 2024 at 10:23 AM +02, Paolo Abeni wrote:
> I guess that the main point in Cong's feedback is that a sockmap update is not
> supposed to race with sock_map_destroy() (???) @Cong, @John, @JakubS: any
> comments on that?

Looking into it, but will need a bit more time because I'm working
through a backlog and OoO next week.

@Dmitry,

To start off, could you ask syzbot to give your patch a spin by replying
to its report? See instructions following the report [1].

Thanks,
-jkbs

[1] https://lore.kernel.org/netdev/000000000000abe6b50620a7f370@google.com/
diff mbox series

Patch

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d3dbb92153f2..1eeb1d3a6b71 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1649,16 +1649,16 @@  void sock_map_destroy(struct sock *sk)
 	struct sk_psock *psock;
 
 	rcu_read_lock();
-	psock = sk_psock_get(sk);
+	psock = sk_psock(sk);
 	if (unlikely(!psock)) {
 		rcu_read_unlock();
 		saved_destroy = READ_ONCE(sk->sk_prot)->destroy;
 	} else {
+		read_lock_bh(&sk->sk_callback_lock);
 		saved_destroy = psock->saved_destroy;
+		read_unlock_bh(&sk->sk_callback_lock);
 		sock_map_remove_links(sk, psock);
 		rcu_read_unlock();
-		sk_psock_stop(psock);
-		sk_psock_put(sk, psock);
 	}
 	if (WARN_ON_ONCE(saved_destroy == sock_map_destroy))
 		return;