Message ID | 20240910114354.14283-1-dmantipov@yandex.ru (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net,v2] net: sockmap: avoid race between sock_map_destroy() and sk_psock_put() | expand |
On 9/10/24 13:43, Dmitry Antipov wrote: > Syzbot has triggered the following race condition: > > On CPU0, 'sk_psock_drop()' (most likely scheduled from 'sock_map_unref()' > called by 'sock_map_update_common()') is running at [1]: > > void sk_psock_drop(struct sock *sk, struct sk_psock *psock) > { > write_lock_bh(&sk->sk_callback_lock); > sk_psock_restore_proto(sk, psock); [1] > rcu_assign_sk_user_data(sk, NULL); [2] > ... > } > > If 'sock_map_destroy()' is scheduled on CPU1 at the same time, psock is > always NULL at [3]. But, since [1] may be is in progress during [4], the > value of 'saved_destroy' at this point is undefined: > > void sock_map_destroy(struct sock *sk) > { > void (*saved_destroy)(struct sock *sk); > struct sk_psock *psock; > > rcu_read_lock(); > psock = sk_psock_get(sk); [3] > if (unlikely(!psock)) { > rcu_read_unlock(); > saved_destroy = READ_ONCE(sk->sk_prot)->destroy; [4] > } else { > saved_destroy = psock->saved_destroy; [5] > sock_map_remove_links(sk, psock); > rcu_read_unlock(); > sk_psock_stop(psock); > sk_psock_put(sk, psock); > } > if (WARN_ON_ONCE(saved_destroy == sock_map_destroy)) > return; > if (saved_destroy) > saved_destroy(sk); > } > > Fix this issue in 3 steps: > > 1. Prefer 'sk_psock()' over 'sk_psock_get()' at [3]. Since zero > refcount is ignored, 'psock' is non-NULL until [2] is completed. > > 2. Add read lock around [5], to make sure that [1] is not in progress > when the former is executed. > > 3. Since 'sk_psock()' does not adjust reference counting, drop > 'sk_psock_put()' and redundant 'sk_psock_stop()' (which is > executed by 'sk_psock_drop()' anyway). > > Fixes: 5b4a79ba65a1 ("bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself") > Reported-by: syzbot+f363afac6b0ace576f45@syzkaller.appspotmail.com > Closes: https://syzkaller.appspot.com/bug?extid=f363afac6b0ace576f45 > Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> I think there is still Cong question pending: why this could not/ should not be addressed instead in the RDS code. Thanks, Paolo
On 9/19/24 11:51 AM, Paolo Abeni wrote:
> I think there is still Cong question pending: why this could not/ should not be addressed instead in the RDS code.
Hm. Except 'rds_tcp_accept_xxx()' functions, the rest of the backtrace
is contributed by generic TCP and IP things. I've tried to debug this
issue starting from the closest innards and seems found an issue within
sockmap code. Anyway, I'm strongly disagree with "unless otherwise proven,
there are a lot of bugs everywhere except the subsystem I'm responsible
to" bias, assuming that a bit more friendly and cooperative efforts
should give us the better results.
Dmitry
On 9/19/24 11:26, Dmitry Antipov wrote: > On 9/19/24 11:51 AM, Paolo Abeni wrote: > >> I think there is still Cong question pending: why this could not/ should not be addressed instead in the RDS code. > > Hm. Except 'rds_tcp_accept_xxx()' functions, the rest of the backtrace > is contributed by generic TCP and IP things. AFAICS most of RDS is build on top of TCP, most RDS-specific bugs will show a lot tcp/ip in their backtrace. i.e. with mptcp we had quite a few bugs with a 'tcp mostily' or 'tcp only' backtrace and root caused in the mptcp code. > I've tried to debug this > issue starting from the closest innards and seems found an issue within > sockmap code. > Anyway, I'm strongly disagree with "unless otherwise proven, > there are a lot of bugs everywhere except the subsystem I'm responsible > to" bias, assuming that a bit more friendly and cooperative efforts > should give us the better results. I guess that the main point in Cong's feedback is that a sockmap update is not supposed to race with sock_map_destroy() (???) @Cong, @John, @JakubS: any comments on that? Cheers, Paolo
On 9/24/24 10:23, Paolo Abeni wrote: > ... > I guess that the main point in Cong's feedback is that a sockmap update > is not supposed to race with sock_map_destroy() (???) @Cong, @John, > @JakubS: any comments on that? In somewhat related news: sock_map_unhash() races with the update, hitting WARN_ON_ONCE(saved_unhash == sock_map_unhash). CPU0 CPU1 ==== ==== BPF_MAP_DELETE_ELEM sk_psock_drop() sk_psock_restore_proto rcu_assign_sk_user_data(NULL) shutdown() sock_map_unhash() psock = sk_psock(sk) if (unlikely(!psock)) { BPF_MAP_UPDATE_ELEM sock_map_init_proto() sock_replace_proto saved_unhash = READ_ONCE(sk->sk_prot)->unhash; } if (WARN_ON_ONCE(saved_unhash == sock_map_unhash)) return; [ 20.860668] WARNING: CPU: 1 PID: 1238 at net/core/sock_map.c:1641 sock_map_unhash+0xa1/0x220 [ 20.860686] CPU: 1 UID: 0 PID: 1238 Comm: a.out Not tainted 6.11.0+ #59 [ 20.860688] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 20.860705] Call Trace: [ 20.860706] <TASK> [ 20.860725] unix_shutdown+0xb0/0x470 [ 20.860728] __sys_shutdown+0x7a/0xa0 [ 20.860731] __x64_sys_shutdown+0x10/0x20 [ 20.860733] do_syscall_64+0x93/0x180 [ 20.860788] entry_SYSCALL_64_after_hwframe+0x76/0x7e Under VM it takes about 10 minutes to trigger the splat: #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <unistd.h> #include <pthread.h> #include <sys/un.h> #include <sys/syscall.h> #include <sys/socket.h> #include <linux/bpf.h> int s[2], sockmap; static void die(char *msg) { perror(msg); exit(-1); } static int create_sockmap(int key_size, int value_size, int max_entries) { union bpf_attr attr = { .map_type = BPF_MAP_TYPE_SOCKMAP, .key_size = key_size, .value_size = value_size, .max_entries = max_entries }; int map = syscall(SYS_bpf, BPF_MAP_CREATE, &attr, sizeof(attr)); if (map < 0) die("bpf_create_map"); return map; } static void map_update_elem(int map_fd, int key, void *value, uint64_t flags) { union bpf_attr attr = { .map_fd = map_fd, .key = (uint64_t)&key, .value = (uint64_t)value, .flags = flags }; syscall(SYS_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); } static void map_del_elem(int map_fd, int key) { union bpf_attr attr = { .map_fd = map_fd, .key = (uint64_t)&key }; syscall(SYS_bpf, BPF_MAP_DELETE_ELEM, &attr, sizeof(attr)); } static void *racer_del(void *unused) { for (;;) map_del_elem(sockmap, 0); return NULL; } static void *racer_update(void *unused) { for (;;) map_update_elem(sockmap, 0, &s[0], BPF_ANY); return NULL; } int main(void) { pthread_t t0, t1; if (pthread_create(&t0, NULL, racer_del, NULL)) die("pthread_create"); if (pthread_create(&t1, NULL, racer_update, NULL)) die("pthread_create"); sockmap = create_sockmap(sizeof(int), sizeof(int), 1); for (;;) { if (socketpair(AF_UNIX, SOCK_STREAM, 0, s) < 0) die("socketpair"); map_update_elem(sockmap, 0, &s[0], BPF_ANY); shutdown(s[1], 0); close(s[0]); close(s[1]); } } With mdelay(1) it's a matter of seconds: diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 724b6856fcc3..98a964399813 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -1631,6 +1631,7 @@ void sock_map_unhash(struct sock *sk) psock = sk_psock(sk); if (unlikely(!psock)) { rcu_read_unlock(); + mdelay(1); saved_unhash = READ_ONCE(sk->sk_prot)->unhash; } else { saved_unhash = psock->saved_unhash; I've tried the patch below and it seems to do the trick diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 724b6856fcc3..a384771a66e8 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -1627,6 +1627,7 @@ void sock_map_unhash(struct sock *sk) void (*saved_unhash)(struct sock *sk); struct sk_psock *psock; + lock_sock(sk); rcu_read_lock(); psock = sk_psock(sk); if (unlikely(!psock)) { @@ -1637,6 +1638,7 @@ void sock_map_unhash(struct sock *sk) sock_map_remove_links(sk, psock); rcu_read_unlock(); } + release_sock(sk); if (WARN_ON_ONCE(saved_unhash == sock_map_unhash)) return; if (saved_unhash) but perhaps what needs to be fixed instead is af_unix shutdown()? CCing Kuniyuki. thanks, Michal
On Tue, Sep 24, 2024 at 10:23 AM +02, Paolo Abeni wrote: > I guess that the main point in Cong's feedback is that a sockmap update is not > supposed to race with sock_map_destroy() (???) @Cong, @John, @JakubS: any > comments on that? Looking into it, but will need a bit more time because I'm working through a backlog and OoO next week. @Dmitry, To start off, could you ask syzbot to give your patch a spin by replying to its report? See instructions following the report [1]. Thanks, -jkbs [1] https://lore.kernel.org/netdev/000000000000abe6b50620a7f370@google.com/
diff --git a/net/core/sock_map.c b/net/core/sock_map.c index d3dbb92153f2..1eeb1d3a6b71 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -1649,16 +1649,16 @@ void sock_map_destroy(struct sock *sk) struct sk_psock *psock; rcu_read_lock(); - psock = sk_psock_get(sk); + psock = sk_psock(sk); if (unlikely(!psock)) { rcu_read_unlock(); saved_destroy = READ_ONCE(sk->sk_prot)->destroy; } else { + read_lock_bh(&sk->sk_callback_lock); saved_destroy = psock->saved_destroy; + read_unlock_bh(&sk->sk_callback_lock); sock_map_remove_links(sk, psock); rcu_read_unlock(); - sk_psock_stop(psock); - sk_psock_put(sk, psock); } if (WARN_ON_ONCE(saved_destroy == sock_map_destroy)) return;
Syzbot has triggered the following race condition: On CPU0, 'sk_psock_drop()' (most likely scheduled from 'sock_map_unref()' called by 'sock_map_update_common()') is running at [1]: void sk_psock_drop(struct sock *sk, struct sk_psock *psock) { write_lock_bh(&sk->sk_callback_lock); sk_psock_restore_proto(sk, psock); [1] rcu_assign_sk_user_data(sk, NULL); [2] ... } If 'sock_map_destroy()' is scheduled on CPU1 at the same time, psock is always NULL at [3]. But, since [1] may be is in progress during [4], the value of 'saved_destroy' at this point is undefined: void sock_map_destroy(struct sock *sk) { void (*saved_destroy)(struct sock *sk); struct sk_psock *psock; rcu_read_lock(); psock = sk_psock_get(sk); [3] if (unlikely(!psock)) { rcu_read_unlock(); saved_destroy = READ_ONCE(sk->sk_prot)->destroy; [4] } else { saved_destroy = psock->saved_destroy; [5] sock_map_remove_links(sk, psock); rcu_read_unlock(); sk_psock_stop(psock); sk_psock_put(sk, psock); } if (WARN_ON_ONCE(saved_destroy == sock_map_destroy)) return; if (saved_destroy) saved_destroy(sk); } Fix this issue in 3 steps: 1. Prefer 'sk_psock()' over 'sk_psock_get()' at [3]. Since zero refcount is ignored, 'psock' is non-NULL until [2] is completed. 2. Add read lock around [5], to make sure that [1] is not in progress when the former is executed. 3. Since 'sk_psock()' does not adjust reference counting, drop 'sk_psock_put()' and redundant 'sk_psock_stop()' (which is executed by 'sk_psock_drop()' anyway). Fixes: 5b4a79ba65a1 ("bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself") Reported-by: syzbot+f363afac6b0ace576f45@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f363afac6b0ace576f45 Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> --- net/core/sock_map.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)