diff mbox series

[bpf-next] bpf,sockmap: fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues

Message ID 20220524075311.649153-1-wangyufen@huawei.com (mailing list archive)
State Accepted
Commit 84dc313f7b79e27fb495e7be67572c00a94b3bc2
Delegated to: BPF
Headers show
Series [bpf-next] bpf,sockmap: fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1479 this patch: 1479
netdev/cc_maintainers warning 3 maintainers not CCed: yoshfuji@linux-ipv6.org pabeni@redhat.com edumazet@google.com
netdev/build_clang success Errors and warnings before: 184 this patch: 184
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1486 this patch: 1486
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 57 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Kernel LATEST on ubuntu-latest with llvm-15
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Kernel LATEST on z15 with gcc
bpf/vmtest-bpf-next-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest with gcc

Commit Message

wangyufen May 24, 2022, 7:53 a.m. UTC
During TCP sockmap redirect pressure test, the following warning is triggered:
WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0
CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G        W         5.10.0+ #9
Call Trace:
 inet_csk_destroy_sock+0x55/0x110
 inet_csk_listen_stop+0xbb/0x380
 tcp_close+0x41b/0x480
 inet_release+0x42/0x80
 __sock_release+0x3d/0xa0
 sock_close+0x11/0x20
 __fput+0x9d/0x240
 task_work_run+0x62/0x90
 exit_to_user_mode_prepare+0x110/0x120
 syscall_exit_to_user_mode+0x27/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The reason we observed is that:
When the listener is closing, a connection may have completed the three-way
handshake but not accepted, and the client has sent some packets. The child
sks in accept queue release by inet_child_forget()->inet_csk_destroy_sock(),
but psocks of child sks have not released.

To fix, add sock_map_destroy to release psocks.

Signed-off-by: Wang Yufen <wangyufen@huawei.com>
---
 include/linux/bpf.h   |  1 +
 include/linux/skmsg.h |  1 +
 net/core/skmsg.c      |  1 +
 net/core/sock_map.c   | 23 +++++++++++++++++++++++
 net/ipv4/tcp_bpf.c    |  1 +
 5 files changed, 27 insertions(+)

Comments

Jakub Sitnicki May 25, 2022, 8:41 a.m. UTC | #1
On Tue, May 24, 2022 at 03:53 PM +08, Wang Yufen wrote:
> During TCP sockmap redirect pressure test, the following warning is triggered:
> WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0
> CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G        W         5.10.0+ #9
> Call Trace:
>  inet_csk_destroy_sock+0x55/0x110
>  inet_csk_listen_stop+0xbb/0x380
>  tcp_close+0x41b/0x480
>  inet_release+0x42/0x80
>  __sock_release+0x3d/0xa0
>  sock_close+0x11/0x20
>  __fput+0x9d/0x240
>  task_work_run+0x62/0x90
>  exit_to_user_mode_prepare+0x110/0x120
>  syscall_exit_to_user_mode+0x27/0x190
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> The reason we observed is that:
> When the listener is closing, a connection may have completed the three-way
> handshake but not accepted, and the client has sent some packets. The child
> sks in accept queue release by inet_child_forget()->inet_csk_destroy_sock(),
> but psocks of child sks have not released.
>
> To fix, add sock_map_destroy to release psocks.
>
> Signed-off-by: Wang Yufen <wangyufen@huawei.com>
> ---

Thanks for the fix.

Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Cong Wang May 27, 2022, 9:37 p.m. UTC | #2
On Tue, May 24, 2022 at 03:53:11PM +0800, Wang Yufen wrote:
> During TCP sockmap redirect pressure test, the following warning is triggered:
> WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0
> CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G        W         5.10.0+ #9
> Call Trace:
>  inet_csk_destroy_sock+0x55/0x110
>  inet_csk_listen_stop+0xbb/0x380
>  tcp_close+0x41b/0x480
>  inet_release+0x42/0x80
>  __sock_release+0x3d/0xa0
>  sock_close+0x11/0x20
>  __fput+0x9d/0x240
>  task_work_run+0x62/0x90
>  exit_to_user_mode_prepare+0x110/0x120
>  syscall_exit_to_user_mode+0x27/0x190
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The reason we observed is that:
> When the listener is closing, a connection may have completed the three-way
> handshake but not accepted, and the client has sent some packets. The child
> sks in accept queue release by inet_child_forget()->inet_csk_destroy_sock(),
> but psocks of child sks have not released.
> 

Hm, in this scenario, how does the child socket end up in the sockmap?
Clearly user-space does not have a chance to get an fd yet.

And, how does your patch work? Since the child sock does not even inheirt
the sock proto after clone (see the comments above tcp_bpf_clone()) at
all?

Thanks.
wangyufen May 28, 2022, 1:54 a.m. UTC | #3
在 2022/5/28 5:37, Cong Wang 写道:
> On Tue, May 24, 2022 at 03:53:11PM +0800, Wang Yufen wrote:
>> During TCP sockmap redirect pressure test, the following warning is triggered:
>> WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0
>> CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G        W         5.10.0+ #9
>> Call Trace:
>>   inet_csk_destroy_sock+0x55/0x110
>>   inet_csk_listen_stop+0xbb/0x380
>>   tcp_close+0x41b/0x480
>>   inet_release+0x42/0x80
>>   __sock_release+0x3d/0xa0
>>   sock_close+0x11/0x20
>>   __fput+0x9d/0x240
>>   task_work_run+0x62/0x90
>>   exit_to_user_mode_prepare+0x110/0x120
>>   syscall_exit_to_user_mode+0x27/0x190
>>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> The reason we observed is that:
>> When the listener is closing, a connection may have completed the three-way
>> handshake but not accepted, and the client has sent some packets. The child
>> sks in accept queue release by inet_child_forget()->inet_csk_destroy_sock(),
>> but psocks of child sks have not released.
>>
> Hm, in this scenario, how does the child socket end up in the sockmap?
> Clearly user-space does not have a chance to get an fd yet.
>
> And, how does your patch work? Since the child sock does not even inheirt
> the sock proto after clone (see the comments above tcp_bpf_clone()) at
> all?
>
> Thanks.
> .
My test cases are as follows:

__section("sockops")
int bpf_sockmap(struct bpf_sock_ops *skops)
{
     switch (skops->op) {
         case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
         case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
             ...
             bpf_sock_hash_update(skops, &sock_ops_map, &key, BPF_NOEXIST);
             break;
         ...
}

__section("sk_msg")
int bpf_redir(struct sk_msg_md *msg)
{
     ...
     bpf_msg_redirect_hash(msg, &sock_ops_map, &key, BPF_F_INGRESS);
     return SK_PASS;
}

//tcp_server
int main(char **argv)
{
     int sk = 0;
     int port, ret;
     struct sockaddr_in addr;

     signal(SIGCHLD, SIG_IGN);

     sk = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
     if (sk < 0) {
         perror("Can't create socket");
         return -1;
     }

     port = atoi(argv[1]);
     memset(&addr, 0, sizeof(addr));
     addr.sin_family = AF_INET;
     addr.sin_addr.s_addr = htonl(INADDR_ANY);
     addr.sin_port = htons(port);

     printf("Binding to port %d\n", port);

     ret = bind(sk, (struct sockaddr *)&addr, sizeof(addr));
     if (ret < 0) {
         perror("Can't bind socket");
         return -1;
     }

     ret = listen(sk, size);
     if (ret < 0) {
         perror("Can't put sock to listen");
         return -1;
     }

     printf("Waiting for connections\n");
     while (1) {
         //not accpet
         sleep(1);
     }
}

//tcp_client
int main(char **argv)
{
     int port, write_size;
     int val[10], rval[10];
     int sk = 0;

     port = atoi(argv[2]);
     val[0] = 1;

     sk = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
     if (sk < 0) {
         perror("Can't create socket");
         return -1;
     }

     memset(&addr, 0, sizeof(addr));
     addr.sin_family = AF_INET;
     inet_aton(argv[1], &addr.sin_addr);
     addr.sin_port = htons(port);

     ret = connect(sk[i], (struct sockaddr *)&addr, sizeof(addr));
     if (ret < 0) {
         perror("Can't connect");
         return -1;
     }

    while (1) {
         printf("send %d -> %d\n", val[0], val[0]);
         write(sk, &val, sizeof(val));
         val[0]++;
         sleep(1);
    }
}


1. start tcp_server
2. start tcp_client
3. kill tcp_server
The problem can be reproduced easily.
Jakub Sitnicki May 30, 2022, 4:37 p.m. UTC | #4
On Sat, May 28, 2022 at 09:54 AM +08, wangyufen wrote:
> 在 2022/5/28 5:37, Cong Wang 写道:
>> On Tue, May 24, 2022 at 03:53:11PM +0800, Wang Yufen wrote:
>>> During TCP sockmap redirect pressure test, the following warning is triggered:
>>> WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0
>>> CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G        W         5.10.0+ #9
>>> Call Trace:
>>>   inet_csk_destroy_sock+0x55/0x110
>>>   inet_csk_listen_stop+0xbb/0x380
>>>   tcp_close+0x41b/0x480
>>>   inet_release+0x42/0x80
>>>   __sock_release+0x3d/0xa0
>>>   sock_close+0x11/0x20
>>>   __fput+0x9d/0x240
>>>   task_work_run+0x62/0x90
>>>   exit_to_user_mode_prepare+0x110/0x120
>>>   syscall_exit_to_user_mode+0x27/0x190
>>>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> The reason we observed is that:
>>> When the listener is closing, a connection may have completed the three-way
>>> handshake but not accepted, and the client has sent some packets. The child
>>> sks in accept queue release by inet_child_forget()->inet_csk_destroy_sock(),
>>> but psocks of child sks have not released.
>>>
>> Hm, in this scenario, how does the child socket end up in the sockmap?
>> Clearly user-space does not have a chance to get an fd yet.
>>
>> And, how does your patch work? Since the child sock does not even inheirt
>> the sock proto after clone (see the comments above tcp_bpf_clone()) at
>> all?
>>
>> Thanks.
>> .
> My test cases are as follows:
>
> __section("sockops")
> int bpf_sockmap(struct bpf_sock_ops *skops)
> {
>     switch (skops->op) {
>         case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
>         case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
>             ...
>             bpf_sock_hash_update(skops, &sock_ops_map, &key, BPF_NOEXIST);
>             break;
>         ...
> }

Right, when processing the final ACK in tcp_rcv_state_process(), we
invoke the BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB BPF callback.

This gives a chance to install sockmap sk_prot callbacks.

An accept() without ever calling accept() ;-)

[...]
John Fastabend May 31, 2022, 11:40 p.m. UTC | #5
Jakub Sitnicki wrote:
> On Sat, May 28, 2022 at 09:54 AM +08, wangyufen wrote:
> > 在 2022/5/28 5:37, Cong Wang 写道:
> >> On Tue, May 24, 2022 at 03:53:11PM +0800, Wang Yufen wrote:
> >>> During TCP sockmap redirect pressure test, the following warning is triggered:
> >>> WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0
> >>> CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G        W         5.10.0+ #9
> >>> Call Trace:
> >>>   inet_csk_destroy_sock+0x55/0x110
> >>>   inet_csk_listen_stop+0xbb/0x380
> >>>   tcp_close+0x41b/0x480
> >>>   inet_release+0x42/0x80
> >>>   __sock_release+0x3d/0xa0
> >>>   sock_close+0x11/0x20
> >>>   __fput+0x9d/0x240
> >>>   task_work_run+0x62/0x90
> >>>   exit_to_user_mode_prepare+0x110/0x120
> >>>   syscall_exit_to_user_mode+0x27/0x190
> >>>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>
> >>> The reason we observed is that:
> >>> When the listener is closing, a connection may have completed the three-way
> >>> handshake but not accepted, and the client has sent some packets. The child
> >>> sks in accept queue release by inet_child_forget()->inet_csk_destroy_sock(),
> >>> but psocks of child sks have not released.
> >>>
> >> Hm, in this scenario, how does the child socket end up in the sockmap?
> >> Clearly user-space does not have a chance to get an fd yet.
> >>
> >> And, how does your patch work? Since the child sock does not even inheirt
> >> the sock proto after clone (see the comments above tcp_bpf_clone()) at
> >> all?
> >>
> >> Thanks.
> >> .
> > My test cases are as follows:
> >
> > __section("sockops")
> > int bpf_sockmap(struct bpf_sock_ops *skops)
> > {
> >     switch (skops->op) {
> >         case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
> >         case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
> >             ...
> >             bpf_sock_hash_update(skops, &sock_ops_map, &key, BPF_NOEXIST);
> >             break;
> >         ...
> > }
> 
> Right, when processing the final ACK in tcp_rcv_state_process(), we
> invoke the BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB BPF callback.
> 
> This gives a chance to install sockmap sk_prot callbacks.
> 
> An accept() without ever calling accept() ;-)
> 
> [...]

LGTM as well.

Acked-by: John Fastabend <john.fastabend@gmail.com>
patchwork-bot+netdevbpf@kernel.org May 31, 2022, 11:50 p.m. UTC | #6
Hello:

This patch was applied to bpf/bpf-next.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Tue, 24 May 2022 15:53:11 +0800 you wrote:
> During TCP sockmap redirect pressure test, the following warning is triggered:
> WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0
> CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G        W         5.10.0+ #9
> Call Trace:
>  inet_csk_destroy_sock+0x55/0x110
>  inet_csk_listen_stop+0xbb/0x380
>  tcp_close+0x41b/0x480
>  inet_release+0x42/0x80
>  __sock_release+0x3d/0xa0
>  sock_close+0x11/0x20
>  __fput+0x9d/0x240
>  task_work_run+0x62/0x90
>  exit_to_user_mode_prepare+0x110/0x120
>  syscall_exit_to_user_mode+0x27/0x190
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> [...]

Here is the summary with links:
  - [bpf-next] bpf,sockmap: fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues
    https://git.kernel.org/bpf/bpf-next/c/84dc313f7b79

You are awesome, thank you!
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cc4d5e394031..c4de82d5e72c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2092,6 +2092,7 @@  int sock_map_bpf_prog_query(const union bpf_attr *attr,
 			    union bpf_attr __user *uattr);
 
 void sock_map_unhash(struct sock *sk);
+void sock_map_destroy(struct sock *sk);
 void sock_map_close(struct sock *sk, long timeout);
 #else
 static inline int bpf_prog_offload_init(struct bpf_prog *prog,
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index c5a2d6f50f25..153b6dec9b6a 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -95,6 +95,7 @@  struct sk_psock {
 	spinlock_t			link_lock;
 	refcount_t			refcnt;
 	void (*saved_unhash)(struct sock *sk);
+	void (*saved_destroy)(struct sock *sk);
 	void (*saved_close)(struct sock *sk, long timeout);
 	void (*saved_write_space)(struct sock *sk);
 	void (*saved_data_ready)(struct sock *sk);
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 22b983ade0e7..7e03f96e441b 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -715,6 +715,7 @@  struct sk_psock *sk_psock_init(struct sock *sk, int node)
 	psock->eval = __SK_NONE;
 	psock->sk_proto = prot;
 	psock->saved_unhash = prot->unhash;
+	psock->saved_destroy = prot->destroy;
 	psock->saved_close = prot->close;
 	psock->saved_write_space = sk->sk_write_space;
 
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 81d4b4756a02..9f08ccfaf6da 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1561,6 +1561,29 @@  void sock_map_unhash(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sock_map_unhash);
 
+void sock_map_destroy(struct sock *sk)
+{
+	void (*saved_destroy)(struct sock *sk);
+	struct sk_psock *psock;
+
+	rcu_read_lock();
+	psock = sk_psock_get(sk);
+	if (unlikely(!psock)) {
+		rcu_read_unlock();
+		if (sk->sk_prot->destroy)
+			sk->sk_prot->destroy(sk);
+		return;
+	}
+
+	saved_destroy = psock->saved_destroy;
+	sock_map_remove_links(sk, psock);
+	rcu_read_unlock();
+	sk_psock_stop(psock, true);
+	sk_psock_put(sk, psock);
+	saved_destroy(sk);
+}
+EXPORT_SYMBOL_GPL(sock_map_destroy);
+
 void sock_map_close(struct sock *sk, long timeout)
 {
 	void (*saved_close)(struct sock *sk, long timeout);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index be3947e70fec..38550bb1b90b 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -540,6 +540,7 @@  static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 				   struct proto *base)
 {
 	prot[TCP_BPF_BASE]			= *base;
+	prot[TCP_BPF_BASE].destroy		= sock_map_destroy;
 	prot[TCP_BPF_BASE].close		= sock_map_close;
 	prot[TCP_BPF_BASE].recvmsg		= tcp_bpf_recvmsg;
 	prot[TCP_BPF_BASE].sock_is_readable	= sk_msg_is_readable;