diff mbox series

[bpf] sock_map: convert cancel_work_sync() to cancel_work()

Message ID 20221018020258.197333-1-xiyou.wangcong@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [bpf] sock_map: convert cancel_work_sync() to cancel_work() | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 35 this patch: 35
netdev/cc_maintainers fail 1 blamed authors not CCed: ast@kernel.org; 5 maintainers not CCed: kuba@kernel.org davem@davemloft.net ast@kernel.org edumazet@google.com pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 35 this patch: 35
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 81 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-PR fail PR summary
bpf/vmtest-bpf-VM_Test-1 fail Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-17 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-20 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-23 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-24 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-14 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-15 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-18 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-21 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-22 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-8 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-9 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-10 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-11 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-12 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-13 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-19 success Logs for test_progs_parallel on s390x with gcc
bpf/vmtest-bpf-VM_Test-16 success Logs for test_progs_no_alu32_parallel on s390x with gcc
bpf/vmtest-bpf-VM_Test-2 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-3 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-4 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-5 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-6 success Logs for llvm-toolchain
bpf/vmtest-bpf-VM_Test-7 success Logs for set-matrix

Commit Message

Cong Wang Oct. 18, 2022, 2:02 a.m. UTC
From: Cong Wang <cong.wang@bytedance.com>

Technically we don't need lock the sock in the psock work, but we
need to prevent this work running in parallel with sock_map_close().

With this, we no longer need to wait for the psock->work synchronously,
because when we reach here, either this work is still pending, or
blocking on the lock_sock(), or it is completed. We only need to cancel
the first case asynchronously, and we need to bail out the second case
quickly by checking SK_PSOCK_TX_ENABLED bit.

Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Reported-by: Stanislav Fomichev <sdf@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skmsg.h |  2 +-
 net/core/skmsg.c      | 19 +++++++++++++------
 net/core/sock_map.c   |  4 ++--
 3 files changed, 16 insertions(+), 9 deletions(-)

Comments

Stanislav Fomichev Oct. 18, 2022, 6:13 p.m. UTC | #1
On 10/17, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>

> Technically we don't need lock the sock in the psock work, but we
> need to prevent this work running in parallel with sock_map_close().

> With this, we no longer need to wait for the psock->work synchronously,
> because when we reach here, either this work is still pending, or
> blocking on the lock_sock(), or it is completed. We only need to cancel
> the first case asynchronously, and we need to bail out the second case
> quickly by checking SK_PSOCK_TX_ENABLED bit.

> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
> Reported-by: Stanislav Fomichev <sdf@google.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>

This seems to remove the splat for me:

Tested-by: Stanislav Fomichev <sdf@google.com>

The patch looks good, but I'll leave the review to Jakub/John.

> ---
>   include/linux/skmsg.h |  2 +-
>   net/core/skmsg.c      | 19 +++++++++++++------
>   net/core/sock_map.c   |  4 ++--
>   3 files changed, 16 insertions(+), 9 deletions(-)

> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 48f4b645193b..70d6cb94e580 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -376,7 +376,7 @@ static inline void sk_psock_report_error(struct  
> sk_psock *psock, int err)
>   }

>   struct sk_psock *sk_psock_init(struct sock *sk, int node);
> -void sk_psock_stop(struct sk_psock *psock, bool wait);
> +void sk_psock_stop(struct sk_psock *psock);

>   #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
>   int sk_psock_init_strp(struct sock *sk, struct sk_psock *psock);
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index ca70525621c7..c329e71ea924 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -647,6 +647,11 @@ static void sk_psock_backlog(struct work_struct  
> *work)
>   	int ret;

>   	mutex_lock(&psock->work_mutex);
> +	lock_sock(psock->sk);
> +
> +	if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
> +		goto end;
> +
>   	if (unlikely(state->skb)) {
>   		spin_lock_bh(&psock->ingress_lock);
>   		skb = state->skb;
> @@ -672,9 +677,12 @@ static void sk_psock_backlog(struct work_struct  
> *work)
>   		skb_bpf_redirect_clear(skb);
>   		do {
>   			ret = -EIO;
> -			if (!sock_flag(psock->sk, SOCK_DEAD))
> +			if (!sock_flag(psock->sk, SOCK_DEAD)) {
> +				release_sock(psock->sk);
>   				ret = sk_psock_handle_skb(psock, skb, off,
>   							  len, ingress);
> +				lock_sock(psock->sk);
> +			}
>   			if (ret <= 0) {
>   				if (ret == -EAGAIN) {
>   					sk_psock_skb_state(psock, state, skb,
> @@ -695,6 +703,7 @@ static void sk_psock_backlog(struct work_struct *work)
>   			kfree_skb(skb);
>   	}
>   end:
> +	release_sock(psock->sk);
>   	mutex_unlock(&psock->work_mutex);
>   }

> @@ -803,16 +812,14 @@ static void sk_psock_link_destroy(struct sk_psock  
> *psock)
>   	}
>   }

> -void sk_psock_stop(struct sk_psock *psock, bool wait)
> +void sk_psock_stop(struct sk_psock *psock)
>   {
>   	spin_lock_bh(&psock->ingress_lock);
>   	sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
>   	sk_psock_cork_free(psock);
>   	__sk_psock_zap_ingress(psock);
>   	spin_unlock_bh(&psock->ingress_lock);
> -
> -	if (wait)
> -		cancel_work_sync(&psock->work);
> +	cancel_work(&psock->work);
>   }

>   static void sk_psock_done_strp(struct sk_psock *psock);
> @@ -850,7 +857,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock  
> *psock)
>   		sk_psock_stop_verdict(sk, psock);
>   	write_unlock_bh(&sk->sk_callback_lock);

> -	sk_psock_stop(psock, false);
> +	sk_psock_stop(psock);

>   	INIT_RCU_WORK(&psock->rwork, sk_psock_destroy);
>   	queue_rcu_work(system_wq, &psock->rwork);
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index a660baedd9e7..d4e11d7f459c 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -1596,7 +1596,7 @@ void sock_map_destroy(struct sock *sk)
>   	saved_destroy = psock->saved_destroy;
>   	sock_map_remove_links(sk, psock);
>   	rcu_read_unlock();
> -	sk_psock_stop(psock, false);
> +	sk_psock_stop(psock);
>   	sk_psock_put(sk, psock);
>   	saved_destroy(sk);
>   }
> @@ -1619,7 +1619,7 @@ void sock_map_close(struct sock *sk, long timeout)
>   	saved_close = psock->saved_close;
>   	sock_map_remove_links(sk, psock);
>   	rcu_read_unlock();
> -	sk_psock_stop(psock, true);
> +	sk_psock_stop(psock);
>   	sk_psock_put(sk, psock);
>   	release_sock(sk);
>   	saved_close(sk, timeout);
> --
> 2.34.1
Jakub Sitnicki Oct. 24, 2022, 1:33 p.m. UTC | #2
On Tue, Oct 18, 2022 at 11:13 AM -07, sdf@google.com wrote:
> On 10/17, Cong Wang wrote:
>> From: Cong Wang <cong.wang@bytedance.com>
>
>> Technically we don't need lock the sock in the psock work, but we
>> need to prevent this work running in parallel with sock_map_close().
>
>> With this, we no longer need to wait for the psock->work synchronously,
>> because when we reach here, either this work is still pending, or
>> blocking on the lock_sock(), or it is completed. We only need to cancel
>> the first case asynchronously, and we need to bail out the second case
>> quickly by checking SK_PSOCK_TX_ENABLED bit.
>
>> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
>> Reported-by: Stanislav Fomichev <sdf@google.com>
>> Cc: John Fastabend <john.fastabend@gmail.com>
>> Cc: Jakub Sitnicki <jakub@cloudflare.com>
>> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
>
> This seems to remove the splat for me:
>
> Tested-by: Stanislav Fomichev <sdf@google.com>
>
> The patch looks good, but I'll leave the review to Jakub/John.

I can't poke any holes in it either.

However, it is harder for me to follow than the initial idea [1].
So I'm wondering if there was anything wrong with it?

This seems like a step back when comes to simplifying locking in
sk_psock_backlog() that was done in 799aa7f98d53.

[1] https://lore.kernel.org/bpf/87ilk9ftls.fsf@cloudflare.com/T/#md486941e228a1b29729dba842ccd396c2c07d9fd

>
>> ---
>>   include/linux/skmsg.h |  2 +-
>>   net/core/skmsg.c      | 19 +++++++++++++------
>>   net/core/sock_map.c   |  4 ++--
>>   3 files changed, 16 insertions(+), 9 deletions(-)
>
>> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
>> index 48f4b645193b..70d6cb94e580 100644
>> --- a/include/linux/skmsg.h
>> +++ b/include/linux/skmsg.h
>> @@ -376,7 +376,7 @@ static inline void sk_psock_report_error(struct  sk_psock
>> *psock, int err)
>>   }
>
>>   struct sk_psock *sk_psock_init(struct sock *sk, int node);
>> -void sk_psock_stop(struct sk_psock *psock, bool wait);
>> +void sk_psock_stop(struct sk_psock *psock);
>
>>   #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
>>   int sk_psock_init_strp(struct sock *sk, struct sk_psock *psock);
>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> index ca70525621c7..c329e71ea924 100644
>> --- a/net/core/skmsg.c
>> +++ b/net/core/skmsg.c
>> @@ -647,6 +647,11 @@ static void sk_psock_backlog(struct work_struct  *work)
>>   	int ret;
>
>>   	mutex_lock(&psock->work_mutex);
>> +	lock_sock(psock->sk);
>> +
>> +	if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
>> +		goto end;
>> +
>>   	if (unlikely(state->skb)) {
>>   		spin_lock_bh(&psock->ingress_lock);
>>   		skb = state->skb;
>> @@ -672,9 +677,12 @@ static void sk_psock_backlog(struct work_struct  *work)
>>   		skb_bpf_redirect_clear(skb);
>>   		do {
>>   			ret = -EIO;
>> -			if (!sock_flag(psock->sk, SOCK_DEAD))
>> +			if (!sock_flag(psock->sk, SOCK_DEAD)) {
>> +				release_sock(psock->sk);
>>   				ret = sk_psock_handle_skb(psock, skb, off,
>>   							  len, ingress);
>> +				lock_sock(psock->sk);
>> +			}
>>   			if (ret <= 0) {
>>   				if (ret == -EAGAIN) {
>>   					sk_psock_skb_state(psock, state, skb,
>> @@ -695,6 +703,7 @@ static void sk_psock_backlog(struct work_struct *work)
>>   			kfree_skb(skb);
>>   	}
>>   end:
>> +	release_sock(psock->sk);
>>   	mutex_unlock(&psock->work_mutex);
>>   }
>
>> @@ -803,16 +812,14 @@ static void sk_psock_link_destroy(struct sk_psock
>> *psock)
>>   	}
>>   }
>
>> -void sk_psock_stop(struct sk_psock *psock, bool wait)
>> +void sk_psock_stop(struct sk_psock *psock)
>>   {
>>   	spin_lock_bh(&psock->ingress_lock);
>>   	sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
>>   	sk_psock_cork_free(psock);
>>   	__sk_psock_zap_ingress(psock);
>>   	spin_unlock_bh(&psock->ingress_lock);
>> -
>> -	if (wait)
>> -		cancel_work_sync(&psock->work);
>> +	cancel_work(&psock->work);
>>   }
>
>>   static void sk_psock_done_strp(struct sk_psock *psock);
>> @@ -850,7 +857,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock
>> *psock)
>>   		sk_psock_stop_verdict(sk, psock);
>>   	write_unlock_bh(&sk->sk_callback_lock);
>
>> -	sk_psock_stop(psock, false);
>> +	sk_psock_stop(psock);
>
>>   	INIT_RCU_WORK(&psock->rwork, sk_psock_destroy);
>>   	queue_rcu_work(system_wq, &psock->rwork);
>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>> index a660baedd9e7..d4e11d7f459c 100644
>> --- a/net/core/sock_map.c
>> +++ b/net/core/sock_map.c
>> @@ -1596,7 +1596,7 @@ void sock_map_destroy(struct sock *sk)
>>   	saved_destroy = psock->saved_destroy;
>>   	sock_map_remove_links(sk, psock);
>>   	rcu_read_unlock();
>> -	sk_psock_stop(psock, false);
>> +	sk_psock_stop(psock);
>>   	sk_psock_put(sk, psock);
>>   	saved_destroy(sk);
>>   }
>> @@ -1619,7 +1619,7 @@ void sock_map_close(struct sock *sk, long timeout)
>>   	saved_close = psock->saved_close;
>>   	sock_map_remove_links(sk, psock);
>>   	rcu_read_unlock();
>> -	sk_psock_stop(psock, true);
>> +	sk_psock_stop(psock);
>>   	sk_psock_put(sk, psock);
>>   	release_sock(sk);
>>   	saved_close(sk, timeout);
>> --
>> 2.34.1
Cong Wang Oct. 28, 2022, 7:16 p.m. UTC | #3
On Mon, Oct 24, 2022 at 03:33:13PM +0200, Jakub Sitnicki wrote:
> On Tue, Oct 18, 2022 at 11:13 AM -07, sdf@google.com wrote:
> > On 10/17, Cong Wang wrote:
> >> From: Cong Wang <cong.wang@bytedance.com>
> >
> >> Technically we don't need lock the sock in the psock work, but we
> >> need to prevent this work running in parallel with sock_map_close().
> >
> >> With this, we no longer need to wait for the psock->work synchronously,
> >> because when we reach here, either this work is still pending, or
> >> blocking on the lock_sock(), or it is completed. We only need to cancel
> >> the first case asynchronously, and we need to bail out the second case
> >> quickly by checking SK_PSOCK_TX_ENABLED bit.
> >
> >> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
> >> Reported-by: Stanislav Fomichev <sdf@google.com>
> >> Cc: John Fastabend <john.fastabend@gmail.com>
> >> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> >> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> >
> > This seems to remove the splat for me:
> >
> > Tested-by: Stanislav Fomichev <sdf@google.com>
> >
> > The patch looks good, but I'll leave the review to Jakub/John.
> 
> I can't poke any holes in it either.
> 
> However, it is harder for me to follow than the initial idea [1].
> So I'm wondering if there was anything wrong with it?

It caused a warning in sk_stream_kill_queues() when I actually tested
it (after posting).

> 
> This seems like a step back when comes to simplifying locking in
> sk_psock_backlog() that was done in 799aa7f98d53.

Kinda, but it is still true that this sock lock is not for sk_socket
(merely for closing this race condition).

Thanks.
Jakub Sitnicki Oct. 31, 2022, 10:03 p.m. UTC | #4
On Fri, Oct 28, 2022 at 12:16 PM -07, Cong Wang wrote:
> On Mon, Oct 24, 2022 at 03:33:13PM +0200, Jakub Sitnicki wrote:
>> On Tue, Oct 18, 2022 at 11:13 AM -07, sdf@google.com wrote:
>> > On 10/17, Cong Wang wrote:
>> >> From: Cong Wang <cong.wang@bytedance.com>
>> >
>> >> Technically we don't need lock the sock in the psock work, but we
>> >> need to prevent this work running in parallel with sock_map_close().
>> >
>> >> With this, we no longer need to wait for the psock->work synchronously,
>> >> because when we reach here, either this work is still pending, or
>> >> blocking on the lock_sock(), or it is completed. We only need to cancel
>> >> the first case asynchronously, and we need to bail out the second case
>> >> quickly by checking SK_PSOCK_TX_ENABLED bit.
>> >
>> >> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
>> >> Reported-by: Stanislav Fomichev <sdf@google.com>
>> >> Cc: John Fastabend <john.fastabend@gmail.com>
>> >> Cc: Jakub Sitnicki <jakub@cloudflare.com>
>> >> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
>> >
>> > This seems to remove the splat for me:
>> >
>> > Tested-by: Stanislav Fomichev <sdf@google.com>
>> >
>> > The patch looks good, but I'll leave the review to Jakub/John.
>> 
>> I can't poke any holes in it either.
>> 
>> However, it is harder for me to follow than the initial idea [1].
>> So I'm wondering if there was anything wrong with it?
>
> It caused a warning in sk_stream_kill_queues() when I actually tested
> it (after posting).

We must have seen the same warnings. They seemed unrelated so I went
digging. We have a fix for these [1]. They were present since 5.18-rc1.

>> This seems like a step back when comes to simplifying locking in
>> sk_psock_backlog() that was done in 799aa7f98d53.
>
> Kinda, but it is still true that this sock lock is not for sk_socket
> (merely for closing this race condition).

I really think the initial idea [2] is much nicer. I can turn it into a
patch, if you are short on time.

With [1] and [2] applied, the dead lock and memory accounting warnings
are gone, when running `test_sockmap`.

Thanks,
Jakub

[1] https://lore.kernel.org/netdev/1667000674-13237-1-git-send-email-wangyufen@huawei.com/
[2] https://lore.kernel.org/netdev/Y0xJUc%2FLRu8K%2FAf8@pop-os.localdomain/
John Fastabend Nov. 1, 2022, 8:01 p.m. UTC | #5
Jakub Sitnicki wrote:
> On Fri, Oct 28, 2022 at 12:16 PM -07, Cong Wang wrote:
> > On Mon, Oct 24, 2022 at 03:33:13PM +0200, Jakub Sitnicki wrote:
> >> On Tue, Oct 18, 2022 at 11:13 AM -07, sdf@google.com wrote:
> >> > On 10/17, Cong Wang wrote:
> >> >> From: Cong Wang <cong.wang@bytedance.com>
> >> >
> >> >> Technically we don't need lock the sock in the psock work, but we
> >> >> need to prevent this work running in parallel with sock_map_close().
> >> >
> >> >> With this, we no longer need to wait for the psock->work synchronously,
> >> >> because when we reach here, either this work is still pending, or
> >> >> blocking on the lock_sock(), or it is completed. We only need to cancel
> >> >> the first case asynchronously, and we need to bail out the second case
> >> >> quickly by checking SK_PSOCK_TX_ENABLED bit.
> >> >
> >> >> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
> >> >> Reported-by: Stanislav Fomichev <sdf@google.com>
> >> >> Cc: John Fastabend <john.fastabend@gmail.com>
> >> >> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> >> >> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> >> >
> >> > This seems to remove the splat for me:
> >> >
> >> > Tested-by: Stanislav Fomichev <sdf@google.com>
> >> >
> >> > The patch looks good, but I'll leave the review to Jakub/John.
> >> 
> >> I can't poke any holes in it either.
> >> 
> >> However, it is harder for me to follow than the initial idea [1].
> >> So I'm wondering if there was anything wrong with it?
> >
> > It caused a warning in sk_stream_kill_queues() when I actually tested
> > it (after posting).
> 
> We must have seen the same warnings. They seemed unrelated so I went
> digging. We have a fix for these [1]. They were present since 5.18-rc1.
> 
> >> This seems like a step back when comes to simplifying locking in
> >> sk_psock_backlog() that was done in 799aa7f98d53.
> >
> > Kinda, but it is still true that this sock lock is not for sk_socket
> > (merely for closing this race condition).
> 
> I really think the initial idea [2] is much nicer. I can turn it into a
> patch, if you are short on time.
> 
> With [1] and [2] applied, the dead lock and memory accounting warnings
> are gone, when running `test_sockmap`.
> 
> Thanks,
> Jakub
> 
> [1] https://lore.kernel.org/netdev/1667000674-13237-1-git-send-email-wangyufen@huawei.com/
> [2] https://lore.kernel.org/netdev/Y0xJUc%2FLRu8K%2FAf8@pop-os.localdomain/

Cong, what do you think? I tend to agree [2] looks nicer to me.

@Jakub,

Also I think we could simply drop the proposed cancel_work_sync in
sock_map_close()?

 }
@@ -1619,9 +1619,10 @@ void sock_map_close(struct sock *sk, long timeout)
 	saved_close = psock->saved_close;
 	sock_map_remove_links(sk, psock);
 	rcu_read_unlock();
-	sk_psock_stop(psock, true);
-	sk_psock_put(sk, psock);
+	sk_psock_stop(psock);
 	release_sock(sk);
+	cancel_work_sync(&psock->work);
+	sk_psock_put(sk, psock);
 	saved_close(sk, timeout);
 }

The sk_psock_put is going to cancel the work before destroying the psock,

 sk_psock_put()
   sk_psock_drop()
     queue_rcu_work(system_wq, psock->rwork)

and then in callback we

  sk_psock_destroy()
    cancel_work_synbc(psock->work)

although it might be nice to have the work cancelled earlier rather than
latter maybe.

Thanks,
John
Jakub Sitnicki Nov. 3, 2022, 7:22 p.m. UTC | #6
On Tue, Nov 01, 2022 at 01:01 PM -07, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Fri, Oct 28, 2022 at 12:16 PM -07, Cong Wang wrote:
>> > On Mon, Oct 24, 2022 at 03:33:13PM +0200, Jakub Sitnicki wrote:
>> >> On Tue, Oct 18, 2022 at 11:13 AM -07, sdf@google.com wrote:
>> >> > On 10/17, Cong Wang wrote:
>> >> >> From: Cong Wang <cong.wang@bytedance.com>
>> >> >
>> >> >> Technically we don't need lock the sock in the psock work, but we
>> >> >> need to prevent this work running in parallel with sock_map_close().
>> >> >
>> >> >> With this, we no longer need to wait for the psock->work synchronously,
>> >> >> because when we reach here, either this work is still pending, or
>> >> >> blocking on the lock_sock(), or it is completed. We only need to cancel
>> >> >> the first case asynchronously, and we need to bail out the second case
>> >> >> quickly by checking SK_PSOCK_TX_ENABLED bit.
>> >> >
>> >> >> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
>> >> >> Reported-by: Stanislav Fomichev <sdf@google.com>
>> >> >> Cc: John Fastabend <john.fastabend@gmail.com>
>> >> >> Cc: Jakub Sitnicki <jakub@cloudflare.com>
>> >> >> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
>> >> >
>> >> > This seems to remove the splat for me:
>> >> >
>> >> > Tested-by: Stanislav Fomichev <sdf@google.com>
>> >> >
>> >> > The patch looks good, but I'll leave the review to Jakub/John.
>> >> 
>> >> I can't poke any holes in it either.
>> >> 
>> >> However, it is harder for me to follow than the initial idea [1].
>> >> So I'm wondering if there was anything wrong with it?
>> >
>> > It caused a warning in sk_stream_kill_queues() when I actually tested
>> > it (after posting).
>> 
>> We must have seen the same warnings. They seemed unrelated so I went
>> digging. We have a fix for these [1]. They were present since 5.18-rc1.
>> 
>> >> This seems like a step back when comes to simplifying locking in
>> >> sk_psock_backlog() that was done in 799aa7f98d53.
>> >
>> > Kinda, but it is still true that this sock lock is not for sk_socket
>> > (merely for closing this race condition).
>> 
>> I really think the initial idea [2] is much nicer. I can turn it into a
>> patch, if you are short on time.
>> 
>> With [1] and [2] applied, the dead lock and memory accounting warnings
>> are gone, when running `test_sockmap`.
>> 
>> Thanks,
>> Jakub
>> 
>> [1] https://lore.kernel.org/netdev/1667000674-13237-1-git-send-email-wangyufen@huawei.com/
>> [2] https://lore.kernel.org/netdev/Y0xJUc%2FLRu8K%2FAf8@pop-os.localdomain/
>
> Cong, what do you think? I tend to agree [2] looks nicer to me.
>
> @Jakub,
>
> Also I think we could simply drop the proposed cancel_work_sync in
> sock_map_close()?
>
>  }
> @@ -1619,9 +1619,10 @@ void sock_map_close(struct sock *sk, long timeout)
>  	saved_close = psock->saved_close;
>  	sock_map_remove_links(sk, psock);
>  	rcu_read_unlock();
> -	sk_psock_stop(psock, true);
> -	sk_psock_put(sk, psock);
> +	sk_psock_stop(psock);
>  	release_sock(sk);
> +	cancel_work_sync(&psock->work);
> +	sk_psock_put(sk, psock);
>  	saved_close(sk, timeout);
>  }
>
> The sk_psock_put is going to cancel the work before destroying the psock,
>
>  sk_psock_put()
>    sk_psock_drop()
>      queue_rcu_work(system_wq, psock->rwork)
>
> and then in callback we
>
>   sk_psock_destroy()
>     cancel_work_synbc(psock->work)
>
> although it might be nice to have the work cancelled earlier rather than
> latter maybe.

Good point.

I kinda like the property that once close() returns we know there is no
deferred work running for the socket.

I find the APIs where a deferred cleanup happens sometimes harder to
write tests for.

But I don't really have a strong opinion here.

-Jakub
John Fastabend Nov. 3, 2022, 9:36 p.m. UTC | #7
Jakub Sitnicki wrote:
> On Tue, Nov 01, 2022 at 01:01 PM -07, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> On Fri, Oct 28, 2022 at 12:16 PM -07, Cong Wang wrote:
> >> > On Mon, Oct 24, 2022 at 03:33:13PM +0200, Jakub Sitnicki wrote:
> >> >> On Tue, Oct 18, 2022 at 11:13 AM -07, sdf@google.com wrote:
> >> >> > On 10/17, Cong Wang wrote:
> >> >> >> From: Cong Wang <cong.wang@bytedance.com>
> >> >> >
> >> >> >> Technically we don't need lock the sock in the psock work, but we
> >> >> >> need to prevent this work running in parallel with sock_map_close().
> >> >> >
> >> >> >> With this, we no longer need to wait for the psock->work synchronously,
> >> >> >> because when we reach here, either this work is still pending, or
> >> >> >> blocking on the lock_sock(), or it is completed. We only need to cancel
> >> >> >> the first case asynchronously, and we need to bail out the second case
> >> >> >> quickly by checking SK_PSOCK_TX_ENABLED bit.
> >> >> >
> >> >> >> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
> >> >> >> Reported-by: Stanislav Fomichev <sdf@google.com>
> >> >> >> Cc: John Fastabend <john.fastabend@gmail.com>
> >> >> >> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> >> >> >> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> >> >> >
> >> >> > This seems to remove the splat for me:
> >> >> >
> >> >> > Tested-by: Stanislav Fomichev <sdf@google.com>
> >> >> >
> >> >> > The patch looks good, but I'll leave the review to Jakub/John.
> >> >> 
> >> >> I can't poke any holes in it either.
> >> >> 
> >> >> However, it is harder for me to follow than the initial idea [1].
> >> >> So I'm wondering if there was anything wrong with it?
> >> >
> >> > It caused a warning in sk_stream_kill_queues() when I actually tested
> >> > it (after posting).
> >> 
> >> We must have seen the same warnings. They seemed unrelated so I went
> >> digging. We have a fix for these [1]. They were present since 5.18-rc1.
> >> 
> >> >> This seems like a step back when comes to simplifying locking in
> >> >> sk_psock_backlog() that was done in 799aa7f98d53.
> >> >
> >> > Kinda, but it is still true that this sock lock is not for sk_socket
> >> > (merely for closing this race condition).
> >> 
> >> I really think the initial idea [2] is much nicer. I can turn it into a
> >> patch, if you are short on time.
> >> 
> >> With [1] and [2] applied, the dead lock and memory accounting warnings
> >> are gone, when running `test_sockmap`.
> >> 
> >> Thanks,
> >> Jakub
> >> 
> >> [1] https://lore.kernel.org/netdev/1667000674-13237-1-git-send-email-wangyufen@huawei.com/
> >> [2] https://lore.kernel.org/netdev/Y0xJUc%2FLRu8K%2FAf8@pop-os.localdomain/
> >
> > Cong, what do you think? I tend to agree [2] looks nicer to me.
> >
> > @Jakub,
> >
> > Also I think we could simply drop the proposed cancel_work_sync in
> > sock_map_close()?
> >
> >  }
> > @@ -1619,9 +1619,10 @@ void sock_map_close(struct sock *sk, long timeout)
> >  	saved_close = psock->saved_close;
> >  	sock_map_remove_links(sk, psock);
> >  	rcu_read_unlock();
> > -	sk_psock_stop(psock, true);
> > -	sk_psock_put(sk, psock);
> > +	sk_psock_stop(psock);
> >  	release_sock(sk);
> > +	cancel_work_sync(&psock->work);
> > +	sk_psock_put(sk, psock);
> >  	saved_close(sk, timeout);
> >  }
> >
> > The sk_psock_put is going to cancel the work before destroying the psock,
> >
> >  sk_psock_put()
> >    sk_psock_drop()
> >      queue_rcu_work(system_wq, psock->rwork)
> >
> > and then in callback we
> >
> >   sk_psock_destroy()
> >     cancel_work_synbc(psock->work)
> >
> > although it might be nice to have the work cancelled earlier rather than
> > latter maybe.
> 
> Good point.
> 
> I kinda like the property that once close() returns we know there is no
> deferred work running for the socket.
> 
> I find the APIs where a deferred cleanup happens sometimes harder to
> write tests for.
> 
> But I don't really have a strong opinion here.

I don't either and Cong left it so I'm good with that.

Reviewing backlog logic though I think there is another bug there, but
I haven't been able to trigger it in any of our tests.

The sk_psock_backlog() logic is,

 sk_psock_backlog(struct work_struct *work)
   mutex_lock()
   while (skb = ...)
   ...
   do {
     ret = sk_psock_handle_skb()
     if (ret <= 0) {
       if (ret == -EAGAIN) {
           sk_psock_skb_state()
           goto  end;
       } 
      ...
   } while (len);
   ...
  end:
   mutex_unlock()

what I'm not seeing is if we get an EAGAIN through sk_psock_handle_skb
how do we schedule the backlog again. For egress we would set the
SOCK_NOSPACE bit and then get a write space available callback which
would do the schedule(). The ingress side could fail with EAGAIN
through the alloc_sk_msg(GFP_ATOMIC) call. This is just a kzalloc,

   sk_psock_handle_skb()
    sk_psock_skb_ingress()
     sk_psock_skb_ingress_self()
       msg = alloc_sk_msg()
               kzalloc()          <- this can return NULL
       if (!msg)
          return -EAGAIN          <- could we stall now


I think we could stall here if there was nothing else to kick it. I
was thinking about this maybe,

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 1efdc47a999b..b96e95625027 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -624,13 +624,20 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
 static void sk_psock_skb_state(struct sk_psock *psock,
                               struct sk_psock_work_state *state,
                               struct sk_buff *skb,
-                              int len, int off)
+                              int len, int off, bool ingress)
 {
        spin_lock_bh(&psock->ingress_lock);
        if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
                state->skb = skb;
                state->len = len;
                state->off = off;
+               /* For ingress we may not have a wakeup callback to trigger
+                * the reschedule on so need to reschedule retry. For egress
+                * we will get TCP stack callback when its a good time to
+                * retry.
+                */
+               if (ingress)
+                       schedule_work(&psock->work);
        } else {
                sock_drop(psock->sk, skb);
        }
@@ -678,7 +685,7 @@ static void sk_psock_backlog(struct work_struct *work)
                        if (ret <= 0) {
                                if (ret == -EAGAIN) {
                                        sk_psock_skb_state(psock, state, skb,
-                                                          len, off);
+                                                          len, off, ingress);
                                        goto end;
                                }
                                /* Hard errors break pipe and stop xmit. */


Its tempting to try and use the memory pressure callbacks but those are
built for the skb cache so I think overloading them is not so nice. The
drawback to above is its possible no memory is available even when we
get back to the backlog. We could use a delayed reschedule but its not
clear what delay makes sense here. Maybe some backoff...

Any thoughts?

Thanks,
John
Jakub Sitnicki Nov. 8, 2022, 6:49 p.m. UTC | #8
On Thu, Nov 03, 2022 at 02:36 PM -07, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Tue, Nov 01, 2022 at 01:01 PM -07, John Fastabend wrote:
>> > Jakub Sitnicki wrote:
>> >> On Fri, Oct 28, 2022 at 12:16 PM -07, Cong Wang wrote:
>> >> > On Mon, Oct 24, 2022 at 03:33:13PM +0200, Jakub Sitnicki wrote:
>> >> >> On Tue, Oct 18, 2022 at 11:13 AM -07, sdf@google.com wrote:
>> >> >> > On 10/17, Cong Wang wrote:
>> >> >> >> From: Cong Wang <cong.wang@bytedance.com>
>> >> >> >
>> >> >> >> Technically we don't need lock the sock in the psock work, but we
>> >> >> >> need to prevent this work running in parallel with sock_map_close().
>> >> >> >
>> >> >> >> With this, we no longer need to wait for the psock->work synchronously,
>> >> >> >> because when we reach here, either this work is still pending, or
>> >> >> >> blocking on the lock_sock(), or it is completed. We only need to cancel
>> >> >> >> the first case asynchronously, and we need to bail out the second case
>> >> >> >> quickly by checking SK_PSOCK_TX_ENABLED bit.
>> >> >> >
>> >> >> >> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
>> >> >> >> Reported-by: Stanislav Fomichev <sdf@google.com>
>> >> >> >> Cc: John Fastabend <john.fastabend@gmail.com>
>> >> >> >> Cc: Jakub Sitnicki <jakub@cloudflare.com>
>> >> >> >> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
>> >> >> >
>> >> >> > This seems to remove the splat for me:
>> >> >> >
>> >> >> > Tested-by: Stanislav Fomichev <sdf@google.com>
>> >> >> >
>> >> >> > The patch looks good, but I'll leave the review to Jakub/John.
>> >> >> 
>> >> >> I can't poke any holes in it either.
>> >> >> 
>> >> >> However, it is harder for me to follow than the initial idea [1].
>> >> >> So I'm wondering if there was anything wrong with it?
>> >> >
>> >> > It caused a warning in sk_stream_kill_queues() when I actually tested
>> >> > it (after posting).
>> >> 
>> >> We must have seen the same warnings. They seemed unrelated so I went
>> >> digging. We have a fix for these [1]. They were present since 5.18-rc1.
>> >> 
>> >> >> This seems like a step back when comes to simplifying locking in
>> >> >> sk_psock_backlog() that was done in 799aa7f98d53.
>> >> >
>> >> > Kinda, but it is still true that this sock lock is not for sk_socket
>> >> > (merely for closing this race condition).
>> >> 
>> >> I really think the initial idea [2] is much nicer. I can turn it into a
>> >> patch, if you are short on time.
>> >> 
>> >> With [1] and [2] applied, the dead lock and memory accounting warnings
>> >> are gone, when running `test_sockmap`.
>> >> 
>> >> Thanks,
>> >> Jakub
>> >> 
>> >> [1]
>> >> https://lore.kernel.org/netdev/1667000674-13237-1-git-send-email-wangyufen@huawei.com/
>> >> [2] https://lore.kernel.org/netdev/Y0xJUc%2FLRu8K%2FAf8@pop-os.localdomain/
>> >
>> > Cong, what do you think? I tend to agree [2] looks nicer to me.
>> >
>> > @Jakub,
>> >
>> > Also I think we could simply drop the proposed cancel_work_sync in
>> > sock_map_close()?
>> >
>> >  }
>> > @@ -1619,9 +1619,10 @@ void sock_map_close(struct sock *sk, long timeout)
>> >  	saved_close = psock->saved_close;
>> >  	sock_map_remove_links(sk, psock);
>> >  	rcu_read_unlock();
>> > -	sk_psock_stop(psock, true);
>> > -	sk_psock_put(sk, psock);
>> > +	sk_psock_stop(psock);
>> >  	release_sock(sk);
>> > +	cancel_work_sync(&psock->work);
>> > +	sk_psock_put(sk, psock);
>> >  	saved_close(sk, timeout);
>> >  }
>> >
>> > The sk_psock_put is going to cancel the work before destroying the psock,
>> >
>> >  sk_psock_put()
>> >    sk_psock_drop()
>> >      queue_rcu_work(system_wq, psock->rwork)
>> >
>> > and then in callback we
>> >
>> >   sk_psock_destroy()
>> >     cancel_work_synbc(psock->work)
>> >
>> > although it might be nice to have the work cancelled earlier rather than
>> > latter maybe.
>> 
>> Good point.
>> 
>> I kinda like the property that once close() returns we know there is no
>> deferred work running for the socket.
>> 
>> I find the APIs where a deferred cleanup happens sometimes harder to
>> write tests for.
>> 
>> But I don't really have a strong opinion here.
>
> I don't either and Cong left it so I'm good with that.
>
> Reviewing backlog logic though I think there is another bug there, but
> I haven't been able to trigger it in any of our tests.
>
> The sk_psock_backlog() logic is,
>
>  sk_psock_backlog(struct work_struct *work)
>    mutex_lock()
>    while (skb = ...)
>    ...
>    do {
>      ret = sk_psock_handle_skb()
>      if (ret <= 0) {
>        if (ret == -EAGAIN) {
>            sk_psock_skb_state()
>            goto  end;
>        } 
>       ...
>    } while (len);
>    ...
>   end:
>    mutex_unlock()
>
> what I'm not seeing is if we get an EAGAIN through sk_psock_handle_skb
> how do we schedule the backlog again. For egress we would set the
> SOCK_NOSPACE bit and then get a write space available callback which
> would do the schedule(). The ingress side could fail with EAGAIN
> through the alloc_sk_msg(GFP_ATOMIC) call. This is just a kzalloc,
>
>    sk_psock_handle_skb()
>     sk_psock_skb_ingress()
>      sk_psock_skb_ingress_self()
>        msg = alloc_sk_msg()
>                kzalloc()          <- this can return NULL
>        if (!msg)
>           return -EAGAIN          <- could we stall now
>
>
> I think we could stall here if there was nothing else to kick it. I
> was thinking about this maybe,
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 1efdc47a999b..b96e95625027 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -624,13 +624,20 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
>  static void sk_psock_skb_state(struct sk_psock *psock,
>                                struct sk_psock_work_state *state,
>                                struct sk_buff *skb,
> -                              int len, int off)
> +                              int len, int off, bool ingress)
>  {
>         spin_lock_bh(&psock->ingress_lock);
>         if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
>                 state->skb = skb;
>                 state->len = len;
>                 state->off = off;
> +               /* For ingress we may not have a wakeup callback to trigger
> +                * the reschedule on so need to reschedule retry. For egress
> +                * we will get TCP stack callback when its a good time to
> +                * retry.
> +                */
> +               if (ingress)
> +                       schedule_work(&psock->work);
>         } else {
>                 sock_drop(psock->sk, skb);
>         }
> @@ -678,7 +685,7 @@ static void sk_psock_backlog(struct work_struct *work)
>                         if (ret <= 0) {
>                                 if (ret == -EAGAIN) {
>                                         sk_psock_skb_state(psock, state, skb,
> -                                                          len, off);
> +                                                          len, off, ingress);
>                                         goto end;
>                                 }
>                                 /* Hard errors break pipe and stop xmit. */
>
>
> Its tempting to try and use the memory pressure callbacks but those are
> built for the skb cache so I think overloading them is not so nice. The
> drawback to above is its possible no memory is available even when we
> get back to the backlog. We could use a delayed reschedule but its not
> clear what delay makes sense here. Maybe some backoff...
>
> Any thoughts?

I don't have any thoughts on the fix yet, but I have a repro.

We can use fault injection [1]. For some reason it's been disabled on
x86-64 since 2007 (stack walking didn't work back then?), so we need to
patch the kernel slightly.

Also, to better target the failure, just for this case, I've de-inlined
alloc_sk_msg(). But in general testing we can just inject any alloc
under sk_psock_backlog().

Incantation looks like so:

#!/usr/bin/env bash

readonly TARGET_FUNC=alloc_sk_msg
readonly ADDR=($(grep -A1 ${TARGET_FUNC} /proc/kallsyms | awk '{print "0x" $1}'))

exec bash \
     ../../fault-injection/failcmd.sh \
     --require-start=${ADDR[0]} --require-end=${ADDR[1]} \
     --stacktrace-depth=32 \
     --probability=50 --times=100 \
     --ignore-gfp-wait=N --task-filter=N \
     -- \
     ./test_sockmap

We won't get a message in dmesg (even with --verbosity=1 set) because
we're allocating with __GFP_NOWARN, and fault injection interface
doesn't provide a way to override that. But we can obseve the 'times'
count go down after ./test_sockmap blocks (also confirmed with a printk
added on -EAGAIN error path).

This is what I observe:

bash-5.1# ./repro.sh
# 1/ 6  sockmap::txmsg test passthrough:OK
# 2/ 6  sockmap::txmsg test redirect:OK
# 3/ 1  sockmap::txmsg test redirect wait send mem:OK
# 4/ 6  sockmap::txmsg test drop:OK
# 5/ 6  sockmap::txmsg test ingress redirect:OK <-- blocked here
^Z
[1]+  Stopped                 ./repro.sh
bash-5.1# cat /sys/kernel/debug/failslab/times
99
bash-5.1#

Kernel tweaks attached below.

-jkbs

[1] https://www.kernel.org/doc/html/latest/fault-injection/fault-injection.html

---8<---

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3fc7abffc7aa..32c5329b0dd9 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1963,7 +1963,6 @@ config FAIL_SUNRPC
 config FAULT_INJECTION_STACKTRACE_FILTER
 	bool "stacktrace filter for fault-injection capabilities"
 	depends on FAULT_INJECTION_DEBUG_FS && STACKTRACE_SUPPORT
-	depends on !X86_64
 	select STACKTRACE
 	depends on FRAME_POINTER || MIPS || PPC || S390 || MICROBLAZE || ARM || ARC || X86
 	help
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index e6b9ced3eda8..0f7dc67a3708 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -500,7 +500,7 @@ bool sk_msg_is_readable(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_msg_is_readable);
 
-static struct sk_msg *alloc_sk_msg(gfp_t gfp)
+static noinline struct sk_msg *alloc_sk_msg(gfp_t gfp)
 {
 	struct sk_msg *msg;
 
diff --git a/tools/testing/fault-injection/failcmd.sh b/tools/testing/fault-injection/failcmd.sh
index 78dac34264be..887dd4553cae 100644
--- a/tools/testing/fault-injection/failcmd.sh
+++ b/tools/testing/fault-injection/failcmd.sh
@@ -212,7 +212,7 @@ done
 echo $oom_kill_allocating_task > /proc/sys/vm/oom_kill_allocating_task
 echo $task_filter > $FAULTATTR/task-filter
 echo $probability > $FAULTATTR/probability
-echo $times > $FAULTATTR/times
+printf "%#x" $times > $FAULTATTR/times
 
 trap "restore_values" SIGINT SIGTERM EXIT
John Fastabend Nov. 8, 2022, 7:57 p.m. UTC | #9
Jakub Sitnicki wrote:
> On Thu, Nov 03, 2022 at 02:36 PM -07, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> On Tue, Nov 01, 2022 at 01:01 PM -07, John Fastabend wrote:
> >> > Jakub Sitnicki wrote:
> >> >> On Fri, Oct 28, 2022 at 12:16 PM -07, Cong Wang wrote:
> >> >> > On Mon, Oct 24, 2022 at 03:33:13PM +0200, Jakub Sitnicki wrote:
> >> >> >> On Tue, Oct 18, 2022 at 11:13 AM -07, sdf@google.com wrote:
> >> >> >> > On 10/17, Cong Wang wrote:
> >> >> >> >> From: Cong Wang <cong.wang@bytedance.com>
> >> >> >> >
> >> >> >> >> Technically we don't need lock the sock in the psock work, but we
> >> >> >> >> need to prevent this work running in parallel with sock_map_close().
> >> >> >> >
> >> >> >> >> With this, we no longer need to wait for the psock->work synchronously,
> >> >> >> >> because when we reach here, either this work is still pending, or
> >> >> >> >> blocking on the lock_sock(), or it is completed. We only need to cancel
> >> >> >> >> the first case asynchronously, and we need to bail out the second case
> >> >> >> >> quickly by checking SK_PSOCK_TX_ENABLED bit.
> >> >> >> >
> >> >> >> >> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
> >> >> >> >> Reported-by: Stanislav Fomichev <sdf@google.com>
> >> >> >> >> Cc: John Fastabend <john.fastabend@gmail.com>
> >> >> >> >> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> >> >> >> >> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> >> >> >> >
> >> >> >> > This seems to remove the splat for me:
> >> >> >> >
> >> >> >> > Tested-by: Stanislav Fomichev <sdf@google.com>
> >> >> >> >
> >> >> >> > The patch looks good, but I'll leave the review to Jakub/John.
> >> >> >> 
> >> >> >> I can't poke any holes in it either.
> >> >> >> 
> >> >> >> However, it is harder for me to follow than the initial idea [1].
> >> >> >> So I'm wondering if there was anything wrong with it?
> >> >> >
> >> >> > It caused a warning in sk_stream_kill_queues() when I actually tested
> >> >> > it (after posting).
> >> >> 
> >> >> We must have seen the same warnings. They seemed unrelated so I went
> >> >> digging. We have a fix for these [1]. They were present since 5.18-rc1.
> >> >> 
> >> >> >> This seems like a step back when comes to simplifying locking in
> >> >> >> sk_psock_backlog() that was done in 799aa7f98d53.
> >> >> >
> >> >> > Kinda, but it is still true that this sock lock is not for sk_socket
> >> >> > (merely for closing this race condition).
> >> >> 
> >> >> I really think the initial idea [2] is much nicer. I can turn it into a
> >> >> patch, if you are short on time.
> >> >> 
> >> >> With [1] and [2] applied, the dead lock and memory accounting warnings
> >> >> are gone, when running `test_sockmap`.
> >> >> 
> >> >> Thanks,
> >> >> Jakub
> >> >> 
> >> >> [1]
> >> >> https://lore.kernel.org/netdev/1667000674-13237-1-git-send-email-wangyufen@huawei.com/
> >> >> [2] https://lore.kernel.org/netdev/Y0xJUc%2FLRu8K%2FAf8@pop-os.localdomain/
> >> >
> >> > Cong, what do you think? I tend to agree [2] looks nicer to me.
> >> >
> >> > @Jakub,
> >> >
> >> > Also I think we could simply drop the proposed cancel_work_sync in
> >> > sock_map_close()?
> >> >
> >> >  }
> >> > @@ -1619,9 +1619,10 @@ void sock_map_close(struct sock *sk, long timeout)
> >> >  	saved_close = psock->saved_close;
> >> >  	sock_map_remove_links(sk, psock);
> >> >  	rcu_read_unlock();
> >> > -	sk_psock_stop(psock, true);
> >> > -	sk_psock_put(sk, psock);
> >> > +	sk_psock_stop(psock);
> >> >  	release_sock(sk);
> >> > +	cancel_work_sync(&psock->work);
> >> > +	sk_psock_put(sk, psock);
> >> >  	saved_close(sk, timeout);
> >> >  }
> >> >
> >> > The sk_psock_put is going to cancel the work before destroying the psock,
> >> >
> >> >  sk_psock_put()
> >> >    sk_psock_drop()
> >> >      queue_rcu_work(system_wq, psock->rwork)
> >> >
> >> > and then in callback we
> >> >
> >> >   sk_psock_destroy()
> >> >     cancel_work_synbc(psock->work)
> >> >
> >> > although it might be nice to have the work cancelled earlier rather than
> >> > latter maybe.
> >> 
> >> Good point.
> >> 
> >> I kinda like the property that once close() returns we know there is no
> >> deferred work running for the socket.
> >> 
> >> I find the APIs where a deferred cleanup happens sometimes harder to
> >> write tests for.
> >> 
> >> But I don't really have a strong opinion here.
> >
> > I don't either and Cong left it so I'm good with that.
> >
> > Reviewing backlog logic though I think there is another bug there, but
> > I haven't been able to trigger it in any of our tests.
> >
> > The sk_psock_backlog() logic is,
> >
> >  sk_psock_backlog(struct work_struct *work)
> >    mutex_lock()
> >    while (skb = ...)
> >    ...
> >    do {
> >      ret = sk_psock_handle_skb()
> >      if (ret <= 0) {
> >        if (ret == -EAGAIN) {
> >            sk_psock_skb_state()
> >            goto  end;
> >        } 
> >       ...
> >    } while (len);
> >    ...
> >   end:
> >    mutex_unlock()
> >
> > what I'm not seeing is if we get an EAGAIN through sk_psock_handle_skb
> > how do we schedule the backlog again. For egress we would set the
> > SOCK_NOSPACE bit and then get a write space available callback which
> > would do the schedule(). The ingress side could fail with EAGAIN
> > through the alloc_sk_msg(GFP_ATOMIC) call. This is just a kzalloc,
> >
> >    sk_psock_handle_skb()
> >     sk_psock_skb_ingress()
> >      sk_psock_skb_ingress_self()
> >        msg = alloc_sk_msg()
> >                kzalloc()          <- this can return NULL
> >        if (!msg)
> >           return -EAGAIN          <- could we stall now
> >
> >
> > I think we could stall here if there was nothing else to kick it. I
> > was thinking about this maybe,
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index 1efdc47a999b..b96e95625027 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -624,13 +624,20 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
> >  static void sk_psock_skb_state(struct sk_psock *psock,
> >                                struct sk_psock_work_state *state,
> >                                struct sk_buff *skb,
> > -                              int len, int off)
> > +                              int len, int off, bool ingress)
> >  {
> >         spin_lock_bh(&psock->ingress_lock);
> >         if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
> >                 state->skb = skb;
> >                 state->len = len;
> >                 state->off = off;
> > +               /* For ingress we may not have a wakeup callback to trigger
> > +                * the reschedule on so need to reschedule retry. For egress
> > +                * we will get TCP stack callback when its a good time to
> > +                * retry.
> > +                */
> > +               if (ingress)
> > +                       schedule_work(&psock->work);
> >         } else {
> >                 sock_drop(psock->sk, skb);
> >         }
> > @@ -678,7 +685,7 @@ static void sk_psock_backlog(struct work_struct *work)
> >                         if (ret <= 0) {
> >                                 if (ret == -EAGAIN) {
> >                                         sk_psock_skb_state(psock, state, skb,
> > -                                                          len, off);
> > +                                                          len, off, ingress);
> >                                         goto end;
> >                                 }
> >                                 /* Hard errors break pipe and stop xmit. */
> >
> >
> > Its tempting to try and use the memory pressure callbacks but those are
> > built for the skb cache so I think overloading them is not so nice. The
> > drawback to above is its possible no memory is available even when we
> > get back to the backlog. We could use a delayed reschedule but its not
> > clear what delay makes sense here. Maybe some backoff...
> >
> > Any thoughts?
> 
> I don't have any thoughts on the fix yet, but I have a repro.

I'm testing it with a delayed workqueue now and a backoff just so
we don't bang on this repeatedly when OOM condition is met. Then
all the other schedule_work() calls become the delayed variant
but I think this is OK.

Better ideas welcome but running the above through our CI today.

> 
> We can use fault injection [1]. For some reason it's been disabled on
> x86-64 since 2007 (stack walking didn't work back then?), so we need to
> patch the kernel slightly.

Could add the function to ALLOW_OVERRIDE as well. But not sure we want
to force it to be _not_ inlined in general case.

> 
> Also, to better target the failure, just for this case, I've de-inlined
> alloc_sk_msg(). But in general testing we can just inject any alloc
> under sk_psock_backlog().
> 
> Incantation looks like so:
> 
> #!/usr/bin/env bash
> 
> readonly TARGET_FUNC=alloc_sk_msg
> readonly ADDR=($(grep -A1 ${TARGET_FUNC} /proc/kallsyms | awk '{print "0x" $1}'))
> 
> exec bash \
>      ../../fault-injection/failcmd.sh \
>      --require-start=${ADDR[0]} --require-end=${ADDR[1]} \
>      --stacktrace-depth=32 \
>      --probability=50 --times=100 \
>      --ignore-gfp-wait=N --task-filter=N \
>      -- \
>      ./test_sockmap
> 
> We won't get a message in dmesg (even with --verbosity=1 set) because
> we're allocating with __GFP_NOWARN, and fault injection interface
> doesn't provide a way to override that. But we can obseve the 'times'
> count go down after ./test_sockmap blocks (also confirmed with a printk
> added on -EAGAIN error path).

We can probably do it through BPF prog with ALLOW_OVERRIDE on one of those
functions in that call path then we can write a selftest for it.

> 
> This is what I observe:

Very cool.

> 
> bash-5.1# ./repro.sh
> # 1/ 6  sockmap::txmsg test passthrough:OK
> # 2/ 6  sockmap::txmsg test redirect:OK
> # 3/ 1  sockmap::txmsg test redirect wait send mem:OK
> # 4/ 6  sockmap::txmsg test drop:OK
> # 5/ 6  sockmap::txmsg test ingress redirect:OK <-- blocked here
> ^Z
> [1]+  Stopped                 ./repro.sh
> bash-5.1# cat /sys/kernel/debug/failslab/times
> 99
> bash-5.1#
> 
[...]
Jakub Sitnicki Nov. 10, 2022, 12:59 p.m. UTC | #10
On Tue, Nov 08, 2022 at 11:57 AM -08, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Thu, Nov 03, 2022 at 02:36 PM -07, John Fastabend wrote:
>> > Jakub Sitnicki wrote:
>> >> On Tue, Nov 01, 2022 at 01:01 PM -07, John Fastabend wrote:
>> >> > Jakub Sitnicki wrote:
>> >> >> On Fri, Oct 28, 2022 at 12:16 PM -07, Cong Wang wrote:
>> >> >> > On Mon, Oct 24, 2022 at 03:33:13PM +0200, Jakub Sitnicki wrote:
>> >> >> >> On Tue, Oct 18, 2022 at 11:13 AM -07, sdf@google.com wrote:
>> >> >> >> > On 10/17, Cong Wang wrote:
>> >> >> >> >> From: Cong Wang <cong.wang@bytedance.com>
>> >> >> >> >
>> >> >> >> >> Technically we don't need lock the sock in the psock work, but we
>> >> >> >> >> need to prevent this work running in parallel with sock_map_close().
>> >> >> >> >
>> >> >> >> >> With this, we no longer need to wait for the psock->work synchronously,
>> >> >> >> >> because when we reach here, either this work is still pending, or
>> >> >> >> >> blocking on the lock_sock(), or it is completed. We only need to cancel
>> >> >> >> >> the first case asynchronously, and we need to bail out the second case
>> >> >> >> >> quickly by checking SK_PSOCK_TX_ENABLED bit.
>> >> >> >> >
>> >> >> >> >> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
>> >> >> >> >> Reported-by: Stanislav Fomichev <sdf@google.com>
>> >> >> >> >> Cc: John Fastabend <john.fastabend@gmail.com>
>> >> >> >> >> Cc: Jakub Sitnicki <jakub@cloudflare.com>
>> >> >> >> >> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
>> >> >> >> >
>> >> >> >> > This seems to remove the splat for me:
>> >> >> >> >
>> >> >> >> > Tested-by: Stanislav Fomichev <sdf@google.com>
>> >> >> >> >
>> >> >> >> > The patch looks good, but I'll leave the review to Jakub/John.
>> >> >> >> 
>> >> >> >> I can't poke any holes in it either.
>> >> >> >> 
>> >> >> >> However, it is harder for me to follow than the initial idea [1].
>> >> >> >> So I'm wondering if there was anything wrong with it?
>> >> >> >
>> >> >> > It caused a warning in sk_stream_kill_queues() when I actually tested
>> >> >> > it (after posting).
>> >> >> 
>> >> >> We must have seen the same warnings. They seemed unrelated so I went
>> >> >> digging. We have a fix for these [1]. They were present since 5.18-rc1.
>> >> >> 
>> >> >> >> This seems like a step back when comes to simplifying locking in
>> >> >> >> sk_psock_backlog() that was done in 799aa7f98d53.
>> >> >> >
>> >> >> > Kinda, but it is still true that this sock lock is not for sk_socket
>> >> >> > (merely for closing this race condition).
>> >> >> 
>> >> >> I really think the initial idea [2] is much nicer. I can turn it into a
>> >> >> patch, if you are short on time.
>> >> >> 
>> >> >> With [1] and [2] applied, the dead lock and memory accounting warnings
>> >> >> are gone, when running `test_sockmap`.
>> >> >> 
>> >> >> Thanks,
>> >> >> Jakub
>> >> >> 
>> >> >> [1]
>> >> >> https://lore.kernel.org/netdev/1667000674-13237-1-git-send-email-wangyufen@huawei.com/
>> >> >> [2] https://lore.kernel.org/netdev/Y0xJUc%2FLRu8K%2FAf8@pop-os.localdomain/
>> >> >
>> >> > Cong, what do you think? I tend to agree [2] looks nicer to me.
>> >> >
>> >> > @Jakub,
>> >> >
>> >> > Also I think we could simply drop the proposed cancel_work_sync in
>> >> > sock_map_close()?
>> >> >
>> >> >  }
>> >> > @@ -1619,9 +1619,10 @@ void sock_map_close(struct sock *sk, long timeout)
>> >> >  	saved_close = psock->saved_close;
>> >> >  	sock_map_remove_links(sk, psock);
>> >> >  	rcu_read_unlock();
>> >> > -	sk_psock_stop(psock, true);
>> >> > -	sk_psock_put(sk, psock);
>> >> > +	sk_psock_stop(psock);
>> >> >  	release_sock(sk);
>> >> > +	cancel_work_sync(&psock->work);
>> >> > +	sk_psock_put(sk, psock);
>> >> >  	saved_close(sk, timeout);
>> >> >  }
>> >> >
>> >> > The sk_psock_put is going to cancel the work before destroying the psock,
>> >> >
>> >> >  sk_psock_put()
>> >> >    sk_psock_drop()
>> >> >      queue_rcu_work(system_wq, psock->rwork)
>> >> >
>> >> > and then in callback we
>> >> >
>> >> >   sk_psock_destroy()
>> >> >     cancel_work_synbc(psock->work)
>> >> >
>> >> > although it might be nice to have the work cancelled earlier rather than
>> >> > latter maybe.
>> >> 
>> >> Good point.
>> >> 
>> >> I kinda like the property that once close() returns we know there is no
>> >> deferred work running for the socket.
>> >> 
>> >> I find the APIs where a deferred cleanup happens sometimes harder to
>> >> write tests for.
>> >> 
>> >> But I don't really have a strong opinion here.
>> >
>> > I don't either and Cong left it so I'm good with that.
>> >
>> > Reviewing backlog logic though I think there is another bug there, but
>> > I haven't been able to trigger it in any of our tests.
>> >
>> > The sk_psock_backlog() logic is,
>> >
>> >  sk_psock_backlog(struct work_struct *work)
>> >    mutex_lock()
>> >    while (skb = ...)
>> >    ...
>> >    do {
>> >      ret = sk_psock_handle_skb()
>> >      if (ret <= 0) {
>> >        if (ret == -EAGAIN) {
>> >            sk_psock_skb_state()
>> >            goto  end;
>> >        } 
>> >       ...
>> >    } while (len);
>> >    ...
>> >   end:
>> >    mutex_unlock()
>> >
>> > what I'm not seeing is if we get an EAGAIN through sk_psock_handle_skb
>> > how do we schedule the backlog again. For egress we would set the
>> > SOCK_NOSPACE bit and then get a write space available callback which
>> > would do the schedule(). The ingress side could fail with EAGAIN
>> > through the alloc_sk_msg(GFP_ATOMIC) call. This is just a kzalloc,
>> >
>> >    sk_psock_handle_skb()
>> >     sk_psock_skb_ingress()
>> >      sk_psock_skb_ingress_self()
>> >        msg = alloc_sk_msg()
>> >                kzalloc()          <- this can return NULL
>> >        if (!msg)
>> >           return -EAGAIN          <- could we stall now
>> >
>> >
>> > I think we could stall here if there was nothing else to kick it. I
>> > was thinking about this maybe,
>> >
>> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> > index 1efdc47a999b..b96e95625027 100644
>> > --- a/net/core/skmsg.c
>> > +++ b/net/core/skmsg.c
>> > @@ -624,13 +624,20 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
>> >  static void sk_psock_skb_state(struct sk_psock *psock,
>> >                                struct sk_psock_work_state *state,
>> >                                struct sk_buff *skb,
>> > -                              int len, int off)
>> > +                              int len, int off, bool ingress)
>> >  {
>> >         spin_lock_bh(&psock->ingress_lock);
>> >         if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
>> >                 state->skb = skb;
>> >                 state->len = len;
>> >                 state->off = off;
>> > +               /* For ingress we may not have a wakeup callback to trigger
>> > +                * the reschedule on so need to reschedule retry. For egress
>> > +                * we will get TCP stack callback when its a good time to
>> > +                * retry.
>> > +                */
>> > +               if (ingress)
>> > +                       schedule_work(&psock->work);
>> >         } else {
>> >                 sock_drop(psock->sk, skb);
>> >         }
>> > @@ -678,7 +685,7 @@ static void sk_psock_backlog(struct work_struct *work)
>> >                         if (ret <= 0) {
>> >                                 if (ret == -EAGAIN) {
>> >                                         sk_psock_skb_state(psock, state, skb,
>> > -                                                          len, off);
>> > +                                                          len, off, ingress);
>> >                                         goto end;
>> >                                 }
>> >                                 /* Hard errors break pipe and stop xmit. */
>> >
>> >
>> > Its tempting to try and use the memory pressure callbacks but those are
>> > built for the skb cache so I think overloading them is not so nice. The
>> > drawback to above is its possible no memory is available even when we
>> > get back to the backlog. We could use a delayed reschedule but its not
>> > clear what delay makes sense here. Maybe some backoff...
>> >
>> > Any thoughts?
>> 
>> I don't have any thoughts on the fix yet, but I have a repro.
>
> I'm testing it with a delayed workqueue now and a backoff just so
> we don't bang on this repeatedly when OOM condition is met. Then
> all the other schedule_work() calls become the delayed variant
> but I think this is OK.
>
> Better ideas welcome but running the above through our CI today.

That sounds good to me because it's easy to comprehend.

If it does not work out, for some reason, we can explore allocating
sk_msg at the time of queuing an skb onto psock->ingress_skb. We know
when we're redirecting to ingress, and are going to need an sk_msg.

Downside is that we would have to bundle up sk_msg somehow with the skb,
so it seems quite convoluted.

>> We can use fault injection [1]. For some reason it's been disabled on
>> x86-64 since 2007 (stack walking didn't work back then?), so we need to
>> patch the kernel slightly.
>
> Could add the function to ALLOW_OVERRIDE as well. But not sure we want
> to force it to be _not_ inlined in general case.

You mean ALLOW_ERROR_INJECTION?

In general, I suspect it will be enough to filter the stacktrace by the
presence of sk_psock_backlog, when smoke testing the code against memory
allocation failures.

>> Also, to better target the failure, just for this case, I've de-inlined
>> alloc_sk_msg(). But in general testing we can just inject any alloc
>> under sk_psock_backlog().
>> 
>> Incantation looks like so:
>> 
>> #!/usr/bin/env bash
>> 
>> readonly TARGET_FUNC=alloc_sk_msg
>> readonly ADDR=($(grep -A1 ${TARGET_FUNC} /proc/kallsyms | awk '{print "0x" $1}'))
>> 
>> exec bash \
>>      ../../fault-injection/failcmd.sh \
>>      --require-start=${ADDR[0]} --require-end=${ADDR[1]} \
>>      --stacktrace-depth=32 \
>>      --probability=50 --times=100 \
>>      --ignore-gfp-wait=N --task-filter=N \
>>      -- \
>>      ./test_sockmap
>> 
>> We won't get a message in dmesg (even with --verbosity=1 set) because
>> we're allocating with __GFP_NOWARN, and fault injection interface
>> doesn't provide a way to override that. But we can obseve the 'times'
>> count go down after ./test_sockmap blocks (also confirmed with a printk
>> added on -EAGAIN error path).
>
> We can probably do it through BPF prog with ALLOW_OVERRIDE on one of those
> functions in that call path then we can write a selftest for it.

You mean BPF_MODIFY_RETURN? Right, that would be another option.

Right now, I see that alloc_sk_msg does not get inlined on my distro:

$ uname -r
6.0.5-100.fc35.x86_64
$ grep alloc_sk_msg /proc/kallsyms
0000000000000000 t alloc_sk_msg
$

But that seems very build-dependent, so I'm not sure if we want
selftests relying on that.

I'd just do smoke-testing of the whole sk_psock_backlog.

[...]
Cong Wang Nov. 19, 2022, 6:37 p.m. UTC | #11
On Thu, Nov 03, 2022 at 02:36:09PM -0700, John Fastabend wrote:
> Jakub Sitnicki wrote:
> > On Tue, Nov 01, 2022 at 01:01 PM -07, John Fastabend wrote:
> > > Jakub Sitnicki wrote:
> > >> On Fri, Oct 28, 2022 at 12:16 PM -07, Cong Wang wrote:
> > >> > On Mon, Oct 24, 2022 at 03:33:13PM +0200, Jakub Sitnicki wrote:
> > >> >> On Tue, Oct 18, 2022 at 11:13 AM -07, sdf@google.com wrote:
> > >> >> > On 10/17, Cong Wang wrote:
> > >> >> >> From: Cong Wang <cong.wang@bytedance.com>
> > >> >> >
> > >> >> >> Technically we don't need lock the sock in the psock work, but we
> > >> >> >> need to prevent this work running in parallel with sock_map_close().
> > >> >> >
> > >> >> >> With this, we no longer need to wait for the psock->work synchronously,
> > >> >> >> because when we reach here, either this work is still pending, or
> > >> >> >> blocking on the lock_sock(), or it is completed. We only need to cancel
> > >> >> >> the first case asynchronously, and we need to bail out the second case
> > >> >> >> quickly by checking SK_PSOCK_TX_ENABLED bit.
> > >> >> >
> > >> >> >> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
> > >> >> >> Reported-by: Stanislav Fomichev <sdf@google.com>
> > >> >> >> Cc: John Fastabend <john.fastabend@gmail.com>
> > >> >> >> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > >> >> >> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > >> >> >
> > >> >> > This seems to remove the splat for me:
> > >> >> >
> > >> >> > Tested-by: Stanislav Fomichev <sdf@google.com>
> > >> >> >
> > >> >> > The patch looks good, but I'll leave the review to Jakub/John.
> > >> >> 
> > >> >> I can't poke any holes in it either.
> > >> >> 
> > >> >> However, it is harder for me to follow than the initial idea [1].
> > >> >> So I'm wondering if there was anything wrong with it?
> > >> >
> > >> > It caused a warning in sk_stream_kill_queues() when I actually tested
> > >> > it (after posting).
> > >> 
> > >> We must have seen the same warnings. They seemed unrelated so I went
> > >> digging. We have a fix for these [1]. They were present since 5.18-rc1.
> > >> 
> > >> >> This seems like a step back when comes to simplifying locking in
> > >> >> sk_psock_backlog() that was done in 799aa7f98d53.
> > >> >
> > >> > Kinda, but it is still true that this sock lock is not for sk_socket
> > >> > (merely for closing this race condition).
> > >> 
> > >> I really think the initial idea [2] is much nicer. I can turn it into a
> > >> patch, if you are short on time.
> > >> 
> > >> With [1] and [2] applied, the dead lock and memory accounting warnings
> > >> are gone, when running `test_sockmap`.
> > >> 
> > >> Thanks,
> > >> Jakub
> > >> 
> > >> [1] https://lore.kernel.org/netdev/1667000674-13237-1-git-send-email-wangyufen@huawei.com/
> > >> [2] https://lore.kernel.org/netdev/Y0xJUc%2FLRu8K%2FAf8@pop-os.localdomain/
> > >
> > > Cong, what do you think? I tend to agree [2] looks nicer to me.
> > >
> > > @Jakub,
> > >
> > > Also I think we could simply drop the proposed cancel_work_sync in
> > > sock_map_close()?
> > >
> > >  }
> > > @@ -1619,9 +1619,10 @@ void sock_map_close(struct sock *sk, long timeout)
> > >  	saved_close = psock->saved_close;
> > >  	sock_map_remove_links(sk, psock);
> > >  	rcu_read_unlock();
> > > -	sk_psock_stop(psock, true);
> > > -	sk_psock_put(sk, psock);
> > > +	sk_psock_stop(psock);
> > >  	release_sock(sk);
> > > +	cancel_work_sync(&psock->work);
> > > +	sk_psock_put(sk, psock);
> > >  	saved_close(sk, timeout);
> > >  }
> > >
> > > The sk_psock_put is going to cancel the work before destroying the psock,
> > >
> > >  sk_psock_put()
> > >    sk_psock_drop()
> > >      queue_rcu_work(system_wq, psock->rwork)
> > >
> > > and then in callback we
> > >
> > >   sk_psock_destroy()
> > >     cancel_work_synbc(psock->work)
> > >
> > > although it might be nice to have the work cancelled earlier rather than
> > > latter maybe.
> > 
> > Good point.
> > 
> > I kinda like the property that once close() returns we know there is no
> > deferred work running for the socket.
> > 
> > I find the APIs where a deferred cleanup happens sometimes harder to
> > write tests for.
> > 
> > But I don't really have a strong opinion here.
> 
> I don't either and Cong left it so I'm good with that.

It has been there because of the infamous warnings triggered in
sk_stream_kill_queues(). We have to wait for flying packets, but this
_may_ be changed after we switch to tcp_read_skb() where we call
skb_set_owner_sk_safe().


> 
> Reviewing backlog logic though I think there is another bug there, but
> I haven't been able to trigger it in any of our tests.
> 
> The sk_psock_backlog() logic is,
> 
>  sk_psock_backlog(struct work_struct *work)
>    mutex_lock()
>    while (skb = ...)
>    ...
>    do {
>      ret = sk_psock_handle_skb()
>      if (ret <= 0) {
>        if (ret == -EAGAIN) {
>            sk_psock_skb_state()
>            goto  end;
>        } 
>       ...
>    } while (len);
>    ...
>   end:
>    mutex_unlock()
> 
> what I'm not seeing is if we get an EAGAIN through sk_psock_handle_skb
> how do we schedule the backlog again. For egress we would set the
> SOCK_NOSPACE bit and then get a write space available callback which
> would do the schedule(). The ingress side could fail with EAGAIN
> through the alloc_sk_msg(GFP_ATOMIC) call. This is just a kzalloc,
> 
>    sk_psock_handle_skb()
>     sk_psock_skb_ingress()
>      sk_psock_skb_ingress_self()
>        msg = alloc_sk_msg()
>                kzalloc()          <- this can return NULL
>        if (!msg)
>           return -EAGAIN          <- could we stall now

Returning EAGAIN here makes little sense to me, it should be ENOMEM and
is not worth retrying.

For other EAGAIN cases, why not just reschedule this work since state is
already saved in sk_psock_work_state?

Thanks.
John Fastabend Nov. 21, 2022, 6:13 a.m. UTC | #12
Cong Wang wrote:
> On Thu, Nov 03, 2022 at 02:36:09PM -0700, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> > > On Tue, Nov 01, 2022 at 01:01 PM -07, John Fastabend wrote:
> > > > Jakub Sitnicki wrote:
> > > >> On Fri, Oct 28, 2022 at 12:16 PM -07, Cong Wang wrote:
> > > >> > On Mon, Oct 24, 2022 at 03:33:13PM +0200, Jakub Sitnicki wrote:
> > > >> >> On Tue, Oct 18, 2022 at 11:13 AM -07, sdf@google.com wrote:
> > > >> >> > On 10/17, Cong Wang wrote:
> > > >> >> >> From: Cong Wang <cong.wang@bytedance.com>
> > > >> >> >
> > > >> >> >> Technically we don't need lock the sock in the psock work, but we
> > > >> >> >> need to prevent this work running in parallel with sock_map_close().
> > > >> >> >
> > > >> >> >> With this, we no longer need to wait for the psock->work synchronously,
> > > >> >> >> because when we reach here, either this work is still pending, or
> > > >> >> >> blocking on the lock_sock(), or it is completed. We only need to cancel
> > > >> >> >> the first case asynchronously, and we need to bail out the second case
> > > >> >> >> quickly by checking SK_PSOCK_TX_ENABLED bit.
> > > >> >> >
> > > >> >> >> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
> > > >> >> >> Reported-by: Stanislav Fomichev <sdf@google.com>
> > > >> >> >> Cc: John Fastabend <john.fastabend@gmail.com>
> > > >> >> >> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > > >> >> >> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > > >> >> >
> > > >> >> > This seems to remove the splat for me:
> > > >> >> >
> > > >> >> > Tested-by: Stanislav Fomichev <sdf@google.com>
> > > >> >> >
> > > >> >> > The patch looks good, but I'll leave the review to Jakub/John.
> > > >> >> 
> > > >> >> I can't poke any holes in it either.
> > > >> >> 
> > > >> >> However, it is harder for me to follow than the initial idea [1].
> > > >> >> So I'm wondering if there was anything wrong with it?
> > > >> >
> > > >> > It caused a warning in sk_stream_kill_queues() when I actually tested
> > > >> > it (after posting).
> > > >> 
> > > >> We must have seen the same warnings. They seemed unrelated so I went
> > > >> digging. We have a fix for these [1]. They were present since 5.18-rc1.
> > > >> 
> > > >> >> This seems like a step back when comes to simplifying locking in
> > > >> >> sk_psock_backlog() that was done in 799aa7f98d53.
> > > >> >
> > > >> > Kinda, but it is still true that this sock lock is not for sk_socket
> > > >> > (merely for closing this race condition).
> > > >> 
> > > >> I really think the initial idea [2] is much nicer. I can turn it into a
> > > >> patch, if you are short on time.
> > > >> 
> > > >> With [1] and [2] applied, the dead lock and memory accounting warnings
> > > >> are gone, when running `test_sockmap`.
> > > >> 
> > > >> Thanks,
> > > >> Jakub
> > > >> 
> > > >> [1] https://lore.kernel.org/netdev/1667000674-13237-1-git-send-email-wangyufen@huawei.com/
> > > >> [2] https://lore.kernel.org/netdev/Y0xJUc%2FLRu8K%2FAf8@pop-os.localdomain/
> > > >
> > > > Cong, what do you think? I tend to agree [2] looks nicer to me.
> > > >
> > > > @Jakub,
> > > >
> > > > Also I think we could simply drop the proposed cancel_work_sync in
> > > > sock_map_close()?
> > > >
> > > >  }
> > > > @@ -1619,9 +1619,10 @@ void sock_map_close(struct sock *sk, long timeout)
> > > >  	saved_close = psock->saved_close;
> > > >  	sock_map_remove_links(sk, psock);
> > > >  	rcu_read_unlock();
> > > > -	sk_psock_stop(psock, true);
> > > > -	sk_psock_put(sk, psock);
> > > > +	sk_psock_stop(psock);
> > > >  	release_sock(sk);
> > > > +	cancel_work_sync(&psock->work);
> > > > +	sk_psock_put(sk, psock);
> > > >  	saved_close(sk, timeout);
> > > >  }
> > > >
> > > > The sk_psock_put is going to cancel the work before destroying the psock,
> > > >
> > > >  sk_psock_put()
> > > >    sk_psock_drop()
> > > >      queue_rcu_work(system_wq, psock->rwork)
> > > >
> > > > and then in callback we
> > > >
> > > >   sk_psock_destroy()
> > > >     cancel_work_synbc(psock->work)
> > > >
> > > > although it might be nice to have the work cancelled earlier rather than
> > > > latter maybe.
> > > 
> > > Good point.
> > > 
> > > I kinda like the property that once close() returns we know there is no
> > > deferred work running for the socket.
> > > 
> > > I find the APIs where a deferred cleanup happens sometimes harder to
> > > write tests for.
> > > 
> > > But I don't really have a strong opinion here.
> > 
> > I don't either and Cong left it so I'm good with that.
> 
> It has been there because of the infamous warnings triggered in
> sk_stream_kill_queues(). We have to wait for flying packets, but this
> _may_ be changed after we switch to tcp_read_skb() where we call
> skb_set_owner_sk_safe().
> 
> 
> > 
> > Reviewing backlog logic though I think there is another bug there, but
> > I haven't been able to trigger it in any of our tests.
> > 
> > The sk_psock_backlog() logic is,
> > 
> >  sk_psock_backlog(struct work_struct *work)
> >    mutex_lock()
> >    while (skb = ...)
> >    ...
> >    do {
> >      ret = sk_psock_handle_skb()
> >      if (ret <= 0) {
> >        if (ret == -EAGAIN) {
> >            sk_psock_skb_state()
> >            goto  end;
> >        } 
> >       ...
> >    } while (len);
> >    ...
> >   end:
> >    mutex_unlock()
> > 
> > what I'm not seeing is if we get an EAGAIN through sk_psock_handle_skb
> > how do we schedule the backlog again. For egress we would set the
> > SOCK_NOSPACE bit and then get a write space available callback which
> > would do the schedule(). The ingress side could fail with EAGAIN
> > through the alloc_sk_msg(GFP_ATOMIC) call. This is just a kzalloc,
> > 
> >    sk_psock_handle_skb()
> >     sk_psock_skb_ingress()
> >      sk_psock_skb_ingress_self()
> >        msg = alloc_sk_msg()
> >                kzalloc()          <- this can return NULL
> >        if (!msg)
> >           return -EAGAIN          <- could we stall now
> 
> Returning EAGAIN here makes little sense to me, it should be ENOMEM and
> is not worth retrying.

The trouble with not retrying is we would drop the skb. And unless
the application retries this could hang the application. So we
really need to try hard to send the sk_buff.

> 
> For other EAGAIN cases, why not just reschedule this work since state is
> already saved in sk_psock_work_state?

For EAGAIN sure. For ENOMEM above I simply didn't want to get in a
loop where we hit it a bunch of times with no backoff. I'm testing
a patch now so can send tomorrow.

> 
> Thanks.
diff mbox series

Patch

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 48f4b645193b..70d6cb94e580 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -376,7 +376,7 @@  static inline void sk_psock_report_error(struct sk_psock *psock, int err)
 }
 
 struct sk_psock *sk_psock_init(struct sock *sk, int node);
-void sk_psock_stop(struct sk_psock *psock, bool wait);
+void sk_psock_stop(struct sk_psock *psock);
 
 #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
 int sk_psock_init_strp(struct sock *sk, struct sk_psock *psock);
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index ca70525621c7..c329e71ea924 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -647,6 +647,11 @@  static void sk_psock_backlog(struct work_struct *work)
 	int ret;
 
 	mutex_lock(&psock->work_mutex);
+	lock_sock(psock->sk);
+
+	if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
+		goto end;
+
 	if (unlikely(state->skb)) {
 		spin_lock_bh(&psock->ingress_lock);
 		skb = state->skb;
@@ -672,9 +677,12 @@  static void sk_psock_backlog(struct work_struct *work)
 		skb_bpf_redirect_clear(skb);
 		do {
 			ret = -EIO;
-			if (!sock_flag(psock->sk, SOCK_DEAD))
+			if (!sock_flag(psock->sk, SOCK_DEAD)) {
+				release_sock(psock->sk);
 				ret = sk_psock_handle_skb(psock, skb, off,
 							  len, ingress);
+				lock_sock(psock->sk);
+			}
 			if (ret <= 0) {
 				if (ret == -EAGAIN) {
 					sk_psock_skb_state(psock, state, skb,
@@ -695,6 +703,7 @@  static void sk_psock_backlog(struct work_struct *work)
 			kfree_skb(skb);
 	}
 end:
+	release_sock(psock->sk);
 	mutex_unlock(&psock->work_mutex);
 }
 
@@ -803,16 +812,14 @@  static void sk_psock_link_destroy(struct sk_psock *psock)
 	}
 }
 
-void sk_psock_stop(struct sk_psock *psock, bool wait)
+void sk_psock_stop(struct sk_psock *psock)
 {
 	spin_lock_bh(&psock->ingress_lock);
 	sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
 	sk_psock_cork_free(psock);
 	__sk_psock_zap_ingress(psock);
 	spin_unlock_bh(&psock->ingress_lock);
-
-	if (wait)
-		cancel_work_sync(&psock->work);
+	cancel_work(&psock->work);
 }
 
 static void sk_psock_done_strp(struct sk_psock *psock);
@@ -850,7 +857,7 @@  void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 		sk_psock_stop_verdict(sk, psock);
 	write_unlock_bh(&sk->sk_callback_lock);
 
-	sk_psock_stop(psock, false);
+	sk_psock_stop(psock);
 
 	INIT_RCU_WORK(&psock->rwork, sk_psock_destroy);
 	queue_rcu_work(system_wq, &psock->rwork);
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index a660baedd9e7..d4e11d7f459c 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1596,7 +1596,7 @@  void sock_map_destroy(struct sock *sk)
 	saved_destroy = psock->saved_destroy;
 	sock_map_remove_links(sk, psock);
 	rcu_read_unlock();
-	sk_psock_stop(psock, false);
+	sk_psock_stop(psock);
 	sk_psock_put(sk, psock);
 	saved_destroy(sk);
 }
@@ -1619,7 +1619,7 @@  void sock_map_close(struct sock *sk, long timeout)
 	saved_close = psock->saved_close;
 	sock_map_remove_links(sk, psock);
 	rcu_read_unlock();
-	sk_psock_stop(psock, true);
+	sk_psock_stop(psock);
 	sk_psock_put(sk, psock);
 	release_sock(sk);
 	saved_close(sk, timeout);