[v2,bpf-next,2/3] bpf: Add bpf_sock_destroy kfunc

Message ID	20230223215311.926899-3-aditi.ghag@isovalent.com (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@vger.kernel.org> From: Aditi Ghag <aditi.ghag@isovalent.com> To: bpf@vger.kernel.org Cc: kafai@fb.com, sdf@google.com, edumazet@google.com, aditi.ghag@isovalent.com Subject: [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc Date: Thu, 23 Feb 2023 21:53:10 +0000 Message-Id: <20230223215311.926899-3-aditi.ghag@isovalent.com> In-Reply-To: <20230223215311.926899-1-aditi.ghag@isovalent.com> References: <20230223215311.926899-1-aditi.ghag@isovalent.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	: Add socket destroy capability \| expand [v2,bpf-next,0/3] : Add socket destroy capability [v2,bpf-next,1/3] bpf: Implement batching in UDP iterator [v2,bpf-next,2/3] bpf: Add bpf_sock_destroy kfunc [v2,bpf-next,3/3] selftests/bpf: Add tests for bpf_sock_destroy

Context	Check	Description
netdev/tree_selection	success	Clearly marked for bpf-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Series has a cover letter
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	fail	Errors and warnings before: 29 this patch: 31
netdev/cc_maintainers	warning	16 maintainers not CCed: john.fastabend@gmail.com dsahern@kernel.org pabeni@redhat.com daniel@iogearbox.net andrii@kernel.org willemdebruijn.kernel@gmail.com kuba@kernel.org netdev@vger.kernel.org jolsa@kernel.org song@kernel.org martin.lau@linux.dev haoluo@google.com yhs@fb.com kpsingh@kernel.org ast@kernel.org davem@davemloft.net
netdev/build_clang	success	Errors and warnings before: 0 this patch: 0
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	fail	Errors and warnings before: 29 this patch: 31
netdev/checkpatch	warning	WARNING: line length of 93 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-PR	fail	PR summary
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for build for aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-4	success	Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-5	success	Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6	success	Logs for build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-7	success	Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-8	success	Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-9	success	Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10	success	Logs for test_maps on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-11	success	Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12	success	Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13	success	Logs for test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-14	success	Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15	success	Logs for test_progs on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-16	success	Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17	fail	Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18	success	Logs for test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-19	success	Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20	success	Logs for test_progs_no_alu32 on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-21	success	Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-22	success	Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23	success	Logs for test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-24	success	Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25	success	Logs for test_progs_no_alu32_parallel on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-26	success	Logs for test_progs_no_alu32_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-27	success	Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28	success	Logs for test_progs_no_alu32_parallel on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-29	success	Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-30	success	Logs for test_progs_parallel on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-31	success	Logs for test_progs_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-32	success	Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-33	success	Logs for test_progs_parallel on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-34	success	Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-35	success	Logs for test_verifier on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-36	success	Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-37	success	Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-38	success	Logs for test_verifier on x86_64 with llvm-17

Aditi Ghag Feb. 23, 2023, 9:53 p.m. UTC

The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium to force
client sockets to reconnect when their remote load-balancing backends are
deleted. The other use case is on-the-fly policy enforcement where existing
socket connections prevented by policies need to be forcefully terminated.
The helper allows terminating sockets that may or may not be actively
sending traffic.

The helper is currently exposed to certain BPF iterators where users can
filter, and terminate selected sockets.  Additionally, the helper can only
be called from these BPF contexts that ensure socket locking in order to
allow synchronous execution of destroy helpers that also acquire socket
locks. The previous commit that batches UDP sockets during iteration
facilitated a synchronous invocation of the destroy helper from BPF context
by skipping taking socket locks in the destroy handler. TCP iterators
already supported batching.

The helper takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. As a comparison, BPF helpers enable this behavior with the
`ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
option available with the verifier logic that handles kfuncs where BTF
types are inferred. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe. Hence, the BPF kfunc converts the argument to a full sock
before casting.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/core/filter.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp.c    | 17 ++++++++++-----
 net/ipv4/udp.c    |  7 ++++--
 3 files changed, 72 insertions(+), 7 deletions(-)

Stanislav Fomichev Feb. 24, 2023, 10:35 p.m. UTC | #1

On 02/23, Aditi Ghag wrote:
> The socket destroy kfunc is used to forcefully terminate sockets from
> certain BPF contexts. We plan to use the capability in Cilium to force
> client sockets to reconnect when their remote load-balancing backends are
> deleted. The other use case is on-the-fly policy enforcement where  
> existing
> socket connections prevented by policies need to be forcefully terminated.
> The helper allows terminating sockets that may or may not be actively
> sending traffic.

> The helper is currently exposed to certain BPF iterators where users can
> filter, and terminate selected sockets.  Additionally, the helper can only
> be called from these BPF contexts that ensure socket locking in order to
> allow synchronous execution of destroy helpers that also acquire socket
> locks. The previous commit that batches UDP sockets during iteration
> facilitated a synchronous invocation of the destroy helper from BPF  
> context
> by skipping taking socket locks in the destroy handler. TCP iterators
> already supported batching.

> The helper takes `sock_common` type argument, even though it expects, and
> casts them to a `sock` pointer. This enables the verifier to allow the
> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
> `sock` structs. As a comparison, BPF helpers enable this behavior with the
> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
> option available with the verifier logic that handles kfuncs where BTF
> types are inferred. Furthermore, as `sock_common` only has a subset of
> certain fields of `sock`, casting pointer to the latter type might not
> always be safe. Hence, the BPF kfunc converts the argument to a full sock
> before casting.

> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   net/core/filter.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
>   net/ipv4/tcp.c    | 17 ++++++++++-----
>   net/ipv4/udp.c    |  7 ++++--
>   3 files changed, 72 insertions(+), 7 deletions(-)

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1d6f165923bf..79cd91ba13d0 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -11621,3 +11621,58 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)

>   	return func;
>   }
> +
> +/* Disables missing prototype warnings */
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global functions as their definitions will be in vmlinux BTF");
> +
> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error  
> code.
> + *
> + * The helper expects a non-NULL pointer to a full socket. It invokes
> + * the protocol specific socket destroy handlers.
> + *
> + * The helper can only be called from BPF contexts that have acquired  
> the socket
> + * locks.
> + *
> + * Parameters:
> + * @sock: Pointer to socket to be destroyed
> + *
> + * Return:
> + * On error, may return EPROTONOSUPPORT, EINVAL.
> + * EPROTONOSUPPORT if protocol specific destroy handler is not  
> implemented.
> + * 0 otherwise
> + */
> +int bpf_sock_destroy(struct sock_common *sock)

Prefix with __bpf_kfunc (see other kfuncs).

> +{
> +	/* Validates the socket can be type casted to a full socket. */
> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);
> +
> +	if (!sk)
> +		return -EINVAL;
> +
> +	/* The locking semantics that allow for synchronous execution of the
> +	 * destroy handlers are only supported for TCP and UDP.
> +	 */
> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
> +		return -EOPNOTSUPP;
> +
> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> +}
> +
> +__diag_pop()
> +
> +BTF_SET8_START(sock_destroy_kfunc_set)
> +BTF_ID_FLAGS(func, bpf_sock_destroy)
> +BTF_SET8_END(sock_destroy_kfunc_set)
> +
> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &sock_destroy_kfunc_set,
> +};
> +
> +static int init_subsystem(void)
> +{
> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,  
> &bpf_sock_destroy_kfunc_set);

Is it safe? Does it mean I can call bpf_sock_destroy from any tracing
program from anywhere? What if the socket is not locked?
IOW, do we have to constrain it to the iterator programs (at least for
now)?

> +}
> +late_initcall(init_subsystem);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 33f559f491c8..8123c264d8ea 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>   		return 0;
>   	}

> -	/* Don't race with userspace socket closes such as tcp_close. */
> -	lock_sock(sk);
> +	/* BPF context ensures sock locking. */
> +	if (!has_current_bpf_ctx())
> +		/* Don't race with userspace socket closes such as tcp_close. */
> +		lock_sock(sk);

>   	if (sk->sk_state == TCP_LISTEN) {
>   		tcp_set_state(sk, TCP_CLOSE);
> @@ -4688,7 +4690,9 @@ int tcp_abort(struct sock *sk, int err)

>   	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
>   	local_bh_disable();
> -	bh_lock_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		bh_lock_sock(sk);
> +

>   	if (!sock_flag(sk, SOCK_DEAD)) {
>   		sk->sk_err = err;
> @@ -4700,10 +4704,13 @@ int tcp_abort(struct sock *sk, int err)
>   		tcp_done(sk);
>   	}

> -	bh_unlock_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		bh_unlock_sock(sk);
> +
>   	local_bh_enable();
>   	tcp_write_queue_purge(sk);
> -	release_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		release_sock(sk);
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(tcp_abort);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 2f3978de45f2..1bc9ad92c3d4 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);

>   int udp_abort(struct sock *sk, int err)
>   {
> -	lock_sock(sk);
> +	/* BPF context ensures sock locking. */
> +	if (!has_current_bpf_ctx())
> +		lock_sock(sk);

>   	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
>   	 * with close()
> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
>   	__udp_disconnect(sk, 0);

>   out:
> -	release_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		release_sock(sk);

>   	return 0;
>   }
> --
> 2.34.1

Aditi Ghag Feb. 27, 2023, 2:56 p.m. UTC | #2

> On Feb 24, 2023, at 2:35 PM, Stanislav Fomichev <sdf@google.com> wrote:
> 
> On 02/23, Aditi Ghag wrote:
>> The socket destroy kfunc is used to forcefully terminate sockets from
>> certain BPF contexts. We plan to use the capability in Cilium to force
>> client sockets to reconnect when their remote load-balancing backends are
>> deleted. The other use case is on-the-fly policy enforcement where existing
>> socket connections prevented by policies need to be forcefully terminated.
>> The helper allows terminating sockets that may or may not be actively
>> sending traffic.
> 
>> The helper is currently exposed to certain BPF iterators where users can
>> filter, and terminate selected sockets.  Additionally, the helper can only
>> be called from these BPF contexts that ensure socket locking in order to
>> allow synchronous execution of destroy helpers that also acquire socket
>> locks. The previous commit that batches UDP sockets during iteration
>> facilitated a synchronous invocation of the destroy helper from BPF context
>> by skipping taking socket locks in the destroy handler. TCP iterators
>> already supported batching.
> 
>> The helper takes `sock_common` type argument, even though it expects, and
>> casts them to a `sock` pointer. This enables the verifier to allow the
>> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
>> `sock` structs. As a comparison, BPF helpers enable this behavior with the
>> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
>> option available with the verifier logic that handles kfuncs where BTF
>> types are inferred. Furthermore, as `sock_common` only has a subset of
>> certain fields of `sock`, casting pointer to the latter type might not
>> always be safe. Hence, the BPF kfunc converts the argument to a full sock
>> before casting.
> 
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  net/core/filter.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
>>  net/ipv4/tcp.c    | 17 ++++++++++-----
>>  net/ipv4/udp.c    |  7 ++++--
>>  3 files changed, 72 insertions(+), 7 deletions(-)
> 
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 1d6f165923bf..79cd91ba13d0 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -11621,3 +11621,58 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
> 
>>  	return func;
>>  }
>> +
>> +/* Disables missing prototype warnings */
>> +__diag_push();
>> +__diag_ignore_all("-Wmissing-prototypes",
>> +		  "Global functions as their definitions will be in vmlinux BTF");
>> +
>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>> + *
>> + * The helper expects a non-NULL pointer to a full socket. It invokes
>> + * the protocol specific socket destroy handlers.
>> + *
>> + * The helper can only be called from BPF contexts that have acquired the socket
>> + * locks.
>> + *
>> + * Parameters:
>> + * @sock: Pointer to socket to be destroyed
>> + *
>> + * Return:
>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>> + * 0 otherwise
>> + */
>> +int bpf_sock_destroy(struct sock_common *sock)
> 
> Prefix with __bpf_kfunc (see other kfuncs).

Will do!

> 
>> +{
>> +	/* Validates the socket can be type casted to a full socket. */
>> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);
>> +
>> +	if (!sk)
>> +		return -EINVAL;
>> +
>> +	/* The locking semantics that allow for synchronous execution of the
>> +	 * destroy handlers are only supported for TCP and UDP.
>> +	 */
>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>> +		return -EOPNOTSUPP;
>> +
>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>> +}
>> +
>> +__diag_pop()
>> +
>> +BTF_SET8_START(sock_destroy_kfunc_set)
>> +BTF_ID_FLAGS(func, bpf_sock_destroy)
>> +BTF_SET8_END(sock_destroy_kfunc_set)
>> +
>> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
>> +	.owner = THIS_MODULE,
>> +	.set   = &sock_destroy_kfunc_set,
>> +};
>> +
>> +static int init_subsystem(void)
>> +{
>> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
> 
> Is it safe? Does it mean I can call bpf_sock_destroy from any tracing
> program from anywhere? What if the socket is not locked?
> IOW, do we have to constrain it to the iterator programs (at least for
> now)?

Given kprobes are not considered as part of BPF_PROG_TYPE_TRACING, I'm not sure if there are other tracing programs with sock/sock_common arguments. Regardless, this is a valid point. I had brought up a similar topic earlier during the v1 discussion -  https://lore.kernel.org/bpf/78E434B0-06A9-466F-A061-B9A05DC6DE6D@isovalent.com/. I suppose you would have a similar problem in the case of setsockopt* helpers. 
Is the general topic of limiting access for kfunc to a subset of BPF_PROG_* programs being discussed?

> 
>> +}
>> +late_initcall(init_subsystem);
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 33f559f491c8..8123c264d8ea 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>>  		return 0;
>>  	}
> 
>> -	/* Don't race with userspace socket closes such as tcp_close. */
>> -	lock_sock(sk);
>> +	/* BPF context ensures sock locking. */
>> +	if (!has_current_bpf_ctx())
>> +		/* Don't race with userspace socket closes such as tcp_close. */
>> +		lock_sock(sk);
> 
>>  	if (sk->sk_state == TCP_LISTEN) {
>>  		tcp_set_state(sk, TCP_CLOSE);
>> @@ -4688,7 +4690,9 @@ int tcp_abort(struct sock *sk, int err)
> 
>>  	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
>>  	local_bh_disable();
>> -	bh_lock_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		bh_lock_sock(sk);
>> +
> 
>>  	if (!sock_flag(sk, SOCK_DEAD)) {
>>  		sk->sk_err = err;
>> @@ -4700,10 +4704,13 @@ int tcp_abort(struct sock *sk, int err)
>>  		tcp_done(sk);
>>  	}
> 
>> -	bh_unlock_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		bh_unlock_sock(sk);
>> +
>>  	local_bh_enable();
>>  	tcp_write_queue_purge(sk);
>> -	release_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		release_sock(sk);
>>  	return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(tcp_abort);
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 2f3978de45f2..1bc9ad92c3d4 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);
> 
>>  int udp_abort(struct sock *sk, int err)
>>  {
>> -	lock_sock(sk);
>> +	/* BPF context ensures sock locking. */
>> +	if (!has_current_bpf_ctx())
>> +		lock_sock(sk);
> 
>>  	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
>>  	 * with close()
>> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
>>  	__udp_disconnect(sk, 0);
> 
>>  out:
>> -	release_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		release_sock(sk);
> 
>>  	return 0;
>>  }
>> --
>> 2.34.1

Aditi Ghag Feb. 27, 2023, 3:32 p.m. UTC | #3

> On Feb 27, 2023, at 6:56 AM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> 
> 
> 
>> On Feb 24, 2023, at 2:35 PM, Stanislav Fomichev <sdf@google.com> wrote:
>> 
>> On 02/23, Aditi Ghag wrote:
>>> The socket destroy kfunc is used to forcefully terminate sockets from
>>> certain BPF contexts. We plan to use the capability in Cilium to force
>>> client sockets to reconnect when their remote load-balancing backends are
>>> deleted. The other use case is on-the-fly policy enforcement where existing
>>> socket connections prevented by policies need to be forcefully terminated.
>>> The helper allows terminating sockets that may or may not be actively
>>> sending traffic.
>> 
>>> The helper is currently exposed to certain BPF iterators where users can
>>> filter, and terminate selected sockets.  Additionally, the helper can only
>>> be called from these BPF contexts that ensure socket locking in order to
>>> allow synchronous execution of destroy helpers that also acquire socket
>>> locks. The previous commit that batches UDP sockets during iteration
>>> facilitated a synchronous invocation of the destroy helper from BPF context
>>> by skipping taking socket locks in the destroy handler. TCP iterators
>>> already supported batching.
>> 
>>> The helper takes `sock_common` type argument, even though it expects, and
>>> casts them to a `sock` pointer. This enables the verifier to allow the
>>> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
>>> `sock` structs. As a comparison, BPF helpers enable this behavior with the
>>> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
>>> option available with the verifier logic that handles kfuncs where BTF
>>> types are inferred. Furthermore, as `sock_common` only has a subset of
>>> certain fields of `sock`, casting pointer to the latter type might not
>>> always be safe. Hence, the BPF kfunc converts the argument to a full sock
>>> before casting.
>> 
>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>> ---
>>> net/core/filter.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
>>> net/ipv4/tcp.c    | 17 ++++++++++-----
>>> net/ipv4/udp.c    |  7 ++++--
>>> 3 files changed, 72 insertions(+), 7 deletions(-)
>> 
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 1d6f165923bf..79cd91ba13d0 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -11621,3 +11621,58 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>> 
>>> 	return func;
>>> }
>>> +
>>> +/* Disables missing prototype warnings */
>>> +__diag_push();
>>> +__diag_ignore_all("-Wmissing-prototypes",
>>> +		  "Global functions as their definitions will be in vmlinux BTF");
>>> +
>>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>>> + *
>>> + * The helper expects a non-NULL pointer to a full socket. It invokes
>>> + * the protocol specific socket destroy handlers.
>>> + *
>>> + * The helper can only be called from BPF contexts that have acquired the socket
>>> + * locks.
>>> + *
>>> + * Parameters:
>>> + * @sock: Pointer to socket to be destroyed
>>> + *
>>> + * Return:
>>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>>> + * 0 otherwise
>>> + */
>>> +int bpf_sock_destroy(struct sock_common *sock)
>> 
>> Prefix with __bpf_kfunc (see other kfuncs).
> 
> Will do!
> 
>> 
>>> +{
>>> +	/* Validates the socket can be type casted to a full socket. */
>>> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);
>>> +
>>> +	if (!sk)
>>> +		return -EINVAL;
>>> +
>>> +	/* The locking semantics that allow for synchronous execution of the
>>> +	 * destroy handlers are only supported for TCP and UDP.
>>> +	 */
>>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>>> +}
>>> +
>>> +__diag_pop()
>>> +
>>> +BTF_SET8_START(sock_destroy_kfunc_set)
>>> +BTF_ID_FLAGS(func, bpf_sock_destroy)
>>> +BTF_SET8_END(sock_destroy_kfunc_set)
>>> +
>>> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
>>> +	.owner = THIS_MODULE,
>>> +	.set   = &sock_destroy_kfunc_set,
>>> +};
>>> +
>>> +static int init_subsystem(void)
>>> +{
>>> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
>> 
>> Is it safe? Does it mean I can call bpf_sock_destroy from any tracing
>> program from anywhere? What if the socket is not locked?
>> IOW, do we have to constrain it to the iterator programs (at least for
>> now)?
> 
> Given kprobes are not considered as part of BPF_PROG_TYPE_TRACING, I'm not sure if there are other tracing programs with sock/sock_common arguments. Regardless, this is a valid point. I had brought up a similar topic earlier during the v1 discussion - https://lore.kernel.org/bpf/78E434B0-06A9-466F-A061-B9A05DC6DE6D@isovalent.com/. I suppose you would have a similar problem in the case of setsockopt* helpers. 
> Is the general topic of limiting access for kfunc to a subset of BPF_PROG_* programs being discussed?

I've submitted this as a potential topic for the lsm/mm/bpf agenda. Also, related to this discussion: could we have something similar to lockdep_sock_is_held with less overhead? 

> 
>> 
>>> +}
>>> +late_initcall(init_subsystem);
>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>> index 33f559f491c8..8123c264d8ea 100644
>>> --- a/net/ipv4/tcp.c
>>> +++ b/net/ipv4/tcp.c
>>> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>>> 		return 0;
>>> 	}
>> 
>>> -	/* Don't race with userspace socket closes such as tcp_close. */
>>> -	lock_sock(sk);
>>> +	/* BPF context ensures sock locking. */
>>> +	if (!has_current_bpf_ctx())
>>> +		/* Don't race with userspace socket closes such as tcp_close. */
>>> +		lock_sock(sk);
>> 
>>> 	if (sk->sk_state == TCP_LISTEN) {
>>> 		tcp_set_state(sk, TCP_CLOSE);
>>> @@ -4688,7 +4690,9 @@ int tcp_abort(struct sock *sk, int err)
>> 
>>> 	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
>>> 	local_bh_disable();
>>> -	bh_lock_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		bh_lock_sock(sk);
>>> +
>> 
>>> 	if (!sock_flag(sk, SOCK_DEAD)) {
>>> 		sk->sk_err = err;
>>> @@ -4700,10 +4704,13 @@ int tcp_abort(struct sock *sk, int err)
>>> 		tcp_done(sk);
>>> 	}
>> 
>>> -	bh_unlock_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		bh_unlock_sock(sk);
>>> +
>>> 	local_bh_enable();
>>> 	tcp_write_queue_purge(sk);
>>> -	release_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		release_sock(sk);
>>> 	return 0;
>>> }
>>> EXPORT_SYMBOL_GPL(tcp_abort);
>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>> index 2f3978de45f2..1bc9ad92c3d4 100644
>>> --- a/net/ipv4/udp.c
>>> +++ b/net/ipv4/udp.c
>>> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);
>> 
>>> int udp_abort(struct sock *sk, int err)
>>> {
>>> -	lock_sock(sk);
>>> +	/* BPF context ensures sock locking. */
>>> +	if (!has_current_bpf_ctx())
>>> +		lock_sock(sk);
>> 
>>> 	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
>>> 	 * with close()
>>> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
>>> 	__udp_disconnect(sk, 0);
>> 
>>> out:
>>> -	release_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		release_sock(sk);
>> 
>>> 	return 0;
>>> }
>>> --
>>> 2.34.1

Stanislav Fomichev Feb. 27, 2023, 5:30 p.m. UTC | #4

On Mon, Feb 27, 2023 at 6:56 AM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>
>
>
> > On Feb 24, 2023, at 2:35 PM, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 02/23, Aditi Ghag wrote:
> >> The socket destroy kfunc is used to forcefully terminate sockets from
> >> certain BPF contexts. We plan to use the capability in Cilium to force
> >> client sockets to reconnect when their remote load-balancing backends are
> >> deleted. The other use case is on-the-fly policy enforcement where existing
> >> socket connections prevented by policies need to be forcefully terminated.
> >> The helper allows terminating sockets that may or may not be actively
> >> sending traffic.
> >
> >> The helper is currently exposed to certain BPF iterators where users can
> >> filter, and terminate selected sockets.  Additionally, the helper can only
> >> be called from these BPF contexts that ensure socket locking in order to
> >> allow synchronous execution of destroy helpers that also acquire socket
> >> locks. The previous commit that batches UDP sockets during iteration
> >> facilitated a synchronous invocation of the destroy helper from BPF context
> >> by skipping taking socket locks in the destroy handler. TCP iterators
> >> already supported batching.
> >
> >> The helper takes `sock_common` type argument, even though it expects, and
> >> casts them to a `sock` pointer. This enables the verifier to allow the
> >> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
> >> `sock` structs. As a comparison, BPF helpers enable this behavior with the
> >> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
> >> option available with the verifier logic that handles kfuncs where BTF
> >> types are inferred. Furthermore, as `sock_common` only has a subset of
> >> certain fields of `sock`, casting pointer to the latter type might not
> >> always be safe. Hence, the BPF kfunc converts the argument to a full sock
> >> before casting.
> >
> >> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> >> ---
> >>  net/core/filter.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
> >>  net/ipv4/tcp.c    | 17 ++++++++++-----
> >>  net/ipv4/udp.c    |  7 ++++--
> >>  3 files changed, 72 insertions(+), 7 deletions(-)
> >
> >> diff --git a/net/core/filter.c b/net/core/filter.c
> >> index 1d6f165923bf..79cd91ba13d0 100644
> >> --- a/net/core/filter.c
> >> +++ b/net/core/filter.c
> >> @@ -11621,3 +11621,58 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
> >
> >>      return func;
> >>  }
> >> +
> >> +/* Disables missing prototype warnings */
> >> +__diag_push();
> >> +__diag_ignore_all("-Wmissing-prototypes",
> >> +              "Global functions as their definitions will be in vmlinux BTF");
> >> +
> >> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
> >> + *
> >> + * The helper expects a non-NULL pointer to a full socket. It invokes
> >> + * the protocol specific socket destroy handlers.
> >> + *
> >> + * The helper can only be called from BPF contexts that have acquired the socket
> >> + * locks.
> >> + *
> >> + * Parameters:
> >> + * @sock: Pointer to socket to be destroyed
> >> + *
> >> + * Return:
> >> + * On error, may return EPROTONOSUPPORT, EINVAL.
> >> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
> >> + * 0 otherwise
> >> + */
> >> +int bpf_sock_destroy(struct sock_common *sock)
> >
> > Prefix with __bpf_kfunc (see other kfuncs).
>
> Will do!
>
> >
> >> +{
> >> +    /* Validates the socket can be type casted to a full socket. */
> >> +    struct sock *sk = sk_to_full_sk((struct sock *)sock);
> >> +
> >> +    if (!sk)
> >> +            return -EINVAL;
> >> +
> >> +    /* The locking semantics that allow for synchronous execution of the
> >> +     * destroy handlers are only supported for TCP and UDP.
> >> +     */
> >> +    if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
> >> +            return -EOPNOTSUPP;
> >> +
> >> +    return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> >> +}
> >> +
> >> +__diag_pop()
> >> +
> >> +BTF_SET8_START(sock_destroy_kfunc_set)
> >> +BTF_ID_FLAGS(func, bpf_sock_destroy)
> >> +BTF_SET8_END(sock_destroy_kfunc_set)
> >> +
> >> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
> >> +    .owner = THIS_MODULE,
> >> +    .set   = &sock_destroy_kfunc_set,
> >> +};
> >> +
> >> +static int init_subsystem(void)
> >> +{
> >> +    return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
> >
> > Is it safe? Does it mean I can call bpf_sock_destroy from any tracing
> > program from anywhere? What if the socket is not locked?
> > IOW, do we have to constrain it to the iterator programs (at least for
> > now)?
>
> Given kprobes are not considered as part of BPF_PROG_TYPE_TRACING, I'm not sure if there are other tracing programs with sock/sock_common arguments. Regardless, this is a valid point. I had brought up a similar topic earlier during the v1 discussion -  https://lore.kernel.org/bpf/78E434B0-06A9-466F-A061-B9A05DC6DE6D@isovalent.com/. I suppose you would have a similar problem in the case of setsockopt* helpers.

Sure, the same problem exists for bpf_setsockopt helper, but these
should be exposed only in a handful of hooks (where we know that the
socket is either locked or not). See, for example, [0] as one of the
recent fixes.

> Is the general topic of limiting access for kfunc to a subset of BPF_PROG_* programs being discussed?

Some of that discussion might have happened in [1] (or one of the
earlier respins). I think at that time I was thinking maybe we can use
btf_tags to annotate __locked/__unlocked socket arguments. Then, in
those annotated contexts, the verifier might allow you to call
bpf_setsockopt.. But I haven't really explored this too much.

0: https://lore.kernel.org/bpf/20230127001732.4162630-1-kuifeng@meta.com/
1: https://lore.kernel.org/bpf/20220622160346.967594-1-sdf@google.com/

> >
> >> +}
> >> +late_initcall(init_subsystem);
> >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >> index 33f559f491c8..8123c264d8ea 100644
> >> --- a/net/ipv4/tcp.c
> >> +++ b/net/ipv4/tcp.c
> >> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
> >>              return 0;
> >>      }
> >
> >> -    /* Don't race with userspace socket closes such as tcp_close. */
> >> -    lock_sock(sk);
> >> +    /* BPF context ensures sock locking. */
> >> +    if (!has_current_bpf_ctx())
> >> +            /* Don't race with userspace socket closes such as tcp_close. */
> >> +            lock_sock(sk);
> >
> >>      if (sk->sk_state == TCP_LISTEN) {
> >>              tcp_set_state(sk, TCP_CLOSE);
> >> @@ -4688,7 +4690,9 @@ int tcp_abort(struct sock *sk, int err)
> >
> >>      /* Don't race with BH socket closes such as inet_csk_listen_stop. */
> >>      local_bh_disable();
> >> -    bh_lock_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            bh_lock_sock(sk);
> >> +
> >
> >>      if (!sock_flag(sk, SOCK_DEAD)) {
> >>              sk->sk_err = err;
> >> @@ -4700,10 +4704,13 @@ int tcp_abort(struct sock *sk, int err)
> >>              tcp_done(sk);
> >>      }
> >
> >> -    bh_unlock_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            bh_unlock_sock(sk);
> >> +
> >>      local_bh_enable();
> >>      tcp_write_queue_purge(sk);
> >> -    release_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            release_sock(sk);
> >>      return 0;
> >>  }
> >>  EXPORT_SYMBOL_GPL(tcp_abort);
> >> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> >> index 2f3978de45f2..1bc9ad92c3d4 100644
> >> --- a/net/ipv4/udp.c
> >> +++ b/net/ipv4/udp.c
> >> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);
> >
> >>  int udp_abort(struct sock *sk, int err)
> >>  {
> >> -    lock_sock(sk);
> >> +    /* BPF context ensures sock locking. */
> >> +    if (!has_current_bpf_ctx())
> >> +            lock_sock(sk);
> >
> >>      /* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
> >>       * with close()
> >> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
> >>      __udp_disconnect(sk, 0);
> >
> >>  out:
> >> -    release_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            release_sock(sk);
> >
> >>      return 0;
> >>  }
> >> --
> >> 2.34.1
>

Martin KaFai Lau Feb. 28, 2023, 10:55 p.m. UTC | #5

On 2/23/23 1:53 PM, Aditi Ghag wrote:
> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
> + *
> + * The helper expects a non-NULL pointer to a full socket. It invokes
> + * the protocol specific socket destroy handlers.
> + *
> + * The helper can only be called from BPF contexts that have acquired the socket
> + * locks.
> + *
> + * Parameters:
> + * @sock: Pointer to socket to be destroyed
> + *
> + * Return:
> + * On error, may return EPROTONOSUPPORT, EINVAL.
> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
> + * 0 otherwise
> + */
> +int bpf_sock_destroy(struct sock_common *sock)
> +{
> +	/* Validates the socket can be type casted to a full socket. */
> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);

If sk != sock, sk is not locked.

Does it have to do sk_to_full_sk()? From looking at tcp_abort(), it can handle 
TCP_NEW_SYN_RECV and TCP_TIME_WAIT. The bpf-tcp-iter is iterating the hashtables 
that may have sk in different states. Ideally, bpf-tcp-iter should be able to 
use bpf_sock_destroy() to abort the sk in different states also.

Otherwise, the bpf_sock_destroy kfunc is aborting the listener while the bpf 
prog expects to abort a req_sk.

> +
> +	if (!sk)
> +		return -EINVAL;
> +
> +	/* The locking semantics that allow for synchronous execution of the
> +	 * destroy handlers are only supported for TCP and UDP.
> +	 */
> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
> +		return -EOPNOTSUPP;
> +
> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> +}
> +
> +__diag_pop()

Aditi Ghag March 16, 2023, 10:37 p.m. UTC | #6

> On Feb 28, 2023, at 2:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 2/23/23 1:53 PM, Aditi Ghag wrote:
>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>> + *
>> + * The helper expects a non-NULL pointer to a full socket. It invokes
>> + * the protocol specific socket destroy handlers.
>> + *
>> + * The helper can only be called from BPF contexts that have acquired the socket
>> + * locks.
>> + *
>> + * Parameters:
>> + * @sock: Pointer to socket to be destroyed
>> + *
>> + * Return:
>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>> + * 0 otherwise
>> + */
>> +int bpf_sock_destroy(struct sock_common *sock)
>> +{
>> +	/* Validates the socket can be type casted to a full socket. */
>> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);
> 
> If sk != sock, sk is not locked.
> 
> Does it have to do sk_to_full_sk()? From looking at tcp_abort(), it can handle TCP_NEW_SYN_RECV and TCP_TIME_WAIT. The bpf-tcp-iter is iterating the hashtables that may have sk in different states. Ideally, bpf-tcp-iter should be able to use bpf_sock_destroy() to abort the sk in different states also.

I initially added the check for request sockets as tcp_abort references some of the fields outside of `sock_common`, but tcp_abort does have a special handling for both req and time_wait sockets, as you pointed out. So I'll remove the  `sk_to_full_sk()` call.

Eric/Martin:

>Ideally, bpf-tcp-iter should be able to use bpf_sock_destroy() to abort the sk in different states also.

On somewhat of a related note, I ran into an interesting problem while adding more tests to exercise changes in the first commit ("Implement batching for UDP sockets iterator") more. As it turns out, UDP *listening* sockets weren't getting destroyed as client sockets. 

So here is what the test does at a high level - 
1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)
2) Run BPF iterators that destroys sockets (there are only server sockets).
3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind! 

When I debugged the issue, I found that the listening UDP sockets were still lurking around in the hash table even after they were supposed to be destroyed. With the help of kprobes and print statements in the BPF test iterator program, I confirmed that tcp_abort and the internal function calls (specifically, `__udp_disconnect`) were getting invoked as expected, and the `sk_state` was also being set to `TCP_CLOSE`. Upon further investigation, I found that the socket unhash and source port reset wasn't happening. This is the relevant code - https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L1988 [1]. When I commented out the `SOCK_BINDPORT_LOCK` check, the new test passed as sockets were getting destroyed correctly.

I didn't observe similar behavior with TCP, and TCP listening sockets were correctly getting destroyed. `tcp_set_state` unhashes sockets unconditionally for `TCP_CLOSE` state.

Can you share insights into why the socket unhashing and source port reset doesn't happen for bind'ed sockets? If that's expected, I suppose we'll need to unhash and reset source ports for sockets getting destroyed, right?
(I wonder if this was an oversight when `__udp_disconnect` was introduced in commit 286c72deabaa240b7eebbd99496ed3324d69f3c0.)

If it's easier, I can push the next version of patches where I've addresses review comments, and added new tests. We can then continue this discussion there. In the latest version, I've modified [1] with a `TCP_CLOSE` state check.

[1] if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
		sk->sk_prot->unhash(sk);
		inet->inet_sport = 0;
     }

> 
> Otherwise, the bpf_sock_destroy kfunc is aborting the listener while the bpf prog expects to abort a req_sk.
> 
>> +
>> +	if (!sk)
>> +		return -EINVAL;
>> +
>> +	/* The locking semantics that allow for synchronous execution of the
>> +	 * destroy handlers are only supported for TCP and UDP.
>> +	 */
>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>> +		return -EOPNOTSUPP;
>> +
>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>> +}
>> +
>> +__diag_pop()
>

Aditi Ghag March 21, 2023, 6:49 p.m. UTC | #7

> On Mar 16, 2023, at 3:37 PM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> 
> 
> 
>> On Feb 28, 2023, at 2:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>> 
>> On 2/23/23 1:53 PM, Aditi Ghag wrote:
>>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>>> + *
>>> + * The helper expects a non-NULL pointer to a full socket. It invokes
>>> + * the protocol specific socket destroy handlers.
>>> + *
>>> + * The helper can only be called from BPF contexts that have acquired the socket
>>> + * locks.
>>> + *
>>> + * Parameters:
>>> + * @sock: Pointer to socket to be destroyed
>>> + *
>>> + * Return:
>>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>>> + * 0 otherwise
>>> + */
>>> +int bpf_sock_destroy(struct sock_common *sock)
>>> +{
>>> +	/* Validates the socket can be type casted to a full socket. */
>>> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);
>> 
>> If sk != sock, sk is not locked.
>> 
>> Does it have to do sk_to_full_sk()? From looking at tcp_abort(), it can handle TCP_NEW_SYN_RECV and TCP_TIME_WAIT. The bpf-tcp-iter is iterating the hashtables that may have sk in different states. Ideally, bpf-tcp-iter should be able to use bpf_sock_destroy() to abort the sk in different states also.
> 
> I initially added the check for request sockets as tcp_abort references some of the fields outside of `sock_common`, but tcp_abort does have a special handling for both req and time_wait sockets, as you pointed out. So I'll remove the `sk_to_full_sk()` call.
> 
> 
> Eric/Martin:
> 
>> Ideally, bpf-tcp-iter should be able to use bpf_sock_destroy() to abort the sk in different states also.
> 
> On somewhat of a related note, I ran into an interesting problem while adding more tests to exercise changes in the first commit ("Implement batching for UDP sockets iterator") more. As it turns out, UDP *listening* sockets weren't getting destroyed as client sockets. 
> 
> So here is what the test does at a high level - 
> 1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)
> 2) Run BPF iterators that destroys sockets (there are only server sockets).
> 3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind! 
> 
> When I debugged the issue, I found that the listening UDP sockets were still lurking around in the hash table even after they were supposed to be destroyed. With the help of kprobes and print statements in the BPF test iterator program, I confirmed that tcp_abort and the internal function calls (specifically, `__udp_disconnect`) were getting invoked as expected, and the `sk_state` was also being set to `TCP_CLOSE`. Upon further investigation, I found that the socket unhash and source port reset wasn't happening. This is the relevant code - https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L1988 [1]. When I commented out the `SOCK_BINDPORT_LOCK` check, the new test passed as sockets were getting destroyed correctly.
> 
> I didn't observe similar behavior with TCP, and TCP listening sockets were correctly getting destroyed. `tcp_set_state` unhashes sockets unconditionally for `TCP_CLOSE` state.
> 
> Can you share insights into why the socket unhashing and source port reset doesn't happen for bind'ed sockets? If that's expected, I suppose we'll need to unhash and reset source ports for sockets getting destroyed, right?
> (I wonder if this was an oversight when `__udp_disconnect` was introduced in commit 286c72deabaa240b7eebbd99496ed3324d69f3c0.)
> 
> 
> If it's easier, I can push the next version of patches where I've addresses review comments, and added new tests. We can then continue this discussion there. In the latest version, I've modified [1] with a `TCP_CLOSE` state check.
> 
> [1] if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
> 		sk->sk_prot->unhash(sk);
> 		inet->inet_sport = 0;
>     }


Scratch my comment about adding the state check. We can discuss the fix in v3 patch series - https://lore.kernel.org/bpf/20230321184541.1857363-5-aditi.ghag@isovalent.com/T/#u. 


> 
>> 
>> Otherwise, the bpf_sock_destroy kfunc is aborting the listener while the bpf prog expects to abort a req_sk.
>> 
>>> +
>>> +	if (!sk)
>>> +		return -EINVAL;
>>> +
>>> +	/* The locking semantics that allow for synchronous execution of the
>>> +	 * destroy handlers are only supported for TCP and UDP.
>>> +	 */
>>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>>> +}
>>> +
>>> +__diag_pop()

[v2,bpf-next,2/3] bpf: Add bpf_sock_destroy kfunc

Checks

Commit Message

Comments

Patch