[1/2] bpf: Add socket destroy capability

Message ID	c3b935a5a72b1371f9262348616a7fa84061b85f.1671242108.git.aditi.ghag@isovalent.com (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@kernel.org> From: Aditi Ghag <aditi.ghag@isovalent.com> To: bpf@vger.kernel.org Cc: kafai@fb.com, sdf@google.com, edumazet@google.com, Aditi Ghag <aditi.ghag@isovalent.com> Subject: [PATCH 1/2] bpf: Add socket destroy capability Date: Sat, 17 Dec 2022 01:57:17 +0000 Message-Id: <c3b935a5a72b1371f9262348616a7fa84061b85f.1671242108.git.aditi.ghag@isovalent.com> In-Reply-To: <cover.1671242108.git.aditi.ghag@isovalent.com> References: <cover.1671242108.git.aditi.ghag@isovalent.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	bpf-next: Add socket destroy capability \| expand [0/2] bpf-next: Add socket destroy capability [1/2] bpf: Add socket destroy capability [2/2] selftests/bpf: Add tests for bpf_sock_destroy

Context	Check	Description
netdev/tree_selection	success	Not a local patch, async
bpf/vmtest-bpf-next-PR	success	PR summary
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-7	success	Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-8	success	Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5	success	Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6	success	Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-4	success	Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-9	success	Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10	success	Logs for test_maps on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-12	success	Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13	success	Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-14	success	Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15	success	Logs for test_progs on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-17	success	Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18	success	Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19	success	Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20	success	Logs for test_progs_no_alu32 on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22	success	Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23	success	Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-24	success	Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25	success	Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27	success	Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28	success	Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29	success	Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-30	success	Logs for test_progs_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-32	success	Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-33	success	Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-34	success	Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-35	success	Logs for test_verifier on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-36	success	Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-37	success	Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-38	success	Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26	success	Logs for test_progs_no_alu32_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-21	success	Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-31	success	Logs for test_progs_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16	success	Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-11	success	Logs for test_maps on s390x with gcc

Aditi Ghag Dec. 17, 2022, 1:57 a.m. UTC

The socket destroy helper is used to
forcefully terminate sockets from certain
BPF contexts. We plan to use the capability
in Cilium to force client sockets to reconnect
when their remote load-balancing backends are
deleted. The other use case is on-the-fly
policy enforcement where existing socket
connections prevented by policies need to
be terminated.

The helper is currently exposed to iterator
type BPF programs where users can filter,
and terminate a set of sockets.

Sockets are destroyed asynchronously using
the work queue infrastructure. This allows
for current the locking semantics within
socket destroy handlers, as BPF iterators
invoking the helper acquire *sock* locks.
This also allows the helper to be invoked
from non-sleepable contexts.
The other approach to skip acquiring locks
by passing an argument to the `diag_destroy`
handler didn't work out well for UDP, as
the UDP abort function internally invokes
another function that ends up acquiring
*sock* lock.
While there are sleepable BPF iterators,
these are limited to only certain map types.
Furthermore, it's limiting in the sense that
it wouldn't allow us to extend the helper
to other non-sleepable BPF programs.

The work queue infrastructure processes work
items from per-cpu structures. As the sock
destroy work items are executed asynchronously,
we need to ref count sockets before they are
added to the work queue. The 'work_pending'
check prevents duplicate ref counting of sockets
in case users invoke the destroy helper for a
socket multiple times. The `{READ,WRITE}_ONCE`
macros ensure that the socket pointer stored
in a work queue item isn't clobbered while
the item is being processed. As BPF programs
are non-preemptible, we can expect that once
a socket is ref counted, no other socket can
sneak in before the ref counted socket is
added to the work queue for asynchronous destroy.
Finally, users are expected to retry when the
helper fails to queue a work item for a socket
to be destroyed in case there is another destroy
operation is in progress.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 include/linux/bpf.h            |  1 +
 include/uapi/linux/bpf.h       | 17 +++++++++
 kernel/bpf/core.c              |  1 +
 kernel/trace/bpf_trace.c       |  2 +
 net/core/filter.c              | 70 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h | 17 +++++++++
 6 files changed, 108 insertions(+)

Stanislav Fomichev Dec. 19, 2022, 6:22 p.m. UTC | #1

On 12/17, Aditi Ghag wrote:
> The socket destroy helper is used to
> forcefully terminate sockets from certain
> BPF contexts. We plan to use the capability
> in Cilium to force client sockets to reconnect
> when their remote load-balancing backends are
> deleted. The other use case is on-the-fly
> policy enforcement where existing socket
> connections prevented by policies need to
> be terminated.

> The helper is currently exposed to iterator
> type BPF programs where users can filter,
> and terminate a set of sockets.

> Sockets are destroyed asynchronously using
> the work queue infrastructure. This allows
> for current the locking semantics within

s/current the/the current/ ?

> socket destroy handlers, as BPF iterators
> invoking the helper acquire *sock* locks.
> This also allows the helper to be invoked
> from non-sleepable contexts.
> The other approach to skip acquiring locks
> by passing an argument to the `diag_destroy`
> handler didn't work out well for UDP, as
> the UDP abort function internally invokes
> another function that ends up acquiring
> *sock* lock.
> While there are sleepable BPF iterators,
> these are limited to only certain map types.
> Furthermore, it's limiting in the sense that
> it wouldn't allow us to extend the helper
> to other non-sleepable BPF programs.

> The work queue infrastructure processes work
> items from per-cpu structures. As the sock
> destroy work items are executed asynchronously,
> we need to ref count sockets before they are
> added to the work queue. The 'work_pending'
> check prevents duplicate ref counting of sockets
> in case users invoke the destroy helper for a
> socket multiple times. The `{READ,WRITE}_ONCE`
> macros ensure that the socket pointer stored
> in a work queue item isn't clobbered while
> the item is being processed. As BPF programs
> are non-preemptible, we can expect that once
> a socket is ref counted, no other socket can
> sneak in before the ref counted socket is
> added to the work queue for asynchronous destroy.
> Finally, users are expected to retry when the
> helper fails to queue a work item for a socket
> to be destroyed in case there is another destroy
> operation is in progress.

nit: maybe reformat to fit into 80 characters per line? A bit hard to
read with this narrow formatting..


> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   include/linux/bpf.h            |  1 +
>   include/uapi/linux/bpf.h       | 17 +++++++++
>   kernel/bpf/core.c              |  1 +
>   kernel/trace/bpf_trace.c       |  2 +
>   net/core/filter.c              | 70 ++++++++++++++++++++++++++++++++++
>   tools/include/uapi/linux/bpf.h | 17 +++++++++
>   6 files changed, 108 insertions(+)

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 3de24cfb7a3d..60eaa05dfab3 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2676,6 +2676,7 @@ extern const struct bpf_func_proto  
> bpf_get_retval_proto;
>   extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
>   extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
>   extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
> +extern const struct bpf_func_proto bpf_sock_destroy_proto;

>   const struct bpf_func_proto *tracing_prog_func_proto(
>     enum bpf_func_id func_id, const struct bpf_prog *prog);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 464ca3f01fe7..789ac7c59fdf 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5484,6 +5484,22 @@ union bpf_attr {
>    *		0 on success.
>    *
>    *		**-ENOENT** if the bpf_local_storage cannot be found.
> + *
> + * int bpf_sock_destroy(struct sock *sk)
> + *	Description
> + *		Destroy the given socket with **ECONNABORTED** error code.
> + *
> + *		*sk* must be a non-**NULL** pointer to a socket.
> + *
> + *	Return
> + *		The socket is destroyed asynchronosuly, so 0 return value may
> + *		not suggest indicate that the socket was successfully destroyed.

s/suggest indicate/ with either suggest or indicate?

> + *
> + *		On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**.
> + *
> + *		**-EPROTONOSUPPORT** if protocol specific destroy handler is not  
> implemented.
> + *
> + *		**-EBUSY** if another socket destroy operation is in progress.
>    */
>   #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
>   	FN(unspec, 0, ##ctx)				\
> @@ -5698,6 +5714,7 @@ union bpf_attr {
>   	FN(user_ringbuf_drain, 209, ##ctx)		\
>   	FN(cgrp_storage_get, 210, ##ctx)		\
>   	FN(cgrp_storage_delete, 211, ##ctx)		\
> +	FN(sock_destroy, 212, ##ctx)			\
>   	/* */

>   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that  
> don't
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 7f98dec6e90f..c59bef9805e5 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2651,6 +2651,7 @@ const struct bpf_func_proto bpf_snprintf_btf_proto  
> __weak;
>   const struct bpf_func_proto bpf_seq_printf_btf_proto __weak;
>   const struct bpf_func_proto bpf_set_retval_proto __weak;
>   const struct bpf_func_proto bpf_get_retval_proto __weak;
> +const struct bpf_func_proto bpf_sock_destroy_proto __weak;

>   const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>   {
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 3bbd3f0c810c..016dbee6b5e4 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1930,6 +1930,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id,  
> const struct bpf_prog *prog)
>   		return &bpf_get_socket_ptr_cookie_proto;
>   	case BPF_FUNC_xdp_get_buff_len:
>   		return &bpf_xdp_get_buff_len_trace_proto;
> +	case BPF_FUNC_sock_destroy:
> +		return &bpf_sock_destroy_proto;
>   #endif
>   	case BPF_FUNC_seq_printf:
>   		return prog->expected_attach_type == BPF_TRACE_ITER ?
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 929358677183..9753606ecc26 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -11569,6 +11569,8 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>   		break;
>   	case BPF_FUNC_ktime_get_coarse_ns:
>   		return &bpf_ktime_get_coarse_ns_proto;
> +	case BPF_FUNC_sock_destroy:
> +		return &bpf_sock_destroy_proto;
>   	default:
>   		return bpf_base_func_proto(func_id);
>   	}
> @@ -11578,3 +11580,71 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)

>   	return func;
>   }
> +
> +struct sock_destroy_work {
> +	struct sock *sk;
> +	struct work_struct destroy;
> +};
> +
> +static DEFINE_PER_CPU(struct sock_destroy_work, sock_destroy_workqueue);
> +
> +static void bpf_sock_destroy_fn(struct work_struct *work)
> +{
> +	struct sock_destroy_work *sd_work = container_of(work,
> +			struct sock_destroy_work, destroy);
> +	struct sock *sk = READ_ONCE(sd_work->sk);
> +
> +	sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> +	sock_put(sk);
> +}
> +
> +static int __init bpf_sock_destroy_workqueue_init(void)
> +{
> +	int cpu;
> +	struct sock_destroy_work *work;
> +
> +	for_each_possible_cpu(cpu) {
> +		work = per_cpu_ptr(&sock_destroy_workqueue, cpu);
> +		INIT_WORK(&work->destroy, bpf_sock_destroy_fn);
> +	}
> +
> +	return 0;
> +}
> +subsys_initcall(bpf_sock_destroy_workqueue_init);
> +
> +BPF_CALL_1(bpf_sock_destroy, struct sock *, sk)
> +{
> +	struct sock_destroy_work *sd_work;
> +
> +	if (!sk->sk_prot->diag_destroy)
> +		return -EOPNOTSUPP;
> +
> +	sd_work = this_cpu_ptr(&sock_destroy_workqueue);

[..]

> +	/* This check prevents duplicate ref counting
> +	 * of sockets, in case the handler is invoked
> +	 * multiple times for the same socket.
> +	 */

This means this helper can also be called for a single socket during
invocation; is it an ok compromise?

I'm also assuming it's still possible that this helper gets called for
the same socket on different cpus?

> +	if (work_pending(&sd_work->destroy))
> +		return -EBUSY;
> +
> +	/* Ref counting ensures that the socket
> +	 * isn't deleted from underneath us before
> +	 * the work queue item is processed.
> +	 */
> +	if (!refcount_inc_not_zero(&sk->sk_refcnt))
> +		return -EINVAL;
> +
> +	WRITE_ONCE(sd_work->sk, sk);
> +	if (!queue_work(system_wq, &sd_work->destroy)) {
> +		sock_put(sk);
> +		return -EBUSY;
> +	}
> +
> +	return 0;
> +}
> +
> +const struct bpf_func_proto bpf_sock_destroy_proto = {
> +	.func		= bpf_sock_destroy,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_BTF_ID_SOCK_COMMON,
> +};
> diff --git a/tools/include/uapi/linux/bpf.h  
> b/tools/include/uapi/linux/bpf.h
> index 464ca3f01fe7..07154a4d92f9 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -5484,6 +5484,22 @@ union bpf_attr {
>    *		0 on success.
>    *
>    *		**-ENOENT** if the bpf_local_storage cannot be found.
> + *
> + * int bpf_sock_destroy(void *sk)
> + *	Description
> + *		Destroy the given socket with **ECONNABORTED** error code.
> + *
> + *		*sk* must be a non-**NULL** pointer to a socket.
> + *
> + *	Return
> + *		The socket is destroyed asynchronosuly, so 0 return value may
> + *		not indicate that the socket was successfully destroyed.
> + *
> + *		On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**.
> + *
> + *		**-EPROTONOSUPPORT** if protocol specific destroy handler is not  
> implemented.
> + *
> + *		**-EBUSY** if another socket destroy operation is in progress.
>    */
>   #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
>   	FN(unspec, 0, ##ctx)				\
> @@ -5698,6 +5714,7 @@ union bpf_attr {
>   	FN(user_ringbuf_drain, 209, ##ctx)		\
>   	FN(cgrp_storage_get, 210, ##ctx)		\
>   	FN(cgrp_storage_delete, 211, ##ctx)		\
> +	FN(sock_destroy, 212, ##ctx)			\
>   	/* */

>   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that  
> don't
> --
> 2.34.1

Alan Maguire Dec. 20, 2022, 10:26 a.m. UTC | #2

On 17/12/2022 01:57, Aditi Ghag wrote:
> The socket destroy helper is used to
> forcefully terminate sockets from certain
> BPF contexts. We plan to use the capability
> in Cilium to force client sockets to reconnect
> when their remote load-balancing backends are
> deleted. The other use case is on-the-fly
> policy enforcement where existing socket
> connections prevented by policies need to
> be terminated.
> 
> The helper is currently exposed to iterator
> type BPF programs where users can filter,
> and terminate a set of sockets.
> 
> Sockets are destroyed asynchronously using
> the work queue infrastructure. This allows
> for current the locking semantics within
> socket destroy handlers, as BPF iterators
> invoking the helper acquire *sock* locks.
> This also allows the helper to be invoked
> from non-sleepable contexts.
> The other approach to skip acquiring locks
> by passing an argument to the `diag_destroy`
> handler didn't work out well for UDP, as
> the UDP abort function internally invokes
> another function that ends up acquiring
> *sock* lock.
> While there are sleepable BPF iterators,
> these are limited to only certain map types.
> Furthermore, it's limiting in the sense that
> it wouldn't allow us to extend the helper
> to other non-sleepable BPF programs.
> 
> The work queue infrastructure processes work
> items from per-cpu structures. As the sock
> destroy work items are executed asynchronously,
> we need to ref count sockets before they are
> added to the work queue. The 'work_pending'
> check prevents duplicate ref counting of sockets
> in case users invoke the destroy helper for a
> socket multiple times. The `{READ,WRITE}_ONCE`
> macros ensure that the socket pointer stored
> in a work queue item isn't clobbered while
> the item is being processed. As BPF programs
> are non-preemptible, we can expect that once
> a socket is ref counted, no other socket can
> sneak in before the ref counted socket is
> added to the work queue for asynchronous destroy.
> Finally, users are expected to retry when the
> helper fails to queue a work item for a socket
> to be destroyed in case there is another destroy
> operation is in progress.
> 
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>  include/linux/bpf.h            |  1 +
>  include/uapi/linux/bpf.h       | 17 +++++++++
>  kernel/bpf/core.c              |  1 +
>  kernel/trace/bpf_trace.c       |  2 +
>  net/core/filter.c              | 70 ++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h | 17 +++++++++
>  6 files changed, 108 insertions(+)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 3de24cfb7a3d..60eaa05dfab3 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2676,6 +2676,7 @@ extern const struct bpf_func_proto bpf_get_retval_proto;
>  extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
>  extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
>  extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
> +extern const struct bpf_func_proto bpf_sock_destroy_proto;
>  
>  const struct bpf_func_proto *tracing_prog_func_proto(
>    enum bpf_func_id func_id, const struct bpf_prog *prog);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 464ca3f01fe7..789ac7c59fdf 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5484,6 +5484,22 @@ union bpf_attr {
>   *		0 on success.
>   *
>   *		**-ENOENT** if the bpf_local_storage cannot be found.
> + *
> + * int bpf_sock_destroy(struct sock *sk)
> + *	Description
> + *		Destroy the given socket with **ECONNABORTED** error code.
> + *
> + *		*sk* must be a non-**NULL** pointer to a socket.
> + *
> + *	Return
> + *		The socket is destroyed asynchronosuly, so 0 return value may
> + *		not suggest indicate that the socket was successfully destroyed.
> + *
> + *		On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**.
> + *
> + *		**-EPROTONOSUPPORT** if protocol specific destroy handler is not implemented.
> + *
> + *		**-EBUSY** if another socket destroy operation is in progress.
>   */
>  #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
>  	FN(unspec, 0, ##ctx)				\
> @@ -5698,6 +5714,7 @@ union bpf_attr {
>  	FN(user_ringbuf_drain, 209, ##ctx)		\
>  	FN(cgrp_storage_get, 210, ##ctx)		\
>  	FN(cgrp_storage_delete, 211, ##ctx)		\
> +	FN(sock_destroy, 212, ##ctx)			\
>  	/* */
>  
>  /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 7f98dec6e90f..c59bef9805e5 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2651,6 +2651,7 @@ const struct bpf_func_proto bpf_snprintf_btf_proto __weak;
>  const struct bpf_func_proto bpf_seq_printf_btf_proto __weak;
>  const struct bpf_func_proto bpf_set_retval_proto __weak;
>  const struct bpf_func_proto bpf_get_retval_proto __weak;
> +const struct bpf_func_proto bpf_sock_destroy_proto __weak;
>  
>  const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>  {
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 3bbd3f0c810c..016dbee6b5e4 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1930,6 +1930,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_get_socket_ptr_cookie_proto;
>  	case BPF_FUNC_xdp_get_buff_len:
>  		return &bpf_xdp_get_buff_len_trace_proto;
> +	case BPF_FUNC_sock_destroy:
> +		return &bpf_sock_destroy_proto;
>  #endif
>  	case BPF_FUNC_seq_printf:
>  		return prog->expected_attach_type == BPF_TRACE_ITER ?
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 929358677183..9753606ecc26 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -11569,6 +11569,8 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>  		break;
>  	case BPF_FUNC_ktime_get_coarse_ns:
>  		return &bpf_ktime_get_coarse_ns_proto;
> +	case BPF_FUNC_sock_destroy:
> +		return &bpf_sock_destroy_proto;
>  	default:
This is a really neat feature! One question though; above it seems to
suggest that the helper is only exposed to BPF iterators, but by
adding it to bpf_sk_base_func_proto() won't it be available to
other BPF programs like sock_addr, sk_filter etc? If I've got that
right, we'd definitely want to exclude potentially unprivileged
programs from having access to this helper. And to be clear, I'd
definitely see value in having it accessible to other BPF program
types too like sockops if possible, but just wanted to check. 

>  		return bpf_base_func_proto(func_id);
>  	}
> @@ -11578,3 +11580,71 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>  
>  	return func;
>  }
> +
> +struct sock_destroy_work {
> +	struct sock *sk;
> +	struct work_struct destroy;
> +};
> +
> +static DEFINE_PER_CPU(struct sock_destroy_work, sock_destroy_workqueue);
> +
> +static void bpf_sock_destroy_fn(struct work_struct *work)
> +{
> +	struct sock_destroy_work *sd_work = container_of(work,
> +			struct sock_destroy_work, destroy);
> +	struct sock *sk = READ_ONCE(sd_work->sk);
> +
> +	sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> +	sock_put(sk);
> +}
> +
> +static int __init bpf_sock_destroy_workqueue_init(void)
> +{
> +	int cpu;
> +	struct sock_destroy_work *work;
> +
> +	for_each_possible_cpu(cpu) {
> +		work = per_cpu_ptr(&sock_destroy_workqueue, cpu);
> +		INIT_WORK(&work->destroy, bpf_sock_destroy_fn);
> +	}
> +
> +	return 0;
> +}
> +subsys_initcall(bpf_sock_destroy_workqueue_init);
> +
> +BPF_CALL_1(bpf_sock_destroy, struct sock *, sk)
> +{
> +	struct sock_destroy_work *sd_work;
> +
> +	if (!sk->sk_prot->diag_destroy)

risk of a NULL sk here?

Thanks!

Alan

Martin KaFai Lau Dec. 22, 2022, 5:08 a.m. UTC | #3

On 12/16/22 5:57 PM, Aditi Ghag wrote:
> The socket destroy helper is used to
> forcefully terminate sockets from certain
> BPF contexts. We plan to use the capability
> in Cilium to force client sockets to reconnect
> when their remote load-balancing backends are
> deleted. The other use case is on-the-fly
> policy enforcement where existing socket
> connections prevented by policies need to
> be terminated.
> 
> The helper is currently exposed to iterator
> type BPF programs where users can filter,
> and terminate a set of sockets.
> 
> Sockets are destroyed asynchronously using
> the work queue infrastructure. This allows
> for current the locking semantics within
> socket destroy handlers, as BPF iterators
> invoking the helper acquire *sock* locks.
> This also allows the helper to be invoked
> from non-sleepable contexts.
> The other approach to skip acquiring locks
> by passing an argument to the `diag_destroy`
> handler didn't work out well for UDP, as
> the UDP abort function internally invokes
> another function that ends up acquiring
> *sock* lock.
> While there are sleepable BPF iterators,
> these are limited to only certain map types.

bpf-iter program can be sleepable and non sleepable. Both sleepable and non 
sleepable tcp/unix bpf-iter programs have been able to call bpf_setsockopt() 
synchronously. bpf_setsockopt() also requires the sock lock to be held first. 
The situation on calling '.diag_destroy' from bpf-iter should not be much 
different from calling bpf_setsockopt(). From a quick look at tcp_abort and 
udp_abort, I don't see they might sleep also and you may want to double check. 
Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it might 
sleep, the common bpf map types are already available to the sleepable programs.

At the kernel side, the tcp and unix iter acquire the lock_sock() first (eg. in 
bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel 
setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and avoids 
doing the lock if it has already been guaranteed by the bpf running context.

For udp, I don't see how the udp_abort acquires the sock lock differently from 
tcp_abort.  I assume the actual problem seen in udp_abort is related to the 
'->unhash()' part which acquires the udp_table's bucket lock.  This is a problem 
for udp bpf-iter only because the udp bpf-iter did not release the udp_table's 
bucket lock before calling the bpf prog.  The tcp (and unix) bpf-iter releases 
the bucket lock first before calling the bpf prog. This was done explicitly to 
allow acquiring the sock lock before calling the bpf prog because otherwise it 
will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() is 
only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter needs 
to do similar change like the tcp and unix iter 
(https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@fb.com/): batch, 
release the bucket's lock, lock the sock, and then call bpf prog.  This will 
allow udp-iter to call bpf_setsockopt() like its tcp and unix counterpart.  That 
will also allow udp bpf-iter prog to directly do '.diag_destroy'.

Daniel Borkmann Dec. 22, 2022, 10:10 a.m. UTC | #4

On 12/22/22 6:08 AM, Martin KaFai Lau wrote:
> On 12/16/22 5:57 PM, Aditi Ghag wrote:
>> The socket destroy helper is used to
>> forcefully terminate sockets from certain
>> BPF contexts. We plan to use the capability
>> in Cilium to force client sockets to reconnect
>> when their remote load-balancing backends are
>> deleted. The other use case is on-the-fly
>> policy enforcement where existing socket
>> connections prevented by policies need to
>> be terminated.
>>
>> The helper is currently exposed to iterator
>> type BPF programs where users can filter,
>> and terminate a set of sockets.
>>
>> Sockets are destroyed asynchronously using
>> the work queue infrastructure. This allows
>> for current the locking semantics within
>> socket destroy handlers, as BPF iterators
>> invoking the helper acquire *sock* locks.
>> This also allows the helper to be invoked
>> from non-sleepable contexts.
>> The other approach to skip acquiring locks
>> by passing an argument to the `diag_destroy`
>> handler didn't work out well for UDP, as
>> the UDP abort function internally invokes
>> another function that ends up acquiring
>> *sock* lock.
>> While there are sleepable BPF iterators,
>> these are limited to only certain map types.
> 
> bpf-iter program can be sleepable and non sleepable. Both sleepable and non sleepable tcp/unix bpf-iter programs have been able to call bpf_setsockopt() synchronously. bpf_setsockopt() also requires the sock lock to be held first. The situation on calling '.diag_destroy' from bpf-iter should not be much different from calling bpf_setsockopt(). From a quick look at tcp_abort and udp_abort, I don't see they might sleep also and you may want to double check. Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it might sleep, the common bpf map types are already available to the sleepable programs.
> 
> At the kernel side, the tcp and unix iter acquire the lock_sock() first (eg. in bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and avoids doing the lock if it has already been guaranteed by the bpf running context.
> 
> For udp, I don't see how the udp_abort acquires the sock lock differently from tcp_abort.  I assume the actual problem seen in udp_abort is related to the '->unhash()' part which acquires the udp_table's bucket lock.  This is a problem for udp bpf-iter only because the udp bpf-iter did not release the udp_table's bucket lock before calling the bpf prog.  The tcp (and unix) bpf-iter releases the bucket lock first before calling the bpf prog. This was done explicitly to allow acquiring the sock lock before calling the bpf prog because otherwise it will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() is only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter needs to do similar change like the tcp and unix iter (https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@fb.com/): batch, release the bucket's lock, lock the sock, and then call bpf prog.  This will allow udp-iter to call bpf_setsockopt() like its tcp and unix counterpart.  
> That will also allow udp bpf-iter prog to directly do '.diag_destroy'.

Agree, that sounds very reasonable way forward and would also enable bpf_setsockopt() support
for UDP sks. Wrt the workqueue I don't really like that there's a loophole with the -EBUSY if
the callback didn't trigger yet, while we need to be able to destroy multiple sks in one iteration
when they match the sk cookie previously recorded in the map.. so agree, this needs to happen in
a synchronous manner.

Thanks,
Daniel

Aditi Ghag Jan. 2, 2023, 7:30 p.m. UTC | #5

Thanks for the reviews.

@martin: Thanks, I'll look into the missing batching related logic for UDP. Appreciate the pointers.
OT: While we can use the has_current_bpf_ctx macro to skip taking sock locks in the BPF context, how is the assumption around locking enforced in the code (e.g., setsockopt helper assumes that BPF iter prog acquires the relevant locks)? We could potentially use lockdep_sock_is_held for the sock lock. It's a debug configuration, but I suppose it's enabled in selftests.

> On Dec 21, 2022, at 9:08 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 12/16/22 5:57 PM, Aditi Ghag wrote:
>> The socket destroy helper is used to
>> forcefully terminate sockets from certain
>> BPF contexts. We plan to use the capability
>> in Cilium to force client sockets to reconnect
>> when their remote load-balancing backends are
>> deleted. The other use case is on-the-fly
>> policy enforcement where existing socket
>> connections prevented by policies need to
>> be terminated.
>> The helper is currently exposed to iterator
>> type BPF programs where users can filter,
>> and terminate a set of sockets.
>> Sockets are destroyed asynchronously using
>> the work queue infrastructure. This allows
>> for current the locking semantics within
>> socket destroy handlers, as BPF iterators
>> invoking the helper acquire *sock* locks.
>> This also allows the helper to be invoked
>> from non-sleepable contexts.
>> The other approach to skip acquiring locks
>> by passing an argument to the `diag_destroy`
>> handler didn't work out well for UDP, as
>> the UDP abort function internally invokes
>> another function that ends up acquiring
>> *sock* lock.
>> While there are sleepable BPF iterators,
>> these are limited to only certain map types.
> 
> bpf-iter program can be sleepable and non sleepable. Both sleepable and non sleepable tcp/unix bpf-iter programs have been able to call bpf_setsockopt() synchronously. bpf_setsockopt() also requires the sock lock to be held first. The situation on calling '.diag_destroy' from bpf-iter should not be much different from calling bpf_setsockopt(). From a quick look at tcp_abort and udp_abort, I don't see they might sleep also and you may want to double check. Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it might sleep, the common bpf map types are already available to the sleepable programs.
> 
> At the kernel side, the tcp and unix iter acquire the lock_sock() first (eg. in bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and avoids doing the lock if it has already been guaranteed by the bpf running context.
> 
> For udp, I don't see how the udp_abort acquires the sock lock differently from tcp_abort.  I assume the actual problem seen in udp_abort is related to the '->unhash()' part which acquires the udp_table's bucket lock.  This is a problem for udp bpf-iter only because the udp bpf-iter did not release the udp_table's bucket lock before calling the bpf prog.  The tcp (and unix) bpf-iter releases the bucket lock first before calling the bpf prog. This was done explicitly to allow acquiring the sock lock before calling the bpf prog because otherwise it will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() is only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter needs to do similar change like the tcp and unix iter (https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@fb.com/): batch, release the bucket's lock, lock the sock, and then call bpf prog.  This will allow udp-iter to call bpf_setsockopt() like its tcp and unix counterpart.  That will also allow udp bpf-iter prog to directly do '.diag_destroy'.

Martin KaFai Lau Jan. 4, 2023, 2:37 a.m. UTC | #6

On 1/2/23 11:08 AM, Aditi Ghag wrote:
> OT: While we can use the has_current_bpf_ctx macro to skip taking sock
> locks in the BPF context, how is the assumption around locking enforced in
> the code (e.g., setsockopt helper assumes that BPF iter prog acquires the
> relevant locks)? 

The lock condition is enforced by the verifier when loading bpf prog. The 
bpf_setsockopt helper is only available to the bpf hooks that have already 
held/owned the sock lock. For example, if a bpf-tc prog tries to call 
bpf_setsockopt(), the verifier will reject loading the prog because the bpf-tc 
hook does not hold/own the sock lock.

> We could potentially use lockdep_sock_is_held for the sock lock. 
> It's a debug configuration, but I suppose it's enabled in selftests.

Not sure I totally get it.  Meaning adding a sock_owned_by_me() check somewhere 
in the new bpf_sk_destroy() kfunc for debug purpose? hmm...yeah, why not but not 
sure how that may look like.  I think it is some minor implementation details 
and will be easier to comment with code.

[ btw, please reply inline instead of top posting. ]

> 
> On Wed, Dec 21, 2022 at 9:08 PM Martin KaFai Lau <martin.lau@linux.dev>
> wrote:
> 
>> On 12/16/22 5:57 PM, Aditi Ghag wrote:
>>> The socket destroy helper is used to
>>> forcefully terminate sockets from certain
>>> BPF contexts. We plan to use the capability
>>> in Cilium to force client sockets to reconnect
>>> when their remote load-balancing backends are
>>> deleted. The other use case is on-the-fly
>>> policy enforcement where existing socket
>>> connections prevented by policies need to
>>> be terminated.
>>>
>>> The helper is currently exposed to iterator
>>> type BPF programs where users can filter,
>>> and terminate a set of sockets.
>>>
>>> Sockets are destroyed asynchronously using
>>> the work queue infrastructure. This allows
>>> for current the locking semantics within
>>> socket destroy handlers, as BPF iterators
>>> invoking the helper acquire *sock* locks.
>>> This also allows the helper to be invoked
>>> from non-sleepable contexts.
>>> The other approach to skip acquiring locks
>>> by passing an argument to the `diag_destroy`
>>> handler didn't work out well for UDP, as
>>> the UDP abort function internally invokes
>>> another function that ends up acquiring
>>> *sock* lock.
>>> While there are sleepable BPF iterators,
>>> these are limited to only certain map types.
>>
>> bpf-iter program can be sleepable and non sleepable. Both sleepable and
>> non
>> sleepable tcp/unix bpf-iter programs have been able to call
>> bpf_setsockopt()
>> synchronously. bpf_setsockopt() also requires the sock lock to be held
>> first.
>> The situation on calling '.diag_destroy' from bpf-iter should not be much
>> different from calling bpf_setsockopt(). From a quick look at tcp_abort
>> and
>> udp_abort, I don't see they might sleep also and you may want to double
>> check.
>> Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it
>> might
>> sleep, the common bpf map types are already available to the sleepable
>> programs.
>>
>> At the kernel side, the tcp and unix iter acquire the lock_sock() first
>> (eg. in
>> bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel
>> setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and
>> avoids
>> doing the lock if it has already been guaranteed by the bpf running
>> context.
>>
>> For udp, I don't see how the udp_abort acquires the sock lock differently
>> from
>> tcp_abort.  I assume the actual problem seen in udp_abort is related to
>> the
>> '->unhash()' part which acquires the udp_table's bucket lock.  This is a
>> problem
>> for udp bpf-iter only because the udp bpf-iter did not release the
>> udp_table's
>> bucket lock before calling the bpf prog.  The tcp (and unix) bpf-iter
>> releases
>> the bucket lock first before calling the bpf prog. This was done
>> explicitly to
>> allow acquiring the sock lock before calling the bpf prog because
>> otherwise it
>> will have a lock ordering issue. Hence, for this reason, bpf_setsockopt()
>> is
>> only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter
>> needs
>> to do similar change like the tcp and unix iter
>> (https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@fb.com/):
>> batch,
>> release the bucket's lock, lock the sock, and then call bpf prog.  This
>> will
>> allow udp-iter to call bpf_setsockopt() like its tcp and unix
>> counterpart.  That
>> will also allow udp bpf-iter prog to directly do '.diag_destroy'.
>>
>

Aditi Ghag Feb. 23, 2023, 10:02 p.m. UTC | #7

> On Dec 19, 2022, at 10:22 AM, sdf@google.com wrote:
> 
> On 12/17, Aditi Ghag wrote:
>> The socket destroy helper is used to
>> forcefully terminate sockets from certain
>> BPF contexts. We plan to use the capability
>> in Cilium to force client sockets to reconnect
>> when their remote load-balancing backends are
>> deleted. The other use case is on-the-fly
>> policy enforcement where existing socket
>> connections prevented by policies need to
>> be terminated.
> 
>> The helper is currently exposed to iterator
>> type BPF programs where users can filter,
>> and terminate a set of sockets.
> 
>> Sockets are destroyed asynchronously using
>> the work queue infrastructure. This allows
>> for current the locking semantics within
> 
> s/current the/the current/ ?
> 
>> socket destroy handlers, as BPF iterators
>> invoking the helper acquire *sock* locks.
>> This also allows the helper to be invoked
>> from non-sleepable contexts.
>> The other approach to skip acquiring locks
>> by passing an argument to the `diag_destroy`
>> handler didn't work out well for UDP, as
>> the UDP abort function internally invokes
>> another function that ends up acquiring
>> *sock* lock.
>> While there are sleepable BPF iterators,
>> these are limited to only certain map types.
>> Furthermore, it's limiting in the sense that
>> it wouldn't allow us to extend the helper
>> to other non-sleepable BPF programs.
> 
>> The work queue infrastructure processes work
>> items from per-cpu structures. As the sock
>> destroy work items are executed asynchronously,
>> we need to ref count sockets before they are
>> added to the work queue. The 'work_pending'
>> check prevents duplicate ref counting of sockets
>> in case users invoke the destroy helper for a
>> socket multiple times. The `{READ,WRITE}_ONCE`
>> macros ensure that the socket pointer stored
>> in a work queue item isn't clobbered while
>> the item is being processed. As BPF programs
>> are non-preemptible, we can expect that once
>> a socket is ref counted, no other socket can
>> sneak in before the ref counted socket is
>> added to the work queue for asynchronous destroy.
>> Finally, users are expected to retry when the
>> helper fails to queue a work item for a socket
>> to be destroyed in case there is another destroy
>> operation is in progress.
> 
> nit: maybe reformat to fit into 80 characters per line? A bit hard to
> read with this narrow formatting..
> 
> 
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  include/linux/bpf.h            |  1 +
>>  include/uapi/linux/bpf.h       | 17 +++++++++
>>  kernel/bpf/core.c              |  1 +
>>  kernel/trace/bpf_trace.c       |  2 +
>>  net/core/filter.c              | 70 ++++++++++++++++++++++++++++++++++
>>  tools/include/uapi/linux/bpf.h | 17 +++++++++
>>  6 files changed, 108 insertions(+)
> 
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 3de24cfb7a3d..60eaa05dfab3 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -2676,6 +2676,7 @@ extern const struct bpf_func_proto bpf_get_retval_proto;
>>  extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
>>  extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
>>  extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
>> +extern const struct bpf_func_proto bpf_sock_destroy_proto;
> 
>>  const struct bpf_func_proto *tracing_prog_func_proto(
>>    enum bpf_func_id func_id, const struct bpf_prog *prog);
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 464ca3f01fe7..789ac7c59fdf 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -5484,6 +5484,22 @@ union bpf_attr {
>>   *		0 on success.
>>   *
>>   *		**-ENOENT** if the bpf_local_storage cannot be found.
>> + *
>> + * int bpf_sock_destroy(struct sock *sk)
>> + *	Description
>> + *		Destroy the given socket with **ECONNABORTED** error code.
>> + *
>> + *		*sk* must be a non-**NULL** pointer to a socket.
>> + *
>> + *	Return
>> + *		The socket is destroyed asynchronosuly, so 0 return value may
>> + *		not suggest indicate that the socket was successfully destroyed.
> 
> s/suggest indicate/ with either suggest or indicate?
> 
>> + *
>> + *		On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**.
>> + *
>> + *		**-EPROTONOSUPPORT** if protocol specific destroy handler is not implemented.
>> + *
>> + *		**-EBUSY** if another socket destroy operation is in progress.
>>   */
>>  #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
>>  	FN(unspec, 0, ##ctx)				\
>> @@ -5698,6 +5714,7 @@ union bpf_attr {
>>  	FN(user_ringbuf_drain, 209, ##ctx)		\
>>  	FN(cgrp_storage_get, 210, ##ctx)		\
>>  	FN(cgrp_storage_delete, 211, ##ctx)		\
>> +	FN(sock_destroy, 212, ##ctx)			\
>>  	/* */
> 
>>  /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 7f98dec6e90f..c59bef9805e5 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -2651,6 +2651,7 @@ const struct bpf_func_proto bpf_snprintf_btf_proto __weak;
>>  const struct bpf_func_proto bpf_seq_printf_btf_proto __weak;
>>  const struct bpf_func_proto bpf_set_retval_proto __weak;
>>  const struct bpf_func_proto bpf_get_retval_proto __weak;
>> +const struct bpf_func_proto bpf_sock_destroy_proto __weak;
> 
>>  const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>>  {
>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> index 3bbd3f0c810c..016dbee6b5e4 100644
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -1930,6 +1930,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>  		return &bpf_get_socket_ptr_cookie_proto;
>>  	case BPF_FUNC_xdp_get_buff_len:
>>  		return &bpf_xdp_get_buff_len_trace_proto;
>> +	case BPF_FUNC_sock_destroy:
>> +		return &bpf_sock_destroy_proto;
>>  #endif
>>  	case BPF_FUNC_seq_printf:
>>  		return prog->expected_attach_type == BPF_TRACE_ITER ?
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 929358677183..9753606ecc26 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -11569,6 +11569,8 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>>  		break;
>>  	case BPF_FUNC_ktime_get_coarse_ns:
>>  		return &bpf_ktime_get_coarse_ns_proto;
>> +	case BPF_FUNC_sock_destroy:
>> +		return &bpf_sock_destroy_proto;
>>  	default:
>>  		return bpf_base_func_proto(func_id);
>>  	}
>> @@ -11578,3 +11580,71 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
> 
>>  	return func;
>>  }
>> +
>> +struct sock_destroy_work {
>> +	struct sock *sk;
>> +	struct work_struct destroy;
>> +};
>> +
>> +static DEFINE_PER_CPU(struct sock_destroy_work, sock_destroy_workqueue);
>> +
>> +static void bpf_sock_destroy_fn(struct work_struct *work)
>> +{
>> +	struct sock_destroy_work *sd_work = container_of(work,
>> +			struct sock_destroy_work, destroy);
>> +	struct sock *sk = READ_ONCE(sd_work->sk);
>> +
>> +	sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>> +	sock_put(sk);
>> +}
>> +
>> +static int __init bpf_sock_destroy_workqueue_init(void)
>> +{
>> +	int cpu;
>> +	struct sock_destroy_work *work;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		work = per_cpu_ptr(&sock_destroy_workqueue, cpu);
>> +		INIT_WORK(&work->destroy, bpf_sock_destroy_fn);
>> +	}
>> +
>> +	return 0;
>> +}
>> +subsys_initcall(bpf_sock_destroy_workqueue_init);
>> +
>> +BPF_CALL_1(bpf_sock_destroy, struct sock *, sk)
>> +{
>> +	struct sock_destroy_work *sd_work;
>> +
>> +	if (!sk->sk_prot->diag_destroy)
>> +		return -EOPNOTSUPP;
>> +
>> +	sd_work = this_cpu_ptr(&sock_destroy_workqueue);
> 
> [..]
> 
>> +	/* This check prevents duplicate ref counting
>> +	 * of sockets, in case the handler is invoked
>> +	 * multiple times for the same socket.
>> +	 */
> 
> This means this helper can also be called for a single socket during
> invocation; is it an ok compromise?
> 
> I'm also assuming it's still possible that this helper gets called for
> the same socket on different cpus?

Thanks for the review. The v2 patch series refactored this code path significantly, and it executes the destroy handlers synchronously. 

> 
>> +	if (work_pending(&sd_work->destroy))
>> +		return -EBUSY;
>> +
>> +	/* Ref counting ensures that the socket
>> +	 * isn't deleted from underneath us before
>> +	 * the work queue item is processed.
>> +	 */
>> +	if (!refcount_inc_not_zero(&sk->sk_refcnt))
>> +		return -EINVAL;
>> +
>> +	WRITE_ONCE(sd_work->sk, sk);
>> +	if (!queue_work(system_wq, &sd_work->destroy)) {
>> +		sock_put(sk);
>> +		return -EBUSY;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +const struct bpf_func_proto bpf_sock_destroy_proto = {
>> +	.func		= bpf_sock_destroy,
>> +	.ret_type	= RET_INTEGER,
>> +	.arg1_type	= ARG_PTR_TO_BTF_ID_SOCK_COMMON,
>> +};
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index 464ca3f01fe7..07154a4d92f9 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -5484,6 +5484,22 @@ union bpf_attr {
>>   *		0 on success.
>>   *
>>   *		**-ENOENT** if the bpf_local_storage cannot be found.
>> + *
>> + * int bpf_sock_destroy(void *sk)
>> + *	Description
>> + *		Destroy the given socket with **ECONNABORTED** error code.
>> + *
>> + *		*sk* must be a non-**NULL** pointer to a socket.
>> + *
>> + *	Return
>> + *		The socket is destroyed asynchronosuly, so 0 return value may
>> + *		not indicate that the socket was successfully destroyed.
>> + *
>> + *		On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**.
>> + *
>> + *		**-EPROTONOSUPPORT** if protocol specific destroy handler is not implemented.
>> + *
>> + *		**-EBUSY** if another socket destroy operation is in progress.
>>   */
>>  #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
>>  	FN(unspec, 0, ##ctx)				\
>> @@ -5698,6 +5714,7 @@ union bpf_attr {
>>  	FN(user_ringbuf_drain, 209, ##ctx)		\
>>  	FN(cgrp_storage_get, 210, ##ctx)		\
>>  	FN(cgrp_storage_delete, 211, ##ctx)		\
>> +	FN(sock_destroy, 212, ##ctx)			\
>>  	/* */
> 
>>  /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
>> --
>> 2.34.1

Aditi Ghag Feb. 23, 2023, 10:05 p.m. UTC | #8

> On Dec 21, 2022, at 9:08 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 12/16/22 5:57 PM, Aditi Ghag wrote:
>> The socket destroy helper is used to
>> forcefully terminate sockets from certain
>> BPF contexts. We plan to use the capability
>> in Cilium to force client sockets to reconnect
>> when their remote load-balancing backends are
>> deleted. The other use case is on-the-fly
>> policy enforcement where existing socket
>> connections prevented by policies need to
>> be terminated.
>> The helper is currently exposed to iterator
>> type BPF programs where users can filter,
>> and terminate a set of sockets.
>> Sockets are destroyed asynchronously using
>> the work queue infrastructure. This allows
>> for current the locking semantics within
>> socket destroy handlers, as BPF iterators
>> invoking the helper acquire *sock* locks.
>> This also allows the helper to be invoked
>> from non-sleepable contexts.
>> The other approach to skip acquiring locks
>> by passing an argument to the `diag_destroy`
>> handler didn't work out well for UDP, as
>> the UDP abort function internally invokes
>> another function that ends up acquiring
>> *sock* lock.
>> While there are sleepable BPF iterators,
>> these are limited to only certain map types.
> 
> bpf-iter program can be sleepable and non sleepable. Both sleepable and non sleepable tcp/unix bpf-iter programs have been able to call bpf_setsockopt() synchronously. bpf_setsockopt() also requires the sock lock to be held first. The situation on calling '.diag_destroy' from bpf-iter should not be much different from calling bpf_setsockopt(). From a quick look at tcp_abort and udp_abort, I don't see they might sleep also and you may want to double check. Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it might sleep, the common bpf map types are already available to the sleepable programs.
> 
> At the kernel side, the tcp and unix iter acquire the lock_sock() first (eg. in bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and avoids doing the lock if it has already been guaranteed by the bpf running context.
> 
> For udp, I don't see how the udp_abort acquires the sock lock differently from tcp_abort.  I assume the actual problem seen in udp_abort is related to the '->unhash()' part which acquires the udp_table's bucket lock.  This is a problem for udp bpf-iter only because the udp bpf-iter did not release the udp_table's bucket lock before calling the bpf prog.  The tcp (and unix) bpf-iter releases the bucket lock first before calling the bpf prog. This was done explicitly to allow acquiring the sock lock before calling the bpf prog because otherwise it will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() is only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter needs to do similar change like the tcp and unix iter (https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@fb.com/): batch, release the bucket's lock, lock the sock, and then call bpf prog.  This will allow udp-iter to call bpf_setsockopt() like its tcp and unix counterpart.  That will also allow udp bpf-iter prog to directly do '.diag_destroy'.

Pushed v2 patch series that implements the missing batching support for UDP iterators. Sorry for the delay. 
Thanks for the well-documented batching code in TCP!

Aditi Ghag Feb. 23, 2023, 10:12 p.m. UTC | #9

> On Dec 20, 2022, at 2:26 AM, Alan Maguire <alan.maguire@oracle.com> wrote:
> 
> On 17/12/2022 01:57, Aditi Ghag wrote:
>> The socket destroy helper is used to
>> forcefully terminate sockets from certain
>> BPF contexts. We plan to use the capability
>> in Cilium to force client sockets to reconnect
>> when their remote load-balancing backends are
>> deleted. The other use case is on-the-fly
>> policy enforcement where existing socket
>> connections prevented by policies need to
>> be terminated.
>> 
>> The helper is currently exposed to iterator
>> type BPF programs where users can filter,
>> and terminate a set of sockets.
>> 
>> Sockets are destroyed asynchronously using
>> the work queue infrastructure. This allows
>> for current the locking semantics within
>> socket destroy handlers, as BPF iterators
>> invoking the helper acquire *sock* locks.
>> This also allows the helper to be invoked
>> from non-sleepable contexts.
>> The other approach to skip acquiring locks
>> by passing an argument to the `diag_destroy`
>> handler didn't work out well for UDP, as
>> the UDP abort function internally invokes
>> another function that ends up acquiring
>> *sock* lock.
>> While there are sleepable BPF iterators,
>> these are limited to only certain map types.
>> Furthermore, it's limiting in the sense that
>> it wouldn't allow us to extend the helper
>> to other non-sleepable BPF programs.
>> 
>> The work queue infrastructure processes work
>> items from per-cpu structures. As the sock
>> destroy work items are executed asynchronously,
>> we need to ref count sockets before they are
>> added to the work queue. The 'work_pending'
>> check prevents duplicate ref counting of sockets
>> in case users invoke the destroy helper for a
>> socket multiple times. The `{READ,WRITE}_ONCE`
>> macros ensure that the socket pointer stored
>> in a work queue item isn't clobbered while
>> the item is being processed. As BPF programs
>> are non-preemptible, we can expect that once
>> a socket is ref counted, no other socket can
>> sneak in before the ref counted socket is
>> added to the work queue for asynchronous destroy.
>> Finally, users are expected to retry when the
>> helper fails to queue a work item for a socket
>> to be destroyed in case there is another destroy
>> operation is in progress.
>> 
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>> include/linux/bpf.h            |  1 +
>> include/uapi/linux/bpf.h       | 17 +++++++++
>> kernel/bpf/core.c              |  1 +
>> kernel/trace/bpf_trace.c       |  2 +
>> net/core/filter.c              | 70 ++++++++++++++++++++++++++++++++++
>> tools/include/uapi/linux/bpf.h | 17 +++++++++
>> 6 files changed, 108 insertions(+)
>> 
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 3de24cfb7a3d..60eaa05dfab3 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -2676,6 +2676,7 @@ extern const struct bpf_func_proto bpf_get_retval_proto;
>> extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
>> extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
>> extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
>> +extern const struct bpf_func_proto bpf_sock_destroy_proto;
>> 
>> const struct bpf_func_proto *tracing_prog_func_proto(
>>   enum bpf_func_id func_id, const struct bpf_prog *prog);
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 464ca3f01fe7..789ac7c59fdf 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -5484,6 +5484,22 @@ union bpf_attr {
>>  *		0 on success.
>>  *
>>  *		**-ENOENT** if the bpf_local_storage cannot be found.
>> + *
>> + * int bpf_sock_destroy(struct sock *sk)
>> + *	Description
>> + *		Destroy the given socket with **ECONNABORTED** error code.
>> + *
>> + *		*sk* must be a non-**NULL** pointer to a socket.
>> + *
>> + *	Return
>> + *		The socket is destroyed asynchronosuly, so 0 return value may
>> + *		not suggest indicate that the socket was successfully destroyed.
>> + *
>> + *		On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**.
>> + *
>> + *		**-EPROTONOSUPPORT** if protocol specific destroy handler is not implemented.
>> + *
>> + *		**-EBUSY** if another socket destroy operation is in progress.
>>  */
>> #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
>> 	FN(unspec, 0, ##ctx)				\
>> @@ -5698,6 +5714,7 @@ union bpf_attr {
>> 	FN(user_ringbuf_drain, 209, ##ctx)		\
>> 	FN(cgrp_storage_get, 210, ##ctx)		\
>> 	FN(cgrp_storage_delete, 211, ##ctx)		\
>> +	FN(sock_destroy, 212, ##ctx)			\
>> 	/* */
>> 
>> /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 7f98dec6e90f..c59bef9805e5 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -2651,6 +2651,7 @@ const struct bpf_func_proto bpf_snprintf_btf_proto __weak;
>> const struct bpf_func_proto bpf_seq_printf_btf_proto __weak;
>> const struct bpf_func_proto bpf_set_retval_proto __weak;
>> const struct bpf_func_proto bpf_get_retval_proto __weak;
>> +const struct bpf_func_proto bpf_sock_destroy_proto __weak;
>> 
>> const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>> {
>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> index 3bbd3f0c810c..016dbee6b5e4 100644
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -1930,6 +1930,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> 		return &bpf_get_socket_ptr_cookie_proto;
>> 	case BPF_FUNC_xdp_get_buff_len:
>> 		return &bpf_xdp_get_buff_len_trace_proto;
>> +	case BPF_FUNC_sock_destroy:
>> +		return &bpf_sock_destroy_proto;
>> #endif
>> 	case BPF_FUNC_seq_printf:
>> 		return prog->expected_attach_type == BPF_TRACE_ITER ?
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 929358677183..9753606ecc26 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -11569,6 +11569,8 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>> 		break;
>> 	case BPF_FUNC_ktime_get_coarse_ns:
>> 		return &bpf_ktime_get_coarse_ns_proto;
>> +	case BPF_FUNC_sock_destroy:
>> +		return &bpf_sock_destroy_proto;
>> 	default:
> This is a really neat feature! One question though; above it seems to
> suggest that the helper is only exposed to BPF iterators, but by

The v2 patch only exposes the capability to BPF iterators. The patch series
has details involving locking semantics due to which the kfunc is only available
for BPF iterators at the moment. 

> adding it to bpf_sk_base_func_proto() won't it be available to
> other BPF programs like sock_addr, sk_filter etc? If I've got that
> right, we'd definitely want to exclude potentially unprivileged
> programs from having access to this helper. And to be clear, I'd
> definitely see value in having it accessible to other BPF program
> types too like sockops if possible, but just wanted to check. 
> 
>> 		return bpf_base_func_proto(func_id);
>> 	}
>> @@ -11578,3 +11580,71 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>> 
>> 	return func;
>> }
>> +
>> +struct sock_destroy_work {
>> +	struct sock *sk;
>> +	struct work_struct destroy;
>> +};
>> +
>> +static DEFINE_PER_CPU(struct sock_destroy_work, sock_destroy_workqueue);
>> +
>> +static void bpf_sock_destroy_fn(struct work_struct *work)
>> +{
>> +	struct sock_destroy_work *sd_work = container_of(work,
>> +			struct sock_destroy_work, destroy);
>> +	struct sock *sk = READ_ONCE(sd_work->sk);
>> +
>> +	sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>> +	sock_put(sk);
>> +}
>> +
>> +static int __init bpf_sock_destroy_workqueue_init(void)
>> +{
>> +	int cpu;
>> +	struct sock_destroy_work *work;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		work = per_cpu_ptr(&sock_destroy_workqueue, cpu);
>> +		INIT_WORK(&work->destroy, bpf_sock_destroy_fn);
>> +	}
>> +
>> +	return 0;
>> +}
>> +subsys_initcall(bpf_sock_destroy_workqueue_init);
>> +
>> +BPF_CALL_1(bpf_sock_destroy, struct sock *, sk)
>> +{
>> +	struct sock_destroy_work *sd_work;
>> +
>> +	if (!sk->sk_prot->diag_destroy)
> 
> risk of a NULL sk here?
> 
> Thanks!
> 
> Alan

[1/2] bpf: Add socket destroy capability

Checks

Commit Message

Comments

Patch