Message ID | c3b935a5a72b1371f9262348616a7fa84061b85f.1671242108.git.aditi.ghag@isovalent.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | bpf-next: Add socket destroy capability | expand |
On 12/17, Aditi Ghag wrote: > The socket destroy helper is used to > forcefully terminate sockets from certain > BPF contexts. We plan to use the capability > in Cilium to force client sockets to reconnect > when their remote load-balancing backends are > deleted. The other use case is on-the-fly > policy enforcement where existing socket > connections prevented by policies need to > be terminated. > The helper is currently exposed to iterator > type BPF programs where users can filter, > and terminate a set of sockets. > Sockets are destroyed asynchronously using > the work queue infrastructure. This allows > for current the locking semantics within s/current the/the current/ ? > socket destroy handlers, as BPF iterators > invoking the helper acquire *sock* locks. > This also allows the helper to be invoked > from non-sleepable contexts. > The other approach to skip acquiring locks > by passing an argument to the `diag_destroy` > handler didn't work out well for UDP, as > the UDP abort function internally invokes > another function that ends up acquiring > *sock* lock. > While there are sleepable BPF iterators, > these are limited to only certain map types. > Furthermore, it's limiting in the sense that > it wouldn't allow us to extend the helper > to other non-sleepable BPF programs. > The work queue infrastructure processes work > items from per-cpu structures. As the sock > destroy work items are executed asynchronously, > we need to ref count sockets before they are > added to the work queue. The 'work_pending' > check prevents duplicate ref counting of sockets > in case users invoke the destroy helper for a > socket multiple times. The `{READ,WRITE}_ONCE` > macros ensure that the socket pointer stored > in a work queue item isn't clobbered while > the item is being processed. As BPF programs > are non-preemptible, we can expect that once > a socket is ref counted, no other socket can > sneak in before the ref counted socket is > added to the work queue for asynchronous destroy. > Finally, users are expected to retry when the > helper fails to queue a work item for a socket > to be destroyed in case there is another destroy > operation is in progress. nit: maybe reformat to fit into 80 characters per line? A bit hard to read with this narrow formatting.. > Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com> > --- > include/linux/bpf.h | 1 + > include/uapi/linux/bpf.h | 17 +++++++++ > kernel/bpf/core.c | 1 + > kernel/trace/bpf_trace.c | 2 + > net/core/filter.c | 70 ++++++++++++++++++++++++++++++++++ > tools/include/uapi/linux/bpf.h | 17 +++++++++ > 6 files changed, 108 insertions(+) > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 3de24cfb7a3d..60eaa05dfab3 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -2676,6 +2676,7 @@ extern const struct bpf_func_proto > bpf_get_retval_proto; > extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > extern const struct bpf_func_proto bpf_cgrp_storage_get_proto; > extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto; > +extern const struct bpf_func_proto bpf_sock_destroy_proto; > const struct bpf_func_proto *tracing_prog_func_proto( > enum bpf_func_id func_id, const struct bpf_prog *prog); > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 464ca3f01fe7..789ac7c59fdf 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -5484,6 +5484,22 @@ union bpf_attr { > * 0 on success. > * > * **-ENOENT** if the bpf_local_storage cannot be found. > + * > + * int bpf_sock_destroy(struct sock *sk) > + * Description > + * Destroy the given socket with **ECONNABORTED** error code. > + * > + * *sk* must be a non-**NULL** pointer to a socket. > + * > + * Return > + * The socket is destroyed asynchronosuly, so 0 return value may > + * not suggest indicate that the socket was successfully destroyed. s/suggest indicate/ with either suggest or indicate? > + * > + * On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**. > + * > + * **-EPROTONOSUPPORT** if protocol specific destroy handler is not > implemented. > + * > + * **-EBUSY** if another socket destroy operation is in progress. > */ > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > FN(unspec, 0, ##ctx) \ > @@ -5698,6 +5714,7 @@ union bpf_attr { > FN(user_ringbuf_drain, 209, ##ctx) \ > FN(cgrp_storage_get, 210, ##ctx) \ > FN(cgrp_storage_delete, 211, ##ctx) \ > + FN(sock_destroy, 212, ##ctx) \ > /* */ > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > don't > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index 7f98dec6e90f..c59bef9805e5 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -2651,6 +2651,7 @@ const struct bpf_func_proto bpf_snprintf_btf_proto > __weak; > const struct bpf_func_proto bpf_seq_printf_btf_proto __weak; > const struct bpf_func_proto bpf_set_retval_proto __weak; > const struct bpf_func_proto bpf_get_retval_proto __weak; > +const struct bpf_func_proto bpf_sock_destroy_proto __weak; > const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) > { > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 3bbd3f0c810c..016dbee6b5e4 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -1930,6 +1930,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, > const struct bpf_prog *prog) > return &bpf_get_socket_ptr_cookie_proto; > case BPF_FUNC_xdp_get_buff_len: > return &bpf_xdp_get_buff_len_trace_proto; > + case BPF_FUNC_sock_destroy: > + return &bpf_sock_destroy_proto; > #endif > case BPF_FUNC_seq_printf: > return prog->expected_attach_type == BPF_TRACE_ITER ? > diff --git a/net/core/filter.c b/net/core/filter.c > index 929358677183..9753606ecc26 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -11569,6 +11569,8 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id) > break; > case BPF_FUNC_ktime_get_coarse_ns: > return &bpf_ktime_get_coarse_ns_proto; > + case BPF_FUNC_sock_destroy: > + return &bpf_sock_destroy_proto; > default: > return bpf_base_func_proto(func_id); > } > @@ -11578,3 +11580,71 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id) > return func; > } > + > +struct sock_destroy_work { > + struct sock *sk; > + struct work_struct destroy; > +}; > + > +static DEFINE_PER_CPU(struct sock_destroy_work, sock_destroy_workqueue); > + > +static void bpf_sock_destroy_fn(struct work_struct *work) > +{ > + struct sock_destroy_work *sd_work = container_of(work, > + struct sock_destroy_work, destroy); > + struct sock *sk = READ_ONCE(sd_work->sk); > + > + sk->sk_prot->diag_destroy(sk, ECONNABORTED); > + sock_put(sk); > +} > + > +static int __init bpf_sock_destroy_workqueue_init(void) > +{ > + int cpu; > + struct sock_destroy_work *work; > + > + for_each_possible_cpu(cpu) { > + work = per_cpu_ptr(&sock_destroy_workqueue, cpu); > + INIT_WORK(&work->destroy, bpf_sock_destroy_fn); > + } > + > + return 0; > +} > +subsys_initcall(bpf_sock_destroy_workqueue_init); > + > +BPF_CALL_1(bpf_sock_destroy, struct sock *, sk) > +{ > + struct sock_destroy_work *sd_work; > + > + if (!sk->sk_prot->diag_destroy) > + return -EOPNOTSUPP; > + > + sd_work = this_cpu_ptr(&sock_destroy_workqueue); [..] > + /* This check prevents duplicate ref counting > + * of sockets, in case the handler is invoked > + * multiple times for the same socket. > + */ This means this helper can also be called for a single socket during invocation; is it an ok compromise? I'm also assuming it's still possible that this helper gets called for the same socket on different cpus? > + if (work_pending(&sd_work->destroy)) > + return -EBUSY; > + > + /* Ref counting ensures that the socket > + * isn't deleted from underneath us before > + * the work queue item is processed. > + */ > + if (!refcount_inc_not_zero(&sk->sk_refcnt)) > + return -EINVAL; > + > + WRITE_ONCE(sd_work->sk, sk); > + if (!queue_work(system_wq, &sd_work->destroy)) { > + sock_put(sk); > + return -EBUSY; > + } > + > + return 0; > +} > + > +const struct bpf_func_proto bpf_sock_destroy_proto = { > + .func = bpf_sock_destroy, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_PTR_TO_BTF_ID_SOCK_COMMON, > +}; > diff --git a/tools/include/uapi/linux/bpf.h > b/tools/include/uapi/linux/bpf.h > index 464ca3f01fe7..07154a4d92f9 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -5484,6 +5484,22 @@ union bpf_attr { > * 0 on success. > * > * **-ENOENT** if the bpf_local_storage cannot be found. > + * > + * int bpf_sock_destroy(void *sk) > + * Description > + * Destroy the given socket with **ECONNABORTED** error code. > + * > + * *sk* must be a non-**NULL** pointer to a socket. > + * > + * Return > + * The socket is destroyed asynchronosuly, so 0 return value may > + * not indicate that the socket was successfully destroyed. > + * > + * On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**. > + * > + * **-EPROTONOSUPPORT** if protocol specific destroy handler is not > implemented. > + * > + * **-EBUSY** if another socket destroy operation is in progress. > */ > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > FN(unspec, 0, ##ctx) \ > @@ -5698,6 +5714,7 @@ union bpf_attr { > FN(user_ringbuf_drain, 209, ##ctx) \ > FN(cgrp_storage_get, 210, ##ctx) \ > FN(cgrp_storage_delete, 211, ##ctx) \ > + FN(sock_destroy, 212, ##ctx) \ > /* */ > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > don't > -- > 2.34.1
On 17/12/2022 01:57, Aditi Ghag wrote: > The socket destroy helper is used to > forcefully terminate sockets from certain > BPF contexts. We plan to use the capability > in Cilium to force client sockets to reconnect > when their remote load-balancing backends are > deleted. The other use case is on-the-fly > policy enforcement where existing socket > connections prevented by policies need to > be terminated. > > The helper is currently exposed to iterator > type BPF programs where users can filter, > and terminate a set of sockets. > > Sockets are destroyed asynchronously using > the work queue infrastructure. This allows > for current the locking semantics within > socket destroy handlers, as BPF iterators > invoking the helper acquire *sock* locks. > This also allows the helper to be invoked > from non-sleepable contexts. > The other approach to skip acquiring locks > by passing an argument to the `diag_destroy` > handler didn't work out well for UDP, as > the UDP abort function internally invokes > another function that ends up acquiring > *sock* lock. > While there are sleepable BPF iterators, > these are limited to only certain map types. > Furthermore, it's limiting in the sense that > it wouldn't allow us to extend the helper > to other non-sleepable BPF programs. > > The work queue infrastructure processes work > items from per-cpu structures. As the sock > destroy work items are executed asynchronously, > we need to ref count sockets before they are > added to the work queue. The 'work_pending' > check prevents duplicate ref counting of sockets > in case users invoke the destroy helper for a > socket multiple times. The `{READ,WRITE}_ONCE` > macros ensure that the socket pointer stored > in a work queue item isn't clobbered while > the item is being processed. As BPF programs > are non-preemptible, we can expect that once > a socket is ref counted, no other socket can > sneak in before the ref counted socket is > added to the work queue for asynchronous destroy. > Finally, users are expected to retry when the > helper fails to queue a work item for a socket > to be destroyed in case there is another destroy > operation is in progress. > > Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com> > --- > include/linux/bpf.h | 1 + > include/uapi/linux/bpf.h | 17 +++++++++ > kernel/bpf/core.c | 1 + > kernel/trace/bpf_trace.c | 2 + > net/core/filter.c | 70 ++++++++++++++++++++++++++++++++++ > tools/include/uapi/linux/bpf.h | 17 +++++++++ > 6 files changed, 108 insertions(+) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 3de24cfb7a3d..60eaa05dfab3 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -2676,6 +2676,7 @@ extern const struct bpf_func_proto bpf_get_retval_proto; > extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > extern const struct bpf_func_proto bpf_cgrp_storage_get_proto; > extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto; > +extern const struct bpf_func_proto bpf_sock_destroy_proto; > > const struct bpf_func_proto *tracing_prog_func_proto( > enum bpf_func_id func_id, const struct bpf_prog *prog); > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 464ca3f01fe7..789ac7c59fdf 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -5484,6 +5484,22 @@ union bpf_attr { > * 0 on success. > * > * **-ENOENT** if the bpf_local_storage cannot be found. > + * > + * int bpf_sock_destroy(struct sock *sk) > + * Description > + * Destroy the given socket with **ECONNABORTED** error code. > + * > + * *sk* must be a non-**NULL** pointer to a socket. > + * > + * Return > + * The socket is destroyed asynchronosuly, so 0 return value may > + * not suggest indicate that the socket was successfully destroyed. > + * > + * On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**. > + * > + * **-EPROTONOSUPPORT** if protocol specific destroy handler is not implemented. > + * > + * **-EBUSY** if another socket destroy operation is in progress. > */ > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > FN(unspec, 0, ##ctx) \ > @@ -5698,6 +5714,7 @@ union bpf_attr { > FN(user_ringbuf_drain, 209, ##ctx) \ > FN(cgrp_storage_get, 210, ##ctx) \ > FN(cgrp_storage_delete, 211, ##ctx) \ > + FN(sock_destroy, 212, ##ctx) \ > /* */ > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index 7f98dec6e90f..c59bef9805e5 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -2651,6 +2651,7 @@ const struct bpf_func_proto bpf_snprintf_btf_proto __weak; > const struct bpf_func_proto bpf_seq_printf_btf_proto __weak; > const struct bpf_func_proto bpf_set_retval_proto __weak; > const struct bpf_func_proto bpf_get_retval_proto __weak; > +const struct bpf_func_proto bpf_sock_destroy_proto __weak; > > const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) > { > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 3bbd3f0c810c..016dbee6b5e4 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -1930,6 +1930,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > return &bpf_get_socket_ptr_cookie_proto; > case BPF_FUNC_xdp_get_buff_len: > return &bpf_xdp_get_buff_len_trace_proto; > + case BPF_FUNC_sock_destroy: > + return &bpf_sock_destroy_proto; > #endif > case BPF_FUNC_seq_printf: > return prog->expected_attach_type == BPF_TRACE_ITER ? > diff --git a/net/core/filter.c b/net/core/filter.c > index 929358677183..9753606ecc26 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -11569,6 +11569,8 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id) > break; > case BPF_FUNC_ktime_get_coarse_ns: > return &bpf_ktime_get_coarse_ns_proto; > + case BPF_FUNC_sock_destroy: > + return &bpf_sock_destroy_proto; > default: This is a really neat feature! One question though; above it seems to suggest that the helper is only exposed to BPF iterators, but by adding it to bpf_sk_base_func_proto() won't it be available to other BPF programs like sock_addr, sk_filter etc? If I've got that right, we'd definitely want to exclude potentially unprivileged programs from having access to this helper. And to be clear, I'd definitely see value in having it accessible to other BPF program types too like sockops if possible, but just wanted to check. > return bpf_base_func_proto(func_id); > } > @@ -11578,3 +11580,71 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id) > > return func; > } > + > +struct sock_destroy_work { > + struct sock *sk; > + struct work_struct destroy; > +}; > + > +static DEFINE_PER_CPU(struct sock_destroy_work, sock_destroy_workqueue); > + > +static void bpf_sock_destroy_fn(struct work_struct *work) > +{ > + struct sock_destroy_work *sd_work = container_of(work, > + struct sock_destroy_work, destroy); > + struct sock *sk = READ_ONCE(sd_work->sk); > + > + sk->sk_prot->diag_destroy(sk, ECONNABORTED); > + sock_put(sk); > +} > + > +static int __init bpf_sock_destroy_workqueue_init(void) > +{ > + int cpu; > + struct sock_destroy_work *work; > + > + for_each_possible_cpu(cpu) { > + work = per_cpu_ptr(&sock_destroy_workqueue, cpu); > + INIT_WORK(&work->destroy, bpf_sock_destroy_fn); > + } > + > + return 0; > +} > +subsys_initcall(bpf_sock_destroy_workqueue_init); > + > +BPF_CALL_1(bpf_sock_destroy, struct sock *, sk) > +{ > + struct sock_destroy_work *sd_work; > + > + if (!sk->sk_prot->diag_destroy) risk of a NULL sk here? Thanks! Alan
On 12/16/22 5:57 PM, Aditi Ghag wrote: > The socket destroy helper is used to > forcefully terminate sockets from certain > BPF contexts. We plan to use the capability > in Cilium to force client sockets to reconnect > when their remote load-balancing backends are > deleted. The other use case is on-the-fly > policy enforcement where existing socket > connections prevented by policies need to > be terminated. > > The helper is currently exposed to iterator > type BPF programs where users can filter, > and terminate a set of sockets. > > Sockets are destroyed asynchronously using > the work queue infrastructure. This allows > for current the locking semantics within > socket destroy handlers, as BPF iterators > invoking the helper acquire *sock* locks. > This also allows the helper to be invoked > from non-sleepable contexts. > The other approach to skip acquiring locks > by passing an argument to the `diag_destroy` > handler didn't work out well for UDP, as > the UDP abort function internally invokes > another function that ends up acquiring > *sock* lock. > While there are sleepable BPF iterators, > these are limited to only certain map types. bpf-iter program can be sleepable and non sleepable. Both sleepable and non sleepable tcp/unix bpf-iter programs have been able to call bpf_setsockopt() synchronously. bpf_setsockopt() also requires the sock lock to be held first. The situation on calling '.diag_destroy' from bpf-iter should not be much different from calling bpf_setsockopt(). From a quick look at tcp_abort and udp_abort, I don't see they might sleep also and you may want to double check. Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it might sleep, the common bpf map types are already available to the sleepable programs. At the kernel side, the tcp and unix iter acquire the lock_sock() first (eg. in bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and avoids doing the lock if it has already been guaranteed by the bpf running context. For udp, I don't see how the udp_abort acquires the sock lock differently from tcp_abort. I assume the actual problem seen in udp_abort is related to the '->unhash()' part which acquires the udp_table's bucket lock. This is a problem for udp bpf-iter only because the udp bpf-iter did not release the udp_table's bucket lock before calling the bpf prog. The tcp (and unix) bpf-iter releases the bucket lock first before calling the bpf prog. This was done explicitly to allow acquiring the sock lock before calling the bpf prog because otherwise it will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() is only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter needs to do similar change like the tcp and unix iter (https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@fb.com/): batch, release the bucket's lock, lock the sock, and then call bpf prog. This will allow udp-iter to call bpf_setsockopt() like its tcp and unix counterpart. That will also allow udp bpf-iter prog to directly do '.diag_destroy'.
On 12/22/22 6:08 AM, Martin KaFai Lau wrote: > On 12/16/22 5:57 PM, Aditi Ghag wrote: >> The socket destroy helper is used to >> forcefully terminate sockets from certain >> BPF contexts. We plan to use the capability >> in Cilium to force client sockets to reconnect >> when their remote load-balancing backends are >> deleted. The other use case is on-the-fly >> policy enforcement where existing socket >> connections prevented by policies need to >> be terminated. >> >> The helper is currently exposed to iterator >> type BPF programs where users can filter, >> and terminate a set of sockets. >> >> Sockets are destroyed asynchronously using >> the work queue infrastructure. This allows >> for current the locking semantics within >> socket destroy handlers, as BPF iterators >> invoking the helper acquire *sock* locks. >> This also allows the helper to be invoked >> from non-sleepable contexts. >> The other approach to skip acquiring locks >> by passing an argument to the `diag_destroy` >> handler didn't work out well for UDP, as >> the UDP abort function internally invokes >> another function that ends up acquiring >> *sock* lock. >> While there are sleepable BPF iterators, >> these are limited to only certain map types. > > bpf-iter program can be sleepable and non sleepable. Both sleepable and non sleepable tcp/unix bpf-iter programs have been able to call bpf_setsockopt() synchronously. bpf_setsockopt() also requires the sock lock to be held first. The situation on calling '.diag_destroy' from bpf-iter should not be much different from calling bpf_setsockopt(). From a quick look at tcp_abort and udp_abort, I don't see they might sleep also and you may want to double check. Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it might sleep, the common bpf map types are already available to the sleepable programs. > > At the kernel side, the tcp and unix iter acquire the lock_sock() first (eg. in bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and avoids doing the lock if it has already been guaranteed by the bpf running context. > > For udp, I don't see how the udp_abort acquires the sock lock differently from tcp_abort. I assume the actual problem seen in udp_abort is related to the '->unhash()' part which acquires the udp_table's bucket lock. This is a problem for udp bpf-iter only because the udp bpf-iter did not release the udp_table's bucket lock before calling the bpf prog. The tcp (and unix) bpf-iter releases the bucket lock first before calling the bpf prog. This was done explicitly to allow acquiring the sock lock before calling the bpf prog because otherwise it will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() is only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter needs to do similar change like the tcp and unix iter (https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@fb.com/): batch, release the bucket's lock, lock the sock, and then call bpf prog. This will allow udp-iter to call bpf_setsockopt() like its tcp and unix counterpart. > That will also allow udp bpf-iter prog to directly do '.diag_destroy'. Agree, that sounds very reasonable way forward and would also enable bpf_setsockopt() support for UDP sks. Wrt the workqueue I don't really like that there's a loophole with the -EBUSY if the callback didn't trigger yet, while we need to be able to destroy multiple sks in one iteration when they match the sk cookie previously recorded in the map.. so agree, this needs to happen in a synchronous manner. Thanks, Daniel
Thanks for the reviews. @martin: Thanks, I'll look into the missing batching related logic for UDP. Appreciate the pointers. OT: While we can use the has_current_bpf_ctx macro to skip taking sock locks in the BPF context, how is the assumption around locking enforced in the code (e.g., setsockopt helper assumes that BPF iter prog acquires the relevant locks)? We could potentially use lockdep_sock_is_held for the sock lock. It's a debug configuration, but I suppose it's enabled in selftests. > On Dec 21, 2022, at 9:08 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 12/16/22 5:57 PM, Aditi Ghag wrote: >> The socket destroy helper is used to >> forcefully terminate sockets from certain >> BPF contexts. We plan to use the capability >> in Cilium to force client sockets to reconnect >> when their remote load-balancing backends are >> deleted. The other use case is on-the-fly >> policy enforcement where existing socket >> connections prevented by policies need to >> be terminated. >> The helper is currently exposed to iterator >> type BPF programs where users can filter, >> and terminate a set of sockets. >> Sockets are destroyed asynchronously using >> the work queue infrastructure. This allows >> for current the locking semantics within >> socket destroy handlers, as BPF iterators >> invoking the helper acquire *sock* locks. >> This also allows the helper to be invoked >> from non-sleepable contexts. >> The other approach to skip acquiring locks >> by passing an argument to the `diag_destroy` >> handler didn't work out well for UDP, as >> the UDP abort function internally invokes >> another function that ends up acquiring >> *sock* lock. >> While there are sleepable BPF iterators, >> these are limited to only certain map types. > > bpf-iter program can be sleepable and non sleepable. Both sleepable and non sleepable tcp/unix bpf-iter programs have been able to call bpf_setsockopt() synchronously. bpf_setsockopt() also requires the sock lock to be held first. The situation on calling '.diag_destroy' from bpf-iter should not be much different from calling bpf_setsockopt(). From a quick look at tcp_abort and udp_abort, I don't see they might sleep also and you may want to double check. Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it might sleep, the common bpf map types are already available to the sleepable programs. > > At the kernel side, the tcp and unix iter acquire the lock_sock() first (eg. in bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and avoids doing the lock if it has already been guaranteed by the bpf running context. > > For udp, I don't see how the udp_abort acquires the sock lock differently from tcp_abort. I assume the actual problem seen in udp_abort is related to the '->unhash()' part which acquires the udp_table's bucket lock. This is a problem for udp bpf-iter only because the udp bpf-iter did not release the udp_table's bucket lock before calling the bpf prog. The tcp (and unix) bpf-iter releases the bucket lock first before calling the bpf prog. This was done explicitly to allow acquiring the sock lock before calling the bpf prog because otherwise it will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() is only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter needs to do similar change like the tcp and unix iter (https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@fb.com/): batch, release the bucket's lock, lock the sock, and then call bpf prog. This will allow udp-iter to call bpf_setsockopt() like its tcp and unix counterpart. That will also allow udp bpf-iter prog to directly do '.diag_destroy'.
On 1/2/23 11:08 AM, Aditi Ghag wrote: > OT: While we can use the has_current_bpf_ctx macro to skip taking sock > locks in the BPF context, how is the assumption around locking enforced in > the code (e.g., setsockopt helper assumes that BPF iter prog acquires the > relevant locks)? The lock condition is enforced by the verifier when loading bpf prog. The bpf_setsockopt helper is only available to the bpf hooks that have already held/owned the sock lock. For example, if a bpf-tc prog tries to call bpf_setsockopt(), the verifier will reject loading the prog because the bpf-tc hook does not hold/own the sock lock. > We could potentially use lockdep_sock_is_held for the sock lock. > It's a debug configuration, but I suppose it's enabled in selftests. Not sure I totally get it. Meaning adding a sock_owned_by_me() check somewhere in the new bpf_sk_destroy() kfunc for debug purpose? hmm...yeah, why not but not sure how that may look like. I think it is some minor implementation details and will be easier to comment with code. [ btw, please reply inline instead of top posting. ] > > On Wed, Dec 21, 2022 at 9:08 PM Martin KaFai Lau <martin.lau@linux.dev> > wrote: > >> On 12/16/22 5:57 PM, Aditi Ghag wrote: >>> The socket destroy helper is used to >>> forcefully terminate sockets from certain >>> BPF contexts. We plan to use the capability >>> in Cilium to force client sockets to reconnect >>> when their remote load-balancing backends are >>> deleted. The other use case is on-the-fly >>> policy enforcement where existing socket >>> connections prevented by policies need to >>> be terminated. >>> >>> The helper is currently exposed to iterator >>> type BPF programs where users can filter, >>> and terminate a set of sockets. >>> >>> Sockets are destroyed asynchronously using >>> the work queue infrastructure. This allows >>> for current the locking semantics within >>> socket destroy handlers, as BPF iterators >>> invoking the helper acquire *sock* locks. >>> This also allows the helper to be invoked >>> from non-sleepable contexts. >>> The other approach to skip acquiring locks >>> by passing an argument to the `diag_destroy` >>> handler didn't work out well for UDP, as >>> the UDP abort function internally invokes >>> another function that ends up acquiring >>> *sock* lock. >>> While there are sleepable BPF iterators, >>> these are limited to only certain map types. >> >> bpf-iter program can be sleepable and non sleepable. Both sleepable and >> non >> sleepable tcp/unix bpf-iter programs have been able to call >> bpf_setsockopt() >> synchronously. bpf_setsockopt() also requires the sock lock to be held >> first. >> The situation on calling '.diag_destroy' from bpf-iter should not be much >> different from calling bpf_setsockopt(). From a quick look at tcp_abort >> and >> udp_abort, I don't see they might sleep also and you may want to double >> check. >> Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it >> might >> sleep, the common bpf map types are already available to the sleepable >> programs. >> >> At the kernel side, the tcp and unix iter acquire the lock_sock() first >> (eg. in >> bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel >> setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and >> avoids >> doing the lock if it has already been guaranteed by the bpf running >> context. >> >> For udp, I don't see how the udp_abort acquires the sock lock differently >> from >> tcp_abort. I assume the actual problem seen in udp_abort is related to >> the >> '->unhash()' part which acquires the udp_table's bucket lock. This is a >> problem >> for udp bpf-iter only because the udp bpf-iter did not release the >> udp_table's >> bucket lock before calling the bpf prog. The tcp (and unix) bpf-iter >> releases >> the bucket lock first before calling the bpf prog. This was done >> explicitly to >> allow acquiring the sock lock before calling the bpf prog because >> otherwise it >> will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() >> is >> only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter >> needs >> to do similar change like the tcp and unix iter >> (https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@fb.com/): >> batch, >> release the bucket's lock, lock the sock, and then call bpf prog. This >> will >> allow udp-iter to call bpf_setsockopt() like its tcp and unix >> counterpart. That >> will also allow udp bpf-iter prog to directly do '.diag_destroy'. >> >
> On Dec 19, 2022, at 10:22 AM, sdf@google.com wrote: > > On 12/17, Aditi Ghag wrote: >> The socket destroy helper is used to >> forcefully terminate sockets from certain >> BPF contexts. We plan to use the capability >> in Cilium to force client sockets to reconnect >> when their remote load-balancing backends are >> deleted. The other use case is on-the-fly >> policy enforcement where existing socket >> connections prevented by policies need to >> be terminated. > >> The helper is currently exposed to iterator >> type BPF programs where users can filter, >> and terminate a set of sockets. > >> Sockets are destroyed asynchronously using >> the work queue infrastructure. This allows >> for current the locking semantics within > > s/current the/the current/ ? > >> socket destroy handlers, as BPF iterators >> invoking the helper acquire *sock* locks. >> This also allows the helper to be invoked >> from non-sleepable contexts. >> The other approach to skip acquiring locks >> by passing an argument to the `diag_destroy` >> handler didn't work out well for UDP, as >> the UDP abort function internally invokes >> another function that ends up acquiring >> *sock* lock. >> While there are sleepable BPF iterators, >> these are limited to only certain map types. >> Furthermore, it's limiting in the sense that >> it wouldn't allow us to extend the helper >> to other non-sleepable BPF programs. > >> The work queue infrastructure processes work >> items from per-cpu structures. As the sock >> destroy work items are executed asynchronously, >> we need to ref count sockets before they are >> added to the work queue. The 'work_pending' >> check prevents duplicate ref counting of sockets >> in case users invoke the destroy helper for a >> socket multiple times. The `{READ,WRITE}_ONCE` >> macros ensure that the socket pointer stored >> in a work queue item isn't clobbered while >> the item is being processed. As BPF programs >> are non-preemptible, we can expect that once >> a socket is ref counted, no other socket can >> sneak in before the ref counted socket is >> added to the work queue for asynchronous destroy. >> Finally, users are expected to retry when the >> helper fails to queue a work item for a socket >> to be destroyed in case there is another destroy >> operation is in progress. > > nit: maybe reformat to fit into 80 characters per line? A bit hard to > read with this narrow formatting.. > > >> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com> >> --- >> include/linux/bpf.h | 1 + >> include/uapi/linux/bpf.h | 17 +++++++++ >> kernel/bpf/core.c | 1 + >> kernel/trace/bpf_trace.c | 2 + >> net/core/filter.c | 70 ++++++++++++++++++++++++++++++++++ >> tools/include/uapi/linux/bpf.h | 17 +++++++++ >> 6 files changed, 108 insertions(+) > >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >> index 3de24cfb7a3d..60eaa05dfab3 100644 >> --- a/include/linux/bpf.h >> +++ b/include/linux/bpf.h >> @@ -2676,6 +2676,7 @@ extern const struct bpf_func_proto bpf_get_retval_proto; >> extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; >> extern const struct bpf_func_proto bpf_cgrp_storage_get_proto; >> extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto; >> +extern const struct bpf_func_proto bpf_sock_destroy_proto; > >> const struct bpf_func_proto *tracing_prog_func_proto( >> enum bpf_func_id func_id, const struct bpf_prog *prog); >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index 464ca3f01fe7..789ac7c59fdf 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -5484,6 +5484,22 @@ union bpf_attr { >> * 0 on success. >> * >> * **-ENOENT** if the bpf_local_storage cannot be found. >> + * >> + * int bpf_sock_destroy(struct sock *sk) >> + * Description >> + * Destroy the given socket with **ECONNABORTED** error code. >> + * >> + * *sk* must be a non-**NULL** pointer to a socket. >> + * >> + * Return >> + * The socket is destroyed asynchronosuly, so 0 return value may >> + * not suggest indicate that the socket was successfully destroyed. > > s/suggest indicate/ with either suggest or indicate? > >> + * >> + * On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**. >> + * >> + * **-EPROTONOSUPPORT** if protocol specific destroy handler is not implemented. >> + * >> + * **-EBUSY** if another socket destroy operation is in progress. >> */ >> #define ___BPF_FUNC_MAPPER(FN, ctx...) \ >> FN(unspec, 0, ##ctx) \ >> @@ -5698,6 +5714,7 @@ union bpf_attr { >> FN(user_ringbuf_drain, 209, ##ctx) \ >> FN(cgrp_storage_get, 210, ##ctx) \ >> FN(cgrp_storage_delete, 211, ##ctx) \ >> + FN(sock_destroy, 212, ##ctx) \ >> /* */ > >> /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >> index 7f98dec6e90f..c59bef9805e5 100644 >> --- a/kernel/bpf/core.c >> +++ b/kernel/bpf/core.c >> @@ -2651,6 +2651,7 @@ const struct bpf_func_proto bpf_snprintf_btf_proto __weak; >> const struct bpf_func_proto bpf_seq_printf_btf_proto __weak; >> const struct bpf_func_proto bpf_set_retval_proto __weak; >> const struct bpf_func_proto bpf_get_retval_proto __weak; >> +const struct bpf_func_proto bpf_sock_destroy_proto __weak; > >> const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) >> { >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c >> index 3bbd3f0c810c..016dbee6b5e4 100644 >> --- a/kernel/trace/bpf_trace.c >> +++ b/kernel/trace/bpf_trace.c >> @@ -1930,6 +1930,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) >> return &bpf_get_socket_ptr_cookie_proto; >> case BPF_FUNC_xdp_get_buff_len: >> return &bpf_xdp_get_buff_len_trace_proto; >> + case BPF_FUNC_sock_destroy: >> + return &bpf_sock_destroy_proto; >> #endif >> case BPF_FUNC_seq_printf: >> return prog->expected_attach_type == BPF_TRACE_ITER ? >> diff --git a/net/core/filter.c b/net/core/filter.c >> index 929358677183..9753606ecc26 100644 >> --- a/net/core/filter.c >> +++ b/net/core/filter.c >> @@ -11569,6 +11569,8 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id) >> break; >> case BPF_FUNC_ktime_get_coarse_ns: >> return &bpf_ktime_get_coarse_ns_proto; >> + case BPF_FUNC_sock_destroy: >> + return &bpf_sock_destroy_proto; >> default: >> return bpf_base_func_proto(func_id); >> } >> @@ -11578,3 +11580,71 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id) > >> return func; >> } >> + >> +struct sock_destroy_work { >> + struct sock *sk; >> + struct work_struct destroy; >> +}; >> + >> +static DEFINE_PER_CPU(struct sock_destroy_work, sock_destroy_workqueue); >> + >> +static void bpf_sock_destroy_fn(struct work_struct *work) >> +{ >> + struct sock_destroy_work *sd_work = container_of(work, >> + struct sock_destroy_work, destroy); >> + struct sock *sk = READ_ONCE(sd_work->sk); >> + >> + sk->sk_prot->diag_destroy(sk, ECONNABORTED); >> + sock_put(sk); >> +} >> + >> +static int __init bpf_sock_destroy_workqueue_init(void) >> +{ >> + int cpu; >> + struct sock_destroy_work *work; >> + >> + for_each_possible_cpu(cpu) { >> + work = per_cpu_ptr(&sock_destroy_workqueue, cpu); >> + INIT_WORK(&work->destroy, bpf_sock_destroy_fn); >> + } >> + >> + return 0; >> +} >> +subsys_initcall(bpf_sock_destroy_workqueue_init); >> + >> +BPF_CALL_1(bpf_sock_destroy, struct sock *, sk) >> +{ >> + struct sock_destroy_work *sd_work; >> + >> + if (!sk->sk_prot->diag_destroy) >> + return -EOPNOTSUPP; >> + >> + sd_work = this_cpu_ptr(&sock_destroy_workqueue); > > [..] > >> + /* This check prevents duplicate ref counting >> + * of sockets, in case the handler is invoked >> + * multiple times for the same socket. >> + */ > > This means this helper can also be called for a single socket during > invocation; is it an ok compromise? > > I'm also assuming it's still possible that this helper gets called for > the same socket on different cpus? Thanks for the review. The v2 patch series refactored this code path significantly, and it executes the destroy handlers synchronously. > >> + if (work_pending(&sd_work->destroy)) >> + return -EBUSY; >> + >> + /* Ref counting ensures that the socket >> + * isn't deleted from underneath us before >> + * the work queue item is processed. >> + */ >> + if (!refcount_inc_not_zero(&sk->sk_refcnt)) >> + return -EINVAL; >> + >> + WRITE_ONCE(sd_work->sk, sk); >> + if (!queue_work(system_wq, &sd_work->destroy)) { >> + sock_put(sk); >> + return -EBUSY; >> + } >> + >> + return 0; >> +} >> + >> +const struct bpf_func_proto bpf_sock_destroy_proto = { >> + .func = bpf_sock_destroy, >> + .ret_type = RET_INTEGER, >> + .arg1_type = ARG_PTR_TO_BTF_ID_SOCK_COMMON, >> +}; >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h >> index 464ca3f01fe7..07154a4d92f9 100644 >> --- a/tools/include/uapi/linux/bpf.h >> +++ b/tools/include/uapi/linux/bpf.h >> @@ -5484,6 +5484,22 @@ union bpf_attr { >> * 0 on success. >> * >> * **-ENOENT** if the bpf_local_storage cannot be found. >> + * >> + * int bpf_sock_destroy(void *sk) >> + * Description >> + * Destroy the given socket with **ECONNABORTED** error code. >> + * >> + * *sk* must be a non-**NULL** pointer to a socket. >> + * >> + * Return >> + * The socket is destroyed asynchronosuly, so 0 return value may >> + * not indicate that the socket was successfully destroyed. >> + * >> + * On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**. >> + * >> + * **-EPROTONOSUPPORT** if protocol specific destroy handler is not implemented. >> + * >> + * **-EBUSY** if another socket destroy operation is in progress. >> */ >> #define ___BPF_FUNC_MAPPER(FN, ctx...) \ >> FN(unspec, 0, ##ctx) \ >> @@ -5698,6 +5714,7 @@ union bpf_attr { >> FN(user_ringbuf_drain, 209, ##ctx) \ >> FN(cgrp_storage_get, 210, ##ctx) \ >> FN(cgrp_storage_delete, 211, ##ctx) \ >> + FN(sock_destroy, 212, ##ctx) \ >> /* */ > >> /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't >> -- >> 2.34.1
> On Dec 21, 2022, at 9:08 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 12/16/22 5:57 PM, Aditi Ghag wrote: >> The socket destroy helper is used to >> forcefully terminate sockets from certain >> BPF contexts. We plan to use the capability >> in Cilium to force client sockets to reconnect >> when their remote load-balancing backends are >> deleted. The other use case is on-the-fly >> policy enforcement where existing socket >> connections prevented by policies need to >> be terminated. >> The helper is currently exposed to iterator >> type BPF programs where users can filter, >> and terminate a set of sockets. >> Sockets are destroyed asynchronously using >> the work queue infrastructure. This allows >> for current the locking semantics within >> socket destroy handlers, as BPF iterators >> invoking the helper acquire *sock* locks. >> This also allows the helper to be invoked >> from non-sleepable contexts. >> The other approach to skip acquiring locks >> by passing an argument to the `diag_destroy` >> handler didn't work out well for UDP, as >> the UDP abort function internally invokes >> another function that ends up acquiring >> *sock* lock. >> While there are sleepable BPF iterators, >> these are limited to only certain map types. > > bpf-iter program can be sleepable and non sleepable. Both sleepable and non sleepable tcp/unix bpf-iter programs have been able to call bpf_setsockopt() synchronously. bpf_setsockopt() also requires the sock lock to be held first. The situation on calling '.diag_destroy' from bpf-iter should not be much different from calling bpf_setsockopt(). From a quick look at tcp_abort and udp_abort, I don't see they might sleep also and you may want to double check. Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it might sleep, the common bpf map types are already available to the sleepable programs. > > At the kernel side, the tcp and unix iter acquire the lock_sock() first (eg. in bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and avoids doing the lock if it has already been guaranteed by the bpf running context. > > For udp, I don't see how the udp_abort acquires the sock lock differently from tcp_abort. I assume the actual problem seen in udp_abort is related to the '->unhash()' part which acquires the udp_table's bucket lock. This is a problem for udp bpf-iter only because the udp bpf-iter did not release the udp_table's bucket lock before calling the bpf prog. The tcp (and unix) bpf-iter releases the bucket lock first before calling the bpf prog. This was done explicitly to allow acquiring the sock lock before calling the bpf prog because otherwise it will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() is only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter needs to do similar change like the tcp and unix iter (https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@fb.com/): batch, release the bucket's lock, lock the sock, and then call bpf prog. This will allow udp-iter to call bpf_setsockopt() like its tcp and unix counterpart. That will also allow udp bpf-iter prog to directly do '.diag_destroy'. Pushed v2 patch series that implements the missing batching support for UDP iterators. Sorry for the delay. Thanks for the well-documented batching code in TCP!
> On Dec 20, 2022, at 2:26 AM, Alan Maguire <alan.maguire@oracle.com> wrote: > > On 17/12/2022 01:57, Aditi Ghag wrote: >> The socket destroy helper is used to >> forcefully terminate sockets from certain >> BPF contexts. We plan to use the capability >> in Cilium to force client sockets to reconnect >> when their remote load-balancing backends are >> deleted. The other use case is on-the-fly >> policy enforcement where existing socket >> connections prevented by policies need to >> be terminated. >> >> The helper is currently exposed to iterator >> type BPF programs where users can filter, >> and terminate a set of sockets. >> >> Sockets are destroyed asynchronously using >> the work queue infrastructure. This allows >> for current the locking semantics within >> socket destroy handlers, as BPF iterators >> invoking the helper acquire *sock* locks. >> This also allows the helper to be invoked >> from non-sleepable contexts. >> The other approach to skip acquiring locks >> by passing an argument to the `diag_destroy` >> handler didn't work out well for UDP, as >> the UDP abort function internally invokes >> another function that ends up acquiring >> *sock* lock. >> While there are sleepable BPF iterators, >> these are limited to only certain map types. >> Furthermore, it's limiting in the sense that >> it wouldn't allow us to extend the helper >> to other non-sleepable BPF programs. >> >> The work queue infrastructure processes work >> items from per-cpu structures. As the sock >> destroy work items are executed asynchronously, >> we need to ref count sockets before they are >> added to the work queue. The 'work_pending' >> check prevents duplicate ref counting of sockets >> in case users invoke the destroy helper for a >> socket multiple times. The `{READ,WRITE}_ONCE` >> macros ensure that the socket pointer stored >> in a work queue item isn't clobbered while >> the item is being processed. As BPF programs >> are non-preemptible, we can expect that once >> a socket is ref counted, no other socket can >> sneak in before the ref counted socket is >> added to the work queue for asynchronous destroy. >> Finally, users are expected to retry when the >> helper fails to queue a work item for a socket >> to be destroyed in case there is another destroy >> operation is in progress. >> >> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com> >> --- >> include/linux/bpf.h | 1 + >> include/uapi/linux/bpf.h | 17 +++++++++ >> kernel/bpf/core.c | 1 + >> kernel/trace/bpf_trace.c | 2 + >> net/core/filter.c | 70 ++++++++++++++++++++++++++++++++++ >> tools/include/uapi/linux/bpf.h | 17 +++++++++ >> 6 files changed, 108 insertions(+) >> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >> index 3de24cfb7a3d..60eaa05dfab3 100644 >> --- a/include/linux/bpf.h >> +++ b/include/linux/bpf.h >> @@ -2676,6 +2676,7 @@ extern const struct bpf_func_proto bpf_get_retval_proto; >> extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; >> extern const struct bpf_func_proto bpf_cgrp_storage_get_proto; >> extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto; >> +extern const struct bpf_func_proto bpf_sock_destroy_proto; >> >> const struct bpf_func_proto *tracing_prog_func_proto( >> enum bpf_func_id func_id, const struct bpf_prog *prog); >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index 464ca3f01fe7..789ac7c59fdf 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -5484,6 +5484,22 @@ union bpf_attr { >> * 0 on success. >> * >> * **-ENOENT** if the bpf_local_storage cannot be found. >> + * >> + * int bpf_sock_destroy(struct sock *sk) >> + * Description >> + * Destroy the given socket with **ECONNABORTED** error code. >> + * >> + * *sk* must be a non-**NULL** pointer to a socket. >> + * >> + * Return >> + * The socket is destroyed asynchronosuly, so 0 return value may >> + * not suggest indicate that the socket was successfully destroyed. >> + * >> + * On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**. >> + * >> + * **-EPROTONOSUPPORT** if protocol specific destroy handler is not implemented. >> + * >> + * **-EBUSY** if another socket destroy operation is in progress. >> */ >> #define ___BPF_FUNC_MAPPER(FN, ctx...) \ >> FN(unspec, 0, ##ctx) \ >> @@ -5698,6 +5714,7 @@ union bpf_attr { >> FN(user_ringbuf_drain, 209, ##ctx) \ >> FN(cgrp_storage_get, 210, ##ctx) \ >> FN(cgrp_storage_delete, 211, ##ctx) \ >> + FN(sock_destroy, 212, ##ctx) \ >> /* */ >> >> /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >> index 7f98dec6e90f..c59bef9805e5 100644 >> --- a/kernel/bpf/core.c >> +++ b/kernel/bpf/core.c >> @@ -2651,6 +2651,7 @@ const struct bpf_func_proto bpf_snprintf_btf_proto __weak; >> const struct bpf_func_proto bpf_seq_printf_btf_proto __weak; >> const struct bpf_func_proto bpf_set_retval_proto __weak; >> const struct bpf_func_proto bpf_get_retval_proto __weak; >> +const struct bpf_func_proto bpf_sock_destroy_proto __weak; >> >> const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) >> { >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c >> index 3bbd3f0c810c..016dbee6b5e4 100644 >> --- a/kernel/trace/bpf_trace.c >> +++ b/kernel/trace/bpf_trace.c >> @@ -1930,6 +1930,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) >> return &bpf_get_socket_ptr_cookie_proto; >> case BPF_FUNC_xdp_get_buff_len: >> return &bpf_xdp_get_buff_len_trace_proto; >> + case BPF_FUNC_sock_destroy: >> + return &bpf_sock_destroy_proto; >> #endif >> case BPF_FUNC_seq_printf: >> return prog->expected_attach_type == BPF_TRACE_ITER ? >> diff --git a/net/core/filter.c b/net/core/filter.c >> index 929358677183..9753606ecc26 100644 >> --- a/net/core/filter.c >> +++ b/net/core/filter.c >> @@ -11569,6 +11569,8 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id) >> break; >> case BPF_FUNC_ktime_get_coarse_ns: >> return &bpf_ktime_get_coarse_ns_proto; >> + case BPF_FUNC_sock_destroy: >> + return &bpf_sock_destroy_proto; >> default: > This is a really neat feature! One question though; above it seems to > suggest that the helper is only exposed to BPF iterators, but by The v2 patch only exposes the capability to BPF iterators. The patch series has details involving locking semantics due to which the kfunc is only available for BPF iterators at the moment. > adding it to bpf_sk_base_func_proto() won't it be available to > other BPF programs like sock_addr, sk_filter etc? If I've got that > right, we'd definitely want to exclude potentially unprivileged > programs from having access to this helper. And to be clear, I'd > definitely see value in having it accessible to other BPF program > types too like sockops if possible, but just wanted to check. > >> return bpf_base_func_proto(func_id); >> } >> @@ -11578,3 +11580,71 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id) >> >> return func; >> } >> + >> +struct sock_destroy_work { >> + struct sock *sk; >> + struct work_struct destroy; >> +}; >> + >> +static DEFINE_PER_CPU(struct sock_destroy_work, sock_destroy_workqueue); >> + >> +static void bpf_sock_destroy_fn(struct work_struct *work) >> +{ >> + struct sock_destroy_work *sd_work = container_of(work, >> + struct sock_destroy_work, destroy); >> + struct sock *sk = READ_ONCE(sd_work->sk); >> + >> + sk->sk_prot->diag_destroy(sk, ECONNABORTED); >> + sock_put(sk); >> +} >> + >> +static int __init bpf_sock_destroy_workqueue_init(void) >> +{ >> + int cpu; >> + struct sock_destroy_work *work; >> + >> + for_each_possible_cpu(cpu) { >> + work = per_cpu_ptr(&sock_destroy_workqueue, cpu); >> + INIT_WORK(&work->destroy, bpf_sock_destroy_fn); >> + } >> + >> + return 0; >> +} >> +subsys_initcall(bpf_sock_destroy_workqueue_init); >> + >> +BPF_CALL_1(bpf_sock_destroy, struct sock *, sk) >> +{ >> + struct sock_destroy_work *sd_work; >> + >> + if (!sk->sk_prot->diag_destroy) > > risk of a NULL sk here? > > Thanks! > > Alan
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 3de24cfb7a3d..60eaa05dfab3 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2676,6 +2676,7 @@ extern const struct bpf_func_proto bpf_get_retval_proto; extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; extern const struct bpf_func_proto bpf_cgrp_storage_get_proto; extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto; +extern const struct bpf_func_proto bpf_sock_destroy_proto; const struct bpf_func_proto *tracing_prog_func_proto( enum bpf_func_id func_id, const struct bpf_prog *prog); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 464ca3f01fe7..789ac7c59fdf 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5484,6 +5484,22 @@ union bpf_attr { * 0 on success. * * **-ENOENT** if the bpf_local_storage cannot be found. + * + * int bpf_sock_destroy(struct sock *sk) + * Description + * Destroy the given socket with **ECONNABORTED** error code. + * + * *sk* must be a non-**NULL** pointer to a socket. + * + * Return + * The socket is destroyed asynchronosuly, so 0 return value may + * not suggest indicate that the socket was successfully destroyed. + * + * On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**. + * + * **-EPROTONOSUPPORT** if protocol specific destroy handler is not implemented. + * + * **-EBUSY** if another socket destroy operation is in progress. */ #define ___BPF_FUNC_MAPPER(FN, ctx...) \ FN(unspec, 0, ##ctx) \ @@ -5698,6 +5714,7 @@ union bpf_attr { FN(user_ringbuf_drain, 209, ##ctx) \ FN(cgrp_storage_get, 210, ##ctx) \ FN(cgrp_storage_delete, 211, ##ctx) \ + FN(sock_destroy, 212, ##ctx) \ /* */ /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 7f98dec6e90f..c59bef9805e5 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -2651,6 +2651,7 @@ const struct bpf_func_proto bpf_snprintf_btf_proto __weak; const struct bpf_func_proto bpf_seq_printf_btf_proto __weak; const struct bpf_func_proto bpf_set_retval_proto __weak; const struct bpf_func_proto bpf_get_retval_proto __weak; +const struct bpf_func_proto bpf_sock_destroy_proto __weak; const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) { diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 3bbd3f0c810c..016dbee6b5e4 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1930,6 +1930,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_socket_ptr_cookie_proto; case BPF_FUNC_xdp_get_buff_len: return &bpf_xdp_get_buff_len_trace_proto; + case BPF_FUNC_sock_destroy: + return &bpf_sock_destroy_proto; #endif case BPF_FUNC_seq_printf: return prog->expected_attach_type == BPF_TRACE_ITER ? diff --git a/net/core/filter.c b/net/core/filter.c index 929358677183..9753606ecc26 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -11569,6 +11569,8 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id) break; case BPF_FUNC_ktime_get_coarse_ns: return &bpf_ktime_get_coarse_ns_proto; + case BPF_FUNC_sock_destroy: + return &bpf_sock_destroy_proto; default: return bpf_base_func_proto(func_id); } @@ -11578,3 +11580,71 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id) return func; } + +struct sock_destroy_work { + struct sock *sk; + struct work_struct destroy; +}; + +static DEFINE_PER_CPU(struct sock_destroy_work, sock_destroy_workqueue); + +static void bpf_sock_destroy_fn(struct work_struct *work) +{ + struct sock_destroy_work *sd_work = container_of(work, + struct sock_destroy_work, destroy); + struct sock *sk = READ_ONCE(sd_work->sk); + + sk->sk_prot->diag_destroy(sk, ECONNABORTED); + sock_put(sk); +} + +static int __init bpf_sock_destroy_workqueue_init(void) +{ + int cpu; + struct sock_destroy_work *work; + + for_each_possible_cpu(cpu) { + work = per_cpu_ptr(&sock_destroy_workqueue, cpu); + INIT_WORK(&work->destroy, bpf_sock_destroy_fn); + } + + return 0; +} +subsys_initcall(bpf_sock_destroy_workqueue_init); + +BPF_CALL_1(bpf_sock_destroy, struct sock *, sk) +{ + struct sock_destroy_work *sd_work; + + if (!sk->sk_prot->diag_destroy) + return -EOPNOTSUPP; + + sd_work = this_cpu_ptr(&sock_destroy_workqueue); + /* This check prevents duplicate ref counting + * of sockets, in case the handler is invoked + * multiple times for the same socket. + */ + if (work_pending(&sd_work->destroy)) + return -EBUSY; + + /* Ref counting ensures that the socket + * isn't deleted from underneath us before + * the work queue item is processed. + */ + if (!refcount_inc_not_zero(&sk->sk_refcnt)) + return -EINVAL; + + WRITE_ONCE(sd_work->sk, sk); + if (!queue_work(system_wq, &sd_work->destroy)) { + sock_put(sk); + return -EBUSY; + } + + return 0; +} + +const struct bpf_func_proto bpf_sock_destroy_proto = { + .func = bpf_sock_destroy, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID_SOCK_COMMON, +}; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 464ca3f01fe7..07154a4d92f9 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5484,6 +5484,22 @@ union bpf_attr { * 0 on success. * * **-ENOENT** if the bpf_local_storage cannot be found. + * + * int bpf_sock_destroy(void *sk) + * Description + * Destroy the given socket with **ECONNABORTED** error code. + * + * *sk* must be a non-**NULL** pointer to a socket. + * + * Return + * The socket is destroyed asynchronosuly, so 0 return value may + * not indicate that the socket was successfully destroyed. + * + * On error, may return **EPROTONOSUPPORT**, **EBUSY**, **EINVAL**. + * + * **-EPROTONOSUPPORT** if protocol specific destroy handler is not implemented. + * + * **-EBUSY** if another socket destroy operation is in progress. */ #define ___BPF_FUNC_MAPPER(FN, ctx...) \ FN(unspec, 0, ##ctx) \ @@ -5698,6 +5714,7 @@ union bpf_attr { FN(user_ringbuf_drain, 209, ##ctx) \ FN(cgrp_storage_get, 210, ##ctx) \ FN(cgrp_storage_delete, 211, ##ctx) \ + FN(sock_destroy, 212, ##ctx) \ /* */ /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
The socket destroy helper is used to forcefully terminate sockets from certain BPF contexts. We plan to use the capability in Cilium to force client sockets to reconnect when their remote load-balancing backends are deleted. The other use case is on-the-fly policy enforcement where existing socket connections prevented by policies need to be terminated. The helper is currently exposed to iterator type BPF programs where users can filter, and terminate a set of sockets. Sockets are destroyed asynchronously using the work queue infrastructure. This allows for current the locking semantics within socket destroy handlers, as BPF iterators invoking the helper acquire *sock* locks. This also allows the helper to be invoked from non-sleepable contexts. The other approach to skip acquiring locks by passing an argument to the `diag_destroy` handler didn't work out well for UDP, as the UDP abort function internally invokes another function that ends up acquiring *sock* lock. While there are sleepable BPF iterators, these are limited to only certain map types. Furthermore, it's limiting in the sense that it wouldn't allow us to extend the helper to other non-sleepable BPF programs. The work queue infrastructure processes work items from per-cpu structures. As the sock destroy work items are executed asynchronously, we need to ref count sockets before they are added to the work queue. The 'work_pending' check prevents duplicate ref counting of sockets in case users invoke the destroy helper for a socket multiple times. The `{READ,WRITE}_ONCE` macros ensure that the socket pointer stored in a work queue item isn't clobbered while the item is being processed. As BPF programs are non-preemptible, we can expect that once a socket is ref counted, no other socket can sneak in before the ref counted socket is added to the work queue for asynchronous destroy. Finally, users are expected to retry when the helper fails to queue a work item for a socket to be destroyed in case there is another destroy operation is in progress. Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com> --- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 17 +++++++++ kernel/bpf/core.c | 1 + kernel/trace/bpf_trace.c | 2 + net/core/filter.c | 70 ++++++++++++++++++++++++++++++++++ tools/include/uapi/linux/bpf.h | 17 +++++++++ 6 files changed, 108 insertions(+)