Message ID | 20210108180333.180906-2-sdf@google.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf: misc performance improvements for cgroup hooks | expand |
On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote: > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom > call in do_tcp_getsockopt using the on-stack data. This removes > 3% overhead for locking/unlocking the socket. > > Without this patch: > 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt > | > --3.30%--__cgroup_bpf_run_filter_getsockopt > | > --0.81%--__kmalloc > > With the patch applied: > 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern > OK but we are adding yet another indirect call. Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ? Thanks.
On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom > > call in do_tcp_getsockopt using the on-stack data. This removes > > 3% overhead for locking/unlocking the socket. > > > > Without this patch: > > 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt > > | > > --3.30%--__cgroup_bpf_run_filter_getsockopt > > | > > --0.81%--__kmalloc > > > > With the patch applied: > > 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern > > > > > OK but we are adding yet another indirect call. > > Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ? Sure, but do you think it will bring any benefit? We don't have any indirect avoidance in __sys_getsockopt for the sock->ops->getsockopt() call. If we add it for this new bpf_bypass_getsockopt, we might as well add it for sock->ops->getsockopt? And we need some new INDIRECT_CALL_INET2 such that f2 doesn't get disabled when ipv6 is disabled :-/
On Fri, Jan 8, 2021 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote: > > On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. > > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom > > > call in do_tcp_getsockopt using the on-stack data. This removes > > > 3% overhead for locking/unlocking the socket. > > > > > > Without this patch: > > > 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt > > > | > > > --3.30%--__cgroup_bpf_run_filter_getsockopt > > > | > > > --0.81%--__kmalloc > > > > > > With the patch applied: > > > 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern > > > > > > > > > OK but we are adding yet another indirect call. > > > > Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ? > Sure, but do you think it will bring any benefit? Sure, avoiding an indirect call might be the same gain than the lock_sock() avoidance :) > We don't have any indirect avoidance in __sys_getsockopt for the > sock->ops->getsockopt() call. > If we add it for this new bpf_bypass_getsockopt, we might as well add > it for sock->ops->getsockopt? Well, that is orthogonal to this patch. As you may know, Google kernels do have a mitigation there already and Brian may upstream it. > And we need some new INDIRECT_CALL_INET2 such that f2 doesn't get > disabled when ipv6 is disabled :-/ The same handler is called for IPv4 and IPv6, so you need the variant with only one known handler (tcp_bpf_bypass_getsockopt) Only it needs to make sure CONFIG_INET is enabled.
On Fri, Jan 8, 2021 at 10:41 AM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Jan 8, 2021 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. > > > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom > > > > call in do_tcp_getsockopt using the on-stack data. This removes > > > > 3% overhead for locking/unlocking the socket. > > > > > > > > Without this patch: > > > > 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt > > > > | > > > > --3.30%--__cgroup_bpf_run_filter_getsockopt > > > > | > > > > --0.81%--__kmalloc > > > > > > > > With the patch applied: > > > > 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern > > > > > > > > > > > > > OK but we are adding yet another indirect call. > > > > > > Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ? > > Sure, but do you think it will bring any benefit? > > Sure, avoiding an indirect call might be the same gain than the > lock_sock() avoidance :) > > > We don't have any indirect avoidance in __sys_getsockopt for the > > sock->ops->getsockopt() call. > > If we add it for this new bpf_bypass_getsockopt, we might as well add > > it for sock->ops->getsockopt? > > Well, that is orthogonal to this patch. > As you may know, Google kernels do have a mitigation there already and > Brian may upstream it. I guess my point here was that if I send it out only for bpf_bypass_getsockopt it might look a bit strange because the rest of the getsockopt still suffers the indirect costs. If Brian has plans to upstream the rest, maybe it's better to upstream everything together with some numbers? CC'ing him for his opinion. I'm happy to follow up in whatever form is best. I can also resend with INDIRECT_CALL_INET2 if there are no objections in including this version from the start. > > And we need some new INDIRECT_CALL_INET2 such that f2 doesn't get > > disabled when ipv6 is disabled :-/ > > The same handler is called for IPv4 and IPv6, so you need the variant > with only one known handler (tcp_bpf_bypass_getsockopt) > > Only it needs to make sure CONFIG_INET is enabled.
On Fri, Jan 8, 2021 at 8:08 PM Stanislav Fomichev <sdf@google.com> wrote: > > On Fri, Jan 8, 2021 at 10:41 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Fri, Jan 8, 2021 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. > > > > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom > > > > > call in do_tcp_getsockopt using the on-stack data. This removes > > > > > 3% overhead for locking/unlocking the socket. > > > > > > > > > > Without this patch: > > > > > 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt > > > > > | > > > > > --3.30%--__cgroup_bpf_run_filter_getsockopt > > > > > | > > > > > --0.81%--__kmalloc > > > > > > > > > > With the patch applied: > > > > > 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern > > > > > > > > > > > > > > > > > OK but we are adding yet another indirect call. > > > > > > > > Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ? > > > Sure, but do you think it will bring any benefit? > > > > Sure, avoiding an indirect call might be the same gain than the > > lock_sock() avoidance :) > > > > > We don't have any indirect avoidance in __sys_getsockopt for the > > > sock->ops->getsockopt() call. > > > If we add it for this new bpf_bypass_getsockopt, we might as well add > > > it for sock->ops->getsockopt? > > > > Well, that is orthogonal to this patch. > > As you may know, Google kernels do have a mitigation there already and > > Brian may upstream it. > I guess my point here was that if I send it out only for bpf_bypass_getsockopt > it might look a bit strange because the rest of the getsockopt still > suffers the indirect costs. Each new indirect call adds a cost. If you focus on optimizing TCP_ZEROCOPY_RECEIVE, it is counter intuitive adding an expensive indirect call. If Brian has plans to upstream the rest, maybe > it's better to upstream everything together with some numbers? > CC'ing him for his opinion. I am just saying your point about the other indirect call is already taken care. > > I'm happy to follow up in whatever form is best. I can also resend > with INDIRECT_CALL_INET2 if there are no objections in including > this version from the start. > INDIRECT_CALL_INET2 seems a strange name to me.
On Fri, Jan 8, 2021 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Jan 8, 2021 at 8:08 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > On Fri, Jan 8, 2021 at 10:41 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Fri, Jan 8, 2021 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > > > > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. > > > > > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom > > > > > > call in do_tcp_getsockopt using the on-stack data. This removes > > > > > > 3% overhead for locking/unlocking the socket. > > > > > > > > > > > > Without this patch: > > > > > > 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt > > > > > > | > > > > > > --3.30%--__cgroup_bpf_run_filter_getsockopt > > > > > > | > > > > > > --0.81%--__kmalloc > > > > > > > > > > > > With the patch applied: > > > > > > 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern > > > > > > > > > > > > > > > > > > > > > OK but we are adding yet another indirect call. > > > > > > > > > > Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ? > > > > Sure, but do you think it will bring any benefit? > > > > > > Sure, avoiding an indirect call might be the same gain than the > > > lock_sock() avoidance :) > > > > > > > We don't have any indirect avoidance in __sys_getsockopt for the > > > > sock->ops->getsockopt() call. > > > > If we add it for this new bpf_bypass_getsockopt, we might as well add > > > > it for sock->ops->getsockopt? > > > > > > Well, that is orthogonal to this patch. > > > As you may know, Google kernels do have a mitigation there already and > > > Brian may upstream it. > > I guess my point here was that if I send it out only for bpf_bypass_getsockopt > > it might look a bit strange because the rest of the getsockopt still > > suffers the indirect costs. > > > Each new indirect call adds a cost. If you focus on optimizing > TCP_ZEROCOPY_RECEIVE, > it is counter intuitive adding an expensive indirect call. Ok, then let me resend with a mitigation in place and a note that the rest will be added later. > If Brian has plans to upstream the rest, maybe > > it's better to upstream everything together with some numbers? > > CC'ing him for his opinion. > > I am just saying your point about the other indirect call is already taken care. > > > > > I'm happy to follow up in whatever form is best. I can also resend > > with INDIRECT_CALL_INET2 if there are no objections in including > > this version from the start. > > > > INDIRECT_CALL_INET2 seems a strange name to me. Any suggestion for a better name? I did play with the following: diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index cbba9c9ab073..f7342a30284c 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -371,7 +371,9 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, int __ret = retval; \ if (cgroup_bpf_enabled(BPF_CGROUP_GETSOCKOPT)) \ if (!(sock)->sk_prot->bpf_bypass_getsockopt || \ - !(sock)->sk_prot->bpf_bypass_getsockopt(level, optname)) \ + !INDIRECT_CALL_INET1((sock)->sk_prot->bpf_bypass_getsockopt, \ + tcp_bpf_bypass_getsockopt, \ + level, optname)) \ __ret = __cgroup_bpf_run_filter_getsockopt( \ sock, level, optname, optval, optlen, \ max_optlen, retval); \ diff --git a/include/linux/indirect_call_wrapper.h b/include/linux/indirect_call_wrapper.h index 54c02c84906a..9c3252f7e9bb 100644 --- a/include/linux/indirect_call_wrapper.h +++ b/include/linux/indirect_call_wrapper.h @@ -54,10 +54,13 @@ #if IS_BUILTIN(CONFIG_IPV6) #define INDIRECT_CALL_INET(f, f2, f1, ...) \ INDIRECT_CALL_2(f, f2, f1, __VA_ARGS__) +#define INDIRECT_CALL_INET1(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__) #elif IS_ENABLED(CONFIG_INET) #define INDIRECT_CALL_INET(f, f2, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__) +#define INDIRECT_CALL_INET1(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__) #else #define INDIRECT_CALL_INET(f, f2, f1, ...) f(__VA_ARGS__) +#define INDIRECT_CALL_INET1(f, f1, ...) f(__VA_ARGS__) #endif #endif
On Fri, Jan 8, 2021 at 8:27 PM Stanislav Fomichev <sdf@google.com> wrote: > > On Fri, Jan 8, 2021 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Fri, Jan 8, 2021 at 8:08 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > On Fri, Jan 8, 2021 at 10:41 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Fri, Jan 8, 2021 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > > > On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > > > > > > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. > > > > > > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom > > > > > > > call in do_tcp_getsockopt using the on-stack data. This removes > > > > > > > 3% overhead for locking/unlocking the socket. > > > > > > > > > > > > > > Without this patch: > > > > > > > 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt > > > > > > > | > > > > > > > --3.30%--__cgroup_bpf_run_filter_getsockopt > > > > > > > | > > > > > > > --0.81%--__kmalloc > > > > > > > > > > > > > > With the patch applied: > > > > > > > 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern > > > > > > > > > > > > > > > > > > > > > > > > > OK but we are adding yet another indirect call. > > > > > > > > > > > > Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ? > > > > > Sure, but do you think it will bring any benefit? > > > > > > > > Sure, avoiding an indirect call might be the same gain than the > > > > lock_sock() avoidance :) > > > > > > > > > We don't have any indirect avoidance in __sys_getsockopt for the > > > > > sock->ops->getsockopt() call. > > > > > If we add it for this new bpf_bypass_getsockopt, we might as well add > > > > > it for sock->ops->getsockopt? > > > > > > > > Well, that is orthogonal to this patch. > > > > As you may know, Google kernels do have a mitigation there already and > > > > Brian may upstream it. > > > I guess my point here was that if I send it out only for bpf_bypass_getsockopt > > > it might look a bit strange because the rest of the getsockopt still > > > suffers the indirect costs. > > > > > > Each new indirect call adds a cost. If you focus on optimizing > > TCP_ZEROCOPY_RECEIVE, > > it is counter intuitive adding an expensive indirect call. > Ok, then let me resend with a mitigation in place and a note > that the rest will be added later. > > > If Brian has plans to upstream the rest, maybe > > > it's better to upstream everything together with some numbers? > > > CC'ing him for his opinion. > > > > I am just saying your point about the other indirect call is already taken care. > > > > > > > > I'm happy to follow up in whatever form is best. I can also resend > > > with INDIRECT_CALL_INET2 if there are no objections in including > > > this version from the start. > > > > > > > INDIRECT_CALL_INET2 seems a strange name to me. > Any suggestion for a better name? I did play with the following: > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h > index cbba9c9ab073..f7342a30284c 100644 > --- a/include/linux/bpf-cgroup.h > +++ b/include/linux/bpf-cgroup.h > @@ -371,7 +371,9 @@ int bpf_percpu_cgroup_storage_update(struct > bpf_map *map, void *key, > int __ret = retval; \ > if (cgroup_bpf_enabled(BPF_CGROUP_GETSOCKOPT)) \ > if (!(sock)->sk_prot->bpf_bypass_getsockopt || \ > - !(sock)->sk_prot->bpf_bypass_getsockopt(level, optname)) \ > + > !INDIRECT_CALL_INET1((sock)->sk_prot->bpf_bypass_getsockopt, \ > + tcp_bpf_bypass_getsockopt, \ > + level, optname)) \ > __ret = __cgroup_bpf_run_filter_getsockopt( \ > sock, level, optname, optval, optlen, \ > max_optlen, retval); \ > diff --git a/include/linux/indirect_call_wrapper.h > b/include/linux/indirect_call_wrapper.h > index 54c02c84906a..9c3252f7e9bb 100644 > --- a/include/linux/indirect_call_wrapper.h > +++ b/include/linux/indirect_call_wrapper.h > @@ -54,10 +54,13 @@ > #if IS_BUILTIN(CONFIG_IPV6) > #define INDIRECT_CALL_INET(f, f2, f1, ...) \ > INDIRECT_CALL_2(f, f2, f1, __VA_ARGS__) > +#define INDIRECT_CALL_INET1(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__) > #elif IS_ENABLED(CONFIG_INET) > #define INDIRECT_CALL_INET(f, f2, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__) > +#define INDIRECT_CALL_INET1(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__) > #else > #define INDIRECT_CALL_INET(f, f2, f1, ...) f(__VA_ARGS__) > +#define INDIRECT_CALL_INET1(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__) > #endif > > #endif Yes, or maybe something only focusing on CONFIG_INET to make it clear. diff --git a/include/linux/indirect_call_wrapper.h b/include/linux/indirect_call_wrapper.h index 54c02c84906ab2548a93bacb46f7795a8e136d83..d082aa4bd3ecae52e5998b3ac05deffafcb45de0 100644 --- a/include/linux/indirect_call_wrapper.h +++ b/include/linux/indirect_call_wrapper.h @@ -46,6 +46,12 @@ #define INDIRECT_CALLABLE_SCOPE static #endif +#elif IS_ENABLED(CONFIG_INET) +#define INDIRECT_CALL_INET1(f, f2, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__) +#else +#define INDIRECT_CALL_INET1(f, f2, f1, ...) f(__VA_ARGS__) +#endif + /* * We can use INDIRECT_CALL_$NR for ipv6 related functions only if ipv6 is * builtin, this macro simplify dealing with indirect calls with only ipv4/ipv6
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index 72e69a0e1e8c..6adaee018c63 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -147,6 +147,10 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, int __user *optlen, int max_optlen, int retval); +int __cgroup_bpf_run_filter_getsockopt_kern(struct sock *sk, int level, + int optname, void *optval, + int *optlen, int retval); + static inline enum bpf_cgroup_storage_type cgroup_storage_type( struct bpf_map *map) { @@ -364,10 +368,21 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, ({ \ int __ret = retval; \ if (cgroup_bpf_enabled) \ - __ret = __cgroup_bpf_run_filter_getsockopt(sock, level, \ - optname, optval, \ - optlen, max_optlen, \ - retval); \ + if (!(sock)->sk_prot->bpf_bypass_getsockopt || \ + !(sock)->sk_prot->bpf_bypass_getsockopt(level, optname)) \ + __ret = __cgroup_bpf_run_filter_getsockopt( \ + sock, level, optname, optval, optlen, \ + max_optlen, retval); \ + __ret; \ +}) + +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sock, level, optname, optval, \ + optlen, retval) \ +({ \ + int __ret = retval; \ + if (cgroup_bpf_enabled) \ + __ret = __cgroup_bpf_run_filter_getsockopt_kern( \ + sock, level, optname, optval, optlen, retval); \ __ret; \ }) @@ -452,6 +467,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map, #define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) ({ 0; }) #define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, \ optlen, max_optlen, retval) ({ retval; }) +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sock, level, optname, optval, \ + optlen, retval) ({ retval; }) #define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen, \ kernel_optval) ({ 0; }) diff --git a/include/net/sock.h b/include/net/sock.h index bdc4323ce53c..ebf44d724845 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1174,6 +1174,8 @@ struct proto { int (*backlog_rcv) (struct sock *sk, struct sk_buff *skb); + bool (*bpf_bypass_getsockopt)(int level, + int optname); void (*release_cb)(struct sock *sk); diff --git a/include/net/tcp.h b/include/net/tcp.h index 78d13c88720f..4bb42fb19711 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -403,6 +403,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, struct poll_table_struct *wait); int tcp_getsockopt(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen); +bool tcp_bpf_bypass_getsockopt(int level, int optname); int tcp_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval, unsigned int optlen); void tcp_set_keepalive(struct sock *sk, int val); diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 6ec088a96302..c41bb2f34013 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -1485,6 +1485,44 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, sockopt_free_buf(&ctx); return ret; } + +int __cgroup_bpf_run_filter_getsockopt_kern(struct sock *sk, int level, + int optname, void *optval, + int *optlen, int retval) +{ + struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); + struct bpf_sockopt_kern ctx = { + .sk = sk, + .level = level, + .optname = optname, + .retval = retval, + .optlen = *optlen, + .optval = optval, + .optval_end = optval + *optlen, + }; + int ret; + + ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT], + &ctx, BPF_PROG_RUN); + if (!ret) + return -EPERM; + + if (ctx.optlen > *optlen) + return -EFAULT; + + /* BPF programs only allowed to set retval to 0, not some + * arbitrary value. + */ + if (ctx.retval != 0 && ctx.retval != retval) + return -EFAULT; + + /* BPF programs can shrink the buffer, export the modifications. + */ + if (ctx.optlen != 0) + *optlen = ctx.optlen; + + return ctx.retval; +} #endif static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index ed42d2193c5c..ef3c895b66c1 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4098,6 +4098,8 @@ static int do_tcp_getsockopt(struct sock *sk, int level, return -EFAULT; lock_sock(sk); err = tcp_zerocopy_receive(sk, &zc); + err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname, + &zc, &len, err); release_sock(sk); if (len >= offsetofend(struct tcp_zerocopy_receive, err)) goto zerocopy_rcv_sk_err; @@ -4132,6 +4134,18 @@ static int do_tcp_getsockopt(struct sock *sk, int level, return 0; } +bool tcp_bpf_bypass_getsockopt(int level, int optname) +{ + /* TCP do_tcp_getsockopt has optimized getsockopt implementation + * to avoid extra socket lock for TCP_ZEROCOPY_RECEIVE. + */ + if (level == SOL_TCP && optname == TCP_ZEROCOPY_RECEIVE) + return true; + + return false; +} +EXPORT_SYMBOL(tcp_bpf_bypass_getsockopt); + int tcp_getsockopt(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen) { diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 58207c7769d0..8b4906980fce 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2792,6 +2792,7 @@ struct proto tcp_prot = { .shutdown = tcp_shutdown, .setsockopt = tcp_setsockopt, .getsockopt = tcp_getsockopt, + .bpf_bypass_getsockopt = tcp_bpf_bypass_getsockopt, .keepalive = tcp_set_keepalive, .recvmsg = tcp_recvmsg, .sendmsg = tcp_sendmsg, diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 0e1509b02cb3..8539715ff035 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -2121,6 +2121,7 @@ struct proto tcpv6_prot = { .shutdown = tcp_shutdown, .setsockopt = tcp_setsockopt, .getsockopt = tcp_getsockopt, + .bpf_bypass_getsockopt = tcp_bpf_bypass_getsockopt, .keepalive = tcp_set_keepalive, .recvmsg = tcp_recvmsg, .sendmsg = tcp_sendmsg, diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c index b25c9c45c148..6bb18b1d8578 100644 --- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c +++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c @@ -11,6 +11,7 @@ static int getsetsockopt(void) char u8[4]; __u32 u32; char cc[16]; /* TCP_CA_NAME_MAX */ + struct tcp_zerocopy_receive zc; } buf = {}; socklen_t optlen; char *big_buf = NULL; @@ -154,6 +155,27 @@ static int getsetsockopt(void) goto err; } + /* TCP_ZEROCOPY_RECEIVE triggers */ + memset(&buf, 0, sizeof(buf)); + optlen = sizeof(buf.zc); + err = getsockopt(fd, SOL_TCP, TCP_ZEROCOPY_RECEIVE, &buf, &optlen); + if (err) { + log_err("Unexpected getsockopt(TCP_ZEROCOPY_RECEIVE) err=%d errno=%d", + err, errno); + goto err; + } + + memset(&buf, 0, sizeof(buf)); + buf.zc.address = 12345; /* rejected by BPF */ + optlen = sizeof(buf.zc); + errno = 0; + err = getsockopt(fd, SOL_TCP, TCP_ZEROCOPY_RECEIVE, &buf, &optlen); + if (errno != EPERM) { + log_err("Unexpected getsockopt(TCP_ZEROCOPY_RECEIVE) err=%d errno=%d", + err, errno); + goto err; + } + free(big_buf); close(fd); return 0; diff --git a/tools/testing/selftests/bpf/progs/sockopt_sk.c b/tools/testing/selftests/bpf/progs/sockopt_sk.c index 712df7b49cb1..c726f0763a13 100644 --- a/tools/testing/selftests/bpf/progs/sockopt_sk.c +++ b/tools/testing/selftests/bpf/progs/sockopt_sk.c @@ -57,6 +57,21 @@ int _getsockopt(struct bpf_sockopt *ctx) return 1; } + if (ctx->level == SOL_TCP && ctx->optname == TCP_ZEROCOPY_RECEIVE) { + /* Verify that TCP_ZEROCOPY_RECEIVE triggers. + * It has a custom implementation for performance + * reasons. + */ + + if (optval + sizeof(struct tcp_zerocopy_receive) > optval_end) + return 0; /* EPERM, bounds check */ + + if (((struct tcp_zerocopy_receive *)optval)->address != 0) + return 0; /* EPERM, unexpected data */ + + return 1; + } + if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) { if (optval + 1 > optval_end) return 0; /* EPERM, bounds check */
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom call in do_tcp_getsockopt using the on-stack data. This removes 3% overhead for locking/unlocking the socket. Without this patch: 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt | --3.30%--__cgroup_bpf_run_filter_getsockopt | --0.81%--__kmalloc With the patch applied: 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern Signed-off-by: Stanislav Fomichev <sdf@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Eric Dumazet <edumazet@google.com> --- include/linux/bpf-cgroup.h | 25 ++++++++++-- include/net/sock.h | 2 + include/net/tcp.h | 1 + kernel/bpf/cgroup.c | 38 +++++++++++++++++++ net/ipv4/tcp.c | 14 +++++++ net/ipv4/tcp_ipv4.c | 1 + net/ipv6/tcp_ipv6.c | 1 + .../selftests/bpf/prog_tests/sockopt_sk.c | 22 +++++++++++ .../testing/selftests/bpf/progs/sockopt_sk.c | 15 ++++++++ 9 files changed, 115 insertions(+), 4 deletions(-)