diff mbox series

[bpf-next,v5,1/3] bpf: remove extra lock_sock for TCP_ZEROCOPY_RECEIVE

Message ID 20210108180333.180906-2-sdf@google.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf: misc performance improvements for cgroup hooks | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 12 maintainers not CCed: kpsingh@kernel.org yhs@fb.com linux-kselftest@vger.kernel.org andrii@kernel.org kuba@kernel.org shuah@kernel.org john.fastabend@gmail.com iii@linux.ibm.com yoshfuji@linux-ipv6.org toke@redhat.com davem@davemloft.net kuznet@ms2.inr.ac.ru
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 12193 this patch: 12193
netdev/kdoc success Errors and warnings before: 3 this patch: 3
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 197 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 12846 this patch: 12846
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Stanislav Fomichev Jan. 8, 2021, 6:03 p.m. UTC
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
call in do_tcp_getsockopt using the on-stack data. This removes
3% overhead for locking/unlocking the socket.

Without this patch:
     3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
            |
             --3.30%--__cgroup_bpf_run_filter_getsockopt
                       |
                        --0.81%--__kmalloc

With the patch applied:
     0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
---
 include/linux/bpf-cgroup.h                    | 25 ++++++++++--
 include/net/sock.h                            |  2 +
 include/net/tcp.h                             |  1 +
 kernel/bpf/cgroup.c                           | 38 +++++++++++++++++++
 net/ipv4/tcp.c                                | 14 +++++++
 net/ipv4/tcp_ipv4.c                           |  1 +
 net/ipv6/tcp_ipv6.c                           |  1 +
 .../selftests/bpf/prog_tests/sockopt_sk.c     | 22 +++++++++++
 .../testing/selftests/bpf/progs/sockopt_sk.c  | 15 ++++++++
 9 files changed, 115 insertions(+), 4 deletions(-)

Comments

Eric Dumazet Jan. 8, 2021, 6:09 p.m. UTC | #1
On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> call in do_tcp_getsockopt using the on-stack data. This removes
> 3% overhead for locking/unlocking the socket.
>
> Without this patch:
>      3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
>             |
>              --3.30%--__cgroup_bpf_run_filter_getsockopt
>                        |
>                         --0.81%--__kmalloc
>
> With the patch applied:
>      0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
>


OK but we are adding yet another indirect call.

Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ?

Thanks.
Stanislav Fomichev Jan. 8, 2021, 6:26 p.m. UTC | #2
On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> > call in do_tcp_getsockopt using the on-stack data. This removes
> > 3% overhead for locking/unlocking the socket.
> >
> > Without this patch:
> >      3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
> >             |
> >              --3.30%--__cgroup_bpf_run_filter_getsockopt
> >                        |
> >                         --0.81%--__kmalloc
> >
> > With the patch applied:
> >      0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
> >
>
>
> OK but we are adding yet another indirect call.
>
> Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ?
Sure, but do you think it will bring any benefit?
We don't have any indirect avoidance in __sys_getsockopt for the
sock->ops->getsockopt() call.
If we add it for this new bpf_bypass_getsockopt, we might as well add
it for sock->ops->getsockopt?
And we need some new INDIRECT_CALL_INET2 such that f2 doesn't get
disabled when ipv6 is disabled :-/
Eric Dumazet Jan. 8, 2021, 6:40 p.m. UTC | #3
On Fri, Jan 8, 2021 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> > > call in do_tcp_getsockopt using the on-stack data. This removes
> > > 3% overhead for locking/unlocking the socket.
> > >
> > > Without this patch:
> > >      3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
> > >             |
> > >              --3.30%--__cgroup_bpf_run_filter_getsockopt
> > >                        |
> > >                         --0.81%--__kmalloc
> > >
> > > With the patch applied:
> > >      0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
> > >
> >
> >
> > OK but we are adding yet another indirect call.
> >
> > Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ?
> Sure, but do you think it will bring any benefit?

Sure, avoiding an indirect call might be the same gain than the
lock_sock() avoidance :)

> We don't have any indirect avoidance in __sys_getsockopt for the
> sock->ops->getsockopt() call.
> If we add it for this new bpf_bypass_getsockopt, we might as well add
> it for sock->ops->getsockopt?

Well, that is orthogonal to this patch.
As you may know, Google kernels do have a mitigation there already and
Brian may upstream it.

> And we need some new INDIRECT_CALL_INET2 such that f2 doesn't get
> disabled when ipv6 is disabled :-/

The same handler is called for IPv4 and IPv6, so you need the variant
with only one known handler (tcp_bpf_bypass_getsockopt)

Only it needs to make sure CONFIG_INET is enabled.
Stanislav Fomichev Jan. 8, 2021, 7:08 p.m. UTC | #4
On Fri, Jan 8, 2021 at 10:41 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Jan 8, 2021 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> > > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> > > > call in do_tcp_getsockopt using the on-stack data. This removes
> > > > 3% overhead for locking/unlocking the socket.
> > > >
> > > > Without this patch:
> > > >      3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
> > > >             |
> > > >              --3.30%--__cgroup_bpf_run_filter_getsockopt
> > > >                        |
> > > >                         --0.81%--__kmalloc
> > > >
> > > > With the patch applied:
> > > >      0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
> > > >
> > >
> > >
> > > OK but we are adding yet another indirect call.
> > >
> > > Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ?
> > Sure, but do you think it will bring any benefit?
>
> Sure, avoiding an indirect call might be the same gain than the
> lock_sock() avoidance :)
>
> > We don't have any indirect avoidance in __sys_getsockopt for the
> > sock->ops->getsockopt() call.
> > If we add it for this new bpf_bypass_getsockopt, we might as well add
> > it for sock->ops->getsockopt?
>
> Well, that is orthogonal to this patch.
> As you may know, Google kernels do have a mitigation there already and
> Brian may upstream it.
I guess my point here was that if I send it out only for bpf_bypass_getsockopt
it might look a bit strange because the rest of the getsockopt still
suffers the indirect costs. If Brian has plans to upstream the rest, maybe
it's better to upstream everything together with some numbers?
CC'ing him for his opinion.

I'm happy to follow up in whatever form is best. I can also resend
with INDIRECT_CALL_INET2 if there are no objections in including
this version from the start.

> > And we need some new INDIRECT_CALL_INET2 such that f2 doesn't get
> > disabled when ipv6 is disabled :-/
>
> The same handler is called for IPv4 and IPv6, so you need the variant
> with only one known handler (tcp_bpf_bypass_getsockopt)
>
> Only it needs to make sure CONFIG_INET is enabled.
Eric Dumazet Jan. 8, 2021, 7:23 p.m. UTC | #5
On Fri, Jan 8, 2021 at 8:08 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Fri, Jan 8, 2021 at 10:41 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Jan 8, 2021 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> > > > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> > > > > call in do_tcp_getsockopt using the on-stack data. This removes
> > > > > 3% overhead for locking/unlocking the socket.
> > > > >
> > > > > Without this patch:
> > > > >      3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
> > > > >             |
> > > > >              --3.30%--__cgroup_bpf_run_filter_getsockopt
> > > > >                        |
> > > > >                         --0.81%--__kmalloc
> > > > >
> > > > > With the patch applied:
> > > > >      0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
> > > > >
> > > >
> > > >
> > > > OK but we are adding yet another indirect call.
> > > >
> > > > Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ?
> > > Sure, but do you think it will bring any benefit?
> >
> > Sure, avoiding an indirect call might be the same gain than the
> > lock_sock() avoidance :)
> >
> > > We don't have any indirect avoidance in __sys_getsockopt for the
> > > sock->ops->getsockopt() call.
> > > If we add it for this new bpf_bypass_getsockopt, we might as well add
> > > it for sock->ops->getsockopt?
> >
> > Well, that is orthogonal to this patch.
> > As you may know, Google kernels do have a mitigation there already and
> > Brian may upstream it.
> I guess my point here was that if I send it out only for bpf_bypass_getsockopt
> it might look a bit strange because the rest of the getsockopt still
> suffers the indirect costs.


Each new indirect call adds a cost. If you focus on optimizing
TCP_ZEROCOPY_RECEIVE,
it is counter intuitive adding an expensive indirect call.

 If Brian has plans to upstream the rest, maybe
> it's better to upstream everything together with some numbers?
> CC'ing him for his opinion.

I am just saying your point about the other indirect call is already taken care.

>
> I'm happy to follow up in whatever form is best. I can also resend
> with INDIRECT_CALL_INET2 if there are no objections in including
> this version from the start.
>

INDIRECT_CALL_INET2 seems a strange name to me.
Stanislav Fomichev Jan. 8, 2021, 7:26 p.m. UTC | #6
On Fri, Jan 8, 2021 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Jan 8, 2021 at 8:08 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Fri, Jan 8, 2021 at 10:41 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Jan 8, 2021 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > >
> > > > > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> > > > > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> > > > > > call in do_tcp_getsockopt using the on-stack data. This removes
> > > > > > 3% overhead for locking/unlocking the socket.
> > > > > >
> > > > > > Without this patch:
> > > > > >      3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
> > > > > >             |
> > > > > >              --3.30%--__cgroup_bpf_run_filter_getsockopt
> > > > > >                        |
> > > > > >                         --0.81%--__kmalloc
> > > > > >
> > > > > > With the patch applied:
> > > > > >      0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
> > > > > >
> > > > >
> > > > >
> > > > > OK but we are adding yet another indirect call.
> > > > >
> > > > > Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ?
> > > > Sure, but do you think it will bring any benefit?
> > >
> > > Sure, avoiding an indirect call might be the same gain than the
> > > lock_sock() avoidance :)
> > >
> > > > We don't have any indirect avoidance in __sys_getsockopt for the
> > > > sock->ops->getsockopt() call.
> > > > If we add it for this new bpf_bypass_getsockopt, we might as well add
> > > > it for sock->ops->getsockopt?
> > >
> > > Well, that is orthogonal to this patch.
> > > As you may know, Google kernels do have a mitigation there already and
> > > Brian may upstream it.
> > I guess my point here was that if I send it out only for bpf_bypass_getsockopt
> > it might look a bit strange because the rest of the getsockopt still
> > suffers the indirect costs.
>
>
> Each new indirect call adds a cost. If you focus on optimizing
> TCP_ZEROCOPY_RECEIVE,
> it is counter intuitive adding an expensive indirect call.
Ok, then let me resend with a mitigation in place and a note
that the rest will be added later.

>  If Brian has plans to upstream the rest, maybe
> > it's better to upstream everything together with some numbers?
> > CC'ing him for his opinion.
>
> I am just saying your point about the other indirect call is already taken care.
>
> >
> > I'm happy to follow up in whatever form is best. I can also resend
> > with INDIRECT_CALL_INET2 if there are no objections in including
> > this version from the start.
> >
>
> INDIRECT_CALL_INET2 seems a strange name to me.
Any suggestion for a better name? I did play with the following:
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index cbba9c9ab073..f7342a30284c 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -371,7 +371,9 @@ int bpf_percpu_cgroup_storage_update(struct
bpf_map *map, void *key,
        int __ret = retval;                                                    \
        if (cgroup_bpf_enabled(BPF_CGROUP_GETSOCKOPT))                         \
                if (!(sock)->sk_prot->bpf_bypass_getsockopt ||                 \
-                   !(sock)->sk_prot->bpf_bypass_getsockopt(level, optname))   \
+
!INDIRECT_CALL_INET1((sock)->sk_prot->bpf_bypass_getsockopt, \
+                                       tcp_bpf_bypass_getsockopt,             \
+                                       level, optname))                       \
                        __ret = __cgroup_bpf_run_filter_getsockopt(            \
                                sock, level, optname, optval, optlen,          \
                                max_optlen, retval);                           \
diff --git a/include/linux/indirect_call_wrapper.h
b/include/linux/indirect_call_wrapper.h
index 54c02c84906a..9c3252f7e9bb 100644
--- a/include/linux/indirect_call_wrapper.h
+++ b/include/linux/indirect_call_wrapper.h
@@ -54,10 +54,13 @@
 #if IS_BUILTIN(CONFIG_IPV6)
 #define INDIRECT_CALL_INET(f, f2, f1, ...) \
        INDIRECT_CALL_2(f, f2, f1, __VA_ARGS__)
+#define INDIRECT_CALL_INET1(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
 #elif IS_ENABLED(CONFIG_INET)
 #define INDIRECT_CALL_INET(f, f2, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
+#define INDIRECT_CALL_INET1(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
 #else
 #define INDIRECT_CALL_INET(f, f2, f1, ...) f(__VA_ARGS__)
+#define INDIRECT_CALL_INET1(f, f1, ...) f(__VA_ARGS__)
 #endif

 #endif
Eric Dumazet Jan. 8, 2021, 8:08 p.m. UTC | #7
On Fri, Jan 8, 2021 at 8:27 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Fri, Jan 8, 2021 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Jan 8, 2021 at 8:08 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Fri, Jan 8, 2021 at 10:41 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Fri, Jan 8, 2021 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > On Fri, Jan 8, 2021 at 10:10 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Fri, Jan 8, 2021 at 7:03 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > > >
> > > > > > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> > > > > > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> > > > > > > call in do_tcp_getsockopt using the on-stack data. This removes
> > > > > > > 3% overhead for locking/unlocking the socket.
> > > > > > >
> > > > > > > Without this patch:
> > > > > > >      3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
> > > > > > >             |
> > > > > > >              --3.30%--__cgroup_bpf_run_filter_getsockopt
> > > > > > >                        |
> > > > > > >                         --0.81%--__kmalloc
> > > > > > >
> > > > > > > With the patch applied:
> > > > > > >      0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
> > > > > > >
> > > > > >
> > > > > >
> > > > > > OK but we are adding yet another indirect call.
> > > > > >
> > > > > > Can you add a patch on top of it adding INDIRECT_CALL_INET() avoidance ?
> > > > > Sure, but do you think it will bring any benefit?
> > > >
> > > > Sure, avoiding an indirect call might be the same gain than the
> > > > lock_sock() avoidance :)
> > > >
> > > > > We don't have any indirect avoidance in __sys_getsockopt for the
> > > > > sock->ops->getsockopt() call.
> > > > > If we add it for this new bpf_bypass_getsockopt, we might as well add
> > > > > it for sock->ops->getsockopt?
> > > >
> > > > Well, that is orthogonal to this patch.
> > > > As you may know, Google kernels do have a mitigation there already and
> > > > Brian may upstream it.
> > > I guess my point here was that if I send it out only for bpf_bypass_getsockopt
> > > it might look a bit strange because the rest of the getsockopt still
> > > suffers the indirect costs.
> >
> >
> > Each new indirect call adds a cost. If you focus on optimizing
> > TCP_ZEROCOPY_RECEIVE,
> > it is counter intuitive adding an expensive indirect call.
> Ok, then let me resend with a mitigation in place and a note
> that the rest will be added later.
>
> >  If Brian has plans to upstream the rest, maybe
> > > it's better to upstream everything together with some numbers?
> > > CC'ing him for his opinion.
> >
> > I am just saying your point about the other indirect call is already taken care.
> >
> > >
> > > I'm happy to follow up in whatever form is best. I can also resend
> > > with INDIRECT_CALL_INET2 if there are no objections in including
> > > this version from the start.
> > >
> >
> > INDIRECT_CALL_INET2 seems a strange name to me.
> Any suggestion for a better name? I did play with the following:
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index cbba9c9ab073..f7342a30284c 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -371,7 +371,9 @@ int bpf_percpu_cgroup_storage_update(struct
> bpf_map *map, void *key,
>         int __ret = retval;                                                    \
>         if (cgroup_bpf_enabled(BPF_CGROUP_GETSOCKOPT))                         \
>                 if (!(sock)->sk_prot->bpf_bypass_getsockopt ||                 \
> -                   !(sock)->sk_prot->bpf_bypass_getsockopt(level, optname))   \
> +
> !INDIRECT_CALL_INET1((sock)->sk_prot->bpf_bypass_getsockopt, \
> +                                       tcp_bpf_bypass_getsockopt,             \
> +                                       level, optname))                       \
>                         __ret = __cgroup_bpf_run_filter_getsockopt(            \
>                                 sock, level, optname, optval, optlen,          \
>                                 max_optlen, retval);                           \
> diff --git a/include/linux/indirect_call_wrapper.h
> b/include/linux/indirect_call_wrapper.h
> index 54c02c84906a..9c3252f7e9bb 100644
> --- a/include/linux/indirect_call_wrapper.h
> +++ b/include/linux/indirect_call_wrapper.h
> @@ -54,10 +54,13 @@
>  #if IS_BUILTIN(CONFIG_IPV6)
>  #define INDIRECT_CALL_INET(f, f2, f1, ...) \
>         INDIRECT_CALL_2(f, f2, f1, __VA_ARGS__)
> +#define INDIRECT_CALL_INET1(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
>  #elif IS_ENABLED(CONFIG_INET)
>  #define INDIRECT_CALL_INET(f, f2, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
> +#define INDIRECT_CALL_INET1(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
>  #else
>  #define INDIRECT_CALL_INET(f, f2, f1, ...) f(__VA_ARGS__)
> +#define INDIRECT_CALL_INET1(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
>  #endif
>
>  #endif

Yes, or maybe something only focusing on CONFIG_INET to make it clear.

diff --git a/include/linux/indirect_call_wrapper.h
b/include/linux/indirect_call_wrapper.h
index 54c02c84906ab2548a93bacb46f7795a8e136d83..d082aa4bd3ecae52e5998b3ac05deffafcb45de0
100644
--- a/include/linux/indirect_call_wrapper.h
+++ b/include/linux/indirect_call_wrapper.h
@@ -46,6 +46,12 @@
 #define INDIRECT_CALLABLE_SCOPE                static
 #endif

+#elif IS_ENABLED(CONFIG_INET)
+#define INDIRECT_CALL_INET1(f, f2, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
+#else
+#define INDIRECT_CALL_INET1(f, f2, f1, ...) f(__VA_ARGS__)
+#endif
+
 /*
  * We can use INDIRECT_CALL_$NR for ipv6 related functions only if ipv6 is
  * builtin, this macro simplify dealing with indirect calls with only ipv4/ipv6
diff mbox series

Patch

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 72e69a0e1e8c..6adaee018c63 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -147,6 +147,10 @@  int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 				       int __user *optlen, int max_optlen,
 				       int retval);
 
+int __cgroup_bpf_run_filter_getsockopt_kern(struct sock *sk, int level,
+					    int optname, void *optval,
+					    int *optlen, int retval);
+
 static inline enum bpf_cgroup_storage_type cgroup_storage_type(
 	struct bpf_map *map)
 {
@@ -364,10 +368,21 @@  int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
 ({									       \
 	int __ret = retval;						       \
 	if (cgroup_bpf_enabled)						       \
-		__ret = __cgroup_bpf_run_filter_getsockopt(sock, level,	       \
-							   optname, optval,    \
-							   optlen, max_optlen, \
-							   retval);	       \
+		if (!(sock)->sk_prot->bpf_bypass_getsockopt ||		       \
+		    !(sock)->sk_prot->bpf_bypass_getsockopt(level, optname))   \
+			__ret = __cgroup_bpf_run_filter_getsockopt(	       \
+				sock, level, optname, optval, optlen,	       \
+				max_optlen, retval);			       \
+	__ret;								       \
+})
+
+#define BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sock, level, optname, optval,      \
+					    optlen, retval)		       \
+({									       \
+	int __ret = retval;						       \
+	if (cgroup_bpf_enabled)						       \
+		__ret = __cgroup_bpf_run_filter_getsockopt_kern(	       \
+			sock, level, optname, optval, optlen, retval);	       \
 	__ret;								       \
 })
 
@@ -452,6 +467,8 @@  static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
 #define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, \
 				       optlen, max_optlen, retval) ({ retval; })
+#define BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sock, level, optname, optval, \
+					    optlen, retval) ({ retval; })
 #define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen, \
 				       kernel_optval) ({ 0; })
 
diff --git a/include/net/sock.h b/include/net/sock.h
index bdc4323ce53c..ebf44d724845 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1174,6 +1174,8 @@  struct proto {
 
 	int			(*backlog_rcv) (struct sock *sk,
 						struct sk_buff *skb);
+	bool			(*bpf_bypass_getsockopt)(int level,
+							 int optname);
 
 	void		(*release_cb)(struct sock *sk);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 78d13c88720f..4bb42fb19711 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -403,6 +403,7 @@  __poll_t tcp_poll(struct file *file, struct socket *sock,
 		      struct poll_table_struct *wait);
 int tcp_getsockopt(struct sock *sk, int level, int optname,
 		   char __user *optval, int __user *optlen);
+bool tcp_bpf_bypass_getsockopt(int level, int optname);
 int tcp_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
 		   unsigned int optlen);
 void tcp_set_keepalive(struct sock *sk, int val);
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 6ec088a96302..c41bb2f34013 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -1485,6 +1485,44 @@  int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 	sockopt_free_buf(&ctx);
 	return ret;
 }
+
+int __cgroup_bpf_run_filter_getsockopt_kern(struct sock *sk, int level,
+					    int optname, void *optval,
+					    int *optlen, int retval)
+{
+	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+	struct bpf_sockopt_kern ctx = {
+		.sk = sk,
+		.level = level,
+		.optname = optname,
+		.retval = retval,
+		.optlen = *optlen,
+		.optval = optval,
+		.optval_end = optval + *optlen,
+	};
+	int ret;
+
+	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
+				 &ctx, BPF_PROG_RUN);
+	if (!ret)
+		return -EPERM;
+
+	if (ctx.optlen > *optlen)
+		return -EFAULT;
+
+	/* BPF programs only allowed to set retval to 0, not some
+	 * arbitrary value.
+	 */
+	if (ctx.retval != 0 && ctx.retval != retval)
+		return -EFAULT;
+
+	/* BPF programs can shrink the buffer, export the modifications.
+	 */
+	if (ctx.optlen != 0)
+		*optlen = ctx.optlen;
+
+	return ctx.retval;
+}
 #endif
 
 static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ed42d2193c5c..ef3c895b66c1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4098,6 +4098,8 @@  static int do_tcp_getsockopt(struct sock *sk, int level,
 			return -EFAULT;
 		lock_sock(sk);
 		err = tcp_zerocopy_receive(sk, &zc);
+		err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
+							  &zc, &len, err);
 		release_sock(sk);
 		if (len >= offsetofend(struct tcp_zerocopy_receive, err))
 			goto zerocopy_rcv_sk_err;
@@ -4132,6 +4134,18 @@  static int do_tcp_getsockopt(struct sock *sk, int level,
 	return 0;
 }
 
+bool tcp_bpf_bypass_getsockopt(int level, int optname)
+{
+	/* TCP do_tcp_getsockopt has optimized getsockopt implementation
+	 * to avoid extra socket lock for TCP_ZEROCOPY_RECEIVE.
+	 */
+	if (level == SOL_TCP && optname == TCP_ZEROCOPY_RECEIVE)
+		return true;
+
+	return false;
+}
+EXPORT_SYMBOL(tcp_bpf_bypass_getsockopt);
+
 int tcp_getsockopt(struct sock *sk, int level, int optname, char __user *optval,
 		   int __user *optlen)
 {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 58207c7769d0..8b4906980fce 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2792,6 +2792,7 @@  struct proto tcp_prot = {
 	.shutdown		= tcp_shutdown,
 	.setsockopt		= tcp_setsockopt,
 	.getsockopt		= tcp_getsockopt,
+	.bpf_bypass_getsockopt	= tcp_bpf_bypass_getsockopt,
 	.keepalive		= tcp_set_keepalive,
 	.recvmsg		= tcp_recvmsg,
 	.sendmsg		= tcp_sendmsg,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0e1509b02cb3..8539715ff035 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2121,6 +2121,7 @@  struct proto tcpv6_prot = {
 	.shutdown		= tcp_shutdown,
 	.setsockopt		= tcp_setsockopt,
 	.getsockopt		= tcp_getsockopt,
+	.bpf_bypass_getsockopt	= tcp_bpf_bypass_getsockopt,
 	.keepalive		= tcp_set_keepalive,
 	.recvmsg		= tcp_recvmsg,
 	.sendmsg		= tcp_sendmsg,
diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
index b25c9c45c148..6bb18b1d8578 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
@@ -11,6 +11,7 @@  static int getsetsockopt(void)
 		char u8[4];
 		__u32 u32;
 		char cc[16]; /* TCP_CA_NAME_MAX */
+		struct tcp_zerocopy_receive zc;
 	} buf = {};
 	socklen_t optlen;
 	char *big_buf = NULL;
@@ -154,6 +155,27 @@  static int getsetsockopt(void)
 		goto err;
 	}
 
+	/* TCP_ZEROCOPY_RECEIVE triggers */
+	memset(&buf, 0, sizeof(buf));
+	optlen = sizeof(buf.zc);
+	err = getsockopt(fd, SOL_TCP, TCP_ZEROCOPY_RECEIVE, &buf, &optlen);
+	if (err) {
+		log_err("Unexpected getsockopt(TCP_ZEROCOPY_RECEIVE) err=%d errno=%d",
+			err, errno);
+		goto err;
+	}
+
+	memset(&buf, 0, sizeof(buf));
+	buf.zc.address = 12345; /* rejected by BPF */
+	optlen = sizeof(buf.zc);
+	errno = 0;
+	err = getsockopt(fd, SOL_TCP, TCP_ZEROCOPY_RECEIVE, &buf, &optlen);
+	if (errno != EPERM) {
+		log_err("Unexpected getsockopt(TCP_ZEROCOPY_RECEIVE) err=%d errno=%d",
+			err, errno);
+		goto err;
+	}
+
 	free(big_buf);
 	close(fd);
 	return 0;
diff --git a/tools/testing/selftests/bpf/progs/sockopt_sk.c b/tools/testing/selftests/bpf/progs/sockopt_sk.c
index 712df7b49cb1..c726f0763a13 100644
--- a/tools/testing/selftests/bpf/progs/sockopt_sk.c
+++ b/tools/testing/selftests/bpf/progs/sockopt_sk.c
@@ -57,6 +57,21 @@  int _getsockopt(struct bpf_sockopt *ctx)
 		return 1;
 	}
 
+	if (ctx->level == SOL_TCP && ctx->optname == TCP_ZEROCOPY_RECEIVE) {
+		/* Verify that TCP_ZEROCOPY_RECEIVE triggers.
+		 * It has a custom implementation for performance
+		 * reasons.
+		 */
+
+		if (optval + sizeof(struct tcp_zerocopy_receive) > optval_end)
+			return 0; /* EPERM, bounds check */
+
+		if (((struct tcp_zerocopy_receive *)optval)->address != 0)
+			return 0; /* EPERM, unexpected data */
+
+		return 1;
+	}
+
 	if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
 		if (optval + 1 > optval_end)
 			return 0; /* EPERM, bounds check */