diff mbox series

[bpf-next,v4,18/27] bpf, x64: Store properly return value for trampoline with multi func programs

Message ID 20210826193922.66204-19-jolsa@kernel.org (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series x86/ftrace/bpf: Add batch support for direct/tracing attach | expand

Checks

Context Check Description
bpf/vmtest fail Kernel LATEST + selftests
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count fail Series longer than 15 patches
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 10 maintainers not CCed: hpa@zytor.com mingo@redhat.com andrii@kernel.org tglx@linutronix.de kpsingh@kernel.org davem@davemloft.net yoshfuji@linux-ipv6.org bp@alien8.de x86@kernel.org dsahern@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 11867 this patch: 11867
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: From:/Signed-off-by: email address mismatch: 'From: Jiri Olsa <jolsa@redhat.com>' != 'Signed-off-by: Jiri Olsa <jolsa@kernel.org>' WARNING: line length of 82 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 11388 this patch: 11388
netdev/header_inline success Link
bpf/vmtest-bpf-next fail Kernel LATEST + selftests

Commit Message

Jiri Olsa Aug. 26, 2021, 7:39 p.m. UTC
When we have multi func program attached, the trampoline
switched to the function model of the multi func program.

This breaks already attached standard programs, for example
when we attach following program:

  SEC("fexit/bpf_fentry_test2")
  int BPF_PROG(test1, int a, __u64 b, int ret)

the trampoline pushes on stack args 'a' and 'b' and return
value 'ret'.

When following multi func program is attached to bpf_fentry_test2:

  SEC("fexit.multi/bpf_fentry_test*")
  int BPF_PROG(test2, __u64 a, __u64 b, __u64 c, __u64 d,
                       __u64 e, __u64 f, int ret)

the trampoline takes this program model and pushes all 6 args
and return value on stack.

But we still have the original 'test1' program attached, that
expects 'ret' value where there's 'c' argument now:

  test1(a, b, c)

To fix that we simply overwrite 'c' argument with 'ret' value,
so test1 is called as expected and test2 gets called as:

  test2(a, b, ret, d, e, f, ret)

which is ok, because 'c' is not defined for bpf_fentry_test2
anyway.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++++++++++-------
 include/linux/bpf.h         |  1 +
 kernel/bpf/trampoline.c     |  1 +
 3 files changed, 35 insertions(+), 7 deletions(-)

Comments

Andrii Nakryiko Aug. 31, 2021, 11:51 p.m. UTC | #1
On Thu, Aug 26, 2021 at 12:41 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> When we have multi func program attached, the trampoline
> switched to the function model of the multi func program.
>
> This breaks already attached standard programs, for example
> when we attach following program:
>
>   SEC("fexit/bpf_fentry_test2")
>   int BPF_PROG(test1, int a, __u64 b, int ret)
>
> the trampoline pushes on stack args 'a' and 'b' and return
> value 'ret'.
>
> When following multi func program is attached to bpf_fentry_test2:
>
>   SEC("fexit.multi/bpf_fentry_test*")
>   int BPF_PROG(test2, __u64 a, __u64 b, __u64 c, __u64 d,
>                        __u64 e, __u64 f, int ret)
>
> the trampoline takes this program model and pushes all 6 args
> and return value on stack.
>
> But we still have the original 'test1' program attached, that
> expects 'ret' value where there's 'c' argument now:
>
>   test1(a, b, c)
>
> To fix that we simply overwrite 'c' argument with 'ret' value,
> so test1 is called as expected and test2 gets called as:
>
>   test2(a, b, ret, d, e, f, ret)
>
> which is ok, because 'c' is not defined for bpf_fentry_test2
> anyway.
>

What if we change the order on the stack to be the return value first,
followed by input arguments. That would get us a bit closer to
unifying multi-trampoline and the normal one, right? BPF verifier
should be able to rewrite access to the last argument (i.e., return
value) for fexit programs to actually be at offset 0, and shift all
other arguments by 8 bytes. For fentry, if that helps to keep things
more aligned, we'd just skip the first 8 bytes on the stack and store
all the input arguments in the same offsets. So BPF verifier rewriting
logic stays consistent (except offset 0 will be disallowed).

Basically, I'm thinking how we can make normal and multi trampolines
more interoperable to remove those limitations that two
multi-trampolines can't be attached to the same function, which seems
like a pretty annoying limitation which will be easy to hit in
practice. Alexei previously proposed (as an optimization) to group all
to-be-attached functions into groups by number of arguments, so that
we can have up to 6 different trampolines tailored to actual functions
being attached. So that we don't save unnecessary extra input
arguments saving, which will be even more important once we allow more
than 6 arguments in the future.

With such logic, we should be able to split all the functions into
multiple underlying trampolines, so it seems like it should be
possible to also allow multiple multi-fentry programs to be attached
to the same function by having a separate bpf_trampoline just for
those functions. It will be just an extension of the above "just 6
trampolines" strategy to "as much as we need trampolines".

It's just a vague idea, sorry, I don't understand all the code yet.
But the limitation outlined in one of the previous patches seems very
limiting and unpleasant. I can totally see that some 24/7 running BPF
tracing app uses multi-fentry for tracing a small subset of kernel
functions non-stop, and then someone is trying to use bpftrace or
retsnoop to trace overlapping set of functions. And it immediately
fails. Very frustrating.

> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++++++++++-------
>  include/linux/bpf.h         |  1 +
>  kernel/bpf/trampoline.c     |  1 +
>  3 files changed, 35 insertions(+), 7 deletions(-)
>

[...]
Jiri Olsa Sept. 1, 2021, 3:15 p.m. UTC | #2
On Tue, Aug 31, 2021 at 04:51:18PM -0700, Andrii Nakryiko wrote:
> On Thu, Aug 26, 2021 at 12:41 PM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > When we have multi func program attached, the trampoline
> > switched to the function model of the multi func program.
> >
> > This breaks already attached standard programs, for example
> > when we attach following program:
> >
> >   SEC("fexit/bpf_fentry_test2")
> >   int BPF_PROG(test1, int a, __u64 b, int ret)
> >
> > the trampoline pushes on stack args 'a' and 'b' and return
> > value 'ret'.
> >
> > When following multi func program is attached to bpf_fentry_test2:
> >
> >   SEC("fexit.multi/bpf_fentry_test*")
> >   int BPF_PROG(test2, __u64 a, __u64 b, __u64 c, __u64 d,
> >                        __u64 e, __u64 f, int ret)
> >
> > the trampoline takes this program model and pushes all 6 args
> > and return value on stack.
> >
> > But we still have the original 'test1' program attached, that
> > expects 'ret' value where there's 'c' argument now:
> >
> >   test1(a, b, c)
> >
> > To fix that we simply overwrite 'c' argument with 'ret' value,
> > so test1 is called as expected and test2 gets called as:
> >
> >   test2(a, b, ret, d, e, f, ret)
> >
> > which is ok, because 'c' is not defined for bpf_fentry_test2
> > anyway.
> >
> 
> What if we change the order on the stack to be the return value first,
> followed by input arguments. That would get us a bit closer to
> unifying multi-trampoline and the normal one, right? BPF verifier
> should be able to rewrite access to the last argument (i.e., return
> value) for fexit programs to actually be at offset 0, and shift all
> other arguments by 8 bytes. For fentry, if that helps to keep things
> more aligned, we'd just skip the first 8 bytes on the stack and store
> all the input arguments in the same offsets. So BPF verifier rewriting
> logic stays consistent (except offset 0 will be disallowed).

nice idea, with this in place we could cut that args re-arranging code

> 
> Basically, I'm thinking how we can make normal and multi trampolines
> more interoperable to remove those limitations that two
> multi-trampolines can't be attached to the same function, which seems
> like a pretty annoying limitation which will be easy to hit in
> practice. Alexei previously proposed (as an optimization) to group all
> to-be-attached functions into groups by number of arguments, so that
> we can have up to 6 different trampolines tailored to actual functions
> being attached. So that we don't save unnecessary extra input
> arguments saving, which will be even more important once we allow more
> than 6 arguments in the future.
> 
> With such logic, we should be able to split all the functions into
> multiple underlying trampolines, so it seems like it should be
> possible to also allow multiple multi-fentry programs to be attached
> to the same function by having a separate bpf_trampoline just for
> those functions. It will be just an extension of the above "just 6
> trampolines" strategy to "as much as we need trampolines".

I'm probably missing something here.. say we have 2 functions with single
argument:

  foo1(int a)
  foo2(int b)

then having 2 programs:

  A - attaching to foo1
  B - attaching to foo2

then you need to have 2 different trampolines instead of single 'generic-1-argument-trampoline'

> 
> It's just a vague idea, sorry, I don't understand all the code yet.
> But the limitation outlined in one of the previous patches seems very
> limiting and unpleasant. I can totally see that some 24/7 running BPF
> tracing app uses multi-fentry for tracing a small subset of kernel
> functions non-stop, and then someone is trying to use bpftrace or
> retsnoop to trace overlapping set of functions. And it immediately
> fails. Very frustrating.

so the current approach is to some extent driven by the direct ftrace
batch API:

  you have ftrace_ops object and set it up with functions you want
  to change with calling:

  ftrace_set_filter_ip(ops, ip1);
  ftrace_set_filter_ip(ops, ip2);
  ...

and then register trampoline with those functions:

  register_ftrace_direct_multi(ops, tramp_addr);

and with this call being the expensive one (it does the actual work
and sync waiting), my objective was to call it just once for update

now with 2 intersecting multi trampolines we end up with 3 functions
sets:

  A - functions for first multi trampoline
  B - functions for second multi trampoline
  C - intersection of them

each set needs different trampoline:

  tramp A - calls program for first multi trampoline
  tramp B - calls program for second multi trampoline
  tramp C - calls both programs

so we need to call register_ftrace_direct_multi 3 times

if we allow also standard trampolines being attached, it makes
it even more complicated and ultimatelly gets broken to
1-function/1-trampoline pairs, ending up with attach speed
that we have now

...

I have test code for ftrace direct interface that would
allow to register/change separate function/addr pairs,
so in one call you can change multiple ips each to
different tramp addresss

but even with that, I ended up with lot of new complexity
on bpf side keeping track of multi trampolines intersections,
so I thought I'd start with something limited and simpler

perhaps I should move back to that approach and see how bad
it ends ;-)

or this could be next step on top of current work, that should
get simpler with the args re-arranging you proposed

jirka
Andrii Nakryiko Sept. 2, 2021, 3:56 a.m. UTC | #3
On Wed, Sep 1, 2021 at 8:15 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Tue, Aug 31, 2021 at 04:51:18PM -0700, Andrii Nakryiko wrote:
> > On Thu, Aug 26, 2021 at 12:41 PM Jiri Olsa <jolsa@redhat.com> wrote:
> > >
> > > When we have multi func program attached, the trampoline
> > > switched to the function model of the multi func program.
> > >
> > > This breaks already attached standard programs, for example
> > > when we attach following program:
> > >
> > >   SEC("fexit/bpf_fentry_test2")
> > >   int BPF_PROG(test1, int a, __u64 b, int ret)
> > >
> > > the trampoline pushes on stack args 'a' and 'b' and return
> > > value 'ret'.
> > >
> > > When following multi func program is attached to bpf_fentry_test2:
> > >
> > >   SEC("fexit.multi/bpf_fentry_test*")
> > >   int BPF_PROG(test2, __u64 a, __u64 b, __u64 c, __u64 d,
> > >                        __u64 e, __u64 f, int ret)
> > >
> > > the trampoline takes this program model and pushes all 6 args
> > > and return value on stack.
> > >
> > > But we still have the original 'test1' program attached, that
> > > expects 'ret' value where there's 'c' argument now:
> > >
> > >   test1(a, b, c)
> > >
> > > To fix that we simply overwrite 'c' argument with 'ret' value,
> > > so test1 is called as expected and test2 gets called as:
> > >
> > >   test2(a, b, ret, d, e, f, ret)
> > >
> > > which is ok, because 'c' is not defined for bpf_fentry_test2
> > > anyway.
> > >
> >
> > What if we change the order on the stack to be the return value first,
> > followed by input arguments. That would get us a bit closer to
> > unifying multi-trampoline and the normal one, right? BPF verifier
> > should be able to rewrite access to the last argument (i.e., return
> > value) for fexit programs to actually be at offset 0, and shift all
> > other arguments by 8 bytes. For fentry, if that helps to keep things
> > more aligned, we'd just skip the first 8 bytes on the stack and store
> > all the input arguments in the same offsets. So BPF verifier rewriting
> > logic stays consistent (except offset 0 will be disallowed).
>
> nice idea, with this in place we could cut that args re-arranging code
>
> >
> > Basically, I'm thinking how we can make normal and multi trampolines
> > more interoperable to remove those limitations that two
> > multi-trampolines can't be attached to the same function, which seems
> > like a pretty annoying limitation which will be easy to hit in
> > practice. Alexei previously proposed (as an optimization) to group all
> > to-be-attached functions into groups by number of arguments, so that
> > we can have up to 6 different trampolines tailored to actual functions
> > being attached. So that we don't save unnecessary extra input
> > arguments saving, which will be even more important once we allow more
> > than 6 arguments in the future.
> >
> > With such logic, we should be able to split all the functions into
> > multiple underlying trampolines, so it seems like it should be
> > possible to also allow multiple multi-fentry programs to be attached
> > to the same function by having a separate bpf_trampoline just for
> > those functions. It will be just an extension of the above "just 6
> > trampolines" strategy to "as much as we need trampolines".
>
> I'm probably missing something here.. say we have 2 functions with single
> argument:
>
>   foo1(int a)
>   foo2(int b)
>
> then having 2 programs:
>
>   A - attaching to foo1
>   B - attaching to foo2
>
> then you need to have 2 different trampolines instead of single 'generic-1-argument-trampoline'

right, you have two different BPF progs attached to two different
functions. You have to have 2 trampolines, not sure what's
confusing?..

>
> >
> > It's just a vague idea, sorry, I don't understand all the code yet.
> > But the limitation outlined in one of the previous patches seems very
> > limiting and unpleasant. I can totally see that some 24/7 running BPF
> > tracing app uses multi-fentry for tracing a small subset of kernel
> > functions non-stop, and then someone is trying to use bpftrace or
> > retsnoop to trace overlapping set of functions. And it immediately
> > fails. Very frustrating.
>
> so the current approach is to some extent driven by the direct ftrace
> batch API:
>
>   you have ftrace_ops object and set it up with functions you want
>   to change with calling:
>
>   ftrace_set_filter_ip(ops, ip1);
>   ftrace_set_filter_ip(ops, ip2);
>   ...
>
> and then register trampoline with those functions:
>
>   register_ftrace_direct_multi(ops, tramp_addr);
>
> and with this call being the expensive one (it does the actual work
> and sync waiting), my objective was to call it just once for update
>
> now with 2 intersecting multi trampolines we end up with 3 functions
> sets:
>
>   A - functions for first multi trampoline
>   B - functions for second multi trampoline
>   C - intersection of them
>
> each set needs different trampoline:
>
>   tramp A - calls program for first multi trampoline
>   tramp B - calls program for second multi trampoline
>   tramp C - calls both programs
>
> so we need to call register_ftrace_direct_multi 3 times

Yes, that's the minimal amount of trampolines you need. Calling
register_ftrace_direct_multi() three times is not that bad at all,
compared to calling it 1000s of times. If you are worried about 1 vs 3
calls, I think you are over-optimizing here. I'd rather take no
restrictions on what can be attached to what and in which sequences
but taking 3ms vs having obscure (for uninitiated users) restrictions,
but in some cases allowing attachment to happen in 1ms.

The goal with multi-attach is to make it decently fast when attaching
to a lot functions, but if attachment speed is fast enough, then such
small performance differences don't matter anymore.

>
> if we allow also standard trampolines being attached, it makes
> it even more complicated and ultimatelly gets broken to
> 1-function/1-trampoline pairs, ending up with attach speed
> that we have now
>

So let's make sure that we are on the same page. Let me write out an example.

Let's say we have 5 kernel functions: a, b, c, d, e. Say a, b, c all
have 1 input args, and d and e have 2.

Now let's say we attach just normal fentry program A to function a.
Also we attach normal fexit program E to func e.

We'll have A  attached to a with trampoline T1. We'll also have E
attached to e with trampoline T2. Right?

And now we try to attach generic fentry (fentry.multi in your
terminology) prog X to all 5 of them. If A and E weren't attached,
we'd need two generic trampolines, one for a, b, c (because 1 input
argument) and another for d,e (because 2 input arguments). But because
we already have A and B attached, we'll end up needing 4:

T1 (1 arg)  for func a calling progs A and X
T2 (2 args) for func e calling progs E and X
T3 (1 arg)  for func b and c calling X
T4 (2 args) for func d calling X

We can't have less than that and satisfy all the constraints. But 4 is
not that bad. If the example has 1000s of functions, you'd still need
between 4 and 8 trampolines (if we had 3, 4, 5, and 6 input args for
kernel functions). That's way less than 1000s of trampolines needed
today. And it's still fast enough on the attachment.

The good thing with what we discussed with making current trampoline
co-exist with generic (multi) fentry/fexit, is that we'll still have
just one trampoline, saving exactly as many input arguments as
attached function(s) have. So at least we don't have to maintain two
separate pieces of logic for that. Then the only added complexity
would be breaking up all to-be-attached kernel functions into groups,
as described in the example.

It sounds a bit more complicated in writing than it will be in
practice, probably. I think the critical part is unification of
trampoline to work with fentry/fexit and fentry.multi/fexit.multi
simultaneously, which seems like you agreed above is achievable.

> ...
>
> I have test code for ftrace direct interface that would
> allow to register/change separate function/addr pairs,
> so in one call you can change multiple ips each to
> different tramp addresss
>
> but even with that, I ended up with lot of new complexity
> on bpf side keeping track of multi trampolines intersections,
> so I thought I'd start with something limited and simpler
>
> perhaps I should move back to that approach and see how bad
> it ends ;-)
>
> or this could be next step on top of current work, that should
> get simpler with the args re-arranging you proposed
>
> jirka
>
Jiri Olsa Sept. 2, 2021, 12:57 p.m. UTC | #4
On Wed, Sep 01, 2021 at 08:56:19PM -0700, Andrii Nakryiko wrote:
> On Wed, Sep 1, 2021 at 8:15 AM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > On Tue, Aug 31, 2021 at 04:51:18PM -0700, Andrii Nakryiko wrote:
> > > On Thu, Aug 26, 2021 at 12:41 PM Jiri Olsa <jolsa@redhat.com> wrote:
> > > >
> > > > When we have multi func program attached, the trampoline
> > > > switched to the function model of the multi func program.
> > > >
> > > > This breaks already attached standard programs, for example
> > > > when we attach following program:
> > > >
> > > >   SEC("fexit/bpf_fentry_test2")
> > > >   int BPF_PROG(test1, int a, __u64 b, int ret)
> > > >
> > > > the trampoline pushes on stack args 'a' and 'b' and return
> > > > value 'ret'.
> > > >
> > > > When following multi func program is attached to bpf_fentry_test2:
> > > >
> > > >   SEC("fexit.multi/bpf_fentry_test*")
> > > >   int BPF_PROG(test2, __u64 a, __u64 b, __u64 c, __u64 d,
> > > >                        __u64 e, __u64 f, int ret)
> > > >
> > > > the trampoline takes this program model and pushes all 6 args
> > > > and return value on stack.
> > > >
> > > > But we still have the original 'test1' program attached, that
> > > > expects 'ret' value where there's 'c' argument now:
> > > >
> > > >   test1(a, b, c)
> > > >
> > > > To fix that we simply overwrite 'c' argument with 'ret' value,
> > > > so test1 is called as expected and test2 gets called as:
> > > >
> > > >   test2(a, b, ret, d, e, f, ret)
> > > >
> > > > which is ok, because 'c' is not defined for bpf_fentry_test2
> > > > anyway.
> > > >
> > >
> > > What if we change the order on the stack to be the return value first,
> > > followed by input arguments. That would get us a bit closer to
> > > unifying multi-trampoline and the normal one, right? BPF verifier
> > > should be able to rewrite access to the last argument (i.e., return
> > > value) for fexit programs to actually be at offset 0, and shift all
> > > other arguments by 8 bytes. For fentry, if that helps to keep things
> > > more aligned, we'd just skip the first 8 bytes on the stack and store
> > > all the input arguments in the same offsets. So BPF verifier rewriting
> > > logic stays consistent (except offset 0 will be disallowed).
> >
> > nice idea, with this in place we could cut that args re-arranging code
> >
> > >
> > > Basically, I'm thinking how we can make normal and multi trampolines
> > > more interoperable to remove those limitations that two
> > > multi-trampolines can't be attached to the same function, which seems
> > > like a pretty annoying limitation which will be easy to hit in
> > > practice. Alexei previously proposed (as an optimization) to group all
> > > to-be-attached functions into groups by number of arguments, so that
> > > we can have up to 6 different trampolines tailored to actual functions
> > > being attached. So that we don't save unnecessary extra input
> > > arguments saving, which will be even more important once we allow more
> > > than 6 arguments in the future.
> > >
> > > With such logic, we should be able to split all the functions into
> > > multiple underlying trampolines, so it seems like it should be
> > > possible to also allow multiple multi-fentry programs to be attached
> > > to the same function by having a separate bpf_trampoline just for
> > > those functions. It will be just an extension of the above "just 6
> > > trampolines" strategy to "as much as we need trampolines".
> >
> > I'm probably missing something here.. say we have 2 functions with single
> > argument:
> >
> >   foo1(int a)
> >   foo2(int b)
> >
> > then having 2 programs:
> >
> >   A - attaching to foo1
> >   B - attaching to foo2
> >
> > then you need to have 2 different trampolines instead of single 'generic-1-argument-trampoline'
> 
> right, you have two different BPF progs attached to two different
> functions. You have to have 2 trampolines, not sure what's
> confusing?..

I misunderstood the statement above:

> > > practice. Alexei previously proposed (as an optimization) to group all
> > > to-be-attached functions into groups by number of arguments, so that
> > > we can have up to 6 different trampolines tailored to actual functions
> > > being attached. So that we don't save unnecessary extra input

you meant just functions to be attached at that moment, not all, ok

> 
> >
> > >
> > > It's just a vague idea, sorry, I don't understand all the code yet.
> > > But the limitation outlined in one of the previous patches seems very
> > > limiting and unpleasant. I can totally see that some 24/7 running BPF
> > > tracing app uses multi-fentry for tracing a small subset of kernel
> > > functions non-stop, and then someone is trying to use bpftrace or
> > > retsnoop to trace overlapping set of functions. And it immediately
> > > fails. Very frustrating.
> >
> > so the current approach is to some extent driven by the direct ftrace
> > batch API:
> >
> >   you have ftrace_ops object and set it up with functions you want
> >   to change with calling:
> >
> >   ftrace_set_filter_ip(ops, ip1);
> >   ftrace_set_filter_ip(ops, ip2);
> >   ...
> >
> > and then register trampoline with those functions:
> >
> >   register_ftrace_direct_multi(ops, tramp_addr);
> >
> > and with this call being the expensive one (it does the actual work
> > and sync waiting), my objective was to call it just once for update
> >
> > now with 2 intersecting multi trampolines we end up with 3 functions
> > sets:
> >
> >   A - functions for first multi trampoline
> >   B - functions for second multi trampoline
> >   C - intersection of them
> >
> > each set needs different trampoline:
> >
> >   tramp A - calls program for first multi trampoline
> >   tramp B - calls program for second multi trampoline
> >   tramp C - calls both programs
> >
> > so we need to call register_ftrace_direct_multi 3 times
> 
> Yes, that's the minimal amount of trampolines you need. Calling
> register_ftrace_direct_multi() three times is not that bad at all,
> compared to calling it 1000s of times. If you are worried about 1 vs 3
> calls, I think you are over-optimizing here. I'd rather take no
> restrictions on what can be attached to what and in which sequences
> but taking 3ms vs having obscure (for uninitiated users) restrictions,
> but in some cases allowing attachment to happen in 1ms.
> 
> The goal with multi-attach is to make it decently fast when attaching
> to a lot functions, but if attachment speed is fast enough, then such
> small performance differences don't matter anymore.

true, I might have been focused on the worst possible case here ;-)

> 
> >
> > if we allow also standard trampolines being attached, it makes
> > it even more complicated and ultimatelly gets broken to
> > 1-function/1-trampoline pairs, ending up with attach speed
> > that we have now
> >
> 
> So let's make sure that we are on the same page. Let me write out an example.
> 
> Let's say we have 5 kernel functions: a, b, c, d, e. Say a, b, c all
> have 1 input args, and d and e have 2.
> 
> Now let's say we attach just normal fentry program A to function a.
> Also we attach normal fexit program E to func e.
> 
> We'll have A  attached to a with trampoline T1. We'll also have E
> attached to e with trampoline T2. Right?
> 
> And now we try to attach generic fentry (fentry.multi in your
> terminology) prog X to all 5 of them. If A and E weren't attached,
> we'd need two generic trampolines, one for a, b, c (because 1 input
> argument) and another for d,e (because 2 input arguments). But because
> we already have A and B attached, we'll end up needing 4:
> 
> T1 (1 arg)  for func a calling progs A and X
> T2 (2 args) for func e calling progs E and X
> T3 (1 arg)  for func b and c calling X
> T4 (2 args) for func d calling X

so current code would group T3/T4 together, but if we keep
them separated, then we won't need to use new model and
cut off some of the code, ok

together with that args shifting we could endup with almost
untouched trampoline generation code ;-)

> 
> We can't have less than that and satisfy all the constraints. But 4 is
> not that bad. If the example has 1000s of functions, you'd still need
> between 4 and 8 trampolines (if we had 3, 4, 5, and 6 input args for
> kernel functions). That's way less than 1000s of trampolines needed
> today. And it's still fast enough on the attachment.
> 
> The good thing with what we discussed with making current trampoline
> co-exist with generic (multi) fentry/fexit, is that we'll still have
> just one trampoline, saving exactly as many input arguments as
> attached function(s) have. So at least we don't have to maintain two
> separate pieces of logic for that. Then the only added complexity
> would be breaking up all to-be-attached kernel functions into groups,
> as described in the example.
> 
> It sounds a bit more complicated in writing than it will be in
> practice, probably. I think the critical part is unification of
> trampoline to work with fentry/fexit and fentry.multi/fexit.multi
> simultaneously, which seems like you agreed above is achievable.

ok, I haven't considered this way, but I think it's doable

thanks,
jirka
Andrii Nakryiko Sept. 2, 2021, 4:54 p.m. UTC | #5
On Thu, Sep 2, 2021 at 5:57 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Wed, Sep 01, 2021 at 08:56:19PM -0700, Andrii Nakryiko wrote:
> > On Wed, Sep 1, 2021 at 8:15 AM Jiri Olsa <jolsa@redhat.com> wrote:
> > >
> > > On Tue, Aug 31, 2021 at 04:51:18PM -0700, Andrii Nakryiko wrote:
> > > > On Thu, Aug 26, 2021 at 12:41 PM Jiri Olsa <jolsa@redhat.com> wrote:
> > > > >
> > > > > When we have multi func program attached, the trampoline
> > > > > switched to the function model of the multi func program.
> > > > >
> > > > > This breaks already attached standard programs, for example
> > > > > when we attach following program:
> > > > >
> > > > >   SEC("fexit/bpf_fentry_test2")
> > > > >   int BPF_PROG(test1, int a, __u64 b, int ret)
> > > > >
> > > > > the trampoline pushes on stack args 'a' and 'b' and return
> > > > > value 'ret'.
> > > > >
> > > > > When following multi func program is attached to bpf_fentry_test2:
> > > > >
> > > > >   SEC("fexit.multi/bpf_fentry_test*")
> > > > >   int BPF_PROG(test2, __u64 a, __u64 b, __u64 c, __u64 d,
> > > > >                        __u64 e, __u64 f, int ret)
> > > > >
> > > > > the trampoline takes this program model and pushes all 6 args
> > > > > and return value on stack.
> > > > >
> > > > > But we still have the original 'test1' program attached, that
> > > > > expects 'ret' value where there's 'c' argument now:
> > > > >
> > > > >   test1(a, b, c)
> > > > >
> > > > > To fix that we simply overwrite 'c' argument with 'ret' value,
> > > > > so test1 is called as expected and test2 gets called as:
> > > > >
> > > > >   test2(a, b, ret, d, e, f, ret)
> > > > >
> > > > > which is ok, because 'c' is not defined for bpf_fentry_test2
> > > > > anyway.
> > > > >
> > > >
> > > > What if we change the order on the stack to be the return value first,
> > > > followed by input arguments. That would get us a bit closer to
> > > > unifying multi-trampoline and the normal one, right? BPF verifier
> > > > should be able to rewrite access to the last argument (i.e., return
> > > > value) for fexit programs to actually be at offset 0, and shift all
> > > > other arguments by 8 bytes. For fentry, if that helps to keep things
> > > > more aligned, we'd just skip the first 8 bytes on the stack and store
> > > > all the input arguments in the same offsets. So BPF verifier rewriting
> > > > logic stays consistent (except offset 0 will be disallowed).
> > >
> > > nice idea, with this in place we could cut that args re-arranging code
> > >
> > > >
> > > > Basically, I'm thinking how we can make normal and multi trampolines
> > > > more interoperable to remove those limitations that two
> > > > multi-trampolines can't be attached to the same function, which seems
> > > > like a pretty annoying limitation which will be easy to hit in
> > > > practice. Alexei previously proposed (as an optimization) to group all
> > > > to-be-attached functions into groups by number of arguments, so that
> > > > we can have up to 6 different trampolines tailored to actual functions
> > > > being attached. So that we don't save unnecessary extra input
> > > > arguments saving, which will be even more important once we allow more
> > > > than 6 arguments in the future.
> > > >
> > > > With such logic, we should be able to split all the functions into
> > > > multiple underlying trampolines, so it seems like it should be
> > > > possible to also allow multiple multi-fentry programs to be attached
> > > > to the same function by having a separate bpf_trampoline just for
> > > > those functions. It will be just an extension of the above "just 6
> > > > trampolines" strategy to "as much as we need trampolines".
> > >
> > > I'm probably missing something here.. say we have 2 functions with single
> > > argument:
> > >
> > >   foo1(int a)
> > >   foo2(int b)
> > >
> > > then having 2 programs:
> > >
> > >   A - attaching to foo1
> > >   B - attaching to foo2
> > >
> > > then you need to have 2 different trampolines instead of single 'generic-1-argument-trampoline'
> >
> > right, you have two different BPF progs attached to two different
> > functions. You have to have 2 trampolines, not sure what's
> > confusing?..
>
> I misunderstood the statement above:
>
> > > > practice. Alexei previously proposed (as an optimization) to group all
> > > > to-be-attached functions into groups by number of arguments, so that
> > > > we can have up to 6 different trampolines tailored to actual functions
> > > > being attached. So that we don't save unnecessary extra input
>
> you meant just functions to be attached at that moment, not all, ok
>
> >
> > >
> > > >
> > > > It's just a vague idea, sorry, I don't understand all the code yet.
> > > > But the limitation outlined in one of the previous patches seems very
> > > > limiting and unpleasant. I can totally see that some 24/7 running BPF
> > > > tracing app uses multi-fentry for tracing a small subset of kernel
> > > > functions non-stop, and then someone is trying to use bpftrace or
> > > > retsnoop to trace overlapping set of functions. And it immediately
> > > > fails. Very frustrating.
> > >
> > > so the current approach is to some extent driven by the direct ftrace
> > > batch API:
> > >
> > >   you have ftrace_ops object and set it up with functions you want
> > >   to change with calling:
> > >
> > >   ftrace_set_filter_ip(ops, ip1);
> > >   ftrace_set_filter_ip(ops, ip2);
> > >   ...
> > >
> > > and then register trampoline with those functions:
> > >
> > >   register_ftrace_direct_multi(ops, tramp_addr);
> > >
> > > and with this call being the expensive one (it does the actual work
> > > and sync waiting), my objective was to call it just once for update
> > >
> > > now with 2 intersecting multi trampolines we end up with 3 functions
> > > sets:
> > >
> > >   A - functions for first multi trampoline
> > >   B - functions for second multi trampoline
> > >   C - intersection of them
> > >
> > > each set needs different trampoline:
> > >
> > >   tramp A - calls program for first multi trampoline
> > >   tramp B - calls program for second multi trampoline
> > >   tramp C - calls both programs
> > >
> > > so we need to call register_ftrace_direct_multi 3 times
> >
> > Yes, that's the minimal amount of trampolines you need. Calling
> > register_ftrace_direct_multi() three times is not that bad at all,
> > compared to calling it 1000s of times. If you are worried about 1 vs 3
> > calls, I think you are over-optimizing here. I'd rather take no
> > restrictions on what can be attached to what and in which sequences
> > but taking 3ms vs having obscure (for uninitiated users) restrictions,
> > but in some cases allowing attachment to happen in 1ms.
> >
> > The goal with multi-attach is to make it decently fast when attaching
> > to a lot functions, but if attachment speed is fast enough, then such
> > small performance differences don't matter anymore.
>
> true, I might have been focused on the worst possible case here ;-)
>
> >
> > >
> > > if we allow also standard trampolines being attached, it makes
> > > it even more complicated and ultimatelly gets broken to
> > > 1-function/1-trampoline pairs, ending up with attach speed
> > > that we have now
> > >
> >
> > So let's make sure that we are on the same page. Let me write out an example.
> >
> > Let's say we have 5 kernel functions: a, b, c, d, e. Say a, b, c all
> > have 1 input args, and d and e have 2.
> >
> > Now let's say we attach just normal fentry program A to function a.
> > Also we attach normal fexit program E to func e.
> >
> > We'll have A  attached to a with trampoline T1. We'll also have E
> > attached to e with trampoline T2. Right?
> >
> > And now we try to attach generic fentry (fentry.multi in your
> > terminology) prog X to all 5 of them. If A and E weren't attached,
> > we'd need two generic trampolines, one for a, b, c (because 1 input
> > argument) and another for d,e (because 2 input arguments). But because
> > we already have A and B attached, we'll end up needing 4:
> >
> > T1 (1 arg)  for func a calling progs A and X
> > T2 (2 args) for func e calling progs E and X
> > T3 (1 arg)  for func b and c calling X
> > T4 (2 args) for func d calling X
>
> so current code would group T3/T4 together, but if we keep
> them separated, then we won't need to use new model and
> cut off some of the code, ok
>
> together with that args shifting we could endup with almost
> untouched trampoline generation code ;-)

exactly, and thus remove those limitations you've described

>
> >
> > We can't have less than that and satisfy all the constraints. But 4 is
> > not that bad. If the example has 1000s of functions, you'd still need
> > between 4 and 8 trampolines (if we had 3, 4, 5, and 6 input args for
> > kernel functions). That's way less than 1000s of trampolines needed
> > today. And it's still fast enough on the attachment.
> >
> > The good thing with what we discussed with making current trampoline
> > co-exist with generic (multi) fentry/fexit, is that we'll still have
> > just one trampoline, saving exactly as many input arguments as
> > attached function(s) have. So at least we don't have to maintain two
> > separate pieces of logic for that. Then the only added complexity
> > would be breaking up all to-be-attached kernel functions into groups,
> > as described in the example.
> >
> > It sounds a bit more complicated in writing than it will be in
> > practice, probably. I think the critical part is unification of
> > trampoline to work with fentry/fexit and fentry.multi/fexit.multi
> > simultaneously, which seems like you agreed above is achievable.
>
> ok, I haven't considered this way, but I think it's doable
>

awesome, give it a try!


> thanks,
> jirka
>
Alexei Starovoitov Sept. 2, 2021, 9:55 p.m. UTC | #6
On Thu, Sep 02, 2021 at 02:57:11PM +0200, Jiri Olsa wrote:
> > 
> > Let's say we have 5 kernel functions: a, b, c, d, e. Say a, b, c all
> > have 1 input args, and d and e have 2.
> > 
> > Now let's say we attach just normal fentry program A to function a.
> > Also we attach normal fexit program E to func e.
> > 
> > We'll have A  attached to a with trampoline T1. We'll also have E
> > attached to e with trampoline T2. Right?
> > 
> > And now we try to attach generic fentry (fentry.multi in your
> > terminology) prog X to all 5 of them. If A and E weren't attached,
> > we'd need two generic trampolines, one for a, b, c (because 1 input
> > argument) and another for d,e (because 2 input arguments). But because
> > we already have A and B attached, we'll end up needing 4:
> > 
> > T1 (1 arg)  for func a calling progs A and X
> > T2 (2 args) for func e calling progs E and X
> > T3 (1 arg)  for func b and c calling X
> > T4 (2 args) for func d calling X
> 
> so current code would group T3/T4 together, but if we keep
> them separated, then we won't need to use new model and
> cut off some of the code, ok

We've brainstormed this idea further with Andrii.
(thankfully we could do it in-person now ;) which saved a ton of time)

It seems the following should work:
5 kernel functions: a(int), b(long), c(void*), d(int, int), e(long, long).
fentry prog A is attached to 'a'.
fexit prog E is attached to 'e'.
multi-prog X wants to attach to all of them.
It can be achieved with 4 trampolines.

The trampolines called from funcs 'a' and 'e' can be patched to
call A+X and E+X programs correspondingly.
The multi program X needs to be able to access return values
and arguments of all functions it was attached to.
We can achieve that by always generating a trampoline (both multi and normal)
with extra constant stored in the stack. This constant is the number of
arguments served by this trampoline.
The trampoline 'a' will store nr_args=1.
The tramopline 'e' will store nr_args=2.
We need two multi trampolines.
The multi tramopline X1 that will serve 'b' and 'c' and store nr_args=1
and multi-tramopline X2 that will serve 'd' and store nr_args=2
into hidden stack location (like ctx[-2]).

The multi prog X can look like:
int BPF_PROG(x, __u64 arg1, __u64 arg2, __u64 ret)
in such case it will read correct args and ret when called from 'd' and 'e'
and only correct arg1 when called from 'a', 'b', 'c'.

To always correctly access arguments and the return value
the program can use two new helpers: bpf_arg(ctx, N) and bpf_ret_value(ctx).
Both will be fully inlined helpers similar to bpf_get_func_ip().
u64 bpf_arg(ctx, int n)
{
  u64 nr_args = ctx[-2]; /* that's the place where _all_ trampoline will store nr_args */
  if (n > nr_args)
    return 0;
  return ctx[n];
}
u64 bpf_ret_value(ctx)
{
  u64 nr_args = ctx[-2];
  return ctx[nr_args];
}

These helpers will be the only recommended way to access args and ret value
in multi progs.
The nice advantage is that normal fentry/fexit progs can use them too.

We can rearrange ctx[-1] /* func_ip */ and ctx[-2] /* nr_args */
if it makes things easier.

If multi prog knows that it is attaching to 100 kernel functions
and all of them have 2 arguments it can still do
int BPF_PROG(x, __u64 arg1, __u64 arg2, __u64 ret)
{ // access arg1, arg2, ret directly
and it will work correctly.

We can make it really strict in the verifier and disallow such
direct access to args from the multi prog and only allow
access via bpf_arg/bpf_ret_value helpers, but I think it's overkill.
Reading garbage values from stack isn't great, but it's not a safety issue.
It means that the verifier will allow something like 16 u64-s args
in multi program. It cannot allow large number, since ctx[1024]
might become a safety issue, while ctx[4] could be a garbage
or a valid value depending on the call site.

Thoughts?
Jiri Olsa Sept. 3, 2021, 9:50 a.m. UTC | #7
On Thu, Sep 02, 2021 at 02:55:38PM -0700, Alexei Starovoitov wrote:
> On Thu, Sep 02, 2021 at 02:57:11PM +0200, Jiri Olsa wrote:
> > > 
> > > Let's say we have 5 kernel functions: a, b, c, d, e. Say a, b, c all
> > > have 1 input args, and d and e have 2.
> > > 
> > > Now let's say we attach just normal fentry program A to function a.
> > > Also we attach normal fexit program E to func e.
> > > 
> > > We'll have A  attached to a with trampoline T1. We'll also have E
> > > attached to e with trampoline T2. Right?
> > > 
> > > And now we try to attach generic fentry (fentry.multi in your
> > > terminology) prog X to all 5 of them. If A and E weren't attached,
> > > we'd need two generic trampolines, one for a, b, c (because 1 input
> > > argument) and another for d,e (because 2 input arguments). But because
> > > we already have A and B attached, we'll end up needing 4:
> > > 
> > > T1 (1 arg)  for func a calling progs A and X
> > > T2 (2 args) for func e calling progs E and X
> > > T3 (1 arg)  for func b and c calling X
> > > T4 (2 args) for func d calling X
> > 
> > so current code would group T3/T4 together, but if we keep
> > them separated, then we won't need to use new model and
> > cut off some of the code, ok
> 
> We've brainstormed this idea further with Andrii.
> (thankfully we could do it in-person now ;) which saved a ton of time)
> 
> It seems the following should work:
> 5 kernel functions: a(int), b(long), c(void*), d(int, int), e(long, long).
> fentry prog A is attached to 'a'.
> fexit prog E is attached to 'e'.
> multi-prog X wants to attach to all of them.
> It can be achieved with 4 trampolines.
> 
> The trampolines called from funcs 'a' and 'e' can be patched to
> call A+X and E+X programs correspondingly.
> The multi program X needs to be able to access return values
> and arguments of all functions it was attached to.
> We can achieve that by always generating a trampoline (both multi and normal)
> with extra constant stored in the stack. This constant is the number of
> arguments served by this trampoline.
> The trampoline 'a' will store nr_args=1.
> The tramopline 'e' will store nr_args=2.
> We need two multi trampolines.
> The multi tramopline X1 that will serve 'b' and 'c' and store nr_args=1
> and multi-tramopline X2 that will serve 'd' and store nr_args=2
> into hidden stack location (like ctx[-2]).
> 
> The multi prog X can look like:
> int BPF_PROG(x, __u64 arg1, __u64 arg2, __u64 ret)
> in such case it will read correct args and ret when called from 'd' and 'e'
> and only correct arg1 when called from 'a', 'b', 'c'.
> 
> To always correctly access arguments and the return value
> the program can use two new helpers: bpf_arg(ctx, N) and bpf_ret_value(ctx).
> Both will be fully inlined helpers similar to bpf_get_func_ip().
> u64 bpf_arg(ctx, int n)
> {
>   u64 nr_args = ctx[-2]; /* that's the place where _all_ trampoline will store nr_args */
>   if (n > nr_args)
>     return 0;
>   return ctx[n];
> }
> u64 bpf_ret_value(ctx)
> {
>   u64 nr_args = ctx[-2];
>   return ctx[nr_args];
> }

ok, this is much better then rewiring args access in verifier

> 
> These helpers will be the only recommended way to access args and ret value
> in multi progs.
> The nice advantage is that normal fentry/fexit progs can use them too.
> 
> We can rearrange ctx[-1] /* func_ip */ and ctx[-2] /* nr_args */
> if it makes things easier.

so nr_args will be there all the time, while func_ip is optional
at the moment (based on get_func_ip helper presence in program),
so we can either switch that:

   func_ip in ctx[-2]
   nr_args in ctx[-1]

or make func_ip not optional to avoid confusion

I think pushing func_ip to ctx-2 is ok

> 
> If multi prog knows that it is attaching to 100 kernel functions
> and all of them have 2 arguments it can still do
> int BPF_PROG(x, __u64 arg1, __u64 arg2, __u64 ret)
> { // access arg1, arg2, ret directly
> and it will work correctly.

ok, it's user's decision, because at load time we don't know the
functions it will be attached to, so verifier can't do anything

> 
> We can make it really strict in the verifier and disallow such
> direct access to args from the multi prog and only allow
> access via bpf_arg/bpf_ret_value helpers, but I think it's overkill.
> Reading garbage values from stack isn't great, but it's not a safety issue.

we could also check it in attach time and forbid to attach if there
are attach functions with different nr_args and program does not use
arg helpers


> It means that the verifier will allow something like 16 u64-s args
> in multi program. It cannot allow large number, since ctx[1024]
> might become a safety issue, while ctx[4] could be a garbage
> or a valid value depending on the call site.
> 
> Thoughts?
> 

looks good, thanks for solving this ;-)

jirka
diff mbox series

Patch

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 9f31197780ae..3f7911a92d62 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1744,7 +1744,7 @@  static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args,
 }
 
 static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
-			   struct bpf_prog *p, int stack_size, bool mod_ret)
+			   struct bpf_prog *p, int stack_size, bool mod_ret, int args_off)
 {
 	u8 *prog = *pprog;
 	u8 *jmp_insn;
@@ -1780,9 +1780,14 @@  static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
 	/* BPF_TRAMP_MODIFY_RETURN trampolines can modify the return
 	 * of the previous call which is then passed on the stack to
 	 * the next BPF program.
+	 * Store the return value also to original args' end in case
+	 * we have multi func programs in trampoline.
 	 */
-	if (mod_ret)
+	if (mod_ret) {
 		emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
+		if (args_off)
+			emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -args_off);
+	}
 
 	/* replace 2 nops with JE insn, since jmp target is known */
 	jmp_insn[0] = X86_JE;
@@ -1834,7 +1839,7 @@  static int invoke_bpf(const struct btf_func_model *m, u8 **pprog,
 	u8 *prog = *pprog;
 
 	for (i = 0; i < tp->nr_progs; i++) {
-		if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size, false))
+		if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size, false, 0))
 			return -EINVAL;
 	}
 	*pprog = prog;
@@ -1843,7 +1848,7 @@  static int invoke_bpf(const struct btf_func_model *m, u8 **pprog,
 
 static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog,
 			      struct bpf_tramp_progs *tp, int stack_size,
-			      u8 **branches)
+			      u8 **branches, int args_off)
 {
 	u8 *prog = *pprog;
 	int i;
@@ -1853,8 +1858,15 @@  static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog,
 	 */
 	emit_mov_imm32(&prog, false, BPF_REG_0, 0);
 	emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
+
+	/* Store the return value also to original args' end in case
+	 * we have multi func programs in trampoline.
+	 */
+	if (args_off)
+		emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -args_off);
+
 	for (i = 0; i < tp->nr_progs; i++) {
-		if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size, true))
+		if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size, true, args_off))
 			return -EINVAL;
 
 		/* mod_ret prog stored return value into [rbp - 8]. Emit:
@@ -1942,7 +1954,7 @@  int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 				struct bpf_tramp_progs *tprogs,
 				void *orig_call)
 {
-	int ret, i, nr_args = m->nr_args;
+	int ret, i, args_off = 0, nr_args = m->nr_args;
 	int stack_size = nr_args * 8;
 	struct bpf_tramp_progs *fentry = &tprogs[BPF_TRAMP_FENTRY];
 	struct bpf_tramp_progs *fexit = &tprogs[BPF_TRAMP_FEXIT];
@@ -1958,6 +1970,13 @@  int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 	    (flags & BPF_TRAMP_F_SKIP_FRAME))
 		return -EINVAL;
 
+	/* if m->nr_args_orig != 0, then we have multi prog model and
+	 * we need to also store return value at the end of standard
+	 * trampoline's arguments
+	 */
+	if (m->nr_args_orig && m->nr_args > m->nr_args_orig)
+		args_off = (m->nr_args - m->nr_args_orig) * 8 + 8;
+
 	if (flags & BPF_TRAMP_F_CALL_ORIG)
 		stack_size += 8; /* room for return value of orig_call */
 
@@ -2015,7 +2034,7 @@  int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 			return -ENOMEM;
 
 		if (invoke_bpf_mod_ret(m, &prog, fmod_ret, stack_size,
-				       branches)) {
+				       branches, args_off)) {
 			ret = -EINVAL;
 			goto cleanup;
 		}
@@ -2036,6 +2055,13 @@  int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 		}
 		/* remember return value in a stack for bpf prog to access */
 		emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
+
+		/* store return value also to original args' end in case we have
+		 * multi func programs in trampoline
+		 */
+		if (args_off)
+			emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -args_off);
+
 		im->ip_after_call = prog;
 		memcpy(prog, x86_nops[5], X86_PATCH_SIZE);
 		prog += X86_PATCH_SIZE;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3ce4656e2057..373f45ae7dce 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -563,6 +563,7 @@  struct btf_func_model {
 	u8 ret_size;
 	u8 nr_args;
 	u8 arg_size[MAX_BPF_FUNC_ARGS];
+	u8 nr_args_orig;
 };
 
 /* Restore arguments before returning from trampoline to let original function
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 6ff5c2512f91..308d58e698be 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -398,6 +398,7 @@  static bool needs_multi_model(struct bpf_trampoline *tr, struct btf_func_model *
 	if (tr->func.model.nr_args >= multi->func.model.nr_args)
 		return false;
 	memcpy(new, &multi->func.model, sizeof(*new));
+	new->nr_args_orig = tr->func.model.nr_args;
 	return true;
 }