mbox series

[bpf-next,v2,0/4] Add ftrace direct call for arm64

Message ID 20220913162732.163631-1-xukuohai@huaweicloud.com (mailing list archive)
Headers show
Series Add ftrace direct call for arm64 | expand

Message

Xu Kuohai Sept. 13, 2022, 4:27 p.m. UTC
This series adds ftrace direct call for arm64, which is required to attach
bpf trampoline to fentry.

Although there is no agreement on how to support ftrace direct call on arm64,
no patch has been posted except the one I posted in [1], so this series
continues the work of [1] with the addition of long jump support. Now ftrace
direct call works regardless of the distance between the callsite and custom
trampoline.

[1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/

v2:
- Fix compile and runtime errors caused by ftrace_rec_arch_init

v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/

Xu Kuohai (4):
  ftrace: Allow users to disable ftrace direct call
  arm64: ftrace: Support long jump for ftrace direct call
  arm64: ftrace: Add ftrace direct call support
  ftrace: Fix dead loop caused by direct call in ftrace selftest

 arch/arm64/Kconfig                |   2 +
 arch/arm64/Makefile               |   4 +
 arch/arm64/include/asm/ftrace.h   |  35 ++++--
 arch/arm64/include/asm/patching.h |   2 +
 arch/arm64/include/asm/ptrace.h   |   6 +-
 arch/arm64/kernel/asm-offsets.c   |   1 +
 arch/arm64/kernel/entry-ftrace.S  |  39 ++++--
 arch/arm64/kernel/ftrace.c        | 198 ++++++++++++++++++++++++++++--
 arch/arm64/kernel/patching.c      |  14 +++
 arch/arm64/net/bpf_jit_comp.c     |   4 +
 include/linux/ftrace.h            |   2 +
 kernel/trace/Kconfig              |   7 +-
 kernel/trace/ftrace.c             |   9 +-
 kernel/trace/trace_selftest.c     |   2 +
 14 files changed, 296 insertions(+), 29 deletions(-)

Comments

Daniel Borkmann Sept. 22, 2022, 6:01 p.m. UTC | #1
On 9/13/22 6:27 PM, Xu Kuohai wrote:
> This series adds ftrace direct call for arm64, which is required to attach
> bpf trampoline to fentry.
> 
> Although there is no agreement on how to support ftrace direct call on arm64,
> no patch has been posted except the one I posted in [1], so this series
> continues the work of [1] with the addition of long jump support. Now ftrace
> direct call works regardless of the distance between the callsite and custom
> trampoline.
> 
> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> 
> v2:
> - Fix compile and runtime errors caused by ftrace_rec_arch_init
> 
> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> 
> Xu Kuohai (4):
>    ftrace: Allow users to disable ftrace direct call
>    arm64: ftrace: Support long jump for ftrace direct call
>    arm64: ftrace: Add ftrace direct call support
>    ftrace: Fix dead loop caused by direct call in ftrace selftest

Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
it probably makes sense that this series goes via Catalin/Will through arm64 tree
instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
could work too, but I'd presume this just results in merge conflicts)?

>   arch/arm64/Kconfig                |   2 +
>   arch/arm64/Makefile               |   4 +
>   arch/arm64/include/asm/ftrace.h   |  35 ++++--
>   arch/arm64/include/asm/patching.h |   2 +
>   arch/arm64/include/asm/ptrace.h   |   6 +-
>   arch/arm64/kernel/asm-offsets.c   |   1 +
>   arch/arm64/kernel/entry-ftrace.S  |  39 ++++--
>   arch/arm64/kernel/ftrace.c        | 198 ++++++++++++++++++++++++++++--
>   arch/arm64/kernel/patching.c      |  14 +++
>   arch/arm64/net/bpf_jit_comp.c     |   4 +
>   include/linux/ftrace.h            |   2 +
>   kernel/trace/Kconfig              |   7 +-
>   kernel/trace/ftrace.c             |   9 +-
>   kernel/trace/trace_selftest.c     |   2 +
>   14 files changed, 296 insertions(+), 29 deletions(-)

Thanks,
Daniel
Catalin Marinas Sept. 26, 2022, 2:40 p.m. UTC | #2
On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> On 9/13/22 6:27 PM, Xu Kuohai wrote:
> > This series adds ftrace direct call for arm64, which is required to attach
> > bpf trampoline to fentry.
> > 
> > Although there is no agreement on how to support ftrace direct call on arm64,
> > no patch has been posted except the one I posted in [1], so this series
> > continues the work of [1] with the addition of long jump support. Now ftrace
> > direct call works regardless of the distance between the callsite and custom
> > trampoline.
> > 
> > [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> > 
> > v2:
> > - Fix compile and runtime errors caused by ftrace_rec_arch_init
> > 
> > v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> > 
> > Xu Kuohai (4):
> >    ftrace: Allow users to disable ftrace direct call
> >    arm64: ftrace: Support long jump for ftrace direct call
> >    arm64: ftrace: Add ftrace direct call support
> >    ftrace: Fix dead loop caused by direct call in ftrace selftest
> 
> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> it probably makes sense that this series goes via Catalin/Will through arm64 tree
> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> could work too, but I'd presume this just results in merge conflicts)?

I think it makes sense for the series to go via the arm64 tree but I'd
like Mark to have a look at the ftrace changes first.

Thanks.
Mark Rutland Sept. 26, 2022, 5:43 p.m. UTC | #3
On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> > On 9/13/22 6:27 PM, Xu Kuohai wrote:
> > > This series adds ftrace direct call for arm64, which is required to attach
> > > bpf trampoline to fentry.
> > > 
> > > Although there is no agreement on how to support ftrace direct call on arm64,
> > > no patch has been posted except the one I posted in [1], so this series
> > > continues the work of [1] with the addition of long jump support. Now ftrace
> > > direct call works regardless of the distance between the callsite and custom
> > > trampoline.
> > > 
> > > [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> > > 
> > > v2:
> > > - Fix compile and runtime errors caused by ftrace_rec_arch_init
> > > 
> > > v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> > > 
> > > Xu Kuohai (4):
> > >    ftrace: Allow users to disable ftrace direct call
> > >    arm64: ftrace: Support long jump for ftrace direct call
> > >    arm64: ftrace: Add ftrace direct call support
> > >    ftrace: Fix dead loop caused by direct call in ftrace selftest
> > 
> > Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> > it probably makes sense that this series goes via Catalin/Will through arm64 tree
> > instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> > could work too, but I'd presume this just results in merge conflicts)?
> 
> I think it makes sense for the series to go via the arm64 tree but I'd
> like Mark to have a look at the ftrace changes first.

From a quick scan, I still don't think this is quite right, and as it stands I
believe this will break backtracing (as the instructions before the function
entry point will not be symbolized correctly, getting in the way of
RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
feedback there, as I have a mechanism in mind that wa a little simpler.

I'll try to reply with some more detail tomorrow, but I don't think this is the
right approach, and as mentioned previously (and e.g. at LPC) I'd strongly
prefer to *not* implement direct calls, so that we can have more consistent
entry/exit handling.

Thanks,
Mark.
Xu Kuohai Sept. 27, 2022, 4:49 a.m. UTC | #4
On 9/27/2022 1:43 AM, Mark Rutland wrote:
> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>> bpf trampoline to fentry.
>>>>
>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>> no patch has been posted except the one I posted in [1], so this series
>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>> direct call works regardless of the distance between the callsite and custom
>>>> trampoline.
>>>>
>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>
>>>> v2:
>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>
>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>
>>>> Xu Kuohai (4):
>>>>     ftrace: Allow users to disable ftrace direct call
>>>>     arm64: ftrace: Support long jump for ftrace direct call
>>>>     arm64: ftrace: Add ftrace direct call support
>>>>     ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>
>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>> could work too, but I'd presume this just results in merge conflicts)?
>>
>> I think it makes sense for the series to go via the arm64 tree but I'd
>> like Mark to have a look at the ftrace changes first.
> 
>>From a quick scan, I still don't think this is quite right, and as it stands I
> believe this will break backtracing (as the instructions before the function
> entry point will not be symbolized correctly, getting in the way of
> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
> feedback there, as I have a mechanism in mind that wa a little simpler.
> 

Thanks for the review. I have some thoughts about reliable stacktrace.

If PC is not in the range of literal_call, stacktrace works as before without
changes.

If PC is in the range of literal_call, for example, interrupted by an
irq, I think there are 2 problems:

1. Caller LR is not pushed to the stack yet, so caller's address and name
    will be missing from the backtrace.

2. Since PC is not in func's address range, no symbol name will be found, so
    func name is also missing.

Problem 1 is not introduced by this patchset, but the occurring probability
may be increased by this patchset. I think this problem should be addressed by
a reliable stacktrace scheme, such as ORC on x86.

Problem 2 is indeed introduced by this patchset. I think there are at least 3
ways to deal with it:

1. Add a symbol name for literal_call.

2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
    we can check if the PC is in literal_call, then adjust PC and try again.

3. Move literal_call to the func's address range, for example:

         a. Compile with -fpatchable-function-entry=7
         func:
                 BTI C
                 NOP
                 NOP
                 NOP
                 NOP
                 NOP
                 NOP
                 NOP
         func_body:
                 ...


         b. When disabled, patch it to
         func:
                 BTI C
                 B func_body
         literal:
                 .quad dummy_tramp
         literal_call:
                 LDR X16, literal
                 MOV X9, LR
                 BLR X16
         func_body:
                 ...


         c. When enabled and target is out-of-range, patch it to
         func:
                 BTI C
                 B literal_call
         literal:
                 .quad custom_trampoline
         literal_call:
                 LDR X16, literal
                 MOV X9, LR
                 BLR X16
         func_body:
                 ...


         d. When enabled and target is in range, patch it to
         func:
                 BTI C
                 B direct_call
         literal:
                 .quad dummy_tramp
                 LDR X16, literal
         direct_call:
                 MOV X9, LR
                 BL custom_trampoline
         func_body:
                 ...


> I'll try to reply with some more detail tomorrow, but I don't think this is the
> right approach, and as mentioned previously (and e.g. at LPC) I'd strongly
> prefer to *not* implement direct calls, so that we can have more consistent
> entry/exit handling.
> 
> Thanks,
> Mark.
> .
Mark Rutland Sept. 28, 2022, 4:42 p.m. UTC | #5
On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
> On 9/27/2022 1:43 AM, Mark Rutland wrote:
> > On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
> > > On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> > > > On 9/13/22 6:27 PM, Xu Kuohai wrote:
> > > > > This series adds ftrace direct call for arm64, which is required to attach
> > > > > bpf trampoline to fentry.
> > > > > 
> > > > > Although there is no agreement on how to support ftrace direct call on arm64,
> > > > > no patch has been posted except the one I posted in [1], so this series
> > > > > continues the work of [1] with the addition of long jump support. Now ftrace
> > > > > direct call works regardless of the distance between the callsite and custom
> > > > > trampoline.
> > > > > 
> > > > > [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> > > > > 
> > > > > v2:
> > > > > - Fix compile and runtime errors caused by ftrace_rec_arch_init
> > > > > 
> > > > > v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> > > > > 
> > > > > Xu Kuohai (4):
> > > > >     ftrace: Allow users to disable ftrace direct call
> > > > >     arm64: ftrace: Support long jump for ftrace direct call
> > > > >     arm64: ftrace: Add ftrace direct call support
> > > > >     ftrace: Fix dead loop caused by direct call in ftrace selftest
> > > > 
> > > > Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> > > > it probably makes sense that this series goes via Catalin/Will through arm64 tree
> > > > instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> > > > could work too, but I'd presume this just results in merge conflicts)?
> > > 
> > > I think it makes sense for the series to go via the arm64 tree but I'd
> > > like Mark to have a look at the ftrace changes first.
> > 
> > > From a quick scan, I still don't think this is quite right, and as it stands I
> > believe this will break backtracing (as the instructions before the function
> > entry point will not be symbolized correctly, getting in the way of
> > RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
> > feedback there, as I have a mechanism in mind that wa a little simpler.
> 
> Thanks for the review. I have some thoughts about reliable stacktrace.
> 
> If PC is not in the range of literal_call, stacktrace works as before without
> changes.
> 
> If PC is in the range of literal_call, for example, interrupted by an
> irq, I think there are 2 problems:
> 
> 1. Caller LR is not pushed to the stack yet, so caller's address and name
>    will be missing from the backtrace.
> 
> 2. Since PC is not in func's address range, no symbol name will be found, so
>    func name is also missing.
> 
> Problem 1 is not introduced by this patchset, but the occurring probability
> may be increased by this patchset. I think this problem should be addressed by
> a reliable stacktrace scheme, such as ORC on x86.

I agree problem 1 is not introduced by this patch set; I have plans fo how to
address that for reliable stacktrace based on identifying the ftrace
trampoline. This is one of the reasons I do not want direct calls, as
identifying all direct call trampolines is going to be very painful and slow,
whereas identifying a statically allocated ftrace trampoline is far simpler.

> Problem 2 is indeed introduced by this patchset. I think there are at least 3
> ways to deal with it:

What I would like to do here, as mentioned previously in other threads, is to
avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
each patch-site with a specific set of ops, and invoke that directly from the
regular ftrace trampoline.

With that, the patch site would look like:

	pre_func_literal:
		NOP		// Patched to a pointer to 
		NOP		// ftrace_ops
	func:
		< optional BTI here >
		NOP		// Patched to MOV X9, LR
		NOP		// Patched to a BL to the ftrace trampoline

... then in the ftrace trampoline we can recover the ops pointer at a negative
offset from the LR based on the LR, and invoke the ops from there (passing a
struct ftrace_regs with the saved regs).

That way the patch-site is less significantly affected, and there's no impact
to backtracing. That gets most of the benefit of the direct calls avoiding the
ftrace ops list traversal, without having to do anything special at all. That
should be much easier to maintain, too.

I started implementing that before LPC (and you can find some branches on my
kernel.org repo), but I haven't yet had the time to rebase those and sort out
the remaining issues:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops

Note that as a prerequisite for that I also want to reduce the set of registers
we save/restore down to the set required by our calling convention, as the
existing pt_regs is both large and generally unsound (since we can not and do
not fill in many of the fields we only acquire at an exception boundary).
That'll further reduce the ftrace overhead generally, and remove the needs for
the two trampolines we currently have. I have a WIP at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs

I intend to get back to both of those shortly (along with some related bits for
kretprobes and stacktracing); I just haven't had much time recently due to
other work and illness.

> 1. Add a symbol name for literal_call.

That'll require a number of invasive changes to make RELIABLE_STACKTRACE work,
so I don't think we want to do that.

> 2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
>    we can check if the PC is in literal_call, then adjust PC and try again.

The problem is that the existing symbolization code doesn't know the length of
the prior symbol, so it will find *some* symbol associated with the previous
function rather than finding no symbol.

To bodge around this we'dd need to special-case each patchable-function-entry
site in symbolization, which is going to be painful and slow down unwinding
unless we try to fix this up at boot-time or compile time.

> 3. Move literal_call to the func's address range, for example:
> 
>         a. Compile with -fpatchable-function-entry=7
>         func:
>                 BTI C
>                 NOP
>                 NOP
>                 NOP
>                 NOP
>                 NOP
>                 NOP
>                 NOP

This is a non-starter. We are not going to add 7 NOPs at the start of every
function.

Thanks,
Mark.
Xu Kuohai Sept. 30, 2022, 4:07 a.m. UTC | #6
On 9/29/2022 12:42 AM, Mark Rutland wrote:
> On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
>> On 9/27/2022 1:43 AM, Mark Rutland wrote:
>>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>>>> bpf trampoline to fentry.
>>>>>>
>>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>>>> no patch has been posted except the one I posted in [1], so this series
>>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>>>> direct call works regardless of the distance between the callsite and custom
>>>>>> trampoline.
>>>>>>
>>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>>>
>>>>>> v2:
>>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>>>
>>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>>>
>>>>>> Xu Kuohai (4):
>>>>>>      ftrace: Allow users to disable ftrace direct call
>>>>>>      arm64: ftrace: Support long jump for ftrace direct call
>>>>>>      arm64: ftrace: Add ftrace direct call support
>>>>>>      ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>>>
>>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>>>> could work too, but I'd presume this just results in merge conflicts)?
>>>>
>>>> I think it makes sense for the series to go via the arm64 tree but I'd
>>>> like Mark to have a look at the ftrace changes first.
>>>
>>>>  From a quick scan, I still don't think this is quite right, and as it stands I
>>> believe this will break backtracing (as the instructions before the function
>>> entry point will not be symbolized correctly, getting in the way of
>>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
>>> feedback there, as I have a mechanism in mind that wa a little simpler.
>>
>> Thanks for the review. I have some thoughts about reliable stacktrace.
>>
>> If PC is not in the range of literal_call, stacktrace works as before without
>> changes.
>>
>> If PC is in the range of literal_call, for example, interrupted by an
>> irq, I think there are 2 problems:
>>
>> 1. Caller LR is not pushed to the stack yet, so caller's address and name
>>     will be missing from the backtrace.
>>
>> 2. Since PC is not in func's address range, no symbol name will be found, so
>>     func name is also missing.
>>
>> Problem 1 is not introduced by this patchset, but the occurring probability
>> may be increased by this patchset. I think this problem should be addressed by
>> a reliable stacktrace scheme, such as ORC on x86.
> 
> I agree problem 1 is not introduced by this patch set; I have plans fo how to
> address that for reliable stacktrace based on identifying the ftrace
> trampoline. This is one of the reasons I do not want direct calls, as
> identifying all direct call trampolines is going to be very painful and slow,
> whereas identifying a statically allocated ftrace trampoline is far simpler.
> 
>> Problem 2 is indeed introduced by this patchset. I think there are at least 3
>> ways to deal with it:
> 
> What I would like to do here, as mentioned previously in other threads, is to
> avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
> each patch-site with a specific set of ops, and invoke that directly from the
> regular ftrace trampoline.
> 
> With that, the patch site would look like:
> 
> 	pre_func_literal:
> 		NOP		// Patched to a pointer to
> 		NOP		// ftrace_ops
> 	func:
> 		< optional BTI here >
> 		NOP		// Patched to MOV X9, LR
> 		NOP		// Patched to a BL to the ftrace trampoline
> 
> ... then in the ftrace trampoline we can recover the ops pointer at a negative
> offset from the LR based on the LR, and invoke the ops from there (passing a
> struct ftrace_regs with the saved regs).
> 
> That way the patch-site is less significantly affected, and there's no impact
> to backtracing. That gets most of the benefit of the direct calls avoiding the
> ftrace ops list traversal, without having to do anything special at all. That
> should be much easier to maintain, too.
> 
> I started implementing that before LPC (and you can find some branches on my
> kernel.org repo), but I haven't yet had the time to rebase those and sort out
> the remaining issues:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>

I've read this code before, but it doesn't run and since you haven't updated
it, I assumed you dropped it :(

This approach seems appropriate for dynamic ftrace trampolines, but I think
there are two more issues for bpf.

1. bpf trampoline was designed to be called directly from fentry (located in
    kernel function or bpf prog). So to make it work as ftrace_op, we may end
    up with two different bpf trampoline types on arm64, one for bpf prog and
    the other for ftrace.

2. Performance overhead, as we always jump to a static ftrace trampoline to
    construct execution environment for bpf trampoline, then jump to the bpf
    trampoline to construct execution environment for bpf prog, then jump to
    the bpf prog, so for some small bpf progs or hot functions, the calling
    overhead may be unacceptable.

> Note that as a prerequisite for that I also want to reduce the set of registers
> we save/restore down to the set required by our calling convention, as the
> existing pt_regs is both large and generally unsound (since we can not and do
> not fill in many of the fields we only acquire at an exception boundary).
> That'll further reduce the ftrace overhead generally, and remove the needs for
> the two trampolines we currently have. I have a WIP at:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
> 
> I intend to get back to both of those shortly (along with some related bits for
> kretprobes and stacktracing); I just haven't had much time recently due to
> other work and illness.
> 

Sorry for that, hope you getting better soon.

>> 1. Add a symbol name for literal_call.
> 
> That'll require a number of invasive changes to make RELIABLE_STACKTRACE work,
> so I don't think we want to do that.
> 
>> 2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
>>     we can check if the PC is in literal_call, then adjust PC and try again.
> 
> The problem is that the existing symbolization code doesn't know the length of
> the prior symbol, so it will find *some* symbol associated with the previous
> function rather than finding no symbol.
> 
> To bodge around this we'dd need to special-case each patchable-function-entry
> site in symbolization, which is going to be painful and slow down unwinding
> unless we try to fix this up at boot-time or compile time.
>  >> 3. Move literal_call to the func's address range, for example:
>>
>>          a. Compile with -fpatchable-function-entry=7
>>          func:
>>                  BTI C
>>                  NOP
>>                  NOP
>>                  NOP
>>                  NOP
>>                  NOP
>>                  NOP
>>                  NOP
> 
> This is a non-starter. We are not going to add 7 NOPs at the start of every
> function.
> 
> Thanks,
> Mark.
> 
> .
Florent Revest Oct. 4, 2022, 4:06 p.m. UTC | #7
On Fri, Sep 30, 2022 at 6:07 AM Xu Kuohai <xukuohai@huawei.com> wrote:
>
> On 9/29/2022 12:42 AM, Mark Rutland wrote:
> > On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
> >> On 9/27/2022 1:43 AM, Mark Rutland wrote:
> >>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
> >>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> >>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
> >>>>>> This series adds ftrace direct call for arm64, which is required to attach
> >>>>>> bpf trampoline to fentry.
> >>>>>>
> >>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
> >>>>>> no patch has been posted except the one I posted in [1], so this series

Hey Xu :) Sorry I wasn't more pro-active about communicating what i
was experimenting with! A lot of conversations happened off-the-list
at LPC and LSS so I was playing on the side with the ideas that got
suggested to me. I start to have a little something to share.
Hopefully if we work closer together now we can get quicker results.

> >>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
> >>>>>> direct call works regardless of the distance between the callsite and custom
> >>>>>> trampoline.
> >>>>>>
> >>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> >>>>>>
> >>>>>> v2:
> >>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
> >>>>>>
> >>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> >>>>>>
> >>>>>> Xu Kuohai (4):
> >>>>>>      ftrace: Allow users to disable ftrace direct call
> >>>>>>      arm64: ftrace: Support long jump for ftrace direct call
> >>>>>>      arm64: ftrace: Add ftrace direct call support
> >>>>>>      ftrace: Fix dead loop caused by direct call in ftrace selftest
> >>>>>
> >>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> >>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
> >>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> >>>>> could work too, but I'd presume this just results in merge conflicts)?
> >>>>
> >>>> I think it makes sense for the series to go via the arm64 tree but I'd
> >>>> like Mark to have a look at the ftrace changes first.
> >>>
> >>>>  From a quick scan, I still don't think this is quite right, and as it stands I
> >>> believe this will break backtracing (as the instructions before the function
> >>> entry point will not be symbolized correctly, getting in the way of
> >>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
> >>> feedback there, as I have a mechanism in mind that wa a little simpler.
> >>
> >> Thanks for the review. I have some thoughts about reliable stacktrace.
> >>
> >> If PC is not in the range of literal_call, stacktrace works as before without
> >> changes.
> >>
> >> If PC is in the range of literal_call, for example, interrupted by an
> >> irq, I think there are 2 problems:
> >>
> >> 1. Caller LR is not pushed to the stack yet, so caller's address and name
> >>     will be missing from the backtrace.
> >>
> >> 2. Since PC is not in func's address range, no symbol name will be found, so
> >>     func name is also missing.
> >>
> >> Problem 1 is not introduced by this patchset, but the occurring probability
> >> may be increased by this patchset. I think this problem should be addressed by
> >> a reliable stacktrace scheme, such as ORC on x86.
> >
> > I agree problem 1 is not introduced by this patch set; I have plans fo how to
> > address that for reliable stacktrace based on identifying the ftrace
> > trampoline. This is one of the reasons I do not want direct calls, as
> > identifying all direct call trampolines is going to be very painful and slow,
> > whereas identifying a statically allocated ftrace trampoline is far simpler.
> >
> >> Problem 2 is indeed introduced by this patchset. I think there are at least 3
> >> ways to deal with it:
> >
> > What I would like to do here, as mentioned previously in other threads, is to
> > avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
> > each patch-site with a specific set of ops, and invoke that directly from the
> > regular ftrace trampoline.
> >
> > With that, the patch site would look like:
> >
> >       pre_func_literal:
> >               NOP             // Patched to a pointer to
> >               NOP             // ftrace_ops
> >       func:
> >               < optional BTI here >
> >               NOP             // Patched to MOV X9, LR
> >               NOP             // Patched to a BL to the ftrace trampoline
> >
> > ... then in the ftrace trampoline we can recover the ops pointer at a negative
> > offset from the LR based on the LR, and invoke the ops from there (passing a
> > struct ftrace_regs with the saved regs).
> >
> > That way the patch-site is less significantly affected, and there's no impact
> > to backtracing. That gets most of the benefit of the direct calls avoiding the
> > ftrace ops list traversal, without having to do anything special at all. That
> > should be much easier to maintain, too.
> >
> > I started implementing that before LPC (and you can find some branches on my
> > kernel.org repo), but I haven't yet had the time to rebase those and sort out
> > the remaining issues:
> >
> >    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
> >
>
> I've read this code before, but it doesn't run and since you haven't updated

I also tried to use this but indeed the "TODO: mess with protection to
set this" in  5437aa788d needs to be addressed before we can use it.

> it, I assumed you dropped it :(
>
> This approach seems appropriate for dynamic ftrace trampolines, but I think
> there are two more issues for bpf.
>
> 1. bpf trampoline was designed to be called directly from fentry (located in
>     kernel function or bpf prog). So to make it work as ftrace_op, we may end
>     up with two different bpf trampoline types on arm64, one for bpf prog and
>     the other for ftrace.
>
> 2. Performance overhead, as we always jump to a static ftrace trampoline to
>     construct execution environment for bpf trampoline, then jump to the bpf
>     trampoline to construct execution environment for bpf prog, then jump to
>     the bpf prog, so for some small bpf progs or hot functions, the calling
>     overhead may be unacceptable.

From the conversations I've had at LPC, Steven, Mark, Jiri and Masami
(all in CC) would like to see an ftrace ops based solution (or rather,
something that doesn't require direct calls) for invoking BPF tracing
programs. I figured that the best way to move forward on the question
of whether the performance impact of that would be acceptable or not
is to just build it and measure it. I understand you're testing your
work on real hardware (I work on an emulator at the moment) , would
you be able to compare the impact of my proof of concept branch with
your direct call based approach ?

https://github.com/FlorentRevest/linux/commits/fprobe-min-args

I first tried to implement this as an ftrace op myself but realized I
was re-implementing a lot of the function graph tracer. So I then
tried to use the function graph tracer API but realized I was missing
some features which Steven had addressed in an RFC few years back. So
I rebuilt on that until I realized Masami has been upstreaming the
fprobe and rethook APIs as spiritual successors of Steven's RFC... So
I've now rebuilt yet another proof of concept based on fprobe and
rethook.

That branch is still very much WIP and there are a few things I'd like
to address before sending even an RFC (when kretprobe is built on
rethook for example, I construct pt_regs on the stack in which I copy
the content of ftrace_regs... or program linking/unlinking is racy
right now) but I think it's good enough for performance measurements
already. (fentry_fexit and lsm tests pass)

> > Note that as a prerequisite for that I also want to reduce the set of registers
> > we save/restore down to the set required by our calling convention, as the
> > existing pt_regs is both large and generally unsound (since we can not and do
> > not fill in many of the fields we only acquire at an exception boundary).
> > That'll further reduce the ftrace overhead generally, and remove the needs for
> > the two trampolines we currently have. I have a WIP at:
> >
> >    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs

Note that I integrated this work to my branch too. I extended it to
also have fprobe and rethook save and pass ftrace_regs structures to
their callbacks. Most performance improvements would come from your
arm64/ftrace/per-callsite-ops branch but we'd need to fix the above
TODO for it to work.

> > I intend to get back to both of those shortly (along with some related bits for
> > kretprobes and stacktracing); I just haven't had much time recently due to
> > other work and illness.
> >
>
> Sorry for that, hope you getting better soon.

Oh, that sucks. Get better Mark!
Xu Kuohai Oct. 5, 2022, 2:54 p.m. UTC | #8
On 10/5/2022 12:06 AM, Florent Revest wrote:
> On Fri, Sep 30, 2022 at 6:07 AM Xu Kuohai <xukuohai@huawei.com> wrote:
>>
>> On 9/29/2022 12:42 AM, Mark Rutland wrote:
>>> On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
>>>> On 9/27/2022 1:43 AM, Mark Rutland wrote:
>>>>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>>>>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>>>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>>>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>>>>>> bpf trampoline to fentry.
>>>>>>>>
>>>>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>>>>>> no patch has been posted except the one I posted in [1], so this series
> 
> Hey Xu :) Sorry I wasn't more pro-active about communicating what i
> was experimenting with! A lot of conversations happened off-the-list
> at LPC and LSS so I was playing on the side with the ideas that got
> suggested to me. I start to have a little something to share.
> Hopefully if we work closer together now we can get quicker results.
> 
>>>>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>>>>>> direct call works regardless of the distance between the callsite and custom
>>>>>>>> trampoline.
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>>>>>
>>>>>>>> v2:
>>>>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>>>>>
>>>>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>>>>>
>>>>>>>> Xu Kuohai (4):
>>>>>>>>       ftrace: Allow users to disable ftrace direct call
>>>>>>>>       arm64: ftrace: Support long jump for ftrace direct call
>>>>>>>>       arm64: ftrace: Add ftrace direct call support
>>>>>>>>       ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>>>>>
>>>>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>>>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>>>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>>>>>> could work too, but I'd presume this just results in merge conflicts)?
>>>>>>
>>>>>> I think it makes sense for the series to go via the arm64 tree but I'd
>>>>>> like Mark to have a look at the ftrace changes first.
>>>>>
>>>>>>   From a quick scan, I still don't think this is quite right, and as it stands I
>>>>> believe this will break backtracing (as the instructions before the function
>>>>> entry point will not be symbolized correctly, getting in the way of
>>>>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
>>>>> feedback there, as I have a mechanism in mind that wa a little simpler.
>>>>
>>>> Thanks for the review. I have some thoughts about reliable stacktrace.
>>>>
>>>> If PC is not in the range of literal_call, stacktrace works as before without
>>>> changes.
>>>>
>>>> If PC is in the range of literal_call, for example, interrupted by an
>>>> irq, I think there are 2 problems:
>>>>
>>>> 1. Caller LR is not pushed to the stack yet, so caller's address and name
>>>>      will be missing from the backtrace.
>>>>
>>>> 2. Since PC is not in func's address range, no symbol name will be found, so
>>>>      func name is also missing.
>>>>
>>>> Problem 1 is not introduced by this patchset, but the occurring probability
>>>> may be increased by this patchset. I think this problem should be addressed by
>>>> a reliable stacktrace scheme, such as ORC on x86.
>>>
>>> I agree problem 1 is not introduced by this patch set; I have plans fo how to
>>> address that for reliable stacktrace based on identifying the ftrace
>>> trampoline. This is one of the reasons I do not want direct calls, as
>>> identifying all direct call trampolines is going to be very painful and slow,
>>> whereas identifying a statically allocated ftrace trampoline is far simpler.
>>>
>>>> Problem 2 is indeed introduced by this patchset. I think there are at least 3
>>>> ways to deal with it:
>>>
>>> What I would like to do here, as mentioned previously in other threads, is to
>>> avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
>>> each patch-site with a specific set of ops, and invoke that directly from the
>>> regular ftrace trampoline.
>>>
>>> With that, the patch site would look like:
>>>
>>>        pre_func_literal:
>>>                NOP             // Patched to a pointer to
>>>                NOP             // ftrace_ops
>>>        func:
>>>                < optional BTI here >
>>>                NOP             // Patched to MOV X9, LR
>>>                NOP             // Patched to a BL to the ftrace trampoline
>>>
>>> ... then in the ftrace trampoline we can recover the ops pointer at a negative
>>> offset from the LR based on the LR, and invoke the ops from there (passing a
>>> struct ftrace_regs with the saved regs).
>>>
>>> That way the patch-site is less significantly affected, and there's no impact
>>> to backtracing. That gets most of the benefit of the direct calls avoiding the
>>> ftrace ops list traversal, without having to do anything special at all. That
>>> should be much easier to maintain, too.
>>>
>>> I started implementing that before LPC (and you can find some branches on my
>>> kernel.org repo), but I haven't yet had the time to rebase those and sort out
>>> the remaining issues:
>>>
>>>     https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>>>
>>
>> I've read this code before, but it doesn't run and since you haven't updated
> 
> I also tried to use this but indeed the "TODO: mess with protection to
> set this" in  5437aa788d needs to be addressed before we can use it.
> 
>> it, I assumed you dropped it :(
>>
>> This approach seems appropriate for dynamic ftrace trampolines, but I think
>> there are two more issues for bpf.
>>
>> 1. bpf trampoline was designed to be called directly from fentry (located in
>>      kernel function or bpf prog). So to make it work as ftrace_op, we may end
>>      up with two different bpf trampoline types on arm64, one for bpf prog and
>>      the other for ftrace.
>>
>> 2. Performance overhead, as we always jump to a static ftrace trampoline to
>>      construct execution environment for bpf trampoline, then jump to the bpf
>>      trampoline to construct execution environment for bpf prog, then jump to
>>      the bpf prog, so for some small bpf progs or hot functions, the calling
>>      overhead may be unacceptable.
> 
>>From the conversations I've had at LPC, Steven, Mark, Jiri and Masami
> (all in CC) would like to see an ftrace ops based solution (or rather,
> something that doesn't require direct calls) for invoking BPF tracing
> programs. I figured that the best way to move forward on the question
> of whether the performance impact of that would be acceptable or not
> is to just build it and measure it. I understand you're testing your
> work on real hardware (I work on an emulator at the moment) , would
> you be able to compare the impact of my proof of concept branch with
> your direct call based approach ?
> 
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args
>

Tested on my pi4, here is the result.

1. test with dd

1.1 when no bpf prog attached to vfs_write

# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.56858 s, 326 MB/s


1.2 attach bpf prog with kprobe, bpftrace -e 'kprobe:vfs_write {}'

# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 2.33439 s, 219 MB/s


1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'

# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s


1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'

# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s


2. test with bpf/bench

2.1 bench trig-base
Iter   0 ( 65.259us): hits    1.774M/s (  1.774M/prod), drops    0.000M/s, total operations    1.774M/s
Iter   1 (-17.075us): hits    1.790M/s (  1.790M/prod), drops    0.000M/s, total operations    1.790M/s
Iter   2 (  0.388us): hits    1.790M/s (  1.790M/prod), drops    0.000M/s, total operations    1.790M/s
Iter   3 ( -1.759us): hits    1.790M/s (  1.790M/prod), drops    0.000M/s, total operations    1.790M/s
Iter   4 (  1.980us): hits    1.790M/s (  1.790M/prod), drops    0.000M/s, total operations    1.790M/s
Iter   5 ( -2.222us): hits    1.790M/s (  1.790M/prod), drops    0.000M/s, total operations    1.790M/s
Iter   6 (  0.869us): hits    1.790M/s (  1.790M/prod), drops    0.000M/s, total operations    1.790M/s
Summary: hits    1.790 ± 0.000M/s (  1.790M/prod), drops    0.000 ± 0.000M/s, total operations    1.790 ± 0.000M/s

2.2 bench trig-kprobe
Iter   0 ( 50.703us): hits    0.765M/s (  0.765M/prod), drops    0.000M/s, total operations    0.765M/s
Iter   1 (-15.056us): hits    0.771M/s (  0.771M/prod), drops    0.000M/s, total operations    0.771M/s
Iter   2 (  2.981us): hits    0.771M/s (  0.771M/prod), drops    0.000M/s, total operations    0.771M/s
Iter   3 ( -3.834us): hits    0.771M/s (  0.771M/prod), drops    0.000M/s, total operations    0.771M/s
Iter   4 ( -1.964us): hits    0.771M/s (  0.771M/prod), drops    0.000M/s, total operations    0.771M/s
Iter   5 (  0.426us): hits    0.770M/s (  0.770M/prod), drops    0.000M/s, total operations    0.770M/s
Iter   6 ( -1.297us): hits    0.771M/s (  0.771M/prod), drops    0.000M/s, total operations    0.771M/s
Summary: hits    0.771 ± 0.000M/s (  0.771M/prod), drops    0.000 ± 0.000M/s, total operations    0.771 ± 0.000M/s

2.2 bench trig-fentry, with direct call
Iter   0 ( 49.981us): hits    1.357M/s (  1.357M/prod), drops    0.000M/s, total operations    1.357M/s
Iter   1 (  2.184us): hits    1.363M/s (  1.363M/prod), drops    0.000M/s, total operations    1.363M/s
Iter   2 (-14.167us): hits    1.358M/s (  1.358M/prod), drops    0.000M/s, total operations    1.358M/s
Iter   3 ( -4.890us): hits    1.362M/s (  1.362M/prod), drops    0.000M/s, total operations    1.362M/s
Iter   4 (  5.759us): hits    1.362M/s (  1.362M/prod), drops    0.000M/s, total operations    1.362M/s
Iter   5 ( -4.389us): hits    1.362M/s (  1.362M/prod), drops    0.000M/s, total operations    1.362M/s
Iter   6 ( -0.594us): hits    1.364M/s (  1.364M/prod), drops    0.000M/s, total operations    1.364M/s
Summary: hits    1.362 ± 0.002M/s (  1.362M/prod), drops    0.000 ± 0.000M/s, total operations    1.362 ± 0.002M/s

2.3 bench trig-fentry, with indirect call
Iter   0 ( 49.148us): hits    1.014M/s (  1.014M/prod), drops    0.000M/s, total operations    1.014M/s
Iter   1 (-13.816us): hits    1.021M/s (  1.021M/prod), drops    0.000M/s, total operations    1.021M/s
Iter   2 (  0.648us): hits    1.021M/s (  1.021M/prod), drops    0.000M/s, total operations    1.021M/s
Iter   3 (  3.370us): hits    1.021M/s (  1.021M/prod), drops    0.000M/s, total operations    1.021M/s
Iter   4 ( 11.388us): hits    1.021M/s (  1.021M/prod), drops    0.000M/s, total operations    1.021M/s
Iter   5 (-17.242us): hits    1.022M/s (  1.022M/prod), drops    0.000M/s, total operations    1.022M/s
Iter   6 (  1.815us): hits    1.021M/s (  1.021M/prod), drops    0.000M/s, total operations    1.021M/s
Summary: hits    1.021 ± 0.000M/s (  1.021M/prod), drops    0.000 ± 0.000M/s, total operations    1.021 ± 0.000M/s

> I first tried to implement this as an ftrace op myself but realized I
> was re-implementing a lot of the function graph tracer. So I then
> tried to use the function graph tracer API but realized I was missing
> some features which Steven had addressed in an RFC few years back. So
> I rebuilt on that until I realized Masami has been upstreaming the
> fprobe and rethook APIs as spiritual successors of Steven's RFC... So
> I've now rebuilt yet another proof of concept based on fprobe and
> rethook.
> 
> That branch is still very much WIP and there are a few things I'd like
> to address before sending even an RFC (when kretprobe is built on
> rethook for example, I construct pt_regs on the stack in which I copy
> the content of ftrace_regs... or program linking/unlinking is racy
> right now) but I think it's good enough for performance measurements
> already. (fentry_fexit and lsm tests pass)
> 
>>> Note that as a prerequisite for that I also want to reduce the set of registers
>>> we save/restore down to the set required by our calling convention, as the
>>> existing pt_regs is both large and generally unsound (since we can not and do
>>> not fill in many of the fields we only acquire at an exception boundary).
>>> That'll further reduce the ftrace overhead generally, and remove the needs for
>>> the two trampolines we currently have. I have a WIP at:
>>>
>>>     https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
> 
> Note that I integrated this work to my branch too. I extended it to
> also have fprobe and rethook save and pass ftrace_regs structures to
> their callbacks. Most performance improvements would come from your
> arm64/ftrace/per-callsite-ops branch but we'd need to fix the above
> TODO for it to work.
> 
>>> I intend to get back to both of those shortly (along with some related bits for
>>> kretprobes and stacktracing); I just haven't had much time recently due to
>>> other work and illness.
>>>
>>
>> Sorry for that, hope you getting better soon.
> 
> Oh, that sucks. Get better Mark!
> .
Steven Rostedt Oct. 5, 2022, 3:07 p.m. UTC | #9
On Wed, 5 Oct 2022 22:54:15 +0800
Xu Kuohai <xukuohai@huawei.com> wrote:

> 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> 
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> 
> 
> 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> 
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s

Can you show the implementation of the indirect call you used?

Thanks,

-- Steve
Florent Revest Oct. 5, 2022, 3:10 p.m. UTC | #10
On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 5 Oct 2022 22:54:15 +0800
> Xu Kuohai <xukuohai@huawei.com> wrote:
>
> > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> >
> > # dd if=/dev/zero of=/dev/null count=1000000
> > 1000000+0 records in
> > 1000000+0 records out
> > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> >
> >
> > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> >
> > # dd if=/dev/zero of=/dev/null count=1000000
> > 1000000+0 records in
> > 1000000+0 records out
> > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s

Thanks for the measurements Xu!

> Can you show the implementation of the indirect call you used?

Xu used my development branch here
https://github.com/FlorentRevest/linux/commits/fprobe-min-args

As it stands, the performance impact of the fprobe based
implementation would be too high for us. I wonder how much Mark's idea
here https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
would help but it doesn't work right now.
Steven Rostedt Oct. 5, 2022, 3:30 p.m. UTC | #11
On Wed, 5 Oct 2022 17:10:33 +0200
Florent Revest <revest@chromium.org> wrote:

> On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 5 Oct 2022 22:54:15 +0800
> > Xu Kuohai <xukuohai@huawei.com> wrote:
> >  
> > > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> > >
> > > # dd if=/dev/zero of=/dev/null count=1000000
> > > 1000000+0 records in
> > > 1000000+0 records out
> > > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> > >
> > >
> > > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> > >
> > > # dd if=/dev/zero of=/dev/null count=1000000
> > > 1000000+0 records in
> > > 1000000+0 records out
> > > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s  
> 
> Thanks for the measurements Xu!
> 
> > Can you show the implementation of the indirect call you used?  
> 
> Xu used my development branch here
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args

That looks like it could be optimized quite a bit too.

Specifically this part:

static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
{
	struct bpf_fprobe_call_context *call_ctx = private;
	struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
	struct bpf_tramp_links *links = fprobe_ctx->links;
	struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
	struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
	struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
	int i, ret;

	memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
	call_ctx->ip = ip;
	for (i = 0; i < fprobe_ctx->nr_args; i++)
		call_ctx->args[i] = ftrace_regs_get_argument(regs, i);

	for (i = 0; i < fentry->nr_links; i++)
		call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);

	call_ctx->args[fprobe_ctx->nr_args] = 0;
	for (i = 0; i < fmod_ret->nr_links; i++) {
		ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
				      call_ctx->args);

		if (ret) {
			ftrace_regs_set_return_value(regs, ret);
			ftrace_override_function_with_return(regs);

			bpf_fprobe_exit(fp, ip, regs, private);
			return false;
		}
	}

	return fexit->nr_links;
}

There's a lot of low hanging fruit to speed up there. I wouldn't be too
fast to throw out this solution if it hasn't had the care that direct calls
have had to speed that up.

For example, trampolines currently only allow to attach to functions with 6
parameters or less (3 on x86_32). You could make 7 specific callbacks, with
zero to 6 parameters, and unroll the argument loop.

Would also be interesting to run perf to see where the overhead is. There
may be other locations to work on to make it almost as fast as direct
callers without the other baggage.

-- Steve

> 
> As it stands, the performance impact of the fprobe based
> implementation would be too high for us. I wonder how much Mark's idea
> here https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
> would help but it doesn't work right now.
Jiri Olsa Oct. 5, 2022, 10:12 p.m. UTC | #12
On Wed, Oct 05, 2022 at 11:30:19AM -0400, Steven Rostedt wrote:
> On Wed, 5 Oct 2022 17:10:33 +0200
> Florent Revest <revest@chromium.org> wrote:
> 
> > On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Wed, 5 Oct 2022 22:54:15 +0800
> > > Xu Kuohai <xukuohai@huawei.com> wrote:
> > >  
> > > > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> > > >
> > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > 1000000+0 records in
> > > > 1000000+0 records out
> > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> > > >
> > > >
> > > > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> > > >
> > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > 1000000+0 records in
> > > > 1000000+0 records out
> > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s  
> > 
> > Thanks for the measurements Xu!
> > 
> > > Can you show the implementation of the indirect call you used?  
> > 
> > Xu used my development branch here
> > https://github.com/FlorentRevest/linux/commits/fprobe-min-args

nice :) I guess you did not try to run it on x86, I had to add some small
changes and disable HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS to compile it

> 
> That looks like it could be optimized quite a bit too.
> 
> Specifically this part:
> 
> static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> {
> 	struct bpf_fprobe_call_context *call_ctx = private;
> 	struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> 	struct bpf_tramp_links *links = fprobe_ctx->links;
> 	struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> 	struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> 	struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> 	int i, ret;
> 
> 	memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> 	call_ctx->ip = ip;
> 	for (i = 0; i < fprobe_ctx->nr_args; i++)
> 		call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
> 
> 	for (i = 0; i < fentry->nr_links; i++)
> 		call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
> 
> 	call_ctx->args[fprobe_ctx->nr_args] = 0;
> 	for (i = 0; i < fmod_ret->nr_links; i++) {
> 		ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> 				      call_ctx->args);
> 
> 		if (ret) {
> 			ftrace_regs_set_return_value(regs, ret);
> 			ftrace_override_function_with_return(regs);
> 
> 			bpf_fprobe_exit(fp, ip, regs, private);
> 			return false;
> 		}
> 	}
> 
> 	return fexit->nr_links;
> }
> 
> There's a lot of low hanging fruit to speed up there. I wouldn't be too
> fast to throw out this solution if it hasn't had the care that direct calls
> have had to speed that up.
> 
> For example, trampolines currently only allow to attach to functions with 6
> parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> zero to 6 parameters, and unroll the argument loop.
> 
> Would also be interesting to run perf to see where the overhead is. There
> may be other locations to work on to make it almost as fast as direct
> callers without the other baggage.

I can boot the change and run tests in qemu but for some reason it
won't boot on hw, so I have just perf report from qemu so far

there's fprobe/rethook machinery showing out as expected

jirka


---
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 23K of event 'cpu-clock:k'
# Event count (approx.): 5841250000
#
# Overhead  Command  Shared Object                                   Symbol                                            
# ........  .......  ..............................................  ..................................................
#
    18.65%  bench    [kernel.kallsyms]                               [k] syscall_enter_from_user_mode
            |
            ---syscall_enter_from_user_mode
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

    13.03%  bench    [kernel.kallsyms]                               [k] seqcount_lockdep_reader_access.constprop.0
            |
            ---seqcount_lockdep_reader_access.constprop.0
               ktime_get_coarse_real_ts64
               syscall_trace_enter.constprop.0
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     9.49%  bench    [kernel.kallsyms]                               [k] rethook_try_get
            |
            ---rethook_try_get
               fprobe_handler
               ftrace_trampoline
               __x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     8.71%  bench    [kernel.kallsyms]                               [k] rethook_recycle
            |
            ---rethook_recycle
               fprobe_handler
               ftrace_trampoline
               __x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     4.31%  bench    [kernel.kallsyms]                               [k] rcu_is_watching
            |
            ---rcu_is_watching
               |          
               |--1.49%--rethook_try_get
               |          fprobe_handler
               |          ftrace_trampoline
               |          __x64_sys_getpgid
               |          do_syscall_64
               |          entry_SYSCALL_64_after_hwframe
               |          syscall
               |          
               |--1.10%--do_getpgid
               |          __x64_sys_getpgid
               |          do_syscall_64
               |          entry_SYSCALL_64_after_hwframe
               |          syscall
               |          
               |--1.02%--__bpf_prog_exit
               |          call_bpf_prog.isra.0
               |          bpf_fprobe_entry
               |          fprobe_handler
               |          ftrace_trampoline
               |          __x64_sys_getpgid
               |          do_syscall_64
               |          entry_SYSCALL_64_after_hwframe
               |          syscall
               |          
                --0.70%--__bpf_prog_enter
                          call_bpf_prog.isra.0
                          bpf_fprobe_entry
                          fprobe_handler
                          ftrace_trampoline
                          __x64_sys_getpgid
                          do_syscall_64
                          entry_SYSCALL_64_after_hwframe
                          syscall

     2.94%  bench    [kernel.kallsyms]                               [k] lock_release
            |
            ---lock_release
               |          
               |--1.51%--call_bpf_prog.isra.0
               |          bpf_fprobe_entry
               |          fprobe_handler
               |          ftrace_trampoline
               |          __x64_sys_getpgid
               |          do_syscall_64
               |          entry_SYSCALL_64_after_hwframe
               |          syscall
               |          
                --1.43%--do_getpgid
                          __x64_sys_getpgid
                          do_syscall_64
                          entry_SYSCALL_64_after_hwframe
                          syscall

     2.91%  bench    bpf_prog_21856463590f61f1_bench_trigger_fentry  [k] bpf_prog_21856463590f61f1_bench_trigger_fentry
            |
            ---bpf_prog_21856463590f61f1_bench_trigger_fentry
               |          
                --2.66%--call_bpf_prog.isra.0
                          bpf_fprobe_entry
                          fprobe_handler
                          ftrace_trampoline
                          __x64_sys_getpgid
                          do_syscall_64
                          entry_SYSCALL_64_after_hwframe
                          syscall

     2.69%  bench    [kernel.kallsyms]                               [k] bpf_fprobe_entry
            |
            ---bpf_fprobe_entry
               fprobe_handler
               ftrace_trampoline
               __x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     2.60%  bench    [kernel.kallsyms]                               [k] lock_acquire
            |
            ---lock_acquire
               |          
               |--1.34%--__bpf_prog_enter
               |          call_bpf_prog.isra.0
               |          bpf_fprobe_entry
               |          fprobe_handler
               |          ftrace_trampoline
               |          __x64_sys_getpgid
               |          do_syscall_64
               |          entry_SYSCALL_64_after_hwframe
               |          syscall
               |          
                --1.24%--do_getpgid
                          __x64_sys_getpgid
                          do_syscall_64
                          entry_SYSCALL_64_after_hwframe
                          syscall

     2.42%  bench    [kernel.kallsyms]                               [k] syscall_exit_to_user_mode_prepare
            |
            ---syscall_exit_to_user_mode_prepare
               syscall_exit_to_user_mode
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     2.37%  bench    [kernel.kallsyms]                               [k] __audit_syscall_entry
            |
            ---__audit_syscall_entry
               syscall_trace_enter.constprop.0
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               |          
                --2.36%--syscall

     2.35%  bench    [kernel.kallsyms]                               [k] syscall_trace_enter.constprop.0
            |
            ---syscall_trace_enter.constprop.0
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     2.12%  bench    [kernel.kallsyms]                               [k] check_preemption_disabled
            |
            ---check_preemption_disabled
               |          
                --1.55%--rcu_is_watching
                          |          
                           --0.59%--do_getpgid
                                     __x64_sys_getpgid
                                     do_syscall_64
                                     entry_SYSCALL_64_after_hwframe
                                     syscall

     2.00%  bench    [kernel.kallsyms]                               [k] fprobe_handler
            |
            ---fprobe_handler
               ftrace_trampoline
               __x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     1.94%  bench    [kernel.kallsyms]                               [k] local_irq_disable_exit_to_user
            |
            ---local_irq_disable_exit_to_user
               syscall_exit_to_user_mode
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     1.84%  bench    [kernel.kallsyms]                               [k] rcu_read_lock_sched_held
            |
            ---rcu_read_lock_sched_held
               |          
               |--0.93%--lock_acquire
               |          
                --0.90%--lock_release

     1.71%  bench    [kernel.kallsyms]                               [k] migrate_enable
            |
            ---migrate_enable
               __bpf_prog_exit
               call_bpf_prog.isra.0
               bpf_fprobe_entry
               fprobe_handler
               ftrace_trampoline
               __x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     1.66%  bench    [kernel.kallsyms]                               [k] call_bpf_prog.isra.0
            |
            ---call_bpf_prog.isra.0
               bpf_fprobe_entry
               fprobe_handler
               ftrace_trampoline
               __x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     1.53%  bench    [kernel.kallsyms]                               [k] __rcu_read_unlock
            |
            ---__rcu_read_unlock
               |          
               |--0.86%--__bpf_prog_exit
               |          call_bpf_prog.isra.0
               |          bpf_fprobe_entry
               |          fprobe_handler
               |          ftrace_trampoline
               |          __x64_sys_getpgid
               |          do_syscall_64
               |          entry_SYSCALL_64_after_hwframe
               |          syscall
               |          
                --0.66%--do_getpgid
                          __x64_sys_getpgid
                          do_syscall_64
                          entry_SYSCALL_64_after_hwframe
                          syscall

     1.31%  bench    [kernel.kallsyms]                               [k] debug_smp_processor_id
            |
            ---debug_smp_processor_id
               |          
                --0.77%--rcu_is_watching

     1.22%  bench    [kernel.kallsyms]                               [k] migrate_disable
            |
            ---migrate_disable
               __bpf_prog_enter
               call_bpf_prog.isra.0
               bpf_fprobe_entry
               fprobe_handler
               ftrace_trampoline
               __x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     1.19%  bench    [kernel.kallsyms]                               [k] __bpf_prog_enter
            |
            ---__bpf_prog_enter
               call_bpf_prog.isra.0
               bpf_fprobe_entry
               fprobe_handler
               ftrace_trampoline
               __x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     0.84%  bench    [kernel.kallsyms]                               [k] __radix_tree_lookup
            |
            ---__radix_tree_lookup
               find_task_by_pid_ns
               do_getpgid
               __x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     0.82%  bench    [kernel.kallsyms]                               [k] do_getpgid
            |
            ---do_getpgid
               __x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     0.78%  bench    [kernel.kallsyms]                               [k] debug_lockdep_rcu_enabled
            |
            ---debug_lockdep_rcu_enabled
               |          
                --0.63%--rcu_read_lock_sched_held

     0.74%  bench    ftrace_trampoline                               [k] ftrace_trampoline
            |
            ---ftrace_trampoline
               __x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     0.72%  bench    [kernel.kallsyms]                               [k] preempt_count_add
            |
            ---preempt_count_add

     0.71%  bench    [kernel.kallsyms]                               [k] ktime_get_coarse_real_ts64
            |
            ---ktime_get_coarse_real_ts64
               syscall_trace_enter.constprop.0
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     0.69%  bench    [kernel.kallsyms]                               [k] do_syscall_64
            |
            ---do_syscall_64
               entry_SYSCALL_64_after_hwframe
               |          
                --0.68%--syscall

     0.60%  bench    [kernel.kallsyms]                               [k] preempt_count_sub
            |
            ---preempt_count_sub

     0.59%  bench    [kernel.kallsyms]                               [k] __rcu_read_lock
            |
            ---__rcu_read_lock

     0.59%  bench    [kernel.kallsyms]                               [k] __x64_sys_getpgid
            |
            ---__x64_sys_getpgid
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     0.58%  bench    [kernel.kallsyms]                               [k] __audit_syscall_exit
            |
            ---__audit_syscall_exit
               syscall_exit_to_user_mode_prepare
               syscall_exit_to_user_mode
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     0.53%  bench    [kernel.kallsyms]                               [k] audit_reset_context
            |
            ---audit_reset_context
               syscall_exit_to_user_mode_prepare
               syscall_exit_to_user_mode
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               syscall

     0.45%  bench    [kernel.kallsyms]                               [k] rcu_read_lock_held
     0.36%  bench    [kernel.kallsyms]                               [k] find_task_by_vpid
     0.32%  bench    [kernel.kallsyms]                               [k] __bpf_prog_exit
     0.26%  bench    [kernel.kallsyms]                               [k] syscall_exit_to_user_mode
     0.20%  bench    [kernel.kallsyms]                               [k] idr_find
     0.18%  bench    [kernel.kallsyms]                               [k] find_task_by_pid_ns
     0.17%  bench    [kernel.kallsyms]                               [k] update_prog_stats
     0.16%  bench    [kernel.kallsyms]                               [k] _raw_spin_unlock_irqrestore
     0.14%  bench    [kernel.kallsyms]                               [k] pid_task
     0.04%  bench    [kernel.kallsyms]                               [k] memchr_inv
     0.04%  bench    [kernel.kallsyms]                               [k] smp_call_function_many_cond
     0.03%  bench    [kernel.kallsyms]                               [k] do_user_addr_fault
     0.03%  bench    [kernel.kallsyms]                               [k] kallsyms_expand_symbol.constprop.0
     0.03%  bench    [kernel.kallsyms]                               [k] native_flush_tlb_global
     0.03%  bench    [kernel.kallsyms]                               [k] __change_page_attr_set_clr
     0.02%  bench    [kernel.kallsyms]                               [k] memcpy_erms
     0.02%  bench    [kernel.kallsyms]                               [k] unwind_next_frame
     0.02%  bench    [kernel.kallsyms]                               [k] copy_user_enhanced_fast_string
     0.01%  bench    [kernel.kallsyms]                               [k] __orc_find
     0.01%  bench    [kernel.kallsyms]                               [k] call_rcu
     0.01%  bench    [kernel.kallsyms]                               [k] __alloc_pages
     0.01%  bench    [kernel.kallsyms]                               [k] __purge_vmap_area_lazy
     0.01%  bench    [kernel.kallsyms]                               [k] __softirqentry_text_start
     0.01%  bench    [kernel.kallsyms]                               [k] __stack_depot_save
     0.01%  bench    [kernel.kallsyms]                               [k] __up_read
     0.01%  bench    [kernel.kallsyms]                               [k] __virt_addr_valid
     0.01%  bench    [kernel.kallsyms]                               [k] clear_page_erms
     0.01%  bench    [kernel.kallsyms]                               [k] deactivate_slab
     0.01%  bench    [kernel.kallsyms]                               [k] do_check_common
     0.01%  bench    [kernel.kallsyms]                               [k] finish_task_switch.isra.0
     0.01%  bench    [kernel.kallsyms]                               [k] free_unref_page_list
     0.01%  bench    [kernel.kallsyms]                               [k] ftrace_rec_iter_next
     0.01%  bench    [kernel.kallsyms]                               [k] handle_mm_fault
     0.01%  bench    [kernel.kallsyms]                               [k] orc_find.part.0
     0.01%  bench    [kernel.kallsyms]                               [k] try_charge_memcg
     0.00%  bench    [kernel.kallsyms]                               [k] ___slab_alloc
     0.00%  bench    [kernel.kallsyms]                               [k] __fdget_pos
     0.00%  bench    [kernel.kallsyms]                               [k] __handle_mm_fault
     0.00%  bench    [kernel.kallsyms]                               [k] __is_insn_slot_addr
     0.00%  bench    [kernel.kallsyms]                               [k] __kmalloc
     0.00%  bench    [kernel.kallsyms]                               [k] __mod_lruvec_page_state
     0.00%  bench    [kernel.kallsyms]                               [k] __mod_node_page_state
     0.00%  bench    [kernel.kallsyms]                               [k] __mutex_lock
     0.00%  bench    [kernel.kallsyms]                               [k] __raw_spin_lock_init
     0.00%  bench    [kernel.kallsyms]                               [k] alloc_vmap_area
     0.00%  bench    [kernel.kallsyms]                               [k] allocate_slab
     0.00%  bench    [kernel.kallsyms]                               [k] audit_get_tty
     0.00%  bench    [kernel.kallsyms]                               [k] bpf_ksym_find
     0.00%  bench    [kernel.kallsyms]                               [k] btf_check_all_metas
     0.00%  bench    [kernel.kallsyms]                               [k] btf_put
     0.00%  bench    [kernel.kallsyms]                               [k] cmpxchg_double_slab.constprop.0.isra.0
     0.00%  bench    [kernel.kallsyms]                               [k] do_fault
     0.00%  bench    [kernel.kallsyms]                               [k] do_raw_spin_trylock
     0.00%  bench    [kernel.kallsyms]                               [k] find_vma
     0.00%  bench    [kernel.kallsyms]                               [k] fs_reclaim_release
     0.00%  bench    [kernel.kallsyms]                               [k] ftrace_check_record
     0.00%  bench    [kernel.kallsyms]                               [k] ftrace_replace_code
     0.00%  bench    [kernel.kallsyms]                               [k] get_mem_cgroup_from_mm
     0.00%  bench    [kernel.kallsyms]                               [k] get_page_from_freelist
     0.00%  bench    [kernel.kallsyms]                               [k] in_gate_area_no_mm
     0.00%  bench    [kernel.kallsyms]                               [k] in_task_stack
     0.00%  bench    [kernel.kallsyms]                               [k] kernel_text_address
     0.00%  bench    [kernel.kallsyms]                               [k] kernfs_fop_read_iter
     0.00%  bench    [kernel.kallsyms]                               [k] kernfs_put_active
     0.00%  bench    [kernel.kallsyms]                               [k] kfree
     0.00%  bench    [kernel.kallsyms]                               [k] kmem_cache_alloc
     0.00%  bench    [kernel.kallsyms]                               [k] ksys_read
     0.00%  bench    [kernel.kallsyms]                               [k] lookup_address_in_pgd
     0.00%  bench    [kernel.kallsyms]                               [k] mlock_page_drain_local
     0.00%  bench    [kernel.kallsyms]                               [k] page_remove_rmap
     0.00%  bench    [kernel.kallsyms]                               [k] post_alloc_hook
     0.00%  bench    [kernel.kallsyms]                               [k] preempt_schedule_irq
     0.00%  bench    [kernel.kallsyms]                               [k] queue_work_on
     0.00%  bench    [kernel.kallsyms]                               [k] stack_trace_save
     0.00%  bench    [kernel.kallsyms]                               [k] within_error_injection_list


#
# (Tip: To record callchains for each sample: perf record -g)
#
Xu Kuohai Oct. 6, 2022, 10:09 a.m. UTC | #13
On 9/29/2022 12:42 AM, Mark Rutland wrote:
> On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
>> On 9/27/2022 1:43 AM, Mark Rutland wrote:
>>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>>>> bpf trampoline to fentry.
>>>>>>
>>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>>>> no patch has been posted except the one I posted in [1], so this series
>>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>>>> direct call works regardless of the distance between the callsite and custom
>>>>>> trampoline.
>>>>>>
>>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>>>
>>>>>> v2:
>>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>>>
>>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>>>
>>>>>> Xu Kuohai (4):
>>>>>>      ftrace: Allow users to disable ftrace direct call
>>>>>>      arm64: ftrace: Support long jump for ftrace direct call
>>>>>>      arm64: ftrace: Add ftrace direct call support
>>>>>>      ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>>>
>>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>>>> could work too, but I'd presume this just results in merge conflicts)?
>>>>
>>>> I think it makes sense for the series to go via the arm64 tree but I'd
>>>> like Mark to have a look at the ftrace changes first.
>>>
>>>>  From a quick scan, I still don't think this is quite right, and as it stands I
>>> believe this will break backtracing (as the instructions before the function
>>> entry point will not be symbolized correctly, getting in the way of
>>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
>>> feedback there, as I have a mechanism in mind that wa a little simpler.
>>
>> Thanks for the review. I have some thoughts about reliable stacktrace.
>>
>> If PC is not in the range of literal_call, stacktrace works as before without
>> changes.
>>
>> If PC is in the range of literal_call, for example, interrupted by an
>> irq, I think there are 2 problems:
>>
>> 1. Caller LR is not pushed to the stack yet, so caller's address and name
>>     will be missing from the backtrace.
>>
>> 2. Since PC is not in func's address range, no symbol name will be found, so
>>     func name is also missing.
>>
>> Problem 1 is not introduced by this patchset, but the occurring probability
>> may be increased by this patchset. I think this problem should be addressed by
>> a reliable stacktrace scheme, such as ORC on x86.
> 
> I agree problem 1 is not introduced by this patch set; I have plans fo how to
> address that for reliable stacktrace based on identifying the ftrace
> trampoline. This is one of the reasons I do not want direct calls, as
> identifying all direct call trampolines is going to be very painful and slow,
> whereas identifying a statically allocated ftrace trampoline is far simpler.
> 
>> Problem 2 is indeed introduced by this patchset. I think there are at least 3
>> ways to deal with it:
> 
> What I would like to do here, as mentioned previously in other threads, is to
> avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
> each patch-site with a specific set of ops, and invoke that directly from the
> regular ftrace trampoline.
> 
> With that, the patch site would look like:
> 
> 	pre_func_literal:
> 		NOP		// Patched to a pointer to
> 		NOP		// ftrace_ops
> 	func:
> 		< optional BTI here >
> 		NOP		// Patched to MOV X9, LR
> 		NOP		// Patched to a BL to the ftrace trampoline
> 
> ... then in the ftrace trampoline we can recover the ops pointer at a negative
> offset from the LR based on the LR, and invoke the ops from there (passing a
> struct ftrace_regs with the saved regs).
> 
> That way the patch-site is less significantly affected, and there's no impact
> to backtracing. That gets most of the benefit of the direct calls avoiding the
> ftrace ops list traversal, without having to do anything special at all. That
> should be much easier to maintain, too.
> 
> I started implementing that before LPC (and you can find some branches on my
> kernel.org repo), but I haven't yet had the time to rebase those and sort out
> the remaining issues:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
> 
> Note that as a prerequisite for that I also want to reduce the set of registers
> we save/restore down to the set required by our calling convention, as the
> existing pt_regs is both large and generally unsound (since we can not and do
> not fill in many of the fields we only acquire at an exception boundary).
> That'll further reduce the ftrace overhead generally, and remove the needs for
> the two trampolines we currently have. I have a WIP at:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
> 
> I intend to get back to both of those shortly (along with some related bits for
> kretprobes and stacktracing); I just haven't had much time recently due to
> other work and illness.
> 
>> 1. Add a symbol name for literal_call.
> 
> That'll require a number of invasive changes to make RELIABLE_STACKTRACE work,
> so I don't think we want to do that.
> 
>> 2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
>>     we can check if the PC is in literal_call, then adjust PC and try again.
> 
> The problem is that the existing symbolization code doesn't know the length of
> the prior symbol, so it will find *some* symbol associated with the previous
> function rather than finding no symbol.
> 
> To bodge around this we'dd need to special-case each patchable-function-entry
> site in symbolization, which is going to be painful and slow down unwinding
> unless we try to fix this up at boot-time or compile time.
> 
>> 3. Move literal_call to the func's address range, for example:
>>
>>          a. Compile with -fpatchable-function-entry=7
>>          func:
>>                  BTI C
>>                  NOP
>>                  NOP
>>                  NOP
>>                  NOP
>>                  NOP
>>                  NOP
>>                  NOP
> 
> This is a non-starter. We are not going to add 7 NOPs at the start of every
> function.
> 

Looks like we could just add 3 NOPs to function entry, like this:

1. At startup or when nothing attached, patch callsite to:

         literal:
                 .quad dummy_tramp
         func:
                 BTI C
                 MOV X9, LR
                 NOP
                 NOP
                 ...

2. When target is in range, patch callsite to

         literal:
                 .quad dummy_tramp
         func:
                 BTI C
                 MOV X9, LR
                 NOP
                 BL custom_trampoline
                 ...


3. Whe target is out of range, patch callsite to

         literal:
                 .quad custom_trampoline
         func:
                 BTI C
                 MOV X9, LR
                 LDR X16, literal
                 BLR X16
                 ...


> Thanks,
> Mark.
> 
> .
Xu Kuohai Oct. 6, 2022, 10:09 a.m. UTC | #14
On 10/5/2022 11:30 PM, Steven Rostedt wrote:
> On Wed, 5 Oct 2022 17:10:33 +0200
> Florent Revest <revest@chromium.org> wrote:
> 
>> On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>>>
>>> On Wed, 5 Oct 2022 22:54:15 +0800
>>> Xu Kuohai <xukuohai@huawei.com> wrote:
>>>   
>>>> 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
>>>>
>>>> # dd if=/dev/zero of=/dev/null count=1000000
>>>> 1000000+0 records in
>>>> 1000000+0 records out
>>>> 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
>>>>
>>>>
>>>> 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
>>>>
>>>> # dd if=/dev/zero of=/dev/null count=1000000
>>>> 1000000+0 records in
>>>> 1000000+0 records out
>>>> 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
>>
>> Thanks for the measurements Xu!
>>
>>> Can you show the implementation of the indirect call you used?
>>
>> Xu used my development branch here
>> https://github.com/FlorentRevest/linux/commits/fprobe-min-args
> 
> That looks like it could be optimized quite a bit too.
> 
> Specifically this part:
> 
> static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> {
> 	struct bpf_fprobe_call_context *call_ctx = private;
> 	struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> 	struct bpf_tramp_links *links = fprobe_ctx->links;
> 	struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> 	struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> 	struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> 	int i, ret;
> 
> 	memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> 	call_ctx->ip = ip;
> 	for (i = 0; i < fprobe_ctx->nr_args; i++)
> 		call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
> 
> 	for (i = 0; i < fentry->nr_links; i++)
> 		call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
> 
> 	call_ctx->args[fprobe_ctx->nr_args] = 0;
> 	for (i = 0; i < fmod_ret->nr_links; i++) {
> 		ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> 				      call_ctx->args);
> 
> 		if (ret) {
> 			ftrace_regs_set_return_value(regs, ret);
> 			ftrace_override_function_with_return(regs);
> 
> 			bpf_fprobe_exit(fp, ip, regs, private);
> 			return false;
> 		}
> 	}
> 
> 	return fexit->nr_links;
> }
> 
> There's a lot of low hanging fruit to speed up there. I wouldn't be too
> fast to throw out this solution if it hasn't had the care that direct calls
> have had to speed that up.
> 
> For example, trampolines currently only allow to attach to functions with 6
> parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> zero to 6 parameters, and unroll the argument loop.
> 
> Would also be interesting to run perf to see where the overhead is. There
> may be other locations to work on to make it almost as fast as direct
> callers without the other baggage.
> 

There is something wrong with my pi4 perf, I'll send the perf report after
I fix it.

> -- Steve
> 
>>
>> As it stands, the performance impact of the fprobe based
>> implementation would be too high for us. I wonder how much Mark's idea
>> here https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>> would help but it doesn't work right now.
> 
> 
> .
Florent Revest Oct. 6, 2022, 4:19 p.m. UTC | #15
On Wed, Oct 5, 2022 at 5:30 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 5 Oct 2022 17:10:33 +0200
> Florent Revest <revest@chromium.org> wrote:
>
> > On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > Can you show the implementation of the indirect call you used?
> >
> > Xu used my development branch here
> > https://github.com/FlorentRevest/linux/commits/fprobe-min-args
>
> That looks like it could be optimized quite a bit too.
>
> Specifically this part:
>
> static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> {
>         struct bpf_fprobe_call_context *call_ctx = private;
>         struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
>         struct bpf_tramp_links *links = fprobe_ctx->links;
>         struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
>         struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
>         struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
>         int i, ret;
>
>         memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
>         call_ctx->ip = ip;
>         for (i = 0; i < fprobe_ctx->nr_args; i++)
>                 call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
>
>         for (i = 0; i < fentry->nr_links; i++)
>                 call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
>
>         call_ctx->args[fprobe_ctx->nr_args] = 0;
>         for (i = 0; i < fmod_ret->nr_links; i++) {
>                 ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
>                                       call_ctx->args);
>
>                 if (ret) {
>                         ftrace_regs_set_return_value(regs, ret);
>                         ftrace_override_function_with_return(regs);
>
>                         bpf_fprobe_exit(fp, ip, regs, private);
>                         return false;
>                 }
>         }
>
>         return fexit->nr_links;
> }
>
> There's a lot of low hanging fruit to speed up there. I wouldn't be too
> fast to throw out this solution if it hasn't had the care that direct calls
> have had to speed that up.
>
> For example, trampolines currently only allow to attach to functions with 6
> parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> zero to 6 parameters, and unroll the argument loop.

Sure, we can give this a try, I'll work on a macro that generates the
7 callbacks and we can check how much that helps. My belief right now
is that ftrace's iteration over all ops on arm64 is where we lose most
time but now that we have numbers it's pretty easy to check hypothesis
:)
Steven Rostedt Oct. 6, 2022, 4:29 p.m. UTC | #16
On Thu, 6 Oct 2022 18:19:12 +0200
Florent Revest <revest@chromium.org> wrote:

> Sure, we can give this a try, I'll work on a macro that generates the
> 7 callbacks and we can check how much that helps. My belief right now
> is that ftrace's iteration over all ops on arm64 is where we lose most
> time but now that we have numbers it's pretty easy to check hypothesis
> :)

Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.

So, let's hold off until that is complete.

-- Steve
Florent Revest Oct. 6, 2022, 4:35 p.m. UTC | #17
On Thu, Oct 6, 2022 at 12:12 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Wed, Oct 05, 2022 at 11:30:19AM -0400, Steven Rostedt wrote:
> > On Wed, 5 Oct 2022 17:10:33 +0200
> > Florent Revest <revest@chromium.org> wrote:
> >
> > > On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > >
> > > > On Wed, 5 Oct 2022 22:54:15 +0800
> > > > Xu Kuohai <xukuohai@huawei.com> wrote:
> > > >
> > > > > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> > > > >
> > > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > > 1000000+0 records in
> > > > > 1000000+0 records out
> > > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> > > > >
> > > > >
> > > > > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> > > > >
> > > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > > 1000000+0 records in
> > > > > 1000000+0 records out
> > > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
> > >
> > > Thanks for the measurements Xu!
> > >
> > > > Can you show the implementation of the indirect call you used?
> > >
> > > Xu used my development branch here
> > > https://github.com/FlorentRevest/linux/commits/fprobe-min-args
>
> nice :) I guess you did not try to run it on x86, I had to add some small
> changes and disable HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS to compile it

Indeed, I haven't tried building on x86 yet, I'll have a look at what
I broke, thanks. :)
That branch is just an outline of the idea at this point anyway. Just
enough for performance measurements, not particularly ready for
review.

> >
> > That looks like it could be optimized quite a bit too.
> >
> > Specifically this part:
> >
> > static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> > {
> >       struct bpf_fprobe_call_context *call_ctx = private;
> >       struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> >       struct bpf_tramp_links *links = fprobe_ctx->links;
> >       struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> >       struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> >       struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> >       int i, ret;
> >
> >       memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> >       call_ctx->ip = ip;
> >       for (i = 0; i < fprobe_ctx->nr_args; i++)
> >               call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
> >
> >       for (i = 0; i < fentry->nr_links; i++)
> >               call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
> >
> >       call_ctx->args[fprobe_ctx->nr_args] = 0;
> >       for (i = 0; i < fmod_ret->nr_links; i++) {
> >               ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> >                                     call_ctx->args);
> >
> >               if (ret) {
> >                       ftrace_regs_set_return_value(regs, ret);
> >                       ftrace_override_function_with_return(regs);
> >
> >                       bpf_fprobe_exit(fp, ip, regs, private);
> >                       return false;
> >               }
> >       }
> >
> >       return fexit->nr_links;
> > }
> >
> > There's a lot of low hanging fruit to speed up there. I wouldn't be too
> > fast to throw out this solution if it hasn't had the care that direct calls
> > have had to speed that up.
> >
> > For example, trampolines currently only allow to attach to functions with 6
> > parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> > zero to 6 parameters, and unroll the argument loop.
> >
> > Would also be interesting to run perf to see where the overhead is. There
> > may be other locations to work on to make it almost as fast as direct
> > callers without the other baggage.
>
> I can boot the change and run tests in qemu but for some reason it
> won't boot on hw, so I have just perf report from qemu so far

Oh, ok, that's interesting. The changes look pretty benign (only
fprobe and arm64 specific code) I'm curious how that would break the
boot uh :p

>
> there's fprobe/rethook machinery showing out as expected
>
> jirka
>
>
> ---
> # To display the perf.data header info, please use --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 23K of event 'cpu-clock:k'
> # Event count (approx.): 5841250000
> #
> # Overhead  Command  Shared Object                                   Symbol
> # ........  .......  ..............................................  ..................................................
> #
>     18.65%  bench    [kernel.kallsyms]                               [k] syscall_enter_from_user_mode
>             |
>             ---syscall_enter_from_user_mode
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>     13.03%  bench    [kernel.kallsyms]                               [k] seqcount_lockdep_reader_access.constprop.0
>             |
>             ---seqcount_lockdep_reader_access.constprop.0
>                ktime_get_coarse_real_ts64
>                syscall_trace_enter.constprop.0
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      9.49%  bench    [kernel.kallsyms]                               [k] rethook_try_get
>             |
>             ---rethook_try_get
>                fprobe_handler
>                ftrace_trampoline
>                __x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      8.71%  bench    [kernel.kallsyms]                               [k] rethook_recycle
>             |
>             ---rethook_recycle
>                fprobe_handler
>                ftrace_trampoline
>                __x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      4.31%  bench    [kernel.kallsyms]                               [k] rcu_is_watching
>             |
>             ---rcu_is_watching
>                |
>                |--1.49%--rethook_try_get
>                |          fprobe_handler
>                |          ftrace_trampoline
>                |          __x64_sys_getpgid
>                |          do_syscall_64
>                |          entry_SYSCALL_64_after_hwframe
>                |          syscall
>                |
>                |--1.10%--do_getpgid
>                |          __x64_sys_getpgid
>                |          do_syscall_64
>                |          entry_SYSCALL_64_after_hwframe
>                |          syscall
>                |
>                |--1.02%--__bpf_prog_exit
>                |          call_bpf_prog.isra.0
>                |          bpf_fprobe_entry
>                |          fprobe_handler
>                |          ftrace_trampoline
>                |          __x64_sys_getpgid
>                |          do_syscall_64
>                |          entry_SYSCALL_64_after_hwframe
>                |          syscall
>                |
>                 --0.70%--__bpf_prog_enter
>                           call_bpf_prog.isra.0
>                           bpf_fprobe_entry
>                           fprobe_handler
>                           ftrace_trampoline
>                           __x64_sys_getpgid
>                           do_syscall_64
>                           entry_SYSCALL_64_after_hwframe
>                           syscall
>
>      2.94%  bench    [kernel.kallsyms]                               [k] lock_release
>             |
>             ---lock_release
>                |
>                |--1.51%--call_bpf_prog.isra.0
>                |          bpf_fprobe_entry
>                |          fprobe_handler
>                |          ftrace_trampoline
>                |          __x64_sys_getpgid
>                |          do_syscall_64
>                |          entry_SYSCALL_64_after_hwframe
>                |          syscall
>                |
>                 --1.43%--do_getpgid
>                           __x64_sys_getpgid
>                           do_syscall_64
>                           entry_SYSCALL_64_after_hwframe
>                           syscall
>
>      2.91%  bench    bpf_prog_21856463590f61f1_bench_trigger_fentry  [k] bpf_prog_21856463590f61f1_bench_trigger_fentry
>             |
>             ---bpf_prog_21856463590f61f1_bench_trigger_fentry
>                |
>                 --2.66%--call_bpf_prog.isra.0
>                           bpf_fprobe_entry
>                           fprobe_handler
>                           ftrace_trampoline
>                           __x64_sys_getpgid
>                           do_syscall_64
>                           entry_SYSCALL_64_after_hwframe
>                           syscall
>
>      2.69%  bench    [kernel.kallsyms]                               [k] bpf_fprobe_entry
>             |
>             ---bpf_fprobe_entry
>                fprobe_handler
>                ftrace_trampoline
>                __x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      2.60%  bench    [kernel.kallsyms]                               [k] lock_acquire
>             |
>             ---lock_acquire
>                |
>                |--1.34%--__bpf_prog_enter
>                |          call_bpf_prog.isra.0
>                |          bpf_fprobe_entry
>                |          fprobe_handler
>                |          ftrace_trampoline
>                |          __x64_sys_getpgid
>                |          do_syscall_64
>                |          entry_SYSCALL_64_after_hwframe
>                |          syscall
>                |
>                 --1.24%--do_getpgid
>                           __x64_sys_getpgid
>                           do_syscall_64
>                           entry_SYSCALL_64_after_hwframe
>                           syscall
>
>      2.42%  bench    [kernel.kallsyms]                               [k] syscall_exit_to_user_mode_prepare
>             |
>             ---syscall_exit_to_user_mode_prepare
>                syscall_exit_to_user_mode
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      2.37%  bench    [kernel.kallsyms]                               [k] __audit_syscall_entry
>             |
>             ---__audit_syscall_entry
>                syscall_trace_enter.constprop.0
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                |
>                 --2.36%--syscall
>
>      2.35%  bench    [kernel.kallsyms]                               [k] syscall_trace_enter.constprop.0
>             |
>             ---syscall_trace_enter.constprop.0
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      2.12%  bench    [kernel.kallsyms]                               [k] check_preemption_disabled
>             |
>             ---check_preemption_disabled
>                |
>                 --1.55%--rcu_is_watching
>                           |
>                            --0.59%--do_getpgid
>                                      __x64_sys_getpgid
>                                      do_syscall_64
>                                      entry_SYSCALL_64_after_hwframe
>                                      syscall
>
>      2.00%  bench    [kernel.kallsyms]                               [k] fprobe_handler
>             |
>             ---fprobe_handler
>                ftrace_trampoline
>                __x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      1.94%  bench    [kernel.kallsyms]                               [k] local_irq_disable_exit_to_user
>             |
>             ---local_irq_disable_exit_to_user
>                syscall_exit_to_user_mode
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      1.84%  bench    [kernel.kallsyms]                               [k] rcu_read_lock_sched_held
>             |
>             ---rcu_read_lock_sched_held
>                |
>                |--0.93%--lock_acquire
>                |
>                 --0.90%--lock_release
>
>      1.71%  bench    [kernel.kallsyms]                               [k] migrate_enable
>             |
>             ---migrate_enable
>                __bpf_prog_exit
>                call_bpf_prog.isra.0
>                bpf_fprobe_entry
>                fprobe_handler
>                ftrace_trampoline
>                __x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      1.66%  bench    [kernel.kallsyms]                               [k] call_bpf_prog.isra.0
>             |
>             ---call_bpf_prog.isra.0
>                bpf_fprobe_entry
>                fprobe_handler
>                ftrace_trampoline
>                __x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      1.53%  bench    [kernel.kallsyms]                               [k] __rcu_read_unlock
>             |
>             ---__rcu_read_unlock
>                |
>                |--0.86%--__bpf_prog_exit
>                |          call_bpf_prog.isra.0
>                |          bpf_fprobe_entry
>                |          fprobe_handler
>                |          ftrace_trampoline
>                |          __x64_sys_getpgid
>                |          do_syscall_64
>                |          entry_SYSCALL_64_after_hwframe
>                |          syscall
>                |
>                 --0.66%--do_getpgid
>                           __x64_sys_getpgid
>                           do_syscall_64
>                           entry_SYSCALL_64_after_hwframe
>                           syscall
>
>      1.31%  bench    [kernel.kallsyms]                               [k] debug_smp_processor_id
>             |
>             ---debug_smp_processor_id
>                |
>                 --0.77%--rcu_is_watching
>
>      1.22%  bench    [kernel.kallsyms]                               [k] migrate_disable
>             |
>             ---migrate_disable
>                __bpf_prog_enter
>                call_bpf_prog.isra.0
>                bpf_fprobe_entry
>                fprobe_handler
>                ftrace_trampoline
>                __x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      1.19%  bench    [kernel.kallsyms]                               [k] __bpf_prog_enter
>             |
>             ---__bpf_prog_enter
>                call_bpf_prog.isra.0
>                bpf_fprobe_entry
>                fprobe_handler
>                ftrace_trampoline
>                __x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      0.84%  bench    [kernel.kallsyms]                               [k] __radix_tree_lookup
>             |
>             ---__radix_tree_lookup
>                find_task_by_pid_ns
>                do_getpgid
>                __x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      0.82%  bench    [kernel.kallsyms]                               [k] do_getpgid
>             |
>             ---do_getpgid
>                __x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      0.78%  bench    [kernel.kallsyms]                               [k] debug_lockdep_rcu_enabled
>             |
>             ---debug_lockdep_rcu_enabled
>                |
>                 --0.63%--rcu_read_lock_sched_held
>
>      0.74%  bench    ftrace_trampoline                               [k] ftrace_trampoline
>             |
>             ---ftrace_trampoline
>                __x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      0.72%  bench    [kernel.kallsyms]                               [k] preempt_count_add
>             |
>             ---preempt_count_add
>
>      0.71%  bench    [kernel.kallsyms]                               [k] ktime_get_coarse_real_ts64
>             |
>             ---ktime_get_coarse_real_ts64
>                syscall_trace_enter.constprop.0
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      0.69%  bench    [kernel.kallsyms]                               [k] do_syscall_64
>             |
>             ---do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                |
>                 --0.68%--syscall
>
>      0.60%  bench    [kernel.kallsyms]                               [k] preempt_count_sub
>             |
>             ---preempt_count_sub
>
>      0.59%  bench    [kernel.kallsyms]                               [k] __rcu_read_lock
>             |
>             ---__rcu_read_lock
>
>      0.59%  bench    [kernel.kallsyms]                               [k] __x64_sys_getpgid
>             |
>             ---__x64_sys_getpgid
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      0.58%  bench    [kernel.kallsyms]                               [k] __audit_syscall_exit
>             |
>             ---__audit_syscall_exit
>                syscall_exit_to_user_mode_prepare
>                syscall_exit_to_user_mode
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      0.53%  bench    [kernel.kallsyms]                               [k] audit_reset_context
>             |
>             ---audit_reset_context
>                syscall_exit_to_user_mode_prepare
>                syscall_exit_to_user_mode
>                do_syscall_64
>                entry_SYSCALL_64_after_hwframe
>                syscall
>
>      0.45%  bench    [kernel.kallsyms]                               [k] rcu_read_lock_held
>      0.36%  bench    [kernel.kallsyms]                               [k] find_task_by_vpid
>      0.32%  bench    [kernel.kallsyms]                               [k] __bpf_prog_exit
>      0.26%  bench    [kernel.kallsyms]                               [k] syscall_exit_to_user_mode
>      0.20%  bench    [kernel.kallsyms]                               [k] idr_find
>      0.18%  bench    [kernel.kallsyms]                               [k] find_task_by_pid_ns
>      0.17%  bench    [kernel.kallsyms]                               [k] update_prog_stats
>      0.16%  bench    [kernel.kallsyms]                               [k] _raw_spin_unlock_irqrestore
>      0.14%  bench    [kernel.kallsyms]                               [k] pid_task
>      0.04%  bench    [kernel.kallsyms]                               [k] memchr_inv
>      0.04%  bench    [kernel.kallsyms]                               [k] smp_call_function_many_cond
>      0.03%  bench    [kernel.kallsyms]                               [k] do_user_addr_fault
>      0.03%  bench    [kernel.kallsyms]                               [k] kallsyms_expand_symbol.constprop.0
>      0.03%  bench    [kernel.kallsyms]                               [k] native_flush_tlb_global
>      0.03%  bench    [kernel.kallsyms]                               [k] __change_page_attr_set_clr
>      0.02%  bench    [kernel.kallsyms]                               [k] memcpy_erms
>      0.02%  bench    [kernel.kallsyms]                               [k] unwind_next_frame
>      0.02%  bench    [kernel.kallsyms]                               [k] copy_user_enhanced_fast_string
>      0.01%  bench    [kernel.kallsyms]                               [k] __orc_find
>      0.01%  bench    [kernel.kallsyms]                               [k] call_rcu
>      0.01%  bench    [kernel.kallsyms]                               [k] __alloc_pages
>      0.01%  bench    [kernel.kallsyms]                               [k] __purge_vmap_area_lazy
>      0.01%  bench    [kernel.kallsyms]                               [k] __softirqentry_text_start
>      0.01%  bench    [kernel.kallsyms]                               [k] __stack_depot_save
>      0.01%  bench    [kernel.kallsyms]                               [k] __up_read
>      0.01%  bench    [kernel.kallsyms]                               [k] __virt_addr_valid
>      0.01%  bench    [kernel.kallsyms]                               [k] clear_page_erms
>      0.01%  bench    [kernel.kallsyms]                               [k] deactivate_slab
>      0.01%  bench    [kernel.kallsyms]                               [k] do_check_common
>      0.01%  bench    [kernel.kallsyms]                               [k] finish_task_switch.isra.0
>      0.01%  bench    [kernel.kallsyms]                               [k] free_unref_page_list
>      0.01%  bench    [kernel.kallsyms]                               [k] ftrace_rec_iter_next
>      0.01%  bench    [kernel.kallsyms]                               [k] handle_mm_fault
>      0.01%  bench    [kernel.kallsyms]                               [k] orc_find.part.0
>      0.01%  bench    [kernel.kallsyms]                               [k] try_charge_memcg
>      0.00%  bench    [kernel.kallsyms]                               [k] ___slab_alloc
>      0.00%  bench    [kernel.kallsyms]                               [k] __fdget_pos
>      0.00%  bench    [kernel.kallsyms]                               [k] __handle_mm_fault
>      0.00%  bench    [kernel.kallsyms]                               [k] __is_insn_slot_addr
>      0.00%  bench    [kernel.kallsyms]                               [k] __kmalloc
>      0.00%  bench    [kernel.kallsyms]                               [k] __mod_lruvec_page_state
>      0.00%  bench    [kernel.kallsyms]                               [k] __mod_node_page_state
>      0.00%  bench    [kernel.kallsyms]                               [k] __mutex_lock
>      0.00%  bench    [kernel.kallsyms]                               [k] __raw_spin_lock_init
>      0.00%  bench    [kernel.kallsyms]                               [k] alloc_vmap_area
>      0.00%  bench    [kernel.kallsyms]                               [k] allocate_slab
>      0.00%  bench    [kernel.kallsyms]                               [k] audit_get_tty
>      0.00%  bench    [kernel.kallsyms]                               [k] bpf_ksym_find
>      0.00%  bench    [kernel.kallsyms]                               [k] btf_check_all_metas
>      0.00%  bench    [kernel.kallsyms]                               [k] btf_put
>      0.00%  bench    [kernel.kallsyms]                               [k] cmpxchg_double_slab.constprop.0.isra.0
>      0.00%  bench    [kernel.kallsyms]                               [k] do_fault
>      0.00%  bench    [kernel.kallsyms]                               [k] do_raw_spin_trylock
>      0.00%  bench    [kernel.kallsyms]                               [k] find_vma
>      0.00%  bench    [kernel.kallsyms]                               [k] fs_reclaim_release
>      0.00%  bench    [kernel.kallsyms]                               [k] ftrace_check_record
>      0.00%  bench    [kernel.kallsyms]                               [k] ftrace_replace_code
>      0.00%  bench    [kernel.kallsyms]                               [k] get_mem_cgroup_from_mm
>      0.00%  bench    [kernel.kallsyms]                               [k] get_page_from_freelist
>      0.00%  bench    [kernel.kallsyms]                               [k] in_gate_area_no_mm
>      0.00%  bench    [kernel.kallsyms]                               [k] in_task_stack
>      0.00%  bench    [kernel.kallsyms]                               [k] kernel_text_address
>      0.00%  bench    [kernel.kallsyms]                               [k] kernfs_fop_read_iter
>      0.00%  bench    [kernel.kallsyms]                               [k] kernfs_put_active
>      0.00%  bench    [kernel.kallsyms]                               [k] kfree
>      0.00%  bench    [kernel.kallsyms]                               [k] kmem_cache_alloc
>      0.00%  bench    [kernel.kallsyms]                               [k] ksys_read
>      0.00%  bench    [kernel.kallsyms]                               [k] lookup_address_in_pgd
>      0.00%  bench    [kernel.kallsyms]                               [k] mlock_page_drain_local
>      0.00%  bench    [kernel.kallsyms]                               [k] page_remove_rmap
>      0.00%  bench    [kernel.kallsyms]                               [k] post_alloc_hook
>      0.00%  bench    [kernel.kallsyms]                               [k] preempt_schedule_irq
>      0.00%  bench    [kernel.kallsyms]                               [k] queue_work_on
>      0.00%  bench    [kernel.kallsyms]                               [k] stack_trace_save
>      0.00%  bench    [kernel.kallsyms]                               [k] within_error_injection_list
>
>
> #
> # (Tip: To record callchains for each sample: perf record -g)
> #
>

Thanks for the measurements Jiri! :) At this point, my hypothesis is
that the biggest part of the performance hit comes from arm64 specific
code in ftrace so I would rather wait to see what Xu finds out on his
pi4. Also, I found an arm64 board today so I should soon be able to
make measurements there too.
Xu Kuohai Oct. 7, 2022, 10:13 a.m. UTC | #18
On 10/7/2022 12:29 AM, Steven Rostedt wrote:
> On Thu, 6 Oct 2022 18:19:12 +0200
> Florent Revest <revest@chromium.org> wrote:
> 
>> Sure, we can give this a try, I'll work on a macro that generates the
>> 7 callbacks and we can check how much that helps. My belief right now
>> is that ftrace's iteration over all ops on arm64 is where we lose most
>> time but now that we have numbers it's pretty easy to check hypothesis
>> :)
> 
> Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
> fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.
> 
> So, let's hold off until that is complete.
> 
> -- Steve
> 
> .


Here is the perf data I captured.

1. perf report

     99.94%     0.00%  ld-linux-aarch6  bench                                           [.] trigger_producer
             |
             ---trigger_producer
                |
                |--98.04%--syscall
                |          |
                |           --81.35%--el0t_64_sync
                |                     el0t_64_sync_handler
                |                     el0_svc
                |                     do_el0_svc
                |                     |
                |                     |--80.75%--el0_svc_common.constprop.0
                |                     |          |
                |                     |          |--49.70%--invoke_syscall
                |                     |          |          |
                |                     |          |           --46.66%--__arm64_sys_getpgid
                |                     |          |                     |
                |                     |          |                     |--40.73%--ftrace_call
                |                     |          |                     |          |
                |                     |          |                     |          |--38.71%--ftrace_ops_list_func
                |                     |          |                     |          |          |
                |                     |          |                     |          |          |--25.06%--fprobe_handler
                |                     |          |                     |          |          |          |
                |                     |          |                     |          |          |          |--13.20%--bpf_fprobe_entry
                |                     |          |                     |          |          |          |          |
                |                     |          |                     |          |          |          |           --11.47%--call_bpf_prog.isra.0
                |                     |          |                     |          |          |          |                     |
                |                     |          |                     |          |          |          |                     |--4.08%--__bpf_prog_exit
                |                     |          |                     |          |          |          |                     |          |
                |                     |          |                     |          |          |          |                     |           --0.87%--migrate_enable
                |                     |          |                     |          |          |          |                     |
                |                     |          |                     |          |          |          |                     |--2.46%--__bpf_prog_enter
                |                     |          |                     |          |          |          |                     |
                |                     |          |                     |          |          |          |                      --2.18%--bpf_prog_21856463590f61f1_bench_trigger_fentry
                |                     |          |                     |          |          |          |
                |                     |          |                     |          |          |          |--8.68%--rethook_trampoline_handler
                |                     |          |                     |          |          |          |
                |                     |          |                     |          |          |           --1.59%--rethook_try_get
                |                     |          |                     |          |          |                     |
                |                     |          |                     |          |          |                      --0.58%--rcu_is_watching
                |                     |          |                     |          |          |
                |                     |          |                     |          |          |--6.65%--rethook_trampoline_handler
                |                     |          |                     |          |          |
                |                     |          |                     |          |           --0.77%--rethook_recycle
                |                     |          |                     |          |
                |                     |          |                     |           --1.74%--hash_contains_ip.isra.0
                |                     |          |                     |
                |                     |          |                      --3.62%--find_task_by_vpid
                |                     |          |                                |
                |                     |          |                                 --2.75%--idr_find
                |                     |          |                                           |
                |                     |          |                                            --2.17%--__radix_tree_lookup
                |                     |          |
                |                     |           --1.30%--ftrace_caller
                |                     |
                |                      --0.60%--invoke_syscall
                |
                |--0.88%--0xffffb2807594
                |
                 --0.87%--syscall@plt


2. perf annotate

2.1 ftrace_caller

          : 39               SYM_CODE_START(ftrace_caller)
          : 40               bti     c
     0.00 :   ffff80000802e0c4:       bti     c
          :
          : 39               /* Save original SP */
          : 40               mov     x10, sp
     0.00 :   ffff80000802e0c8:       mov     x10, sp
          :
          : 42               /* Make room for pt_regs, plus two frame records */
          : 43               sub     sp, sp, #(FREGS_SIZE + 32)
     0.00 :   ffff80000802e0cc:       sub     sp, sp, #0x90
          :
          : 45               /* Save function arguments */
          : 46               stp     x0, x1, [sp, #FREGS_X0]
     0.00 :   ffff80000802e0d0:       stp     x0, x1, [sp]
          : 45               stp     x2, x3, [sp, #FREGS_X2]
     0.00 :   ffff80000802e0d4:       stp     x2, x3, [sp, #16]
          : 46               stp     x4, x5, [sp, #FREGS_X4]
    16.67 :   ffff80000802e0d8:       stp     x4, x5, [sp, #32] // entry-ftrace.S:46
          : 47               stp     x6, x7, [sp, #FREGS_X6]
     8.33 :   ffff80000802e0dc:       stp     x6, x7, [sp, #48] // entry-ftrace.S:47
          : 48               str     x8,     [sp, #FREGS_X8]
     0.00 :   ffff80000802e0e0:       str     x8, [sp, #64]
          :
          : 52               /* Save the callsite's FP, LR, SP */
          : 53               str     x29, [sp, #FREGS_FP]
     8.33 :   ffff80000802e0e4:       str     x29, [sp, #80] // entry-ftrace.S:51
          : 52               str     x9,  [sp, #FREGS_LR]
     8.33 :   ffff80000802e0e8:       str     x9, [sp, #88] // entry-ftrace.S:52
          : 53               str     x10, [sp, #FREGS_SP]
     0.00 :   ffff80000802e0ec:       str     x10, [sp, #96]
          :
          : 57               /* Save the PC after the ftrace callsite */
          : 58               str     x30, [sp, #FREGS_PC]
    16.67 :   ffff80000802e0f0:       str     x30, [sp, #104] // entry-ftrace.S:56
          :
          : 60               /* Create a frame record for the callsite above the ftrace regs */
          : 61               stp     x29, x9, [sp, #FREGS_SIZE + 16]
    16.67 :   ffff80000802e0f4:       stp     x29, x9, [sp, #128] // entry-ftrace.S:59
          : 60               add     x29, sp, #FREGS_SIZE + 16
     0.00 :   ffff80000802e0f8:       add     x29, sp, #0x80
          :
          : 64               /* Create our frame record above the ftrace regs */
          : 65               stp     x29, x30, [sp, #FREGS_SIZE]
    16.67 :   ffff80000802e0fc:       stp     x29, x30, [sp, #112] // entry-ftrace.S:63
          : 64               add     x29, sp, #FREGS_SIZE
     0.00 :   ffff80000802e100:       add     x29, sp, #0x70
          :
          : 67               sub     x0, x30, #AARCH64_INSN_SIZE     // ip (callsite's BL insn)
     0.00 :   ffff80000802e104:       sub     x0, x30, #0x4
          : 67               mov     x1, x9                          // parent_ip (callsite's LR)
     0.00 :   ffff80000802e108:       mov     x1, x9
          : 68               ldr_l   x2, function_trace_op           // op
     0.00 :   ffff80000802e10c:       adrp    x2, ffff800009638000 <folio_wait_table+0x14c0>
     0.00 :   ffff80000802e110:       ldr     x2, [x2, #3320]
          : 69               mov     x3, sp                          // regs
     0.00 :   ffff80000802e114:       mov     x3, sp
          :
          : 72               ffff80000802e118 <ftrace_call>:
          :
          : 73               SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
          : 74               bl      ftrace_stub
     0.00 :   ffff80000802e118:       bl      ffff80000802e144 <ftrace_stub>
          : 80               * At the callsite x0-x8 and x19-x30 were live. Any C code will have preserved
          : 81               * x19-x29 per the AAPCS, and we created frame records upon entry, so we need
          : 82               * to restore x0-x8, x29, and x30.
          : 83               */
          : 84               /* Restore function arguments */
          : 85               ldp     x0, x1, [sp, #FREGS_X0]
     8.33 :   ffff80000802e11c:       ldp     x0, x1, [sp] // entry-ftrace.S:80
          : 81               ldp     x2, x3, [sp, #FREGS_X2]
     0.00 :   ffff80000802e120:       ldp     x2, x3, [sp, #16]
          : 82               ldp     x4, x5, [sp, #FREGS_X4]
     0.00 :   ffff80000802e124:       ldp     x4, x5, [sp, #32]
          : 83               ldp     x6, x7, [sp, #FREGS_X6]
     0.00 :   ffff80000802e128:       ldp     x6, x7, [sp, #48]
          : 84               ldr     x8,     [sp, #FREGS_X8]
     0.00 :   ffff80000802e12c:       ldr     x8, [sp, #64]
          :
          : 88               /* Restore the callsite's FP, LR, PC */
          : 89               ldr     x29, [sp, #FREGS_FP]
     0.00 :   ffff80000802e130:       ldr     x29, [sp, #80]
          : 88               ldr     x30, [sp, #FREGS_LR]
     0.00 :   ffff80000802e134:       ldr     x30, [sp, #88]
          : 89               ldr     x9,  [sp, #FREGS_PC]
     0.00 :   ffff80000802e138:       ldr     x9, [sp, #104]
          :
          : 93               /* Restore the callsite's SP */
          : 94               add     sp, sp, #FREGS_SIZE + 32
     0.00 :   ffff80000802e13c:       add     sp, sp, #0x90
          :
          : 95               ret     x9
     0.00 :   ffff80000802e140:       ret     x9


2.2 arch_ftrace_ops_list_func

          : 7554             void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
          : 7555             struct ftrace_ops *op, struct ftrace_regs *fregs)
          : 7556             {
     0.00 :   ffff80000815bdf0:       paciasp
     4.65 :   ffff80000815bdf4:       stp     x29, x30, [sp, #-144]! // ftrace.c:7551
     0.00 :   ffff80000815bdf8:       mrs     x2, sp_el0
     0.00 :   ffff80000815bdfc:       mov     x29, sp
     2.32 :   ffff80000815be00:       stp     x19, x20, [sp, #16]
     0.00 :   ffff80000815be04:       mov     x20, x1
          : 7563             trace_test_and_set_recursion():
          : 147              int start)
          : 148              {
          : 149              unsigned int val = READ_ONCE(current->trace_recursion);
          : 150              int bit;
          :
          : 152              bit = trace_get_context_bit() + start;
     0.00 :   ffff80000815be08:       mov     w5, #0x8                        // #8
          : 154              arch_ftrace_ops_list_func():
     0.00 :   ffff80000815be0c:       stp     x21, x22, [sp, #32]
     0.00 :   ffff80000815be10:       mov     x21, x3
     2.32 :   ffff80000815be14:       stp     x23, x24, [sp, #48]
     0.00 :   ffff80000815be18:       mov     x23, x0
     0.00 :   ffff80000815be1c:       ldr     x4, [x2, #1168]
     2.32 :   ffff80000815be20:       str     x4, [sp, #136]
     0.00 :   ffff80000815be24:       mov     x4, #0x0                        // #0
          : 7558             trace_test_and_set_recursion():
          : 148              if (unlikely(val & (1 << bit))) {
     0.00 :   ffff80000815be28:       mov     w2, #0x1                        // #1
          : 150              get_current():
          : 19               */
          : 20               static __always_inline struct task_struct *get_current(void)
          : 21               {
          : 22               unsigned long sp_el0;
          :
          : 24               asm ("mrs %0, sp_el0" : "=r" (sp_el0));
     0.00 :   ffff80000815be2c:       mrs     x4, sp_el0
          : 26               trace_test_and_set_recursion():
          : 144              unsigned int val = READ_ONCE(current->trace_recursion);
     0.00 :   ffff80000815be30:       ldr     x7, [x4, #2520]
          : 146              preempt_count():
          : 13               #define PREEMPT_NEED_RESCHED    BIT(32)
          : 14               #define PREEMPT_ENABLED (PREEMPT_NEED_RESCHED)
          :
          : 16               static inline int preempt_count(void)
          : 17               {
          : 18               return READ_ONCE(current_thread_info()->preempt.count);
     0.00 :   ffff80000815be34:       ldr     w6, [x4, #8]
          : 20               interrupt_context_level():
          : 94               static __always_inline unsigned char interrupt_context_level(void)
          : 95               {
          : 96               unsigned long pc = preempt_count();
          : 97               unsigned char level = 0;
          :
          : 99               level += !!(pc & (NMI_MASK));
     0.00 :   ffff80000815be38:       tst     w6, #0xf00000
          : 96               level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
          : 97               level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
     0.00 :   ffff80000815be3c:       and     w1, w6, #0xffff00
          : 94               level += !!(pc & (NMI_MASK));
     0.00 :   ffff80000815be40:       cset    w4, ne  // ne = any
          : 96               level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
     0.00 :   ffff80000815be44:       and     w1, w1, #0xffff01ff
          : 95               level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
     0.00 :   ffff80000815be48:       tst     w6, #0xff0000
     0.00 :   ffff80000815be4c:       cinc    w4, w4, ne      // ne = any
          : 96               level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
     0.00 :   ffff80000815be50:       cmp     w1, #0x0
          : 98               trace_get_context_bit():
          : 121              return TRACE_CTX_NORMAL - bit;
     0.00 :   ffff80000815be54:       cinc    w4, w4, ne      // ne = any
          : 123              trace_test_and_set_recursion():
          : 147              bit = trace_get_context_bit() + start;
     0.00 :   ffff80000815be58:       sub     w5, w5, w4
          : 148              if (unlikely(val & (1 << bit))) {
     0.00 :   ffff80000815be5c:       lsl     w2, w2, w5
     0.00 :   ffff80000815be60:       tst     w2, w7
     0.00 :   ffff80000815be64:       b.ne    ffff80000815bf84 <arch_ftrace_ops_list_func+0x194>  // b.any
          : 152              trace_clear_recursion():
          : 180              */
          : 181              static __always_inline void trace_clear_recursion(int bit)
          : 182              {
          : 183              preempt_enable_notrace();
          : 184              barrier();
          : 185              trace_recursion_clear(bit);
     4.65 :   ffff80000815be68:       mvn     w22, w2 // trace_recursion.h:180
     0.00 :   ffff80000815be6c:       str     x25, [sp, #64]
     0.00 :   ffff80000815be70:       sxtw    x22, w22
          : 189              trace_test_and_set_recursion():
          : 165              current->trace_recursion = val;
     0.00 :   ffff80000815be74:       orr     w2, w2, w7
          : 167              get_current():
     0.00 :   ffff80000815be78:       mrs     x4, sp_el0
          : 20               trace_test_and_set_recursion():
     2.32 :   ffff80000815be7c:       str     x2, [x4, #2520] // trace_recursion.h:165
          : 166              __preempt_count_add():
          : 47               return !current_thread_info()->preempt.need_resched;
          : 48               }
          :
          : 50               static inline void __preempt_count_add(int val)
          : 51               {
          : 52               u32 pc = READ_ONCE(current_thread_info()->preempt.count);
     0.00 :   ffff80000815be80:       ldr     w1, [x4, #8]
          : 48               pc += val;
     0.00 :   ffff80000815be84:       add     w1, w1, #0x1
          : 49               WRITE_ONCE(current_thread_info()->preempt.count, pc);
     2.32 :   ffff80000815be88:       str     w1, [x4, #8] // preempt.h:49
          : 51               __ftrace_ops_list_func():
          : 7506             do_for_each_ftrace_op(op, ftrace_ops_list) {
     0.00 :   ffff80000815be8c:       adrp    x0, ffff800009638000 <folio_wait_table+0x14c0>
     0.00 :   ffff80000815be90:       add     x25, x0, #0xc28
          : 7527             } while_for_each_ftrace_op(op);
     0.00 :   ffff80000815be94:       add     x24, x25, #0x8
          : 7506             do_for_each_ftrace_op(op, ftrace_ops_list) {
     0.00 :   ffff80000815be98:       ldr     x19, [x0, #3112]
          : 7508             if (op->flags & FTRACE_OPS_FL_STUB)
     4.72 :   ffff80000815be9c:       ldr     x0, [x19, #16] // ftrace.c:7508
     0.00 :   ffff80000815bea0:       tbnz    w0, #5, ffff80000815bef8 <arch_ftrace_ops_list_func+0x108>
          : 7519             if ((!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching()) &&
     2.32 :   ffff80000815bea4:       tbnz    w0, #14, ffff80000815bf74 <arch_ftrace_ops_list_func+0x184> // ftrace.c:7519
          : 7521             ftrace_ops_test():
          : 1486             rcu_assign_pointer(hash.filter_hash, ops->func_hash->filter_hash);
     2.32 :   ffff80000815bea8:       ldr     x0, [x19, #88] // ftrace.c:1486
     0.00 :   ffff80000815beac:       add     x1, sp, #0x60
     0.00 :   ffff80000815beb0:       ldr     x0, [x0, #8]
     0.00 :   ffff80000815beb4:       stlr    x0, [x1]
          : 1487             rcu_assign_pointer(hash.notrace_hash, ops->func_hash->notrace_hash);
     0.00 :   ffff80000815beb8:       ldr     x0, [x19, #88]
     0.00 :   ffff80000815bebc:       add     x1, sp, #0x58
     0.00 :   ffff80000815bec0:       ldr     x0, [x0]
     2.32 :   ffff80000815bec4:       stlr    x0, [x1] // ftrace.c:1487
          : 1489             if (hash_contains_ip(ip, &hash))
    44.15 :   ffff80000815bec8:       ldp     x1, x2, [sp, #88] // ftrace.c:1489
     0.00 :   ffff80000815becc:       mov     x0, x23
     0.00 :   ffff80000815bed0:       bl      ffff80000815b530 <hash_contains_ip.isra.0>
     0.00 :   ffff80000815bed4:       tst     w0, #0xff
     0.00 :   ffff80000815bed8:       b.eq    ffff80000815bef8 <arch_ftrace_ops_list_func+0x108>  // b.none
          : 1495             __ftrace_ops_list_func():
          : 7521             if (FTRACE_WARN_ON(!op->func)) {
     0.00 :   ffff80000815bedc:       ldr     x4, [x19]
     0.00 :   ffff80000815bee0:       cbz     x4, ffff80000815bfa0 <arch_ftrace_ops_list_func+0x1b0>
          : 7525             op->func(ip, parent_ip, op, fregs);
     0.00 :   ffff80000815bee4:       mov     x3, x21
     0.00 :   ffff80000815bee8:       mov     x2, x19
     0.00 :   ffff80000815beec:       mov     x1, x20
     0.00 :   ffff80000815bef0:       mov     x0, x23
     0.00 :   ffff80000815bef4:       blr     x4
          : 7527             } while_for_each_ftrace_op(op);
     0.00 :   ffff80000815bef8:       ldr     x19, [x19, #8]
     0.00 :   ffff80000815befc:       cmp     x19, #0x0
     0.00 :   ffff80000815bf00:       ccmp    x19, x24, #0x4, ne      // ne = any
     0.00 :   ffff80000815bf04:       b.ne    ffff80000815be9c <arch_ftrace_ops_list_func+0xac>  // b.any
          : 7532             get_current():
     0.00 :   ffff80000815bf08:       mrs     x1, sp_el0
          : 20               __preempt_count_dec_and_test():
          : 62               }
          :
          : 64               static inline bool __preempt_count_dec_and_test(void)
          : 65               {
          : 66               struct thread_info *ti = current_thread_info();
          : 67               u64 pc = READ_ONCE(ti->preempt_count);
     0.00 :   ffff80000815bf0c:       ldr     x0, [x1, #8]
          :
          : 66               /* Update only the count field, leaving need_resched unchanged */
          : 67               WRITE_ONCE(ti->preempt.count, --pc);
     0.00 :   ffff80000815bf10:       sub     x0, x0, #0x1
     0.00 :   ffff80000815bf14:       str     w0, [x1, #8]
          : 74               * need of a reschedule. Otherwise, we need to reload the
          : 75               * preempt_count in case the need_resched flag was cleared by an
          : 76               * interrupt occurring between the non-atomic READ_ONCE/WRITE_ONCE
          : 77               * pair.
          : 78               */
          : 79               return !pc || !READ_ONCE(ti->preempt_count);
     0.00 :   ffff80000815bf18:       cbnz    x0, ffff80000815bf64 <arch_ftrace_ops_list_func+0x174>
          : 81               trace_clear_recursion():
          : 178              preempt_enable_notrace();
     0.00 :   ffff80000815bf1c:       bl      ffff800008ae88d0 <preempt_schedule_notrace>
          : 180              get_current():
     2.32 :   ffff80000815bf20:       mrs     x1, sp_el0 // current.h:19
          : 20               trace_clear_recursion():
          : 180              trace_recursion_clear(bit);
     0.00 :   ffff80000815bf24:       ldr     x0, [x1, #2520]
     0.00 :   ffff80000815bf28:       and     x0, x0, x22
     2.32 :   ffff80000815bf2c:       str     x0, [x1, #2520] // trace_recursion.h:180
          : 184              arch_ftrace_ops_list_func():
          : 7553             __ftrace_ops_list_func(ip, parent_ip, NULL, fregs);
          : 7554             }
     0.00 :   ffff80000815bf30:       ldr     x25, [sp, #64]
     0.00 :   ffff80000815bf34:       mrs     x0, sp_el0
     2.32 :   ffff80000815bf38:       ldr     x2, [sp, #136] // ftrace.c:7553
     0.00 :   ffff80000815bf3c:       ldr     x1, [x0, #1168]
     0.00 :   ffff80000815bf40:       subs    x2, x2, x1
     0.00 :   ffff80000815bf44:       mov     x1, #0x0                        // #0
     0.00 :   ffff80000815bf48:       b.ne    ffff80000815bf98 <arch_ftrace_ops_list_func+0x1a8>  // b.any
     2.32 :   ffff80000815bf4c:       ldp     x19, x20, [sp, #16]
     0.00 :   ffff80000815bf50:       ldp     x21, x22, [sp, #32]
     2.32 :   ffff80000815bf54:       ldp     x23, x24, [sp, #48]
     0.00 :   ffff80000815bf58:       ldp     x29, x30, [sp], #144
     0.00 :   ffff80000815bf5c:       autiasp
     0.00 :   ffff80000815bf60:       ret
          : 7568             __preempt_count_dec_and_test():
    11.62 :   ffff80000815bf64:       ldr     x0, [x1, #8] // preempt.h:74
     0.00 :   ffff80000815bf68:       cbnz    x0, ffff80000815bf20 <arch_ftrace_ops_list_func+0x130>
          : 76               trace_clear_recursion():
          : 178              preempt_enable_notrace();
     0.00 :   ffff80000815bf6c:       bl      ffff800008ae88d0 <preempt_schedule_notrace>
     0.00 :   ffff80000815bf70:       b       ffff80000815bf20 <arch_ftrace_ops_list_func+0x130>
          : 181              __ftrace_ops_list_func():
          : 7519             if ((!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching()) &&
     0.00 :   ffff80000815bf74:       bl      ffff8000080e5770 <rcu_is_watching>
     0.00 :   ffff80000815bf78:       tst     w0, #0xff
     0.00 :   ffff80000815bf7c:       b.ne    ffff80000815bea8 <arch_ftrace_ops_list_func+0xb8>  // b.any
     0.00 :   ffff80000815bf80:       b       ffff80000815bef8 <arch_ftrace_ops_list_func+0x108>
          : 7524             trace_test_and_set_recursion():
          : 158              if (val & (1 << bit)) {
     0.00 :   ffff80000815bf84:       tbnz    w7, #9, ffff80000815bf34 <arch_ftrace_ops_list_func+0x144>
     0.00 :   ffff80000815bf88:       mov     x22, #0xfffffffffffffdff        // #-513
     0.00 :   ffff80000815bf8c:       mov     w2, #0x200                      // #512
     0.00 :   ffff80000815bf90:       str     x25, [sp, #64]
     0.00 :   ffff80000815bf94:       b       ffff80000815be74 <arch_ftrace_ops_list_func+0x84>
     0.00 :   ffff80000815bf98:       str     x25, [sp, #64]
          : 165              arch_ftrace_ops_list_func():
          : 7553             }
     0.00 :   ffff80000815bf9c:       bl      ffff800008ae5de0 <__stack_chk_fail>
          : 7555             __ftrace_ops_list_func():
          : 7521             if (FTRACE_WARN_ON(!op->func)) {
     0.00 :   ffff80000815bfa0:       brk     #0x800
          : 7523             ftrace_kill():
          : 8040             */
          : 8041             void ftrace_kill(void)
          : 8042             {
          : 8043             ftrace_disabled = 1;
          : 8044             ftrace_enabled = 0;
          : 8045             ftrace_trace_function = ftrace_stub;
     0.00 :   ffff80000815bfa4:       adrp    x3, ffff80000802e000 <arch_ftrace_update_code+0x10>
     0.00 :   ffff80000815bfa8:       add     x3, x3, #0x144
          : 8038             ftrace_disabled = 1;
     0.00 :   ffff80000815bfac:       mov     w4, #0x1                        // #1
          : 8040             __ftrace_ops_list_func():
          : 7522             pr_warn("op=%p %pS\n", op, op);
     0.00 :   ffff80000815bfb0:       mov     x2, x19
     0.00 :   ffff80000815bfb4:       mov     x1, x19
     0.00 :   ffff80000815bfb8:       adrp    x0, ffff800008d80000 <kallsyms_token_index+0x17f60>
     0.00 :   ffff80000815bfbc:       add     x0, x0, #0x678
          : 7527             ftrace_kill():
          : 8040             ftrace_trace_function = ftrace_stub;
     0.00 :   ffff80000815bfc0:       str     x3, [x25, #192]
          : 8039             ftrace_enabled = 0;
     0.00 :   ffff80000815bfc4:       stp     w4, wzr, [x25, #200]
          : 8041             __ftrace_ops_list_func():
          : 7522             pr_warn("op=%p %pS\n", op, op);
     0.00 :   ffff80000815bfc8:       bl      ffff800008ad5220 <_printk>
          : 7523             goto out;
     0.00 :   ffff80000815bfcc:       b       ffff80000815bf08 <arch_ftrace_ops_list_func+0x118>


2.3 fprobe_handler

          : 28               static void fprobe_handler(unsigned long ip, unsigned long parent_ip,
          : 29               struct ftrace_ops *ops, struct ftrace_regs *fregs)
          : 30               {
     0.00 :   ffff8000081a2020:       paciasp
     0.00 :   ffff8000081a2024:       stp     x29, x30, [sp, #-64]!
     0.00 :   ffff8000081a2028:       mov     x29, sp
     0.00 :   ffff8000081a202c:       stp     x19, x20, [sp, #16]
     0.00 :   ffff8000081a2030:       mov     x19, x2
     0.00 :   ffff8000081a2034:       stp     x21, x22, [sp, #32]
     0.00 :   ffff8000081a2038:       mov     x22, x3
     0.00 :   ffff8000081a203c:       str     x23, [sp, #48]
     0.00 :   ffff8000081a2040:       mov     x23, x0
          : 40               fprobe_disabled():
          : 49               */
          : 50               #define FPROBE_FL_KPROBE_SHARED 2
          :
          : 52               static inline bool fprobe_disabled(struct fprobe *fp)
          : 53               {
          : 54               return (fp) ? fp->flags & FPROBE_FL_DISABLED : false;
     0.00 :   ffff8000081a2044:       cbz     x2, ffff8000081a2050 <fprobe_handler+0x30>
    20.00 :   ffff8000081a2048:       ldr     w0, [x2, #192] // fprobe.h:49
     0.00 :   ffff8000081a204c:       tbnz    w0, #0, ffff8000081a2128 <fprobe_handler+0x108>
          : 58               get_current():
          : 19               */
          : 20               static __always_inline struct task_struct *get_current(void)
          : 21               {
          : 22               unsigned long sp_el0;
          :
          : 24               asm ("mrs %0, sp_el0" : "=r" (sp_el0));
     0.00 :   ffff8000081a2050:       mrs     x0, sp_el0
          : 26               trace_test_and_set_recursion():
          : 144              * Preemption is promised to be disabled when return bit >= 0.
          : 145              */
          : 146              static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip,
          : 147              int start)
          : 148              {
          : 149              unsigned int val = READ_ONCE(current->trace_recursion);
    10.00 :   ffff8000081a2054:       ldr     x9, [x0, #2520] // trace_recursion.h:144
          : 151              trace_get_context_bit():
          : 121              return TRACE_CTX_NORMAL - bit;
     0.00 :   ffff8000081a2058:       mov     w6, #0x3                        // #3
          : 123              preempt_count():
          : 13               #define PREEMPT_NEED_RESCHED    BIT(32)
          : 14               #define PREEMPT_ENABLED (PREEMPT_NEED_RESCHED)
          :
          : 16               static inline int preempt_count(void)
          : 17               {
          : 18               return READ_ONCE(current_thread_info()->preempt.count);
     0.00 :   ffff8000081a205c:       ldr     w8, [x0, #8]
          : 20               trace_test_and_set_recursion():
          : 148              int bit;
          :
          : 150              bit = trace_get_context_bit() + start;
          : 151              if (unlikely(val & (1 << bit))) {
     0.00 :   ffff8000081a2060:       mov     w4, #0x1                        // #1
          : 153              interrupt_context_level():
          : 94               static __always_inline unsigned char interrupt_context_level(void)
          : 95               {
          : 96               unsigned long pc = preempt_count();
          : 97               unsigned char level = 0;
          :
          : 99               level += !!(pc & (NMI_MASK));
     0.00 :   ffff8000081a2064:       tst     w8, #0xf00000
          : 96               level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
          : 97               level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
     0.00 :   ffff8000081a2068:       and     w7, w8, #0xffff00
          : 94               level += !!(pc & (NMI_MASK));
     0.00 :   ffff8000081a206c:       cset    w5, ne  // ne = any
          : 96               level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
     0.00 :   ffff8000081a2070:       and     w7, w7, #0xffff01ff
          : 95               level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
     0.00 :   ffff8000081a2074:       tst     w8, #0xff0000
     0.00 :   ffff8000081a2078:       cinc    w5, w5, ne      // ne = any
          : 96               level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
     0.00 :   ffff8000081a207c:       cmp     w7, #0x0
          : 98               trace_get_context_bit():
          : 121              return TRACE_CTX_NORMAL - bit;
     0.00 :   ffff8000081a2080:       cinc    w5, w5, ne      // ne = any
     0.00 :   ffff8000081a2084:       sub     w5, w6, w5
          : 124              trace_test_and_set_recursion():
          : 148              if (unlikely(val & (1 << bit))) {
     0.00 :   ffff8000081a2088:       lsl     w4, w4, w5
          : 150              trace_clear_recursion():
          : 180              */
          : 181              static __always_inline void trace_clear_recursion(int bit)
          : 182              {
          : 183              preempt_enable_notrace();
          : 184              barrier();
          : 185              trace_recursion_clear(bit);
    10.00 :   ffff8000081a208c:       mvn     w20, w4 // trace_recursion.h:180
     0.00 :   ffff8000081a2090:       sxtw    x20, w20
          : 188              trace_test_and_set_recursion():
          : 148              if (unlikely(val & (1 << bit))) {
     0.00 :   ffff8000081a2094:       tst     w4, w9
     0.00 :   ffff8000081a2098:       b.ne    ffff8000081a2194 <fprobe_handler+0x174>  // b.any
          : 165              current->trace_recursion = val;
     0.00 :   ffff8000081a209c:       orr     w4, w4, w9
          : 167              get_current():
     0.00 :   ffff8000081a20a0:       mrs     x5, sp_el0
          : 20               trace_test_and_set_recursion():
     0.00 :   ffff8000081a20a4:       str     x4, [x5, #2520]
          : 166              __preempt_count_add():
          : 47               return !current_thread_info()->preempt.need_resched;
          : 48               }
          :
          : 50               static inline void __preempt_count_add(int val)
          : 51               {
          : 52               u32 pc = READ_ONCE(current_thread_info()->preempt.count);
     0.00 :   ffff8000081a20a8:       ldr     w4, [x5, #8]
          : 48               pc += val;
     0.00 :   ffff8000081a20ac:       add     w4, w4, #0x1
          : 49               WRITE_ONCE(current_thread_info()->preempt.count, pc);
     0.00 :   ffff8000081a20b0:       str     w4, [x5, #8]
          : 51               fprobe_handler():
          : 43               if (bit < 0) {
          : 44               fp->nmissed++;
          : 45               return;
          : 46               }
          :
          : 48               if (fp->exit_handler) {
     0.00 :   ffff8000081a20b4:       ldr     x0, [x19, #224]
     0.00 :   ffff8000081a20b8:       cbz     x0, ffff8000081a2140 <fprobe_handler+0x120>
          : 44               rh = rethook_try_get(fp->rethook);
    10.00 :   ffff8000081a20bc:       ldr     x0, [x19, #200] // fprobe.c:44
     0.00 :   ffff8000081a20c0:       bl      ffff8000081a2a54 <rethook_try_get>
     0.00 :   ffff8000081a20c4:       mov     x21, x0
          : 45               if (!rh) {
     0.00 :   ffff8000081a20c8:       cbz     x0, ffff8000081a21a4 <fprobe_handler+0x184>
          : 50               fp->nmissed++;
          : 51               goto out;
          : 52               }
          : 53               fpr = container_of(rh, struct fprobe_rethook_node, node);
          : 54               fpr->entry_ip = ip;
     0.00 :   ffff8000081a20cc:       str     x23, [x0, #48]
          : 54               private = fpr->private;
          : 55               }
          :
          : 57               if (fp->entry_handler)
     0.00 :   ffff8000081a20d0:       ldr     x4, [x19, #216]
     0.00 :   ffff8000081a20d4:       cbz     x4, ffff8000081a2180 <fprobe_handler+0x160>
          : 55               should_rethook = fp->entry_handler(fp, ip, fregs, fpr->private);
     0.00 :   ffff8000081a20d8:       mov     x1, x23
     0.00 :   ffff8000081a20dc:       mov     x0, x19
     0.00 :   ffff8000081a20e0:       add     x3, x21, #0x38
     0.00 :   ffff8000081a20e4:       mov     x2, x22
     0.00 :   ffff8000081a20e8:       blr     x4
          :
          : 59               if (rh) {
          : 60               if (should_rethook)
     0.00 :   ffff8000081a20ec:       tst     w0, #0xff
     0.00 :   ffff8000081a20f0:       b.ne    ffff8000081a2180 <fprobe_handler+0x160>  // b.any
          : 61               rethook_hook(rh, fregs, true);
          : 62               else
          : 63               rethook_recycle(rh);
     0.00 :   ffff8000081a20f4:       mov     x0, x21
     0.00 :   ffff8000081a20f8:       bl      ffff8000081a2bf0 <rethook_recycle>
          : 66               get_current():
     0.00 :   ffff8000081a20fc:       mrs     x1, sp_el0
          : 20               __preempt_count_dec_and_test():
          : 62               }
          :
          : 64               static inline bool __preempt_count_dec_and_test(void)
          : 65               {
          : 66               struct thread_info *ti = current_thread_info();
          : 67               u64 pc = READ_ONCE(ti->preempt_count);
     0.00 :   ffff8000081a2100:       ldr     x0, [x1, #8]
          :
          : 66               /* Update only the count field, leaving need_resched unchanged */
          : 67               WRITE_ONCE(ti->preempt.count, --pc);
     0.00 :   ffff8000081a2104:       sub     x0, x0, #0x1
     0.00 :   ffff8000081a2108:       str     w0, [x1, #8]
          : 74               * need of a reschedule. Otherwise, we need to reload the
          : 75               * preempt_count in case the need_resched flag was cleared by an
          : 76               * interrupt occurring between the non-atomic READ_ONCE/WRITE_ONCE
          : 77               * pair.
          : 78               */
          : 79               return !pc || !READ_ONCE(ti->preempt_count);
     0.00 :   ffff8000081a210c:       cbnz    x0, ffff8000081a2170 <fprobe_handler+0x150>
          : 81               trace_clear_recursion():
          : 178              preempt_enable_notrace();
     0.00 :   ffff8000081a2110:       bl      ffff800008ae88d0 <preempt_schedule_notrace>
     0.00 :   ffff8000081a2114:       nop
          : 181              get_current():
    10.00 :   ffff8000081a2118:       mrs     x1, sp_el0 // current.h:19
          : 20               trace_clear_recursion():
          : 180              trace_recursion_clear(bit);
     0.00 :   ffff8000081a211c:       ldr     x0, [x1, #2520]
     0.00 :   ffff8000081a2120:       and     x0, x0, x20
    10.00 :   ffff8000081a2124:       str     x0, [x1, #2520] // trace_recursion.h:180
          : 184              fprobe_handler():
          : 66               }
          :
          : 68               out:
          : 69               ftrace_test_recursion_unlock(bit);
          : 70               }
     0.00 :   ffff8000081a2128:       ldp     x19, x20, [sp, #16]
     0.00 :   ffff8000081a212c:       ldp     x21, x22, [sp, #32]
     0.00 :   ffff8000081a2130:       ldr     x23, [sp, #48]
    20.00 :   ffff8000081a2134:       ldp     x29, x30, [sp], #64 // fprobe.c:66
     0.00 :   ffff8000081a2138:       autiasp
    10.00 :   ffff8000081a213c:       ret
          : 54               if (fp->entry_handler)
     0.00 :   ffff8000081a2140:       ldr     x4, [x19, #216]
     0.00 :   ffff8000081a2144:       cbz     x4, ffff8000081a215c <fprobe_handler+0x13c>
          : 55               should_rethook = fp->entry_handler(fp, ip, fregs, fpr->private);
     0.00 :   ffff8000081a2148:       mov     x2, x22
     0.00 :   ffff8000081a214c:       mov     x1, x23
     0.00 :   ffff8000081a2150:       mov     x0, x19
     0.00 :   ffff8000081a2154:       mov     x3, #0x38                       // #56
     0.00 :   ffff8000081a2158:       blr     x4
          : 61               get_current():
     0.00 :   ffff8000081a215c:       mrs     x1, sp_el0
          : 20               __preempt_count_dec_and_test():
          : 62               u64 pc = READ_ONCE(ti->preempt_count);
     0.00 :   ffff8000081a2160:       ldr     x0, [x1, #8]
          : 65               WRITE_ONCE(ti->preempt.count, --pc);
     0.00 :   ffff8000081a2164:       sub     x0, x0, #0x1
     0.00 :   ffff8000081a2168:       str     w0, [x1, #8]
          : 74               return !pc || !READ_ONCE(ti->preempt_count);
     0.00 :   ffff8000081a216c:       cbz     x0, ffff8000081a2110 <fprobe_handler+0xf0>
     0.00 :   ffff8000081a2170:       ldr     x0, [x1, #8]
     0.00 :   ffff8000081a2174:       cbnz    x0, ffff8000081a2118 <fprobe_handler+0xf8>
          : 78               trace_clear_recursion():
          : 178              preempt_enable_notrace();
     0.00 :   ffff8000081a2178:       bl      ffff800008ae88d0 <preempt_schedule_notrace>
     0.00 :   ffff8000081a217c:       b       ffff8000081a2118 <fprobe_handler+0xf8>
          : 181              fprobe_handler():
          : 59               rethook_hook(rh, fregs, true);
     0.00 :   ffff8000081a2180:       mov     x1, x22
     0.00 :   ffff8000081a2184:       mov     x0, x21
     0.00 :   ffff8000081a2188:       mov     w2, #0x1                        // #1
     0.00 :   ffff8000081a218c:       bl      ffff8000081a27d0 <rethook_hook>
     0.00 :   ffff8000081a2190:       b       ffff8000081a215c <fprobe_handler+0x13c>
          : 65               trace_test_and_set_recursion():
          : 158              if (val & (1 << bit)) {
     0.00 :   ffff8000081a2194:       tbnz    w9, #4, ffff8000081a21b4 <fprobe_handler+0x194>
     0.00 :   ffff8000081a2198:       mov     x20, #0xffffffffffffffef        // #-17
     0.00 :   ffff8000081a219c:       mov     w4, #0x10                       // #16
     0.00 :   ffff8000081a21a0:       b       ffff8000081a209c <fprobe_handler+0x7c>
          : 163              fprobe_handler():
          : 46               fp->nmissed++;
     0.00 :   ffff8000081a21a4:       ldr     x0, [x19, #184]
     0.00 :   ffff8000081a21a8:       add     x0, x0, #0x1
     0.00 :   ffff8000081a21ac:       str     x0, [x19, #184]
          : 47               goto out;
     0.00 :   ffff8000081a21b0:       b       ffff8000081a215c <fprobe_handler+0x13c>
          : 39               fp->nmissed++;
     0.00 :   ffff8000081a21b4:       ldr     x0, [x19, #184]
     0.00 :   ffff8000081a21b8:       add     x0, x0, #0x1
     0.00 :   ffff8000081a21bc:       str     x0, [x19, #184]
          : 40               return;
     0.00 :   ffff8000081a21c0:       b       ffff8000081a2128 <fprobe_handler+0x108>


2.4 bpf_fprobe_entry

          : 5                ffff8000081e19f0 <bpf_fprobe_entry>:
          : 6                bpf_fprobe_entry():
          : 1057             flags = u64_stats_update_begin_irqsave(&stats->syncp);
          : 1058             u64_stats_inc(&stats->cnt);
          : 1059             u64_stats_add(&stats->nsecs, sched_clock() - start);
          : 1060             u64_stats_update_end_irqrestore(&stats->syncp, flags);
          : 1061             }
          : 1062             }
     0.00 :   ffff8000081e19f0:       bti     c
     0.00 :   ffff8000081e19f4:       nop
     0.00 :   ffff8000081e19f8:       nop
          : 165              {
     0.00 :   ffff8000081e19fc:       paciasp
     0.00 :   ffff8000081e1a00:       stp     x29, x30, [sp, #-80]!
     0.00 :   ffff8000081e1a04:       mov     w4, #0x0                        // #0
     0.00 :   ffff8000081e1a08:       mov     x29, sp
     0.00 :   ffff8000081e1a0c:       stp     x19, x20, [sp, #16]
     0.00 :   ffff8000081e1a10:       mov     x19, x3
     0.00 :   ffff8000081e1a14:       stp     x21, x22, [sp, #32]
     0.00 :   ffff8000081e1a18:       mov     x22, x0
     0.00 :   ffff8000081e1a1c:       mov     x21, x2
     0.00 :   ffff8000081e1a20:       stp     x23, x24, [sp, #48]
     0.00 :   ffff8000081e1a24:       str     x25, [sp, #64]
          : 167              struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
     0.00 :   ffff8000081e1a28:       ldr     x24, [x0, #24]
          : 168              struct bpf_tramp_links *links = fprobe_ctx->links;
     0.00 :   ffff8000081e1a2c:       ldr     x23, [x24]
          : 174              memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
     0.00 :   ffff8000081e1a30:       stp     xzr, xzr, [x3]
          : 175              call_ctx->ip = ip;
     0.00 :   ffff8000081e1a34:       str     x1, [x3, #16]
          : 176              for (i = 0; i < fprobe_ctx->nr_args; i++)
     0.00 :   ffff8000081e1a38:       ldr     w0, [x24, #8]
     0.00 :   ffff8000081e1a3c:       cmp     w0, #0x0
     0.00 :   ffff8000081e1a40:       b.gt    ffff8000081e1a64 <bpf_fprobe_entry+0x74>
     0.00 :   ffff8000081e1a44:       b       ffff8000081e1a90 <bpf_fprobe_entry+0xa0>
          : 177              call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
     0.00 :   ffff8000081e1a48:       ldr     x0, [x21, x1, lsl #3]
     0.00 :   ffff8000081e1a4c:       add     x1, x19, x1, lsl #3
          : 176              for (i = 0; i < fprobe_ctx->nr_args; i++)
     0.00 :   ffff8000081e1a50:       add     w4, w4, #0x1
          : 177              call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
    16.67 :   ffff8000081e1a54:       str     x0, [x1, #24] // trampoline.c:177
          : 176              for (i = 0; i < fprobe_ctx->nr_args; i++)
     0.00 :   ffff8000081e1a58:       ldr     w0, [x24, #8]
     0.00 :   ffff8000081e1a5c:       cmp     w0, w4
     0.00 :   ffff8000081e1a60:       b.le    ffff8000081e1a90 <bpf_fprobe_entry+0xa0>
          : 177              call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
     8.33 :   ffff8000081e1a64:       sxtw    x1, w4
     0.00 :   ffff8000081e1a68:       mov     x0, #0x0                        // #0
     0.00 :   ffff8000081e1a6c:       cmp     w4, #0x7
     0.00 :   ffff8000081e1a70:       b.le    ffff8000081e1a48 <bpf_fprobe_entry+0x58>
     0.00 :   ffff8000081e1a74:       sxtw    x1, w4
          : 176              for (i = 0; i < fprobe_ctx->nr_args; i++)
     0.00 :   ffff8000081e1a78:       add     w4, w4, #0x1
          : 177              call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
     0.00 :   ffff8000081e1a7c:       add     x1, x19, x1, lsl #3
     0.00 :   ffff8000081e1a80:       str     x0, [x1, #24]
          : 176              for (i = 0; i < fprobe_ctx->nr_args; i++)
     0.00 :   ffff8000081e1a84:       ldr     w0, [x24, #8]
     0.00 :   ffff8000081e1a88:       cmp     w0, w4
     0.00 :   ffff8000081e1a8c:       b.gt    ffff8000081e1a64 <bpf_fprobe_entry+0x74>
          : 179              for (i = 0; i < fentry->nr_links; i++)
     0.00 :   ffff8000081e1a90:       ldr     w1, [x23, #304]
          : 185              call_ctx->args);
     0.00 :   ffff8000081e1a94:       add     x25, x19, #0x18
     0.00 :   ffff8000081e1a98:       mov     x20, #0x0                       // #0
          : 179              for (i = 0; i < fentry->nr_links; i++)
     0.00 :   ffff8000081e1a9c:       cmp     w1, #0x0
     0.00 :   ffff8000081e1aa0:       b.le    ffff8000081e1ad4 <bpf_fprobe_entry+0xe4>
     0.00 :   ffff8000081e1aa4:       nop
          : 180              call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
     0.00 :   ffff8000081e1aa8:       ldr     x1, [x23, x20, lsl #3]
     0.00 :   ffff8000081e1aac:       mov     x3, x25
     0.00 :   ffff8000081e1ab0:       mov     x2, x19
          : 179              for (i = 0; i < fentry->nr_links; i++)
     0.00 :   ffff8000081e1ab4:       add     x20, x20, #0x1
          : 180              call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
    16.67 :   ffff8000081e1ab8:       ldr     x0, [x1, #24] // trampoline.c:180
     0.00 :   ffff8000081e1abc:       ldr     x1, [x1, #80]
     0.00 :   ffff8000081e1ac0:       bl      ffff8000081e1800 <call_bpf_prog.isra.0>
          : 179              for (i = 0; i < fentry->nr_links; i++)
     0.00 :   ffff8000081e1ac4:       ldr     w0, [x23, #304]
     0.00 :   ffff8000081e1ac8:       cmp     w0, w20
     0.00 :   ffff8000081e1acc:       b.gt    ffff8000081e1aa8 <bpf_fprobe_entry+0xb8>
     0.00 :   ffff8000081e1ad0:       ldr     w0, [x24, #8]
          : 182              call_ctx->args[fprobe_ctx->nr_args] = 0;
     0.00 :   ffff8000081e1ad4:       add     x0, x19, w0, sxtw #3
          : 183              for (i = 0; i < fmod_ret->nr_links; i++) {
     0.00 :   ffff8000081e1ad8:       add     x25, x23, #0x270
          : 185              call_ctx->args);
     0.00 :   ffff8000081e1adc:       add     x24, x19, #0x18
     0.00 :   ffff8000081e1ae0:       mov     x20, #0x0                       // #0
          : 182              call_ctx->args[fprobe_ctx->nr_args] = 0;
    25.00 :   ffff8000081e1ae4:       str     xzr, [x0, #24] // trampoline.c:182
          : 183              for (i = 0; i < fmod_ret->nr_links; i++) {
     0.00 :   ffff8000081e1ae8:       ldr     w0, [x25, #304]
     0.00 :   ffff8000081e1aec:       cmp     w0, #0x0
     0.00 :   ffff8000081e1af0:       b.gt    ffff8000081e1b04 <bpf_fprobe_entry+0x114>
    16.67 :   ffff8000081e1af4:       b       ffff8000081e1ba8 <bpf_fprobe_entry+0x1b8> // trampoline.c:183
     0.00 :   ffff8000081e1af8:       ldr     w0, [x25, #304]
     0.00 :   ffff8000081e1afc:       cmp     w0, w20
     0.00 :   ffff8000081e1b00:       b.le    ffff8000081e1ba8 <bpf_fprobe_entry+0x1b8>
          : 184              ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
     0.00 :   ffff8000081e1b04:       ldr     x1, [x25, x20, lsl #3]
     0.00 :   ffff8000081e1b08:       mov     x3, x24
     0.00 :   ffff8000081e1b0c:       mov     x2, x19
          : 183              for (i = 0; i < fmod_ret->nr_links; i++) {
     0.00 :   ffff8000081e1b10:       add     x20, x20, #0x1
          : 184              ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
     0.00 :   ffff8000081e1b14:       ldr     x0, [x1, #24]
     0.00 :   ffff8000081e1b18:       ldr     x1, [x1, #80]
     0.00 :   ffff8000081e1b1c:       bl      ffff8000081e1800 <call_bpf_prog.isra.0>
          : 187              if (ret) {
     0.00 :   ffff8000081e1b20:       cbz     w0, ffff8000081e1af8 <bpf_fprobe_entry+0x108>
          : 189              ftrace_override_function_with_return(regs);
     0.00 :   ffff8000081e1b24:       ldr     x2, [x21, #88]
          : 188              ftrace_regs_set_return_value(regs, ret);
     0.00 :   ffff8000081e1b28:       sxtw    x1, w0
     0.00 :   ffff8000081e1b2c:       str     x1, [x21]
          : 191              bpf_fprobe_exit():
          : 160              for (i = 0; i < fexit->nr_links; i++)
     0.00 :   ffff8000081e1b30:       mov     x20, #0x0                       // #0
          : 162              bpf_fprobe_entry():
          : 189              ftrace_override_function_with_return(regs);
     0.00 :   ffff8000081e1b34:       str     x2, [x21, #104]
          : 191              bpf_fprobe_exit():
          : 153              struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
     0.00 :   ffff8000081e1b38:       ldr     x2, [x22, #24]
          : 158              call_ctx->args[fprobe_ctx->nr_args] = ftrace_regs_return_value(regs);
     0.00 :   ffff8000081e1b3c:       ldrsw   x0, [x2, #8]
          : 154              struct bpf_tramp_links *links = fprobe_ctx->links;
     0.00 :   ffff8000081e1b40:       ldr     x21, [x2]
          : 158              call_ctx->args[fprobe_ctx->nr_args] = ftrace_regs_return_value(regs);
     0.00 :   ffff8000081e1b44:       add     x0, x19, x0, lsl #3
          : 160              for (i = 0; i < fexit->nr_links; i++)
     0.00 :   ffff8000081e1b48:       add     x21, x21, #0x138
          : 158              call_ctx->args[fprobe_ctx->nr_args] = ftrace_regs_return_value(regs);
     0.00 :   ffff8000081e1b4c:       str     x1, [x0, #24]
          : 160              for (i = 0; i < fexit->nr_links; i++)
     0.00 :   ffff8000081e1b50:       ldr     w0, [x21, #304]
     0.00 :   ffff8000081e1b54:       cmp     w0, #0x0
     0.00 :   ffff8000081e1b58:       b.le    ffff8000081e1b88 <bpf_fprobe_entry+0x198>
     0.00 :   ffff8000081e1b5c:       nop
          : 161              call_bpf_prog(fexit->links[i], &call_ctx->ctx, call_ctx->args);
     0.00 :   ffff8000081e1b60:       ldr     x1, [x21, x20, lsl #3]
     0.00 :   ffff8000081e1b64:       mov     x3, x24
     0.00 :   ffff8000081e1b68:       mov     x2, x19
          : 160              for (i = 0; i < fexit->nr_links; i++)
     0.00 :   ffff8000081e1b6c:       add     x20, x20, #0x1
          : 161              call_bpf_prog(fexit->links[i], &call_ctx->ctx, call_ctx->args);
     0.00 :   ffff8000081e1b70:       ldr     x0, [x1, #24]
     0.00 :   ffff8000081e1b74:       ldr     x1, [x1, #80]
     0.00 :   ffff8000081e1b78:       bl      ffff8000081e1800 <call_bpf_prog.isra.0>
          : 160              for (i = 0; i < fexit->nr_links; i++)
     0.00 :   ffff8000081e1b7c:       ldr     w0, [x21, #304]
     0.00 :   ffff8000081e1b80:       cmp     w0, w20
     0.00 :   ffff8000081e1b84:       b.gt    ffff8000081e1b60 <bpf_fprobe_entry+0x170>
          : 164              bpf_fprobe_entry():
          : 192              return false;
     0.00 :   ffff8000081e1b88:       mov     w0, #0x0                        // #0
          : 197              }
     0.00 :   ffff8000081e1b8c:       ldp     x19, x20, [sp, #16]
     0.00 :   ffff8000081e1b90:       ldp     x21, x22, [sp, #32]
     0.00 :   ffff8000081e1b94:       ldp     x23, x24, [sp, #48]
     0.00 :   ffff8000081e1b98:       ldr     x25, [sp, #64]
     0.00 :   ffff8000081e1b9c:       ldp     x29, x30, [sp], #80
     0.00 :   ffff8000081e1ba0:       autiasp
     0.00 :   ffff8000081e1ba4:       ret
          : 196              return fexit->nr_links;
     0.00 :   ffff8000081e1ba8:       ldr     w0, [x23, #616]
          : 197              }
     0.00 :   ffff8000081e1bac:       ldp     x19, x20, [sp, #16]
          : 196              return fexit->nr_links;
     0.00 :   ffff8000081e1bb0:       cmp     w0, #0x0
     0.00 :   ffff8000081e1bb4:       cset    w0, ne  // ne = any
          : 197              }
     0.00 :   ffff8000081e1bb8:       ldp     x21, x22, [sp, #32]
     0.00 :   ffff8000081e1bbc:       ldp     x23, x24, [sp, #48]
     0.00 :   ffff8000081e1bc0:       ldr     x25, [sp, #64]
     0.00 :   ffff8000081e1bc4:       ldp     x29, x30, [sp], #80
     0.00 :   ffff8000081e1bc8:       autiasp
    16.67 :   ffff8000081e1bcc:       ret // trampoline.c:197

2.5 call_bpf_prog

          : 5                ffff8000081e1800 <call_bpf_prog.isra.0>:
          : 6                call_bpf_prog.isra.0():
          :
          : 200              if (oldp)
          : 201              *oldp = old;
          :
          : 203              if (unlikely(!old))
          : 204              refcount_warn_saturate(r, REFCOUNT_ADD_UAF);
    13.33 :   ffff8000081e1800:       nop // refcount.h:199
     0.00 :   ffff8000081e1804:       nop
          : 207              call_bpf_prog():
          :
          : 108              mutex_unlock(&tr->mutex);
          : 109              return ret;
          : 110              }
          : 111              #else
          : 112              static unsigned int call_bpf_prog(struct bpf_tramp_link *l,
     0.00 :   ffff8000081e1808:       paciasp
     0.00 :   ffff8000081e180c:       stp     x29, x30, [sp, #-64]!
     0.00 :   ffff8000081e1810:       mov     x29, sp
     0.00 :   ffff8000081e1814:       stp     x19, x20, [sp, #16]
     0.00 :   ffff8000081e1818:       mov     x19, x0
     0.00 :   ffff8000081e181c:       mov     x20, x2
     0.00 :   ffff8000081e1820:       stp     x21, x22, [sp, #32]
     6.67 :   ffff8000081e1824:       stp     x23, x24, [sp, #48] // trampoline.c:107
     0.00 :   ffff8000081e1828:       mov     x24, x3
          : 118              struct bpf_tramp_run_ctx *run_ctx) = __bpf_prog_exit;
          : 119              struct bpf_prog *p = l->link.prog;
          : 120              unsigned int ret;
          : 121              u64 start_time;
          :
          : 123              if (p->aux->sleepable) {
    60.00 :   ffff8000081e182c:       ldr     x0, [x0, #56] // trampoline.c:118
    13.33 :   ffff8000081e1830:       ldrb    w0, [x0, #140]
     0.00 :   ffff8000081e1834:       cbnz    w0, ffff8000081e1858 <call_bpf_prog.isra.0+0x58>
          : 121              enter = __bpf_prog_enter_sleepable;
          : 122              exit = __bpf_prog_exit_sleepable;
          : 123              } else if (p->expected_attach_type == BPF_LSM_CGROUP) {
     0.00 :   ffff8000081e1838:       ldr     w0, [x19, #8]
     0.00 :   ffff8000081e183c:       cmp     w0, #0x2b
     0.00 :   ffff8000081e1840:       b.eq    ffff8000081e18c4 <call_bpf_prog.isra.0+0xc4>  // b.none
          : 112              void (*exit)(struct bpf_prog *prog, u64 start,
     0.00 :   ffff8000081e1844:       adrp    x22, ffff8000081e1000 <print_bpf_insn+0x580>
          : 110              u64 (*enter)(struct bpf_prog *prog,
     0.00 :   ffff8000081e1848:       adrp    x2, ffff8000081e1000 <print_bpf_insn+0x580>
          : 112              void (*exit)(struct bpf_prog *prog, u64 start,
     0.00 :   ffff8000081e184c:       add     x22, x22, #0xbd0
          : 110              u64 (*enter)(struct bpf_prog *prog,
     0.00 :   ffff8000081e1850:       add     x2, x2, #0xd20
     0.00 :   ffff8000081e1854:       b       ffff8000081e1868 <call_bpf_prog.isra.0+0x68>
          : 120              exit = __bpf_prog_exit_sleepable;
     0.00 :   ffff8000081e1858:       adrp    x22, ffff8000081e1000 <print_bpf_insn+0x580>
          : 119              enter = __bpf_prog_enter_sleepable;
     0.00 :   ffff8000081e185c:       adrp    x2, ffff8000081e1000 <print_bpf_insn+0x580>
          : 120              exit = __bpf_prog_exit_sleepable;
     0.00 :   ffff8000081e1860:       add     x22, x22, #0xc60
          : 119              enter = __bpf_prog_enter_sleepable;
     0.00 :   ffff8000081e1864:       add     x2, x2, #0xe10
          : 126              enter = __bpf_prog_enter_lsm_cgroup;
          : 127              exit = __bpf_prog_exit_lsm_cgroup;
          : 128              }
          :
          : 130              ctx->bpf_cookie = l->cookie;
     0.00 :   ffff8000081e1868:       str     x1, [x20]
          :
          : 129              start_time = enter(p, ctx);
     0.00 :   ffff8000081e186c:       mov     x0, x19
     0.00 :   ffff8000081e1870:       mov     x1, x20
          : 130              if (!start_time)
          : 131              return 0;
     0.00 :   ffff8000081e1874:       mov     w23, #0x0                       // #0
          : 128              start_time = enter(p, ctx);
     0.00 :   ffff8000081e1878:       blr     x2
     0.00 :   ffff8000081e187c:       mov     x21, x0
          : 129              if (!start_time)
     0.00 :   ffff8000081e1880:       cbz     x0, ffff8000081e18a8 <call_bpf_prog.isra.0+0xa8>
          :
          : 133              ret = p->bpf_func(args, p->insnsi);
     0.00 :   ffff8000081e1884:       ldr     x2, [x19, #48]
     0.00 :   ffff8000081e1888:       add     x1, x19, #0x48
     0.00 :   ffff8000081e188c:       mov     x0, x24
     0.00 :   ffff8000081e1890:       blr     x2
     0.00 :   ffff8000081e1894:       mov     w23, w0
          :
          : 135              exit(p, start_time, ctx);
     0.00 :   ffff8000081e1898:       mov     x2, x20
     0.00 :   ffff8000081e189c:       mov     x1, x21
     0.00 :   ffff8000081e18a0:       mov     x0, x19
     0.00 :   ffff8000081e18a4:       blr     x22
          :
          : 138              return ret;
          : 139              }
     0.00 :   ffff8000081e18a8:       mov     w0, w23
     0.00 :   ffff8000081e18ac:       ldp     x19, x20, [sp, #16]
     0.00 :   ffff8000081e18b0:       ldp     x21, x22, [sp, #32]
     0.00 :   ffff8000081e18b4:       ldp     x23, x24, [sp, #48]
     6.67 :   ffff8000081e18b8:       ldp     x29, x30, [sp], #64 // trampoline.c:137
     0.00 :   ffff8000081e18bc:       autiasp
     0.00 :   ffff8000081e18c0:       ret
          : 123              exit = __bpf_prog_exit_lsm_cgroup;
     0.00 :   ffff8000081e18c4:       adrp    x22, ffff8000081e1000 <print_bpf_insn+0x580>
          : 122              enter = __bpf_prog_enter_lsm_cgroup;
     0.00 :   ffff8000081e18c8:       adrp    x2, ffff8000081e1000 <print_bpf_insn+0x580>
          : 123              exit = __bpf_prog_exit_lsm_cgroup;
     0.00 :   ffff8000081e18cc:       add     x22, x22, #0x200
          : 122              enter = __bpf_prog_enter_lsm_cgroup;
     0.00 :   ffff8000081e18d0:       add     x2, x2, #0x1c0
     0.00 :   ffff8000081e18d4:       b       ffff8000081e1868 <call_bpf_prog.isra.0+0x68>
Florent Revest Oct. 17, 2022, 5:55 p.m. UTC | #19
On Thu, Oct 6, 2022 at 6:29 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 6 Oct 2022 18:19:12 +0200
> Florent Revest <revest@chromium.org> wrote:
>
> > Sure, we can give this a try, I'll work on a macro that generates the
> > 7 callbacks and we can check how much that helps. My belief right now
> > is that ftrace's iteration over all ops on arm64 is where we lose most
> > time but now that we have numbers it's pretty easy to check hypothesis
> > :)
>
> Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
> fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.
>
> So, let's hold off until that is complete.
>
> -- Steve

Mark finished an implementation of his per-callsite-ops and min-args
branches (meaning that we can now skip the expensive ftrace's saving
of all registers and iteration over all ops if only one is attached)
- https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017

And Masami wrote similar patches to what I had originally done to
fprobe in my branch:
- https://github.com/mhiramat/linux/commits/kprobes/fprobe-update

So I could rebase my previous "bpf on fprobe" branch on top of these:
(as before, it's just good enough for benchmarking and to give a
general sense of the idea, not for a thorough code review):
- https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3

And I could run the benchmarks against my rpi4. I have different
baseline numbers as Xu so I ran everything again and tried to keep the
format the same. "indirect call" refers to my branch I just linked and
"direct call" refers to the series this is a reply to (Xu's work)

1. test with dd

1.1 when no bpf prog attached to vfs_write

# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 3.94315 s, 130 MB/s


1.2 attach bpf prog with kprobe, bpftrace -e kprobe:vfs_write {}

# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 5.80493 s, 88.2 MB/s


1.3 attach bpf prog with with direct call, bpftrace -e kfunc:vfs_write {}

# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 4.18579 s, 122 MB/s


1.4 attach bpf prog with with indirect call, bpftrace -e kfunc:vfs_write {}

# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 4.92616 s, 104 MB/s


2. test with bpf/bench

2.1 bench trig-base
Iter   0 ( 86.518us): hits    0.700M/s (  0.700M/prod), drops
0.000M/s, total operations    0.700M/s
Iter   1 (-26.352us): hits    0.701M/s (  0.701M/prod), drops
0.000M/s, total operations    0.701M/s
Iter   2 (  1.092us): hits    0.701M/s (  0.701M/prod), drops
0.000M/s, total operations    0.701M/s
Iter   3 ( -1.890us): hits    0.701M/s (  0.701M/prod), drops
0.000M/s, total operations    0.701M/s
Iter   4 ( -2.315us): hits    0.701M/s (  0.701M/prod), drops
0.000M/s, total operations    0.701M/s
Iter   5 (  4.184us): hits    0.701M/s (  0.701M/prod), drops
0.000M/s, total operations    0.701M/s
Iter   6 ( -3.241us): hits    0.701M/s (  0.701M/prod), drops
0.000M/s, total operations    0.701M/s
Summary: hits    0.701 ± 0.000M/s (  0.701M/prod), drops    0.000 ±
0.000M/s, total operations    0.701 ± 0.000M/s

2.2 bench trig-kprobe
Iter   0 ( 96.833us): hits    0.290M/s (  0.290M/prod), drops
0.000M/s, total operations    0.290M/s
Iter   1 (-20.834us): hits    0.291M/s (  0.291M/prod), drops
0.000M/s, total operations    0.291M/s
Iter   2 ( -2.426us): hits    0.291M/s (  0.291M/prod), drops
0.000M/s, total operations    0.291M/s
Iter   3 ( 22.332us): hits    0.292M/s (  0.292M/prod), drops
0.000M/s, total operations    0.292M/s
Iter   4 (-18.204us): hits    0.292M/s (  0.292M/prod), drops
0.000M/s, total operations    0.292M/s
Iter   5 (  5.370us): hits    0.292M/s (  0.292M/prod), drops
0.000M/s, total operations    0.292M/s
Iter   6 ( -7.853us): hits    0.290M/s (  0.290M/prod), drops
0.000M/s, total operations    0.290M/s
Summary: hits    0.291 ± 0.001M/s (  0.291M/prod), drops    0.000 ±
0.000M/s, total operations    0.291 ± 0.001M/s

2.3 bench trig-fentry, with direct call
Iter   0 ( 86.481us): hits    0.530M/s (  0.530M/prod), drops
0.000M/s, total operations    0.530M/s
Iter   1 (-12.593us): hits    0.536M/s (  0.536M/prod), drops
0.000M/s, total operations    0.536M/s
Iter   2 ( -5.760us): hits    0.532M/s (  0.532M/prod), drops
0.000M/s, total operations    0.532M/s
Iter   3 (  1.629us): hits    0.532M/s (  0.532M/prod), drops
0.000M/s, total operations    0.532M/s
Iter   4 ( -1.945us): hits    0.533M/s (  0.533M/prod), drops
0.000M/s, total operations    0.533M/s
Iter   5 ( -1.297us): hits    0.532M/s (  0.532M/prod), drops
0.000M/s, total operations    0.532M/s
Iter   6 (  0.444us): hits    0.535M/s (  0.535M/prod), drops
0.000M/s, total operations    0.535M/s
Summary: hits    0.533 ± 0.002M/s (  0.533M/prod), drops    0.000 ±
0.000M/s, total operations    0.533 ± 0.002M/s

2.3 bench trig-fentry, with indirect call
Iter   0 ( 84.463us): hits    0.404M/s (  0.404M/prod), drops
0.000M/s, total operations    0.404M/s
Iter   1 (-16.260us): hits    0.405M/s (  0.405M/prod), drops
0.000M/s, total operations    0.405M/s
Iter   2 ( -1.038us): hits    0.405M/s (  0.405M/prod), drops
0.000M/s, total operations    0.405M/s
Iter   3 ( -3.797us): hits    0.405M/s (  0.405M/prod), drops
0.000M/s, total operations    0.405M/s
Iter   4 ( -0.537us): hits    0.402M/s (  0.402M/prod), drops
0.000M/s, total operations    0.402M/s
Iter   5 (  3.536us): hits    0.403M/s (  0.403M/prod), drops
0.000M/s, total operations    0.403M/s
Iter   6 ( 12.203us): hits    0.404M/s (  0.404M/prod), drops
0.000M/s, total operations    0.404M/s
Summary: hits    0.404 ± 0.001M/s (  0.404M/prod), drops    0.000 ±
0.000M/s, total operations    0.404 ± 0.001M/s


3. perf report of bench trig-fentry

3.1 with direct call

    98.67%     0.27%  bench    bench
        [.] trigger_producer
            |
             --98.40%--trigger_producer
                       |
                       |--96.63%--syscall
                       |          |
                       |           --71.90%--el0t_64_sync
                       |                     el0t_64_sync_handler
                       |                     el0_svc
                       |                     do_el0_svc
                       |                     |
                       |                     |--70.94%--el0_svc_common
                       |                     |          |
                       |                     |
|--29.55%--invoke_syscall
                       |                     |          |          |
                       |                     |          |
|--26.23%--__arm64_sys_getpgid
                       |                     |          |          |
       |
                       |                     |          |          |
       |--18.88%--bpf_trampoline_6442462665_0
                       |                     |          |          |
       |          |
                       |                     |          |          |
       |          |--6.85%--__bpf_prog_enter
                       |                     |          |          |
       |          |          |
                       |                     |          |          |
       |          |           --2.68%--migrate_disable
                       |                     |          |          |
       |          |
                       |                     |          |          |
       |          |--5.28%--__bpf_prog_exit
                       |                     |          |          |
       |          |          |
                       |                     |          |          |
       |          |           --1.29%--migrate_enable
                       |                     |          |          |
       |          |
                       |                     |          |          |
       |
|--3.96%--bpf_prog_21856463590f61f1_bench_trigger_fentry
                       |                     |          |          |
       |          |
                       |                     |          |          |
       |           --0.61%--__rcu_read_lock
                       |                     |          |          |
       |
                       |                     |          |          |
        --4.42%--find_task_by_vpid
                       |                     |          |          |
                  |
                       |                     |          |          |
                  |--2.53%--radix_tree_lookup
                       |                     |          |          |
                  |
                       |                     |          |          |
                   --0.61%--idr_find
                       |                     |          |          |
                       |                     |          |
--0.81%--pid_vnr
                       |                     |          |
                       |                     |
--0.53%--__arm64_sys_getpgid
                       |                     |
                       |                      --0.95%--invoke_syscall
                       |
                        --0.99%--syscall@plt


3.2 with indirect call

    98.68%     0.20%  bench    bench
        [.] trigger_producer
            |
             --98.48%--trigger_producer
                       |
                        --97.47%--syscall
                                  |
                                   --76.11%--el0t_64_sync
                                             el0t_64_sync_handler
                                             el0_svc
                                             do_el0_svc
                                             |
                                             |--75.52%--el0_svc_common
                                             |          |
                                             |
|--46.35%--invoke_syscall
                                             |          |          |
                                             |          |
--44.06%--__arm64_sys_getpgid
                                             |          |
       |
                                             |          |
       |--35.40%--ftrace_caller
                                             |          |
       |          |
                                             |          |
       |           --34.04%--fprobe_handler
                                             |          |
       |                     |
                                             |          |
       |                     |--15.61%--bpf_fprobe_entry
                                             |          |
       |                     |          |
                                             |          |
       |                     |          |--3.79%--__bpf_prog_enter
                                             |          |
       |                     |          |          |
                                             |          |
       |                     |          |
--0.80%--migrate_disable
                                             |          |
       |                     |          |
                                             |          |
       |                     |          |--3.74%--__bpf_prog_exit
                                             |          |
       |                     |          |          |
                                             |          |
       |                     |          |
--0.77%--migrate_enable
                                             |          |
       |                     |          |
                                             |          |
       |                     |
--2.65%--bpf_prog_21856463590f61f1_bench_trigger_fentry
                                             |          |
       |                     |
                                             |          |
       |                     |--12.65%--rethook_trampoline_handler
                                             |          |
       |                     |
                                             |          |
       |                     |--1.70%--rethook_try_get
                                             |          |
       |                     |          |
                                             |          |
       |                     |           --1.48%--rcu_is_watching
                                             |          |
       |                     |
                                             |          |
       |                     |--1.46%--freelist_try_get
                                             |          |
       |                     |
                                             |          |
       |                      --0.65%--rethook_recycle
                                             |          |
       |
                                             |          |
        --6.36%--find_task_by_vpid
                                             |          |
                  |
                                             |          |
                  |--3.64%--radix_tree_lookup
                                             |          |
                  |
                                             |          |
                   --1.74%--idr_find
                                             |          |
                                             |           --1.05%--ftrace_caller
                                             |
                                              --0.59%--invoke_syscall

This looks slightly better than before but it is actually still a
pretty significant performance hit compared to direct calls.

Note that I can't really make sense of the perf report with indirect
calls. it always reports it spent 12% of the time in
rethook_trampoline_handler but I verified with both a WARN in that
function and a breakpoint with a debugger, this function does *not*
get called when running this "bench trig-fentry" benchmark. Also it
wouldn't make sense for fprobe_handler to call it so I'm quite
confused why perf would report this call and such a long time spent
there. Anyone know what I could be missing here ?
Steven Rostedt Oct. 17, 2022, 6:49 p.m. UTC | #20
On Mon, 17 Oct 2022 19:55:06 +0200
Florent Revest <revest@chromium.org> wrote:

> Note that I can't really make sense of the perf report with indirect
> calls. it always reports it spent 12% of the time in
> rethook_trampoline_handler but I verified with both a WARN in that
> function and a breakpoint with a debugger, this function does *not*
> get called when running this "bench trig-fentry" benchmark. Also it
> wouldn't make sense for fprobe_handler to call it so I'm quite
> confused why perf would report this call and such a long time spent
> there. Anyone know what I could be missing here ?

The trace shows __bpf_prog_exit, which I'm guessing is tracing the end of
the function. Right?

In which case I believe it must call rethook_trampoline_handler:

 -> fprobe_handler() /* Which could use some "unlikely()" to move disabled
                        paths out of the hot path */

       /* And also calls rethook_try_get () which does a cmpxchg! */

	-> ret_hook()
		-> arch_rethook_prepare()
			Sets regs->lr = arch_rethook_trampoline

On return of the function, it jumps to arch_rethook_trampoline()

  -> arch_rethook_trampoline()
	-> arch_rethook_trampoline_callback()
		-> rethook_trampoline_handler()

So I do not know how it wouldn't trigger the WARNING or breakpoint if you
added it there.

-- Steve
Florent Revest Oct. 17, 2022, 7:10 p.m. UTC | #21
Uhuh, apologies for my perf report formatting! I'll try to figure it
out for next time, meanwhile you can find it better formatted here
https://paste.debian.net/1257405/

On Mon, Oct 17, 2022 at 8:49 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 17 Oct 2022 19:55:06 +0200
> Florent Revest <revest@chromium.org> wrote:
>
> > Note that I can't really make sense of the perf report with indirect
> > calls. it always reports it spent 12% of the time in
> > rethook_trampoline_handler but I verified with both a WARN in that
> > function and a breakpoint with a debugger, this function does *not*
> > get called when running this "bench trig-fentry" benchmark. Also it
> > wouldn't make sense for fprobe_handler to call it so I'm quite
> > confused why perf would report this call and such a long time spent
> > there. Anyone know what I could be missing here ?
>
> The trace shows __bpf_prog_exit, which I'm guessing is tracing the end of
> the function. Right?

Actually no, this function is called to end the context of a BPF
program execution. Here it is called at the end of the fentry program
(so still before the traced function). I hope the pastebin helps
clarify this!

> In which case I believe it must call rethook_trampoline_handler:
>
>  -> fprobe_handler() /* Which could use some "unlikely()" to move disabled
>                         paths out of the hot path */
>
>        /* And also calls rethook_try_get () which does a cmpxchg! */
>
>         -> ret_hook()
>                 -> arch_rethook_prepare()
>                         Sets regs->lr = arch_rethook_trampoline
>
> On return of the function, it jumps to arch_rethook_trampoline()
>
>   -> arch_rethook_trampoline()
>         -> arch_rethook_trampoline_callback()
>                 -> rethook_trampoline_handler()

This is indeed what happens when an fexit program is also attached.
But when running "bench trig-fentry", only an fentry program is
attached so bpf_fprobe_entry returns a non-zero value and fprobe
doesn't call rethook_hook.

Also, in this situation arch_rethook_trampoline is called by the
traced function's return but in the perf report, iiuc, it shows up as
being called from fprobe_handler and that should never happen. I
wonder if this is some sort of stack unwinding artifact during the
perf record?

> So I do not know how it wouldn't trigger the WARNING or breakpoint if you
> added it there.

By the way, the WARNING does trigger if I also attach an fexit program
(then rethook_hook is called). But I made sure we skip the whole
rethook logic if no fexit program is attached so bench trig-fentry
should not go through rethook_trampoline_handler.
Masami Hiramatsu (Google) Oct. 21, 2022, 11:31 a.m. UTC | #22
Hi Florent,

On Mon, 17 Oct 2022 19:55:06 +0200
Florent Revest <revest@chromium.org> wrote:

> On Thu, Oct 6, 2022 at 6:29 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu, 6 Oct 2022 18:19:12 +0200
> > Florent Revest <revest@chromium.org> wrote:
> >
> > > Sure, we can give this a try, I'll work on a macro that generates the
> > > 7 callbacks and we can check how much that helps. My belief right now
> > > is that ftrace's iteration over all ops on arm64 is where we lose most
> > > time but now that we have numbers it's pretty easy to check hypothesis
> > > :)
> >
> > Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
> > fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.
> >
> > So, let's hold off until that is complete.
> >
> > -- Steve
> 
> Mark finished an implementation of his per-callsite-ops and min-args
> branches (meaning that we can now skip the expensive ftrace's saving
> of all registers and iteration over all ops if only one is attached)
> - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
> 
> And Masami wrote similar patches to what I had originally done to
> fprobe in my branch:
> - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
> 
> So I could rebase my previous "bpf on fprobe" branch on top of these:
> (as before, it's just good enough for benchmarking and to give a
> general sense of the idea, not for a thorough code review):
> - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
> 
> And I could run the benchmarks against my rpi4. I have different
> baseline numbers as Xu so I ran everything again and tried to keep the
> format the same. "indirect call" refers to my branch I just linked and
> "direct call" refers to the series this is a reply to (Xu's work)

Thanks for sharing the measurement results. Yes, fprobes/rethook
implementation is just porting the kretprobes implementation, thus
it may not be so optimized.

BTW, I remember Wuqiang's patch for kretprobes.

https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u

This is for the scalability fixing, but may possible to improve
the performance a bit. It is not hard to port to the recent kernel.
Can you try it too?

Anyway, eventually, I would like to remove the current kretprobe
based implementation and unify fexit hook with function-graph
tracer. It should make more better perfromance on it.

Thank you,


> 
> 1. test with dd
> 
> 1.1 when no bpf prog attached to vfs_write
> 
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 3.94315 s, 130 MB/s
> 
> 
> 1.2 attach bpf prog with kprobe, bpftrace -e kprobe:vfs_write {}
> 
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 5.80493 s, 88.2 MB/s
> 
> 
> 1.3 attach bpf prog with with direct call, bpftrace -e kfunc:vfs_write {}
> 
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 4.18579 s, 122 MB/s
> 
> 
> 1.4 attach bpf prog with with indirect call, bpftrace -e kfunc:vfs_write {}
> 
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 4.92616 s, 104 MB/s
> 
> 
> 2. test with bpf/bench
> 
> 2.1 bench trig-base
> Iter   0 ( 86.518us): hits    0.700M/s (  0.700M/prod), drops
> 0.000M/s, total operations    0.700M/s
> Iter   1 (-26.352us): hits    0.701M/s (  0.701M/prod), drops
> 0.000M/s, total operations    0.701M/s
> Iter   2 (  1.092us): hits    0.701M/s (  0.701M/prod), drops
> 0.000M/s, total operations    0.701M/s
> Iter   3 ( -1.890us): hits    0.701M/s (  0.701M/prod), drops
> 0.000M/s, total operations    0.701M/s
> Iter   4 ( -2.315us): hits    0.701M/s (  0.701M/prod), drops
> 0.000M/s, total operations    0.701M/s
> Iter   5 (  4.184us): hits    0.701M/s (  0.701M/prod), drops
> 0.000M/s, total operations    0.701M/s
> Iter   6 ( -3.241us): hits    0.701M/s (  0.701M/prod), drops
> 0.000M/s, total operations    0.701M/s
> Summary: hits    0.701 ± 0.000M/s (  0.701M/prod), drops    0.000 ±
> 0.000M/s, total operations    0.701 ± 0.000M/s
> 
> 2.2 bench trig-kprobe
> Iter   0 ( 96.833us): hits    0.290M/s (  0.290M/prod), drops
> 0.000M/s, total operations    0.290M/s
> Iter   1 (-20.834us): hits    0.291M/s (  0.291M/prod), drops
> 0.000M/s, total operations    0.291M/s
> Iter   2 ( -2.426us): hits    0.291M/s (  0.291M/prod), drops
> 0.000M/s, total operations    0.291M/s
> Iter   3 ( 22.332us): hits    0.292M/s (  0.292M/prod), drops
> 0.000M/s, total operations    0.292M/s
> Iter   4 (-18.204us): hits    0.292M/s (  0.292M/prod), drops
> 0.000M/s, total operations    0.292M/s
> Iter   5 (  5.370us): hits    0.292M/s (  0.292M/prod), drops
> 0.000M/s, total operations    0.292M/s
> Iter   6 ( -7.853us): hits    0.290M/s (  0.290M/prod), drops
> 0.000M/s, total operations    0.290M/s
> Summary: hits    0.291 ± 0.001M/s (  0.291M/prod), drops    0.000 ±
> 0.000M/s, total operations    0.291 ± 0.001M/s
> 
> 2.3 bench trig-fentry, with direct call
> Iter   0 ( 86.481us): hits    0.530M/s (  0.530M/prod), drops
> 0.000M/s, total operations    0.530M/s
> Iter   1 (-12.593us): hits    0.536M/s (  0.536M/prod), drops
> 0.000M/s, total operations    0.536M/s
> Iter   2 ( -5.760us): hits    0.532M/s (  0.532M/prod), drops
> 0.000M/s, total operations    0.532M/s
> Iter   3 (  1.629us): hits    0.532M/s (  0.532M/prod), drops
> 0.000M/s, total operations    0.532M/s
> Iter   4 ( -1.945us): hits    0.533M/s (  0.533M/prod), drops
> 0.000M/s, total operations    0.533M/s
> Iter   5 ( -1.297us): hits    0.532M/s (  0.532M/prod), drops
> 0.000M/s, total operations    0.532M/s
> Iter   6 (  0.444us): hits    0.535M/s (  0.535M/prod), drops
> 0.000M/s, total operations    0.535M/s
> Summary: hits    0.533 ± 0.002M/s (  0.533M/prod), drops    0.000 ±
> 0.000M/s, total operations    0.533 ± 0.002M/s
> 
> 2.3 bench trig-fentry, with indirect call
> Iter   0 ( 84.463us): hits    0.404M/s (  0.404M/prod), drops
> 0.000M/s, total operations    0.404M/s
> Iter   1 (-16.260us): hits    0.405M/s (  0.405M/prod), drops
> 0.000M/s, total operations    0.405M/s
> Iter   2 ( -1.038us): hits    0.405M/s (  0.405M/prod), drops
> 0.000M/s, total operations    0.405M/s
> Iter   3 ( -3.797us): hits    0.405M/s (  0.405M/prod), drops
> 0.000M/s, total operations    0.405M/s
> Iter   4 ( -0.537us): hits    0.402M/s (  0.402M/prod), drops
> 0.000M/s, total operations    0.402M/s
> Iter   5 (  3.536us): hits    0.403M/s (  0.403M/prod), drops
> 0.000M/s, total operations    0.403M/s
> Iter   6 ( 12.203us): hits    0.404M/s (  0.404M/prod), drops
> 0.000M/s, total operations    0.404M/s
> Summary: hits    0.404 ± 0.001M/s (  0.404M/prod), drops    0.000 ±
> 0.000M/s, total operations    0.404 ± 0.001M/s
> 
> 
> 3. perf report of bench trig-fentry
> 
> 3.1 with direct call
> 
>     98.67%     0.27%  bench    bench
>         [.] trigger_producer
>             |
>              --98.40%--trigger_producer
>                        |
>                        |--96.63%--syscall
>                        |          |
>                        |           --71.90%--el0t_64_sync
>                        |                     el0t_64_sync_handler
>                        |                     el0_svc
>                        |                     do_el0_svc
>                        |                     |
>                        |                     |--70.94%--el0_svc_common
>                        |                     |          |
>                        |                     |
> |--29.55%--invoke_syscall
>                        |                     |          |          |
>                        |                     |          |
> |--26.23%--__arm64_sys_getpgid
>                        |                     |          |          |
>        |
>                        |                     |          |          |
>        |--18.88%--bpf_trampoline_6442462665_0
>                        |                     |          |          |
>        |          |
>                        |                     |          |          |
>        |          |--6.85%--__bpf_prog_enter
>                        |                     |          |          |
>        |          |          |
>                        |                     |          |          |
>        |          |           --2.68%--migrate_disable
>                        |                     |          |          |
>        |          |
>                        |                     |          |          |
>        |          |--5.28%--__bpf_prog_exit
>                        |                     |          |          |
>        |          |          |
>                        |                     |          |          |
>        |          |           --1.29%--migrate_enable
>                        |                     |          |          |
>        |          |
>                        |                     |          |          |
>        |
> |--3.96%--bpf_prog_21856463590f61f1_bench_trigger_fentry
>                        |                     |          |          |
>        |          |
>                        |                     |          |          |
>        |           --0.61%--__rcu_read_lock
>                        |                     |          |          |
>        |
>                        |                     |          |          |
>         --4.42%--find_task_by_vpid
>                        |                     |          |          |
>                   |
>                        |                     |          |          |
>                   |--2.53%--radix_tree_lookup
>                        |                     |          |          |
>                   |
>                        |                     |          |          |
>                    --0.61%--idr_find
>                        |                     |          |          |
>                        |                     |          |
> --0.81%--pid_vnr
>                        |                     |          |
>                        |                     |
> --0.53%--__arm64_sys_getpgid
>                        |                     |
>                        |                      --0.95%--invoke_syscall
>                        |
>                         --0.99%--syscall@plt
> 
> 
> 3.2 with indirect call
> 
>     98.68%     0.20%  bench    bench
>         [.] trigger_producer
>             |
>              --98.48%--trigger_producer
>                        |
>                         --97.47%--syscall
>                                   |
>                                    --76.11%--el0t_64_sync
>                                              el0t_64_sync_handler
>                                              el0_svc
>                                              do_el0_svc
>                                              |
>                                              |--75.52%--el0_svc_common
>                                              |          |
>                                              |
> |--46.35%--invoke_syscall
>                                              |          |          |
>                                              |          |
> --44.06%--__arm64_sys_getpgid
>                                              |          |
>        |
>                                              |          |
>        |--35.40%--ftrace_caller
>                                              |          |
>        |          |
>                                              |          |
>        |           --34.04%--fprobe_handler
>                                              |          |
>        |                     |
>                                              |          |
>        |                     |--15.61%--bpf_fprobe_entry
>                                              |          |
>        |                     |          |
>                                              |          |
>        |                     |          |--3.79%--__bpf_prog_enter
>                                              |          |
>        |                     |          |          |
>                                              |          |
>        |                     |          |
> --0.80%--migrate_disable
>                                              |          |
>        |                     |          |
>                                              |          |
>        |                     |          |--3.74%--__bpf_prog_exit
>                                              |          |
>        |                     |          |          |
>                                              |          |
>        |                     |          |
> --0.77%--migrate_enable
>                                              |          |
>        |                     |          |
>                                              |          |
>        |                     |
> --2.65%--bpf_prog_21856463590f61f1_bench_trigger_fentry
>                                              |          |
>        |                     |
>                                              |          |
>        |                     |--12.65%--rethook_trampoline_handler
>                                              |          |
>        |                     |
>                                              |          |
>        |                     |--1.70%--rethook_try_get
>                                              |          |
>        |                     |          |
>                                              |          |
>        |                     |           --1.48%--rcu_is_watching
>                                              |          |
>        |                     |
>                                              |          |
>        |                     |--1.46%--freelist_try_get
>                                              |          |
>        |                     |
>                                              |          |
>        |                      --0.65%--rethook_recycle
>                                              |          |
>        |
>                                              |          |
>         --6.36%--find_task_by_vpid
>                                              |          |
>                   |
>                                              |          |
>                   |--3.64%--radix_tree_lookup
>                                              |          |
>                   |
>                                              |          |
>                    --1.74%--idr_find
>                                              |          |
>                                              |           --1.05%--ftrace_caller
>                                              |
>                                               --0.59%--invoke_syscall
> 
> This looks slightly better than before but it is actually still a
> pretty significant performance hit compared to direct calls.
> 
> Note that I can't really make sense of the perf report with indirect
> calls. it always reports it spent 12% of the time in
> rethook_trampoline_handler but I verified with both a WARN in that
> function and a breakpoint with a debugger, this function does *not*
> get called when running this "bench trig-fentry" benchmark. Also it
> wouldn't make sense for fprobe_handler to call it so I'm quite
> confused why perf would report this call and such a long time spent
> there. Anyone know what I could be missing here ?
Florent Revest Oct. 21, 2022, 4:49 p.m. UTC | #23
On Fri, Oct 21, 2022 at 1:32 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> On Mon, 17 Oct 2022 19:55:06 +0200
> Florent Revest <revest@chromium.org> wrote:
> > Mark finished an implementation of his per-callsite-ops and min-args
> > branches (meaning that we can now skip the expensive ftrace's saving
> > of all registers and iteration over all ops if only one is attached)
> > - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
> >
> > And Masami wrote similar patches to what I had originally done to
> > fprobe in my branch:
> > - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
> >
> > So I could rebase my previous "bpf on fprobe" branch on top of these:
> > (as before, it's just good enough for benchmarking and to give a
> > general sense of the idea, not for a thorough code review):
> > - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
> >
> > And I could run the benchmarks against my rpi4. I have different
> > baseline numbers as Xu so I ran everything again and tried to keep the
> > format the same. "indirect call" refers to my branch I just linked and
> > "direct call" refers to the series this is a reply to (Xu's work)
>
> Thanks for sharing the measurement results. Yes, fprobes/rethook
> implementation is just porting the kretprobes implementation, thus
> it may not be so optimized.
>
> BTW, I remember Wuqiang's patch for kretprobes.
>
> https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u

Oh that's a great idea, thanks for pointing it out Masami!

> This is for the scalability fixing, but may possible to improve
> the performance a bit. It is not hard to port to the recent kernel.
> Can you try it too?

I rebased it on my branch
https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3

And I got measurements again. Unfortunately it looks like this does not help :/

New benchmark results: https://paste.debian.net/1257856/
New perf report: https://paste.debian.net/1257859/

The fprobe based approach is still significantly slower than the
direct call approach.

> Anyway, eventually, I would like to remove the current kretprobe
> based implementation and unify fexit hook with function-graph
> tracer. It should make more better perfromance on it.

That makes sense. :) How do you imagine the unified solution ?
Would both the fgraph and fprobe APIs keep existing but under the hood
one would be implemented on the other ? (or would one be gone ?) Would
we replace the rethook freelist with the function graph's per-task
shadow stacks ? (or the other way around ?))

> > Note that I can't really make sense of the perf report with indirect
> > calls. it always reports it spent 12% of the time in
> > rethook_trampoline_handler but I verified with both a WARN in that
> > function and a breakpoint with a debugger, this function does *not*
> > get called when running this "bench trig-fentry" benchmark. Also it
> > wouldn't make sense for fprobe_handler to call it so I'm quite
> > confused why perf would report this call and such a long time spent
> > there. Anyone know what I could be missing here ?

I made slight progress on this. If I put the vmlinux file in the cwd
where I run perf report, the reports no longer contain references to
rethook_trampoline_handler. Instead, they have a few
0xffff800008xxxxxx addresses under fprobe_handler. (like in the
pastebin I just linked)

It's still pretty weird because that range is the vmalloc area on
arm64 and I don't understand why anything under fprobe_handler would
execute there. However, I'm also definitely sure that these 12% are
actually spent getting buffers from the rethook memory pool because if
I replace rethook_try_get and rethook_recycle calls with the usage of
a dummy static bss buffer (for the sake of benchmarking the
"theoretical best case scenario") these weird perf report traces are
gone and the 12% are saved. https://paste.debian.net/1257862/

This is why I would be interested in seeing rethook's memory pool
reimplemented on top of something like
https://lwn.net/Articles/788923/ If we get closer to the performance
of the the theoretical best case scenario where getting a blob of
memory is ~free (and I think it could be the case with a per task
shadow stack like fgraph's), then a bpf on fprobe implementation would
start to approach the performances of a direct called trampoline on
arm64: https://paste.debian.net/1257863/
Masami Hiramatsu (Google) Oct. 24, 2022, 1 p.m. UTC | #24
On Fri, 21 Oct 2022 18:49:38 +0200
Florent Revest <revest@chromium.org> wrote:

> On Fri, Oct 21, 2022 at 1:32 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > On Mon, 17 Oct 2022 19:55:06 +0200
> > Florent Revest <revest@chromium.org> wrote:
> > > Mark finished an implementation of his per-callsite-ops and min-args
> > > branches (meaning that we can now skip the expensive ftrace's saving
> > > of all registers and iteration over all ops if only one is attached)
> > > - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
> > >
> > > And Masami wrote similar patches to what I had originally done to
> > > fprobe in my branch:
> > > - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
> > >
> > > So I could rebase my previous "bpf on fprobe" branch on top of these:
> > > (as before, it's just good enough for benchmarking and to give a
> > > general sense of the idea, not for a thorough code review):
> > > - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
> > >
> > > And I could run the benchmarks against my rpi4. I have different
> > > baseline numbers as Xu so I ran everything again and tried to keep the
> > > format the same. "indirect call" refers to my branch I just linked and
> > > "direct call" refers to the series this is a reply to (Xu's work)
> >
> > Thanks for sharing the measurement results. Yes, fprobes/rethook
> > implementation is just porting the kretprobes implementation, thus
> > it may not be so optimized.
> >
> > BTW, I remember Wuqiang's patch for kretprobes.
> >
> > https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u
> 
> Oh that's a great idea, thanks for pointing it out Masami!
> 
> > This is for the scalability fixing, but may possible to improve
> > the performance a bit. It is not hard to port to the recent kernel.
> > Can you try it too?
> 
> I rebased it on my branch
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
> 
> And I got measurements again. Unfortunately it looks like this does not help :/
> 
> New benchmark results: https://paste.debian.net/1257856/
> New perf report: https://paste.debian.net/1257859/

Hmm, OK. That is only for the scalability.

> 
> The fprobe based approach is still significantly slower than the
> direct call approach.
> 
> > Anyway, eventually, I would like to remove the current kretprobe
> > based implementation and unify fexit hook with function-graph
> > tracer. It should make more better perfromance on it.
> 
> That makes sense. :) How do you imagine the unified solution ?
> Would both the fgraph and fprobe APIs keep existing but under the hood
> one would be implemented on the other ? (or would one be gone ?) Would
> we replace the rethook freelist with the function graph's per-task
> shadow stacks ? (or the other way around ?))

Yes, that's right. As far as using a global object pool, there must
be a performance bottleneck to pick up an object and returning the
object to the pool. Per-CPU pool may give a better performance but
more complicated to balance pools. Per-task shadow stack will solve it.
So I plan to expand fgraph API and use it in fprobe instead of rethook.
(I planned to re-implement rethook, but I realized that it has more issue
than I thought.)

> > > Note that I can't really make sense of the perf report with indirect
> > > calls. it always reports it spent 12% of the time in
> > > rethook_trampoline_handler but I verified with both a WARN in that
> > > function and a breakpoint with a debugger, this function does *not*
> > > get called when running this "bench trig-fentry" benchmark. Also it
> > > wouldn't make sense for fprobe_handler to call it so I'm quite
> > > confused why perf would report this call and such a long time spent
> > > there. Anyone know what I could be missing here ?
> 
> I made slight progress on this. If I put the vmlinux file in the cwd
> where I run perf report, the reports no longer contain references to
> rethook_trampoline_handler. Instead, they have a few
> 0xffff800008xxxxxx addresses under fprobe_handler. (like in the
> pastebin I just linked)
> 
> It's still pretty weird because that range is the vmalloc area on
> arm64 and I don't understand why anything under fprobe_handler would
> execute there. However, I'm also definitely sure that these 12% are
> actually spent getting buffers from the rethook memory pool because if
> I replace rethook_try_get and rethook_recycle calls with the usage of
> a dummy static bss buffer (for the sake of benchmarking the
> "theoretical best case scenario") these weird perf report traces are
> gone and the 12% are saved. https://paste.debian.net/1257862/

Yeah, I understand that. Rethook (and kretprobes) is not designed
for such heavy workload.

> This is why I would be interested in seeing rethook's memory pool
> reimplemented on top of something like
> https://lwn.net/Articles/788923/ If we get closer to the performance
> of the the theoretical best case scenario where getting a blob of
> memory is ~free (and I think it could be the case with a per task
> shadow stack like fgraph's), then a bpf on fprobe implementation would
> start to approach the performances of a direct called trampoline on
> arm64: https://paste.debian.net/1257863/

OK, I think we are on the same page and same direction.

Thank you,
wuqiang.matt Nov. 10, 2022, 4:58 a.m. UTC | #25
On 2022/10/22 00:49, Florent Revest wrote:
> On Fri, Oct 21, 2022 at 1:32 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>> On Mon, 17 Oct 2022 19:55:06 +0200
>> Florent Revest <revest@chromium.org> wrote:
>>> Mark finished an implementation of his per-callsite-ops and min-args
>>> branches (meaning that we can now skip the expensive ftrace's saving
>>> of all registers and iteration over all ops if only one is attached)
>>> - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
>>>
>>> And Masami wrote similar patches to what I had originally done to
>>> fprobe in my branch:
>>> - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
>>>
>>> So I could rebase my previous "bpf on fprobe" branch on top of these:
>>> (as before, it's just good enough for benchmarking and to give a
>>> general sense of the idea, not for a thorough code review):
>>> - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
>>>
>>> And I could run the benchmarks against my rpi4. I have different
>>> baseline numbers as Xu so I ran everything again and tried to keep the
>>> format the same. "indirect call" refers to my branch I just linked and
>>> "direct call" refers to the series this is a reply to (Xu's work)
>>
>> Thanks for sharing the measurement results. Yes, fprobes/rethook
>> implementation is just porting the kretprobes implementation, thus
>> it may not be so optimized.
>>
>> BTW, I remember Wuqiang's patch for kretprobes.
>>
>> https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u
> 
> Oh that's a great idea, thanks for pointing it out Masami!
> 
>> This is for the scalability fixing, but may possible to improve
>> the performance a bit. It is not hard to port to the recent kernel.
>> Can you try it too?
> 
> I rebased it on my branch
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
> 
> And I got measurements again. Unfortunately it looks like this does not help :/
> 
> New benchmark results: https://paste.debian.net/1257856/
> New perf report: https://paste.debian.net/1257859/
> 
> The fprobe based approach is still significantly slower than the
> direct call approach.

FYI, a new version was released, basing on ring-array, which brings a 6.96%
increase in throughput of 1-thread case for ARM64.

https://lore.kernel.org/all/20221108071443.258794-1-wuqiang.matt@bytedance.com/

Could you share more details of the test ? I'll give it a try.

>> Anyway, eventually, I would like to remove the current kretprobe
>> based implementation and unify fexit hook with function-graph
>> tracer. It should make more better perfromance on it.
> 
> That makes sense. :) How do you imagine the unified solution ?
> Would both the fgraph and fprobe APIs keep existing but under the hood
> one would be implemented on the other ? (or would one be gone ?) Would
> we replace the rethook freelist with the function graph's per-task
> shadow stacks ? (or the other way around ?))

How about a private pool designate for local cpu ? If the fprobed routine
sticks to the same CPU when returning, the object allocation and reclaim
can go a quick path, that should bring same performance as shadow stack.
Otherwise the return of an object will go a slow path (slow as current
freelist or objpool).

>>> Note that I can't really make sense of the perf report with indirect
>>> calls. it always reports it spent 12% of the time in
>>> rethook_trampoline_handler but I verified with both a WARN in that
>>> function and a breakpoint with a debugger, this function does *not*
>>> get called when running this "bench trig-fentry" benchmark. Also it
>>> wouldn't make sense for fprobe_handler to call it so I'm quite
>>> confused why perf would report this call and such a long time spent
>>> there. Anyone know what I could be missing here ?
> 
> I made slight progress on this. If I put the vmlinux file in the cwd
> where I run perf report, the reports no longer contain references to
> rethook_trampoline_handler. Instead, they have a few
> 0xffff800008xxxxxx addresses under fprobe_handler. (like in the
> pastebin I just linked)
> 
> It's still pretty weird because that range is the vmalloc area on
> arm64 and I don't understand why anything under fprobe_handler would
> execute there. However, I'm also definitely sure that these 12% are
> actually spent getting buffers from the rethook memory pool because if
> I replace rethook_try_get and rethook_recycle calls with the usage of
> a dummy static bss buffer (for the sake of benchmarking the
> "theoretical best case scenario") these weird perf report traces are
> gone and the 12% are saved. https://paste.debian.net/1257862/
> 
> This is why I would be interested in seeing rethook's memory pool
> reimplemented on top of something like
> https://lwn.net/Articles/788923/ If we get closer to the performance
> of the the theoretical best case scenario where getting a blob of
> memory is ~free (and I think it could be the case with a per task
> shadow stack like fgraph's), then a bpf on fprobe implementation would
> start to approach the performances of a direct called trampoline on
> arm64: https://paste.debian.net/1257863/