diff mbox series

[RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code

Message ID 20240425000211.708557-1-andrii@kernel.org (mailing list archive)
State New
Headers show
Series [RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code | expand

Commit Message

Andrii Nakryiko April 25, 2024, 12:02 a.m. UTC
At the lowest level, rethook-based kretprobes on x86-64 architecture go
through arch_rethoook_trampoline() function, manually written in
assembly, which calls into a simple arch_rethook_trampoline_callback()
function, written in C, and only doing a few straightforward field
assignments, before calling further into rethook_trampoline_handler(),
which handles kretprobe callbacks generically.

Looking at simplicity of arch_rethook_trampoline_callback(), it seems
not really worthwhile to spend an extra function call just to do 4 or
5 assignments. As such, this patch proposes to "inline"
arch_rethook_trampoline_callback() into arch_rethook_trampoline() by
manually implementing it in an assembly code.

This has two motivations. First, we do get a bit of runtime speed up by
avoiding function calls. Using BPF selftests's bench tool, we see
0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe
triggering code path:

BEFORE (latest probes/for-next)
===============================
kretprobe      :   10.455 ± 0.024M/s
kretprobe-multi:   11.150 ± 0.012M/s

AFTER (probes/for-next + this patch)
====================================
kretprobe      :   10.540 ± 0.009M/s (+0.8%)
kretprobe-multi:   11.219 ± 0.042M/s (+0.6%)

Second, and no less importantly for some specialized use cases, this
avoids unnecessarily "polluting" LBR records with an extra function call
(recorded as a jump by CPU). This is the case for the retsnoop ([0])
tool, which relies havily on capturing LBR records to provide users with
lots of insight into kernel internals.

This RFC patch is only inlining this function for x86-64, but it's
possible to do that for 32-bit x86 arch as well and then remove
arch_rethook_trampoline_callback() implementation altogether. Please let
me know if this change is acceptable and whether I should complete it
with 32-bit "inlining" as well. Thanks!

  [0] https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 arch/x86/kernel/asm-offsets_64.c |  4 ++++
 arch/x86/kernel/rethook.c        | 37 +++++++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 5 deletions(-)

Comments

Andrii Nakryiko April 29, 2024, 10:38 p.m. UTC | #1
On Wed, Apr 24, 2024 at 5:02 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> At the lowest level, rethook-based kretprobes on x86-64 architecture go
> through arch_rethoook_trampoline() function, manually written in
> assembly, which calls into a simple arch_rethook_trampoline_callback()
> function, written in C, and only doing a few straightforward field
> assignments, before calling further into rethook_trampoline_handler(),
> which handles kretprobe callbacks generically.
>
> Looking at simplicity of arch_rethook_trampoline_callback(), it seems
> not really worthwhile to spend an extra function call just to do 4 or
> 5 assignments. As such, this patch proposes to "inline"
> arch_rethook_trampoline_callback() into arch_rethook_trampoline() by
> manually implementing it in an assembly code.
>
> This has two motivations. First, we do get a bit of runtime speed up by
> avoiding function calls. Using BPF selftests's bench tool, we see
> 0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe
> triggering code path:
>
> BEFORE (latest probes/for-next)
> ===============================
> kretprobe      :   10.455 ± 0.024M/s
> kretprobe-multi:   11.150 ± 0.012M/s
>
> AFTER (probes/for-next + this patch)
> ====================================
> kretprobe      :   10.540 ± 0.009M/s (+0.8%)
> kretprobe-multi:   11.219 ± 0.042M/s (+0.6%)
>
> Second, and no less importantly for some specialized use cases, this
> avoids unnecessarily "polluting" LBR records with an extra function call
> (recorded as a jump by CPU). This is the case for the retsnoop ([0])
> tool, which relies havily on capturing LBR records to provide users with
> lots of insight into kernel internals.
>
> This RFC patch is only inlining this function for x86-64, but it's
> possible to do that for 32-bit x86 arch as well and then remove
> arch_rethook_trampoline_callback() implementation altogether. Please let
> me know if this change is acceptable and whether I should complete it
> with 32-bit "inlining" as well. Thanks!
>
>   [0] https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr
>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  arch/x86/kernel/asm-offsets_64.c |  4 ++++
>  arch/x86/kernel/rethook.c        | 37 +++++++++++++++++++++++++++-----
>  2 files changed, 36 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
> index bb65371ea9df..5c444abc540c 100644
> --- a/arch/x86/kernel/asm-offsets_64.c
> +++ b/arch/x86/kernel/asm-offsets_64.c
> @@ -42,6 +42,10 @@ int main(void)
>         ENTRY(r14);
>         ENTRY(r15);
>         ENTRY(flags);
> +       ENTRY(ip);
> +       ENTRY(cs);
> +       ENTRY(ss);
> +       ENTRY(orig_ax);
>         BLANK();
>  #undef ENTRY
>
> diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c
> index 8a1c0111ae79..3e1c01beebd1 100644
> --- a/arch/x86/kernel/rethook.c
> +++ b/arch/x86/kernel/rethook.c
> @@ -6,6 +6,7 @@
>  #include <linux/rethook.h>
>  #include <linux/kprobes.h>
>  #include <linux/objtool.h>
> +#include <asm/asm-offsets.h>
>
>  #include "kprobes/common.h"
>
> @@ -34,10 +35,36 @@ asm(
>         "       pushq %rsp\n"
>         "       pushfq\n"
>         SAVE_REGS_STRING
> -       "       movq %rsp, %rdi\n"
> -       "       call arch_rethook_trampoline_callback\n"
> +       "       movq %rsp, %rdi\n" /* $rdi points to regs */
> +       /* fixup registers */
> +       /* regs->cs = __KERNEL_CS; */
> +       "       movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) "(%rdi)\n"
> +       /* regs->ip = (unsigned long)&arch_rethook_trampoline; */
> +       "       movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) "(%rdi)\n"
> +       /* regs->orig_ax = ~0UL; */
> +       "       movq $0xffffffffffffffff, " __stringify(pt_regs_orig_ax) "(%rdi)\n"
> +       /* regs->sp += 2*sizeof(long); */
> +       "       addq $16, " __stringify(pt_regs_sp) "(%rdi)\n"
> +       /* 2nd arg is frame_pointer = (long *)(regs + 1); */
> +       "       lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n"

BTW, all this __stringify() ugliness can be avoided if we move this
assembly into its own .S file, like lots of other assembly functions
in arch/x86/kernel subdir. That has another benefit of generating
better line information in DWARF for those assembly instructions. It's
lots more work, so before I do this, I'd like to get confirmation that
this change is acceptable in principle.

> +       /*
> +        * The return address at 'frame_pointer' is recovered by the
> +        * arch_rethook_fixup_return() which called from this
> +        * rethook_trampoline_handler().
> +        */
> +       "       call rethook_trampoline_handler\n"
> +       /*
> +        * Copy FLAGS to 'pt_regs::ss' so we can do RET right after POPF.
> +        *
> +        * We don't save/restore %rax below, because we ignore
> +        * rethook_trampoline_handler result.
> +        *
> +        * *(unsigned long *)&regs->ss = regs->flags;
> +        */
> +       "       mov " __stringify(pt_regs_flags) "(%rsp), %rax\n"
> +       "       mov %rax, " __stringify(pt_regs_ss) "(%rsp)\n"
>         RESTORE_REGS_STRING
> -       /* In the callback function, 'regs->flags' is copied to 'regs->ss'. */
> +       /* We just copied 'regs->flags' into 'regs->ss'. */
>         "       addq $16, %rsp\n"
>         "       popfq\n"
>  #else
> @@ -61,6 +88,7 @@ asm(
>  );
>  NOKPROBE_SYMBOL(arch_rethook_trampoline);
>
> +#ifdef CONFIG_X86_32
>  /*
>   * Called from arch_rethook_trampoline
>   */
> @@ -70,9 +98,7 @@ __used __visible void arch_rethook_trampoline_callback(struct pt_regs *regs)
>
>         /* fixup registers */
>         regs->cs = __KERNEL_CS;
> -#ifdef CONFIG_X86_32
>         regs->gs = 0;
> -#endif
>         regs->ip = (unsigned long)&arch_rethook_trampoline;
>         regs->orig_ax = ~0UL;
>         regs->sp += 2*sizeof(long);
> @@ -92,6 +118,7 @@ __used __visible void arch_rethook_trampoline_callback(struct pt_regs *regs)
>         *(unsigned long *)&regs->ss = regs->flags;
>  }
>  NOKPROBE_SYMBOL(arch_rethook_trampoline_callback);
> +#endif
>
>  /*
>   * arch_rethook_trampoline() skips updating frame pointer. The frame pointer
> --
> 2.43.0
>
diff mbox series

Patch

diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index bb65371ea9df..5c444abc540c 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -42,6 +42,10 @@  int main(void)
 	ENTRY(r14);
 	ENTRY(r15);
 	ENTRY(flags);
+	ENTRY(ip);
+	ENTRY(cs);
+	ENTRY(ss);
+	ENTRY(orig_ax);
 	BLANK();
 #undef ENTRY
 
diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c
index 8a1c0111ae79..3e1c01beebd1 100644
--- a/arch/x86/kernel/rethook.c
+++ b/arch/x86/kernel/rethook.c
@@ -6,6 +6,7 @@ 
 #include <linux/rethook.h>
 #include <linux/kprobes.h>
 #include <linux/objtool.h>
+#include <asm/asm-offsets.h>
 
 #include "kprobes/common.h"
 
@@ -34,10 +35,36 @@  asm(
 	"	pushq %rsp\n"
 	"	pushfq\n"
 	SAVE_REGS_STRING
-	"	movq %rsp, %rdi\n"
-	"	call arch_rethook_trampoline_callback\n"
+	"	movq %rsp, %rdi\n" /* $rdi points to regs */
+	/* fixup registers */
+	/* regs->cs = __KERNEL_CS; */
+	"	movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) "(%rdi)\n"
+	/* regs->ip = (unsigned long)&arch_rethook_trampoline; */
+	"	movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) "(%rdi)\n"
+	/* regs->orig_ax = ~0UL; */
+	"	movq $0xffffffffffffffff, " __stringify(pt_regs_orig_ax) "(%rdi)\n"
+	/* regs->sp += 2*sizeof(long); */
+	"	addq $16, " __stringify(pt_regs_sp) "(%rdi)\n"
+	/* 2nd arg is frame_pointer = (long *)(regs + 1); */
+	"	lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n"
+	/*
+	 * The return address at 'frame_pointer' is recovered by the
+	 * arch_rethook_fixup_return() which called from this
+	 * rethook_trampoline_handler().
+	 */
+	"	call rethook_trampoline_handler\n"
+	/*
+	 * Copy FLAGS to 'pt_regs::ss' so we can do RET right after POPF.
+	 *
+	 * We don't save/restore %rax below, because we ignore
+	 * rethook_trampoline_handler result.
+	 *
+	 * *(unsigned long *)&regs->ss = regs->flags;
+	 */
+	"	mov " __stringify(pt_regs_flags) "(%rsp), %rax\n"
+	"	mov %rax, " __stringify(pt_regs_ss) "(%rsp)\n"
 	RESTORE_REGS_STRING
-	/* In the callback function, 'regs->flags' is copied to 'regs->ss'. */
+	/* We just copied 'regs->flags' into 'regs->ss'. */
 	"	addq $16, %rsp\n"
 	"	popfq\n"
 #else
@@ -61,6 +88,7 @@  asm(
 );
 NOKPROBE_SYMBOL(arch_rethook_trampoline);
 
+#ifdef CONFIG_X86_32
 /*
  * Called from arch_rethook_trampoline
  */
@@ -70,9 +98,7 @@  __used __visible void arch_rethook_trampoline_callback(struct pt_regs *regs)
 
 	/* fixup registers */
 	regs->cs = __KERNEL_CS;
-#ifdef CONFIG_X86_32
 	regs->gs = 0;
-#endif
 	regs->ip = (unsigned long)&arch_rethook_trampoline;
 	regs->orig_ax = ~0UL;
 	regs->sp += 2*sizeof(long);
@@ -92,6 +118,7 @@  __used __visible void arch_rethook_trampoline_callback(struct pt_regs *regs)
 	*(unsigned long *)&regs->ss = regs->flags;
 }
 NOKPROBE_SYMBOL(arch_rethook_trampoline_callback);
+#endif
 
 /*
  * arch_rethook_trampoline() skips updating frame pointer. The frame pointer