diff mbox series

[PATCHv4,bpf-next,2/7] uprobe: Add uretprobe syscall to speed up return probe

Message ID 20240502122313.1579719-3-jolsa@kernel.org (mailing list archive)
State Superseded
Headers show
Series uprobe: uretprobe speed up | expand

Commit Message

Jiri Olsa May 2, 2024, 12:23 p.m. UTC
Adding uretprobe syscall instead of trap to speed up return probe.

At the moment the uretprobe setup/path is:

  - install entry uprobe

  - when the uprobe is hit, it overwrites probed function's return address
    on stack with address of the trampoline that contains breakpoint
    instruction

  - the breakpoint trap code handles the uretprobe consumers execution and
    jumps back to original return address

This patch replaces the above trampoline's breakpoint instruction with new
ureprobe syscall call. This syscall does exactly the same job as the trap
with some more extra work:

  - syscall trampoline must save original value for rax/r11/rcx registers
    on stack - rax is set to syscall number and r11/rcx are changed and
    used by syscall instruction

  - the syscall code reads the original values of those registers and
    restore those values in task's pt_regs area

  - only caller from trampoline exposed in '[uprobes]' is allowed,
    the process will receive SIGILL signal otherwise

Even with some extra work, using the uretprobes syscall shows speed
improvement (compared to using standard breakpoint):

  On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)

  current:
    uretprobe-nop  :    1.498 ± 0.000M/s
    uretprobe-push :    1.448 ± 0.001M/s
    uretprobe-ret  :    0.816 ± 0.001M/s

  with the fix:
    uretprobe-nop  :    1.969 ± 0.002M/s  < 31% speed up
    uretprobe-push :    1.910 ± 0.000M/s  < 31% speed up
    uretprobe-ret  :    0.934 ± 0.000M/s  < 14% speed up

  On Amd (AMD Ryzen 7 5700U)

  current:
    uretprobe-nop  :    0.778 ± 0.001M/s
    uretprobe-push :    0.744 ± 0.001M/s
    uretprobe-ret  :    0.540 ± 0.001M/s

  with the fix:
    uretprobe-nop  :    0.860 ± 0.001M/s  < 10% speed up
    uretprobe-push :    0.818 ± 0.001M/s  < 10% speed up
    uretprobe-ret  :    0.578 ± 0.000M/s  <  7% speed up

The performance test spawns a thread that runs loop which triggers
uprobe with attached bpf program that increments the counter that
gets printed in results above.

The uprobe (and uretprobe) kind is determined by which instruction
is being patched with breakpoint instruction. That's also important
for uretprobes, because uprobe is installed for each uretprobe.

The performance test is part of bpf selftests:
  tools/testing/selftests/bpf/run_bench_uprobes.sh

Note at the moment uretprobe syscall is supported only for native
64-bit process, compat process still uses standard breakpoint.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/kernel/uprobes.c | 115 ++++++++++++++++++++++++++++++++++++++
 include/linux/uprobes.h   |   3 +
 kernel/events/uprobes.c   |  24 +++++---
 3 files changed, 135 insertions(+), 7 deletions(-)

Comments

Peter Zijlstra May 3, 2024, 11:34 a.m. UTC | #1
On Thu, May 02, 2024 at 02:23:08PM +0200, Jiri Olsa wrote:
> Adding uretprobe syscall instead of trap to speed up return probe.
> 
> At the moment the uretprobe setup/path is:
> 
>   - install entry uprobe
> 
>   - when the uprobe is hit, it overwrites probed function's return address
>     on stack with address of the trampoline that contains breakpoint
>     instruction
> 
>   - the breakpoint trap code handles the uretprobe consumers execution and
>     jumps back to original return address
> 
> This patch replaces the above trampoline's breakpoint instruction with new
> ureprobe syscall call. This syscall does exactly the same job as the trap
> with some more extra work:
> 
>   - syscall trampoline must save original value for rax/r11/rcx registers
>     on stack - rax is set to syscall number and r11/rcx are changed and
>     used by syscall instruction
> 
>   - the syscall code reads the original values of those registers and
>     restore those values in task's pt_regs area
> 
>   - only caller from trampoline exposed in '[uprobes]' is allowed,
>     the process will receive SIGILL signal otherwise
> 

Did you consider shadow stacks? IIRC we currently have userspace shadow
stack support available, and that will utterly break all of this.

It would be really nice if the new scheme would consider shadow stacks.
Jiri Olsa May 3, 2024, 1:04 p.m. UTC | #2
On Fri, May 03, 2024 at 01:34:53PM +0200, Peter Zijlstra wrote:
> On Thu, May 02, 2024 at 02:23:08PM +0200, Jiri Olsa wrote:
> > Adding uretprobe syscall instead of trap to speed up return probe.
> > 
> > At the moment the uretprobe setup/path is:
> > 
> >   - install entry uprobe
> > 
> >   - when the uprobe is hit, it overwrites probed function's return address
> >     on stack with address of the trampoline that contains breakpoint
> >     instruction
> > 
> >   - the breakpoint trap code handles the uretprobe consumers execution and
> >     jumps back to original return address
> > 
> > This patch replaces the above trampoline's breakpoint instruction with new
> > ureprobe syscall call. This syscall does exactly the same job as the trap
> > with some more extra work:
> > 
> >   - syscall trampoline must save original value for rax/r11/rcx registers
> >     on stack - rax is set to syscall number and r11/rcx are changed and
> >     used by syscall instruction
> > 
> >   - the syscall code reads the original values of those registers and
> >     restore those values in task's pt_regs area
> > 
> >   - only caller from trampoline exposed in '[uprobes]' is allowed,
> >     the process will receive SIGILL signal otherwise
> > 
> 
> Did you consider shadow stacks? IIRC we currently have userspace shadow
> stack support available, and that will utterly break all of this.

nope.. I guess it's the extra ret instruction in the trampoline that would
make it crash?

> 
> It would be really nice if the new scheme would consider shadow stacks.

I seem to have the hw with support for user_shstk, let me test that

thanks,
jirka
Edgecombe, Rick P May 3, 2024, 3:53 p.m. UTC | #3
On Fri, 2024-05-03 at 15:04 +0200, Jiri Olsa wrote:
> On Fri, May 03, 2024 at 01:34:53PM +0200, Peter Zijlstra wrote:
> > On Thu, May 02, 2024 at 02:23:08PM +0200, Jiri Olsa wrote:
> > > Adding uretprobe syscall instead of trap to speed up return probe.
> > > 
> > > At the moment the uretprobe setup/path is:
> > > 
> > >    - install entry uprobe
> > > 
> > >    - when the uprobe is hit, it overwrites probed function's return
> > > address
> > >      on stack with address of the trampoline that contains breakpoint
> > >      instruction
> > > 
> > >    - the breakpoint trap code handles the uretprobe consumers execution
> > > and
> > >      jumps back to original return address

Hi,

I worked on the x86 shadow stack support.

I didn't know uprobes did anything like this. In hindsight I should have looked
more closely. The current upstream behavior is to overwrite the return address
on the stack?

Stupid uprobes question - what is actually overwriting the return address on the
stack? Is it the kernel? If so perhaps the kernel could just update the shadow
stack at the same time.

> > > 
> > > This patch replaces the above trampoline's breakpoint instruction with new
> > > ureprobe syscall call. This syscall does exactly the same job as the trap
> > > with some more extra work:
> > > 
> > >    - syscall trampoline must save original value for rax/r11/rcx registers
> > >      on stack - rax is set to syscall number and r11/rcx are changed and
> > >      used by syscall instruction
> > > 
> > >    - the syscall code reads the original values of those registers and
> > >      restore those values in task's pt_regs area
> > > 
> > >    - only caller from trampoline exposed in '[uprobes]' is allowed,
> > >      the process will receive SIGILL signal otherwise
> > > 
> > 
> > Did you consider shadow stacks? IIRC we currently have userspace shadow
> > stack support available, and that will utterly break all of this.
> 
> nope.. I guess it's the extra ret instruction in the trampoline that would
> make it crash?

The original behavior seems problematic for shadow stack IIUC. I'm not sure of
the additional breakage with the new behavior.

Roughly, how shadow stack works is there is an additional protected stack for
the app thread. The HW pushes to from the shadow stack with CALL, and pops from
it with RET. But it also continues to push and pop from the normal stack. On
pop, if the values don't match between the two stacks, an exception is
generated. The whole point is to prevent the app from overwriting its stack
return address to return to random places.

Userspace cannot (normally) write to the shadow stack, but the kernel can do
this or adust the SSP (shadow stack pointer). So in the kernel (for things like
sigreturn) there is an ability to do what is needed. Ptracers also can do things
like this.
Jiri Olsa May 3, 2024, 7:18 p.m. UTC | #4
On Fri, May 03, 2024 at 03:53:15PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2024-05-03 at 15:04 +0200, Jiri Olsa wrote:
> > On Fri, May 03, 2024 at 01:34:53PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 02, 2024 at 02:23:08PM +0200, Jiri Olsa wrote:
> > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > > > 
> > > > At the moment the uretprobe setup/path is:
> > > > 
> > > >    - install entry uprobe
> > > > 
> > > >    - when the uprobe is hit, it overwrites probed function's return
> > > > address
> > > >      on stack with address of the trampoline that contains breakpoint
> > > >      instruction
> > > > 
> > > >    - the breakpoint trap code handles the uretprobe consumers execution
> > > > and
> > > >      jumps back to original return address
> 
> Hi,
> 
> I worked on the x86 shadow stack support.
> 
> I didn't know uprobes did anything like this. In hindsight I should have looked
> more closely. The current upstream behavior is to overwrite the return address
> on the stack?
> 
> Stupid uprobes question - what is actually overwriting the return address on the
> stack? Is it the kernel? If so perhaps the kernel could just update the shadow
> stack at the same time.

yes, it's in kernel - arch_uretprobe_hijack_return_addr .. so I guess
we need to update the shadow stack with the new return value as well

> 
> > > > 
> > > > This patch replaces the above trampoline's breakpoint instruction with new
> > > > ureprobe syscall call. This syscall does exactly the same job as the trap
> > > > with some more extra work:
> > > > 
> > > >    - syscall trampoline must save original value for rax/r11/rcx registers
> > > >      on stack - rax is set to syscall number and r11/rcx are changed and
> > > >      used by syscall instruction
> > > > 
> > > >    - the syscall code reads the original values of those registers and
> > > >      restore those values in task's pt_regs area
> > > > 
> > > >    - only caller from trampoline exposed in '[uprobes]' is allowed,
> > > >      the process will receive SIGILL signal otherwise
> > > > 
> > > 
> > > Did you consider shadow stacks? IIRC we currently have userspace shadow
> > > stack support available, and that will utterly break all of this.
> > 
> > nope.. I guess it's the extra ret instruction in the trampoline that would
> > make it crash?
> 
> The original behavior seems problematic for shadow stack IIUC. I'm not sure of
> the additional breakage with the new behavior.

I can see it's broken also for current uprobes

> 
> Roughly, how shadow stack works is there is an additional protected stack for
> the app thread. The HW pushes to from the shadow stack with CALL, and pops from
> it with RET. But it also continues to push and pop from the normal stack. On
> pop, if the values don't match between the two stacks, an exception is
> generated. The whole point is to prevent the app from overwriting its stack
> return address to return to random places.
> 
> Userspace cannot (normally) write to the shadow stack, but the kernel can do
> this or adust the SSP (shadow stack pointer). So in the kernel (for things like
> sigreturn) there is an ability to do what is needed. Ptracers also can do things
> like this.

hack below seems to fix it for the current uprobe setup,
we need similar fix for the uretprobe syscall trampoline setup

jirka


---
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 42fee8959df7..99a0948a3b79 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -21,6 +21,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clon
 void shstk_free(struct task_struct *p);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
+void uprobe_change_stack(unsigned long addr);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 59e15dd8d0f8..d2c4dbe5843c 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -577,3 +577,11 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long arg2)
 		return wrss_control(true);
 	return -EINVAL;
 }
+
+void uprobe_change_stack(unsigned long addr)
+{
+	unsigned long ssp;
+
+	ssp = get_user_shstk_addr();
+	write_user_shstk_64((u64 __user *)ssp, (u64)addr);
+}
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 81e6ee95784d..88afbeaacb8f 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -348,7 +348,7 @@ void *arch_uprobe_trampoline(unsigned long *psize)
 	 * only for native 64-bit process, the compat process still uses
 	 * standard breakpoint.
 	 */
-	if (user_64bit_mode(regs)) {
+	if (0 && user_64bit_mode(regs)) {
 		*psize = uretprobe_syscall_end - uretprobe_syscall_entry;
 		return uretprobe_syscall_entry;
 	}
@@ -1191,8 +1191,10 @@ arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs
 		return orig_ret_vaddr;
 
 	nleft = copy_to_user((void __user *)regs->sp, &trampoline_vaddr, rasize);
-	if (likely(!nleft))
+	if (likely(!nleft)) {
+		uprobe_change_stack(trampoline_vaddr);
 		return orig_ret_vaddr;
+	}
 
 	if (nleft != rasize) {
 		pr_err("return address clobbered: pid=%d, %%sp=%#lx, %%ip=%#lx\n",
Edgecombe, Rick P May 3, 2024, 7:38 p.m. UTC | #5
+Some more shadow stack folks from other archs. We are discussing how uretprobes
work with shadow stack.

Context:
https://lore.kernel.org/lkml/ZjU4ganRF1Cbiug6@krava/

On Fri, 2024-05-03 at 21:18 +0200, Jiri Olsa wrote:
> 
> hack below seems to fix it for the current uprobe setup,
> we need similar fix for the uretprobe syscall trampoline setup

It seems like a reasonable direction.

Security-wise, applications cannot do this on themselves, or it is an otherwise
privileged thing right?
Jiri Olsa May 3, 2024, 8:17 p.m. UTC | #6
On Fri, May 03, 2024 at 07:38:18PM +0000, Edgecombe, Rick P wrote:
> +Some more shadow stack folks from other archs. We are discussing how uretprobes
> work with shadow stack.
> 
> Context:
> https://lore.kernel.org/lkml/ZjU4ganRF1Cbiug6@krava/
> 
> On Fri, 2024-05-03 at 21:18 +0200, Jiri Olsa wrote:
> > 
> > hack below seems to fix it for the current uprobe setup,
> > we need similar fix for the uretprobe syscall trampoline setup
> 
> It seems like a reasonable direction.
> 
> Security-wise, applications cannot do this on themselves, or it is an otherwise
> privileged thing right?

when uretprobe is created, kernel overwrites the return address on user
stack to point to user space trampoline, so the setup is in kernel hands

with the hack below on top of this patchset I'm no longer seeing shadow
stack app crash on uretprobe.. I'll try to polish it and send out next
week, any suggestions are welcome ;-)

thanks,
jirka


---
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 42fee8959df7..d374305a6851 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -21,6 +21,8 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clon
 void shstk_free(struct task_struct *p);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
+void uprobe_change_stack(unsigned long addr);
+void uprobe_push_stack(unsigned long addr);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 59e15dd8d0f8..804c446231d9 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -577,3 +577,24 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long arg2)
 		return wrss_control(true);
 	return -EINVAL;
 }
+
+void uprobe_change_stack(unsigned long addr)
+{
+	unsigned long ssp;
+
+	ssp = get_user_shstk_addr();
+	write_user_shstk_64((u64 __user *)ssp, (u64)addr);
+}
+
+void uprobe_push_stack(unsigned long addr)
+{
+	unsigned long ssp;
+
+	ssp = get_user_shstk_addr();
+	ssp -= SS_FRAME_SIZE;
+	write_user_shstk_64((u64 __user *)ssp, (u64)addr);
+
+	fpregs_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, ssp);
+	fpregs_unlock();
+}
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 81e6ee95784d..259457838020 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -416,6 +416,7 @@ SYSCALL_DEFINE0(uretprobe)
 	regs->r11 = regs->flags;
 	regs->cx  = regs->ip;
 
+	uprobe_push_stack(r11_cx_ax[2]);
 	return regs->ax;
 
 sigill:
@@ -1191,8 +1192,10 @@ arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs
 		return orig_ret_vaddr;
 
 	nleft = copy_to_user((void __user *)regs->sp, &trampoline_vaddr, rasize);
-	if (likely(!nleft))
+	if (likely(!nleft)) {
+		uprobe_change_stack(trampoline_vaddr);
 		return orig_ret_vaddr;
+	}
 
 	if (nleft != rasize) {
 		pr_err("return address clobbered: pid=%d, %%sp=%#lx, %%ip=%#lx\n",
Edgecombe, Rick P May 3, 2024, 8:35 p.m. UTC | #7
On Fri, 2024-05-03 at 22:17 +0200, Jiri Olsa wrote:
> when uretprobe is created, kernel overwrites the return address on user
> stack to point to user space trampoline, so the setup is in kernel hands

I mean for uprobes in general. I'm didn't have any specific ideas in mind, but
in general when we give the kernel more abilities around shadow stack we have to
think if attackers could use it to work around shadow stack protections.

> 
> with the hack below on top of this patchset I'm no longer seeing shadow
> stack app crash on uretprobe.. I'll try to polish it and send out next
> week, any suggestions are welcome ;-)

Thanks. Some comments below.

> 
> thanks,
> jirka
> 
> 
> ---
> diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
> index 42fee8959df7..d374305a6851 100644
> --- a/arch/x86/include/asm/shstk.h
> +++ b/arch/x86/include/asm/shstk.h
> @@ -21,6 +21,8 @@ unsigned long shstk_alloc_thread_stack(struct task_struct
> *p, unsigned long clon
>  void shstk_free(struct task_struct *p);
>  int setup_signal_shadow_stack(struct ksignal *ksig);
>  int restore_signal_shadow_stack(void);
> +void uprobe_change_stack(unsigned long addr);
> +void uprobe_push_stack(unsigned long addr);

Maybe name them:
shstk_update_last_frame();
shstk_push_frame();


>  #else
>  static inline long shstk_prctl(struct task_struct *task, int option,
>                                unsigned long arg2) { return -EINVAL; }
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 59e15dd8d0f8..804c446231d9 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -577,3 +577,24 @@ long shstk_prctl(struct task_struct *task, int option,
> unsigned long arg2)
>                 return wrss_control(true);
>         return -EINVAL;
>  }
> +
> +void uprobe_change_stack(unsigned long addr)
> +{
> +       unsigned long ssp;

Probably want something like:

	if (!features_enabled(ARCH_SHSTK_SHSTK))
		return;

So this doesn't try the below if shadow stack is disabled.

> +
> +       ssp = get_user_shstk_addr();
> +       write_user_shstk_64((u64 __user *)ssp, (u64)addr);
> +}

Can we know that there was a valid return address just before this point on the
stack? Or could it be a sigframe or something?

> +
> +void uprobe_push_stack(unsigned long addr)
> +{
> +       unsigned long ssp;

	if (!features_enabled(ARCH_SHSTK_SHSTK))
		return;

> +
> +       ssp = get_user_shstk_addr();
> +       ssp -= SS_FRAME_SIZE;
> +       write_user_shstk_64((u64 __user *)ssp, (u64)addr);
> +
> +       fpregs_lock_and_load();
> +       wrmsrl(MSR_IA32_PL3_SSP, ssp);
> +       fpregs_unlock();
> +}
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 81e6ee95784d..259457838020 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -416,6 +416,7 @@ SYSCALL_DEFINE0(uretprobe)
>         regs->r11 = regs->flags;
>         regs->cx  = regs->ip;
>  
> +       uprobe_push_stack(r11_cx_ax[2]);

I'm concerned this could be used to push arbitrary frames to the shadow stack.
Couldn't an attacker do a jump to the point that calls this syscall? Maybe this
is what peterz was raising.

>         return regs->ax;
>  
>  sigill:
> @@ -1191,8 +1192,10 @@ arch_uretprobe_hijack_return_addr(unsigned long
> trampoline_vaddr, struct pt_regs
>                 return orig_ret_vaddr;
>  
>         nleft = copy_to_user((void __user *)regs->sp, &trampoline_vaddr,
> rasize);
> -       if (likely(!nleft))
> +       if (likely(!nleft)) {
> +               uprobe_change_stack(trampoline_vaddr);
>                 return orig_ret_vaddr;
> +       }
>  
>         if (nleft != rasize) {
>                 pr_err("return address clobbered: pid=%d, %%sp=%#lx,
> %%ip=%#lx\n",
Deepak Gupta May 3, 2024, 11:01 p.m. UTC | #8
On Fri, May 03, 2024 at 07:38:18PM +0000, Edgecombe, Rick P wrote:
>+Some more shadow stack folks from other archs. We are discussing how uretprobes
>work with shadow stack.
>
>Context:
>https://lore.kernel.org/lkml/ZjU4ganRF1Cbiug6@krava/

Thanks Rick.

Yeah I didn't give enough attention to uprobes either.
Although now that I think for RISC-V shadow stack, it shouldn't be an issue.
On RISC-V return addresses don't get pushed as part of call instruction.
There is a distinct instruction "shadow stack push of return address" in prolog.
Similarly in epilog there is distinct instruction "shadow stack pop and check with
link register".

On RISC-V, uretprobe would install a uprobe on function start and when it's hit.
It'll replace pt_regs->ra = trampoline_handler. As function will resume, trampoline
addr will get pushed and popped. Although trampoline_handler would have to be enlightened
to eventually return to original return site.

>
>On Fri, 2024-05-03 at 21:18 +0200, Jiri Olsa wrote:
>>
>> hack below seems to fix it for the current uprobe setup,
>> we need similar fix for the uretprobe syscall trampoline setup
>
>It seems like a reasonable direction.
>
>Security-wise, applications cannot do this on themselves, or it is an otherwise
>privileged thing right?
>
>
Jiri Olsa May 6, 2024, 10:56 a.m. UTC | #9
On Fri, May 03, 2024 at 08:35:24PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2024-05-03 at 22:17 +0200, Jiri Olsa wrote:
> > when uretprobe is created, kernel overwrites the return address on user
> > stack to point to user space trampoline, so the setup is in kernel hands
> 
> I mean for uprobes in general. I'm didn't have any specific ideas in mind, but
> in general when we give the kernel more abilities around shadow stack we have to
> think if attackers could use it to work around shadow stack protections.
> 
> > 
> > with the hack below on top of this patchset I'm no longer seeing shadow
> > stack app crash on uretprobe.. I'll try to polish it and send out next
> > week, any suggestions are welcome ;-)
> 
> Thanks. Some comments below.
> 
> > 
> > thanks,
> > jirka
> > 
> > 
> > ---
> > diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
> > index 42fee8959df7..d374305a6851 100644
> > --- a/arch/x86/include/asm/shstk.h
> > +++ b/arch/x86/include/asm/shstk.h
> > @@ -21,6 +21,8 @@ unsigned long shstk_alloc_thread_stack(struct task_struct
> > *p, unsigned long clon
> >  void shstk_free(struct task_struct *p);
> >  int setup_signal_shadow_stack(struct ksignal *ksig);
> >  int restore_signal_shadow_stack(void);
> > +void uprobe_change_stack(unsigned long addr);
> > +void uprobe_push_stack(unsigned long addr);
> 
> Maybe name them:
> shstk_update_last_frame();
> shstk_push_frame();

ok

> 
> 
> >  #else
> >  static inline long shstk_prctl(struct task_struct *task, int option,
> >                                unsigned long arg2) { return -EINVAL; }
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > index 59e15dd8d0f8..804c446231d9 100644
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -577,3 +577,24 @@ long shstk_prctl(struct task_struct *task, int option,
> > unsigned long arg2)
> >                 return wrss_control(true);
> >         return -EINVAL;
> >  }
> > +
> > +void uprobe_change_stack(unsigned long addr)
> > +{
> > +       unsigned long ssp;
> 
> Probably want something like:
> 
> 	if (!features_enabled(ARCH_SHSTK_SHSTK))
> 		return;

ok

> 
> So this doesn't try the below if shadow stack is disabled.
> 
> > +
> > +       ssp = get_user_shstk_addr();
> > +       write_user_shstk_64((u64 __user *)ssp, (u64)addr);
> > +}
> 
> Can we know that there was a valid return address just before this point on the
> stack? Or could it be a sigframe or something?

when uprobe hijack the return address it assumes it's on the top of the stack,
so it's saved and replaced with address of the user space trampoline

> 
> > +
> > +void uprobe_push_stack(unsigned long addr)
> > +{
> > +       unsigned long ssp;
> 
> 	if (!features_enabled(ARCH_SHSTK_SHSTK))
> 		return;
> 
> > +
> > +       ssp = get_user_shstk_addr();
> > +       ssp -= SS_FRAME_SIZE;
> > +       write_user_shstk_64((u64 __user *)ssp, (u64)addr);
> > +
> > +       fpregs_lock_and_load();
> > +       wrmsrl(MSR_IA32_PL3_SSP, ssp);
> > +       fpregs_unlock();
> > +}
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index 81e6ee95784d..259457838020 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -416,6 +416,7 @@ SYSCALL_DEFINE0(uretprobe)
> >         regs->r11 = regs->flags;
> >         regs->cx  = regs->ip;
> >  
> > +       uprobe_push_stack(r11_cx_ax[2]);
> 
> I'm concerned this could be used to push arbitrary frames to the shadow stack.
> Couldn't an attacker do a jump to the point that calls this syscall? Maybe this
> is what peterz was raising.

of course never say never, but here's my reasoning why I think it's ok

the page with the syscall trampoline is mapped in user space and can be
found in procfs maps file under '[uprobes]' name

the syscall can be called only from this trampoline, if it's called from
anywhere else the calling process receives SIGILL

now if you run the uretprobe syscall without any pending uretprobe for
the task it will receive SIGILL before it gets to the point of pushing
address on the shadow stack

and to configure the uretprobe you need to have CAP_PERFMON or CAP_SYS_ADMIN

if you'd actually managed to get the pending uretprobe instance, the shadow
stack entry is going to be used/pop-ed right away in the trampoline with
the ret instruction

and as I mentioned above it's ensured that the syscall is returning to the
trampoline and it can't be called from any other place

> 
> >         return regs->ax;
> >  
> >  sigill:
> > @@ -1191,8 +1192,10 @@ arch_uretprobe_hijack_return_addr(unsigned long
> > trampoline_vaddr, struct pt_regs
> >                 return orig_ret_vaddr;
> >  
> >         nleft = copy_to_user((void __user *)regs->sp, &trampoline_vaddr,
> > rasize);
> > -       if (likely(!nleft))
> > +       if (likely(!nleft)) {
> > +               uprobe_change_stack(trampoline_vaddr);
> >                 return orig_ret_vaddr;
> > +       }
> >  
> >         if (nleft != rasize) {
> >                 pr_err("return address clobbered: pid=%d, %%sp=%#lx,
> > %%ip=%#lx\n",
> 

I'll try to add uprobe test under tools/testing/selftests/x86/test_shadow_stack.c
and send that and change below as part of new version

thanks for the comments,
jirka


---
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 42fee8959df7..2e1ddcf98242 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -21,6 +21,8 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clon
 void shstk_free(struct task_struct *p);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
+int shstk_update_last_frame(unsigned long val);
+int shstk_push_frame(unsigned long val);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
@@ -31,6 +33,8 @@ static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
 static inline void shstk_free(struct task_struct *p) {}
 static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
 static inline int restore_signal_shadow_stack(void) { return 0; }
+static inline int shstk_update_last_frame(unsigned long val) { return 0; }
+static inline int shstk_push_frame(unsigned long val) { return 0; }
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 59e15dd8d0f8..66434dfde52e 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -577,3 +577,32 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long arg2)
 		return wrss_control(true);
 	return -EINVAL;
 }
+
+int shstk_update_last_frame(unsigned long val)
+{
+	unsigned long ssp;
+
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	ssp = get_user_shstk_addr();
+	return write_user_shstk_64((u64 __user *)ssp, (u64)val);
+}
+
+int shstk_push_frame(unsigned long val)
+{
+	unsigned long ssp;
+
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return 0;
+
+	ssp = get_user_shstk_addr();
+	ssp -= SS_FRAME_SIZE;
+	if (write_user_shstk_64((u64 __user *)ssp, (u64)val))
+		return -EFAULT;
+
+	fpregs_lock_and_load();
+	wrmsrl(MSR_IA32_PL3_SSP, ssp);
+	fpregs_unlock();
+	return 0;
+}
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 81e6ee95784d..ae6c3458a675 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -406,6 +406,11 @@ SYSCALL_DEFINE0(uretprobe)
 	 * trampoline's ret instruction
 	 */
 	r11_cx_ax[2] = regs->ip;
+
+	/* make the shadow stack follow that */
+	if (shstk_push_frame(regs->ip))
+		goto sigill;
+
 	regs->ip = ip;
 
 	err = copy_to_user((void __user *)regs->sp, r11_cx_ax, sizeof(r11_cx_ax));
@@ -1191,8 +1196,13 @@ arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs
 		return orig_ret_vaddr;
 
 	nleft = copy_to_user((void __user *)regs->sp, &trampoline_vaddr, rasize);
-	if (likely(!nleft))
+	if (likely(!nleft)) {
+		if (shstk_update_last_frame(trampoline_vaddr)) {
+			force_sig(SIGSEGV);
+			return -1;
+		}
 		return orig_ret_vaddr;
+	}
 
 	if (nleft != rasize) {
 		pr_err("return address clobbered: pid=%d, %%sp=%#lx, %%ip=%#lx\n",
diff mbox series

Patch

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6c07f6daaa22..81e6ee95784d 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -12,6 +12,7 @@ 
 #include <linux/ptrace.h>
 #include <linux/uprobes.h>
 #include <linux/uaccess.h>
+#include <linux/syscalls.h>
 
 #include <linux/kdebug.h>
 #include <asm/processor.h>
@@ -308,6 +309,120 @@  static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
 }
 
 #ifdef CONFIG_X86_64
+
+asm (
+	".pushsection .rodata\n"
+	".global uretprobe_syscall_entry\n"
+	"uretprobe_syscall_entry:\n"
+	"pushq %rax\n"
+	"pushq %rcx\n"
+	"pushq %r11\n"
+	"movq $" __stringify(__NR_uretprobe) ", %rax\n"
+	"syscall\n"
+	".global uretprobe_syscall_check\n"
+	"uretprobe_syscall_check:\n"
+	"popq %r11\n"
+	"popq %rcx\n"
+
+	/* The uretprobe syscall replaces stored %rax value with final
+	 * return address, so we don't restore %rax in here and just
+	 * call ret.
+	 */
+	"retq\n"
+	".global uretprobe_syscall_end\n"
+	"uretprobe_syscall_end:\n"
+	".popsection\n"
+);
+
+extern u8 uretprobe_syscall_entry[];
+extern u8 uretprobe_syscall_check[];
+extern u8 uretprobe_syscall_end[];
+
+void *arch_uprobe_trampoline(unsigned long *psize)
+{
+	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+	struct pt_regs *regs = task_pt_regs(current);
+
+	/*
+	 * At the moment the uretprobe syscall trampoline is supported
+	 * only for native 64-bit process, the compat process still uses
+	 * standard breakpoint.
+	 */
+	if (user_64bit_mode(regs)) {
+		*psize = uretprobe_syscall_end - uretprobe_syscall_entry;
+		return uretprobe_syscall_entry;
+	}
+
+	*psize = UPROBE_SWBP_INSN_SIZE;
+	return &insn;
+}
+
+static unsigned long trampoline_check_ip(void)
+{
+	unsigned long tramp = uprobe_get_trampoline_vaddr();
+
+	return tramp + (uretprobe_syscall_check - uretprobe_syscall_entry);
+}
+
+SYSCALL_DEFINE0(uretprobe)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long err, ip, sp, r11_cx_ax[3];
+
+	if (regs->ip != trampoline_check_ip())
+		goto sigill;
+
+	err = copy_from_user(r11_cx_ax, (void __user *)regs->sp, sizeof(r11_cx_ax));
+	if (err)
+		goto sigill;
+
+	/* expose the "right" values of r11/cx/ax/sp to uprobe_consumer/s */
+	regs->r11 = r11_cx_ax[0];
+	regs->cx  = r11_cx_ax[1];
+	regs->ax  = r11_cx_ax[2];
+	regs->sp += sizeof(r11_cx_ax);
+	regs->orig_ax = -1;
+
+	ip = regs->ip;
+	sp = regs->sp;
+
+	uprobe_handle_trampoline(regs);
+
+	/*
+	 * uprobe_consumer has changed sp, we can do nothing,
+	 * just return via iret
+	 */
+	if (regs->sp != sp)
+		return regs->ax;
+	regs->sp -= sizeof(r11_cx_ax);
+
+	/* for the case uprobe_consumer has changed r11/cx */
+	r11_cx_ax[0] = regs->r11;
+	r11_cx_ax[1] = regs->cx;
+
+	/*
+	 * ax register is passed through as return value, so we can use
+	 * its space on stack for ip value and jump to it through the
+	 * trampoline's ret instruction
+	 */
+	r11_cx_ax[2] = regs->ip;
+	regs->ip = ip;
+
+	err = copy_to_user((void __user *)regs->sp, r11_cx_ax, sizeof(r11_cx_ax));
+	if (err)
+		goto sigill;
+
+	/* ensure sysret, see do_syscall_64() */
+	regs->r11 = regs->flags;
+	regs->cx  = regs->ip;
+
+	return regs->ax;
+
+sigill:
+	force_sig(SIGILL);
+	return -1;
+}
+
 /*
  * If arch_uprobe->insn doesn't use rip-relative addressing, return
  * immediately.  Otherwise, rewrite the instruction so that it accesses
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..b503fafb7fb3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -138,6 +138,9 @@  extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
+extern void uprobe_handle_trampoline(struct pt_regs *regs);
+extern void *arch_uprobe_trampoline(unsigned long *psize);
+extern unsigned long uprobe_get_trampoline_vaddr(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e4834d23e1d1..c550449d66be 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1474,11 +1474,20 @@  static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	return ret;
 }
 
+void * __weak arch_uprobe_trampoline(unsigned long *psize)
+{
+	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+
+	*psize = UPROBE_SWBP_INSN_SIZE;
+	return &insn;
+}
+
 static struct xol_area *__create_xol_area(unsigned long vaddr)
 {
 	struct mm_struct *mm = current->mm;
-	uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+	unsigned long insns_size;
 	struct xol_area *area;
+	void *insns;
 
 	area = kmalloc(sizeof(*area), GFP_KERNEL);
 	if (unlikely(!area))
@@ -1502,7 +1511,8 @@  static struct xol_area *__create_xol_area(unsigned long vaddr)
 	/* Reserve the 1st slot for get_trampoline_vaddr() */
 	set_bit(0, area->bitmap);
 	atomic_set(&area->slot_count, 1);
-	arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
+	insns = arch_uprobe_trampoline(&insns_size);
+	arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
 
 	if (!xol_add_vma(mm, area))
 		return area;
@@ -1827,7 +1837,7 @@  void uprobe_copy_process(struct task_struct *t, unsigned long flags)
  *
  * Returns -1 in case the xol_area is not allocated.
  */
-static unsigned long get_trampoline_vaddr(void)
+unsigned long uprobe_get_trampoline_vaddr(void)
 {
 	struct xol_area *area;
 	unsigned long trampoline_vaddr = -1;
@@ -1878,7 +1888,7 @@  static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 	if (!ri)
 		return;
 
-	trampoline_vaddr = get_trampoline_vaddr();
+	trampoline_vaddr = uprobe_get_trampoline_vaddr();
 	orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs);
 	if (orig_ret_vaddr == -1)
 		goto fail;
@@ -2123,7 +2133,7 @@  static struct return_instance *find_next_ret_chain(struct return_instance *ri)
 	return ri;
 }
 
-static void handle_trampoline(struct pt_regs *regs)
+void uprobe_handle_trampoline(struct pt_regs *regs)
 {
 	struct uprobe_task *utask;
 	struct return_instance *ri, *next;
@@ -2187,8 +2197,8 @@  static void handle_swbp(struct pt_regs *regs)
 	int is_swbp;
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
-	if (bp_vaddr == get_trampoline_vaddr())
-		return handle_trampoline(regs);
+	if (bp_vaddr == uprobe_get_trampoline_vaddr())
+		return uprobe_handle_trampoline(regs);
 
 	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
 	if (!uprobe) {