[RFC,00/11] uprobes: Add support to optimize usdt probes on x86_64

Message ID	20241105133405.2703607-1-jolsa@kernel.org (mailing list archive)
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 52BE71D63F4; Tue, 5 Nov 2024 13:34:11 +0000 (UTC) From: Jiri Olsa <jolsa@kernel.org> To: Oleg Nesterov <oleg@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Andrii Nakryiko <andrii@kernel.org> Cc: bpf@vger.kernel.org, Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>, John Fastabend <john.fastabend@gmail.com>, Hao Luo <haoluo@google.com>, Steven Rostedt <rostedt@goodmis.org>, Masami Hiramatsu <mhiramat@kernel.org>, Alan Maguire <alan.maguire@oracle.com>, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Subject: [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Date: Tue, 5 Nov 2024 14:33:54 +0100 Message-ID: <20241105133405.2703607-1-jolsa@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	uprobes: Add support to optimize usdt probes on x86_64 \| expand [RFC,00/11] uprobes: Add support to optimize usdt probes on x86_64 [RFC,perf/core,01/11] uprobes: Rename arch_uretprobe_trampoline function [RFC,perf/core,02/11] uprobes: Make copy_from_page global [RFC,perf/core,03/11] uprobes: Add len argument to uprobe_write_opcode [RFC,perf/core,04/11] uprobes: Add data argument to uprobe_write_opcode function [RFC,perf/core,05/11] uprobes: Add mapping for optimized uprobe trampolines [RFC,perf/core,06/11] uprobes: Add uprobe syscall to speed up uprobe [RFC,perf/core,07/11] uprobes/x86: Add support to optimize uprobes [RFC,bpf-next,08/11] selftests/bpf: Use 5-byte nop for x86 usdt probes [RFC,bpf-next,09/11] selftests/bpf: Add usdt trigger bench [RFC,bpf-next,10/11] selftests/bpf: Add uprobe/usdt optimized test [RFC,bpf-next,11/11] selftests/bpf: Add hit/attach/detach race optimized uprobe test

Jiri Olsa Nov. 5, 2024, 1:33 p.m. UTC

hi,
this patchset adds support to optimize usdt probes on top of 5-byte
nop instruction.

The generic approach (optimize all uprobes) is hard due to emulating
possible multiple original instructions and its related issues. The
usdt case, which stores 5-byte nop seems much easier, so starting
with that.

The basic idea is to replace breakpoint exception with syscall which
is faster on x86_64. For more details please see changelog of patch 7.

The first benchmark shows about 68% speed up (see below). The benchmark
triggers usdt probe in a loop and counts how many of those happened
per second.

It's still rfc state with some loose ends, but I'd be interested in
any feedback about the direction of this.

It's based on tip/perf/core with bpf-next/master merged on top of
that together with uprobe session patchset.

thanks,
jirka


current:
        # ./bench -w2 -d5 -a  trig-usdt
        Setting up benchmark 'trig-usdt'...
        Benchmark 'trig-usdt' started.
        Iter   0 ( 46.982us): hits    4.893M/s (  4.893M/prod), drops    0.000M/s, total operations    4.893M/s
        Iter   1 ( -5.967us): hits    4.892M/s (  4.892M/prod), drops    0.000M/s, total operations    4.892M/s
        Iter   2 ( -2.771us): hits    4.899M/s (  4.899M/prod), drops    0.000M/s, total operations    4.899M/s
        Iter   3 (  1.286us): hits    4.889M/s (  4.889M/prod), drops    0.000M/s, total operations    4.889M/s
        Iter   4 ( -2.871us): hits    4.881M/s (  4.881M/prod), drops    0.000M/s, total operations    4.881M/s
        Iter   5 (  1.005us): hits    4.886M/s (  4.886M/prod), drops    0.000M/s, total operations    4.886M/s
        Iter   6 ( 11.626us): hits    4.906M/s (  4.906M/prod), drops    0.000M/s, total operations    4.906M/s
        Iter   7 ( -6.638us): hits    4.896M/s (  4.896M/prod), drops    0.000M/s, total operations    4.896M/s
        Summary: hits    4.893 +- 0.009M/s (  4.893M/prod), drops    0.000 +- 0.000M/s, total operations    4.893 +- 0.009M/s

optimized:
        # ./bench -w2 -d5 -a  trig-usdt
        Setting up benchmark 'trig-usdt'...
        Benchmark 'trig-usdt' started.
        Iter   0 ( 46.073us): hits    8.258M/s (  8.258M/prod), drops    0.000M/s, total operations    8.258M/s
        Iter   1 ( -5.752us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
        Iter   2 ( -1.333us): hits    8.263M/s (  8.263M/prod), drops    0.000M/s, total operations    8.263M/s
        Iter   3 ( -2.996us): hits    8.265M/s (  8.265M/prod), drops    0.000M/s, total operations    8.265M/s
        Iter   4 ( -0.620us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
        Iter   5 ( -2.624us): hits    8.236M/s (  8.236M/prod), drops    0.000M/s, total operations    8.236M/s
        Iter   6 ( -0.840us): hits    8.232M/s (  8.232M/prod), drops    0.000M/s, total operations    8.232M/s
        Iter   7 ( -1.783us): hits    8.235M/s (  8.235M/prod), drops    0.000M/s, total operations    8.235M/s
        Summary: hits    8.249 +- 0.016M/s (  8.249M/prod), drops    0.000 +- 0.000M/s, total operations    8.249 +- 0.016M/s

---
Jiri Olsa (11):
      uprobes: Rename arch_uretprobe_trampoline function
      uprobes: Make copy_from_page global
      uprobes: Add len argument to uprobe_write_opcode
      uprobes: Add data argument to uprobe_write_opcode function
      uprobes: Add mapping for optimized uprobe trampolines
      uprobes: Add uprobe syscall to speed up uprobe
      uprobes/x86: Add support to optimize uprobes
      selftests/bpf: Use 5-byte nop for x86 usdt probes
      selftests/bpf: Add usdt trigger bench
      selftests/bpf: Add uprobe/usdt optimized test
      selftests/bpf: Add hit/attach/detach race optimized uprobe test

 arch/x86/entry/syscalls/syscall_64.tbl                    |   1 +
 arch/x86/include/asm/uprobes.h                            |   7 +++
 arch/x86/kernel/uprobes.c                                 | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/syscalls.h                                  |   2 +
 include/linux/uprobes.h                                   |  25 +++++++++-
 kernel/events/uprobes.c                                   | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 kernel/fork.c                                             |   2 +
 kernel/sys_ni.c                                           |   1 +
 tools/testing/selftests/bpf/bench.c                       |   2 +
 tools/testing/selftests/bpf/benchs/bench_trigger.c        |  45 +++++++++++++++++
 tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c | 252 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/progs/trigger_bench.c         |  10 +++-
 tools/testing/selftests/bpf/progs/uprobe_optimized.c      |  29 +++++++++++
 tools/testing/selftests/bpf/sdt.h                         |   9 +++-
 14 files changed, 768 insertions(+), 19 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c

Peter Zijlstra Nov. 17, 2024, 11:49 a.m. UTC | #1

On Tue, Nov 05, 2024 at 02:33:54PM +0100, Jiri Olsa wrote:
> hi,
> this patchset adds support to optimize usdt probes on top of 5-byte
> nop instruction.
> 
> The generic approach (optimize all uprobes) is hard due to emulating
> possible multiple original instructions and its related issues. The
> usdt case, which stores 5-byte nop seems much easier, so starting
> with that.
> 
> The basic idea is to replace breakpoint exception with syscall which
> is faster on x86_64. For more details please see changelog of patch 7.

So this is really about the fact that syscalls are faster than traps on
x86_64? Is there something similar on ARM64, or are they roughly the
same speed there?

That is, I don't think this scheme will work for the various RISC
architectures, given their very limited immediate range turns a typical
call into a multi-instruction trainwreck real quick.

Now, that isn't a problem if their exceptions and syscalls are of equal
speed.

Masami Hiramatsu (Google) Nov. 18, 2024, 8:04 a.m. UTC | #2

Hi Jiri,

On Tue,  5 Nov 2024 14:33:54 +0100
Jiri Olsa <jolsa@kernel.org> wrote:

> hi,
> this patchset adds support to optimize usdt probes on top of 5-byte
> nop instruction.
> 
> The generic approach (optimize all uprobes) is hard due to emulating
> possible multiple original instructions and its related issues. The
> usdt case, which stores 5-byte nop seems much easier, so starting
> with that.
> 
> The basic idea is to replace breakpoint exception with syscall which
> is faster on x86_64. For more details please see changelog of patch 7.

This looks like a great idea!

> 
> The first benchmark shows about 68% speed up (see below). The benchmark
> triggers usdt probe in a loop and counts how many of those happened
> per second.

Hmm, interesting result. I'd like to compare it with user-space event,
which is also use "write" syscall to write the pre-defined events
in the ftrace trace buffer.

But if uprobe trampoline can run in the comparable speed, user may
want to use uprobes because it is based on widely used usdt and
avoid accessing tracefs from application. (The user event user
application has to setup their events via tracefs interface)

> 
> It's still rfc state with some loose ends, but I'd be interested in
> any feedback about the direction of this.

So does this change the usdt macro? Or it just reuse the usdt so that
user applications does not need to be recompiled?

Thank you,

> 
> It's based on tip/perf/core with bpf-next/master merged on top of
> that together with uprobe session patchset.
> 
> thanks,
> jirka
> 
> 
> current:
>         # ./bench -w2 -d5 -a  trig-usdt
>         Setting up benchmark 'trig-usdt'...
>         Benchmark 'trig-usdt' started.
>         Iter   0 ( 46.982us): hits    4.893M/s (  4.893M/prod), drops    0.000M/s, total operations    4.893M/s
>         Iter   1 ( -5.967us): hits    4.892M/s (  4.892M/prod), drops    0.000M/s, total operations    4.892M/s
>         Iter   2 ( -2.771us): hits    4.899M/s (  4.899M/prod), drops    0.000M/s, total operations    4.899M/s
>         Iter   3 (  1.286us): hits    4.889M/s (  4.889M/prod), drops    0.000M/s, total operations    4.889M/s
>         Iter   4 ( -2.871us): hits    4.881M/s (  4.881M/prod), drops    0.000M/s, total operations    4.881M/s
>         Iter   5 (  1.005us): hits    4.886M/s (  4.886M/prod), drops    0.000M/s, total operations    4.886M/s
>         Iter   6 ( 11.626us): hits    4.906M/s (  4.906M/prod), drops    0.000M/s, total operations    4.906M/s
>         Iter   7 ( -6.638us): hits    4.896M/s (  4.896M/prod), drops    0.000M/s, total operations    4.896M/s
>         Summary: hits    4.893 +- 0.009M/s (  4.893M/prod), drops    0.000 +- 0.000M/s, total operations    4.893 +- 0.009M/s
> 
> optimized:
>         # ./bench -w2 -d5 -a  trig-usdt
>         Setting up benchmark 'trig-usdt'...
>         Benchmark 'trig-usdt' started.
>         Iter   0 ( 46.073us): hits    8.258M/s (  8.258M/prod), drops    0.000M/s, total operations    8.258M/s
>         Iter   1 ( -5.752us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
>         Iter   2 ( -1.333us): hits    8.263M/s (  8.263M/prod), drops    0.000M/s, total operations    8.263M/s
>         Iter   3 ( -2.996us): hits    8.265M/s (  8.265M/prod), drops    0.000M/s, total operations    8.265M/s
>         Iter   4 ( -0.620us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
>         Iter   5 ( -2.624us): hits    8.236M/s (  8.236M/prod), drops    0.000M/s, total operations    8.236M/s
>         Iter   6 ( -0.840us): hits    8.232M/s (  8.232M/prod), drops    0.000M/s, total operations    8.232M/s
>         Iter   7 ( -1.783us): hits    8.235M/s (  8.235M/prod), drops    0.000M/s, total operations    8.235M/s
>         Summary: hits    8.249 +- 0.016M/s (  8.249M/prod), drops    0.000 +- 0.000M/s, total operations    8.249 +- 0.016M/s
> 
> ---
> Jiri Olsa (11):
>       uprobes: Rename arch_uretprobe_trampoline function
>       uprobes: Make copy_from_page global
>       uprobes: Add len argument to uprobe_write_opcode
>       uprobes: Add data argument to uprobe_write_opcode function
>       uprobes: Add mapping for optimized uprobe trampolines
>       uprobes: Add uprobe syscall to speed up uprobe
>       uprobes/x86: Add support to optimize uprobes
>       selftests/bpf: Use 5-byte nop for x86 usdt probes
>       selftests/bpf: Add usdt trigger bench
>       selftests/bpf: Add uprobe/usdt optimized test
>       selftests/bpf: Add hit/attach/detach race optimized uprobe test
> 
>  arch/x86/entry/syscalls/syscall_64.tbl                    |   1 +
>  arch/x86/include/asm/uprobes.h                            |   7 +++
>  arch/x86/kernel/uprobes.c                                 | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/syscalls.h                                  |   2 +
>  include/linux/uprobes.h                                   |  25 +++++++++-
>  kernel/events/uprobes.c                                   | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
>  kernel/fork.c                                             |   2 +
>  kernel/sys_ni.c                                           |   1 +
>  tools/testing/selftests/bpf/bench.c                       |   2 +
>  tools/testing/selftests/bpf/benchs/bench_trigger.c        |  45 +++++++++++++++++
>  tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c | 252 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tools/testing/selftests/bpf/progs/trigger_bench.c         |  10 +++-
>  tools/testing/selftests/bpf/progs/uprobe_optimized.c      |  29 +++++++++++
>  tools/testing/selftests/bpf/sdt.h                         |   9 +++-
>  14 files changed, 768 insertions(+), 19 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
>  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c

Jiri Olsa Nov. 18, 2024, 9:29 a.m. UTC | #3

On Sun, Nov 17, 2024 at 12:49:46PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 05, 2024 at 02:33:54PM +0100, Jiri Olsa wrote:
> > hi,
> > this patchset adds support to optimize usdt probes on top of 5-byte
> > nop instruction.
> > 
> > The generic approach (optimize all uprobes) is hard due to emulating
> > possible multiple original instructions and its related issues. The
> > usdt case, which stores 5-byte nop seems much easier, so starting
> > with that.
> > 
> > The basic idea is to replace breakpoint exception with syscall which
> > is faster on x86_64. For more details please see changelog of patch 7.
> 
> So this is really about the fact that syscalls are faster than traps on
> x86_64? Is there something similar on ARM64, or are they roughly the
> same speed there?

yes, I recall somebody was porting uretprobe syscall to arm, but there was
no speed up IIRC, so looks like it's not the case on arm

I can't find the post atm, I'll keep digging

jirka

> 
> That is, I don't think this scheme will work for the various RISC
> architectures, given their very limited immediate range turns a typical
> call into a multi-instruction trainwreck real quick.
> 
> Now, that isn't a problem if their exceptions and syscalls are of equal
> speed.

Jiri Olsa Nov. 18, 2024, 9:52 a.m. UTC | #4

On Mon, Nov 18, 2024 at 05:04:58PM +0900, Masami Hiramatsu wrote:
> Hi Jiri,
> 
> On Tue,  5 Nov 2024 14:33:54 +0100
> Jiri Olsa <jolsa@kernel.org> wrote:
> 
> > hi,
> > this patchset adds support to optimize usdt probes on top of 5-byte
> > nop instruction.
> > 
> > The generic approach (optimize all uprobes) is hard due to emulating
> > possible multiple original instructions and its related issues. The
> > usdt case, which stores 5-byte nop seems much easier, so starting
> > with that.
> > 
> > The basic idea is to replace breakpoint exception with syscall which
> > is faster on x86_64. For more details please see changelog of patch 7.
> 
> This looks like a great idea!
> 
> > 
> > The first benchmark shows about 68% speed up (see below). The benchmark
> > triggers usdt probe in a loop and counts how many of those happened
> > per second.
> 
> Hmm, interesting result. I'd like to compare it with user-space event,
> which is also use "write" syscall to write the pre-defined events
> in the ftrace trace buffer.
> 
> But if uprobe trampoline can run in the comparable speed, user may
> want to use uprobes because it is based on widely used usdt and
> avoid accessing tracefs from application. (The user event user
> application has to setup their events via tracefs interface)

I see, will check more details on user events

> 
> > 
> > It's still rfc state with some loose ends, but I'd be interested in
> > any feedback about the direction of this.
> 
> So does this change the usdt macro? Or it just reuse the usdt so that
> user applications does not need to be recompiled?

yes, ideally apps keeps using the current usdt macro (or switch to [1])
which would unwind to nop5 (or whatever we will end up using)

there's an issue wrt to old kernels running apps compiled with the new usdt macro,
but it's solvable as Andrii pointed out in the other reply [2]

thanks,
jirka

[1] https://github.com/libbpf/usdt
[2] https://lore.kernel.org/bpf/20241118170458.c825bf255c2fb93f2e6a3519@kernel.org/T/#m20da8469e210880cae07f01783f4a51817ffbe4d

> 
> Thank you,
> 
> > 
> > It's based on tip/perf/core with bpf-next/master merged on top of
> > that together with uprobe session patchset.
> > 
> > thanks,
> > jirka
> > 
> > 
> > current:
> >         # ./bench -w2 -d5 -a  trig-usdt
> >         Setting up benchmark 'trig-usdt'...
> >         Benchmark 'trig-usdt' started.
> >         Iter   0 ( 46.982us): hits    4.893M/s (  4.893M/prod), drops    0.000M/s, total operations    4.893M/s
> >         Iter   1 ( -5.967us): hits    4.892M/s (  4.892M/prod), drops    0.000M/s, total operations    4.892M/s
> >         Iter   2 ( -2.771us): hits    4.899M/s (  4.899M/prod), drops    0.000M/s, total operations    4.899M/s
> >         Iter   3 (  1.286us): hits    4.889M/s (  4.889M/prod), drops    0.000M/s, total operations    4.889M/s
> >         Iter   4 ( -2.871us): hits    4.881M/s (  4.881M/prod), drops    0.000M/s, total operations    4.881M/s
> >         Iter   5 (  1.005us): hits    4.886M/s (  4.886M/prod), drops    0.000M/s, total operations    4.886M/s
> >         Iter   6 ( 11.626us): hits    4.906M/s (  4.906M/prod), drops    0.000M/s, total operations    4.906M/s
> >         Iter   7 ( -6.638us): hits    4.896M/s (  4.896M/prod), drops    0.000M/s, total operations    4.896M/s
> >         Summary: hits    4.893 +- 0.009M/s (  4.893M/prod), drops    0.000 +- 0.000M/s, total operations    4.893 +- 0.009M/s
> > 
> > optimized:
> >         # ./bench -w2 -d5 -a  trig-usdt
> >         Setting up benchmark 'trig-usdt'...
> >         Benchmark 'trig-usdt' started.
> >         Iter   0 ( 46.073us): hits    8.258M/s (  8.258M/prod), drops    0.000M/s, total operations    8.258M/s
> >         Iter   1 ( -5.752us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
> >         Iter   2 ( -1.333us): hits    8.263M/s (  8.263M/prod), drops    0.000M/s, total operations    8.263M/s
> >         Iter   3 ( -2.996us): hits    8.265M/s (  8.265M/prod), drops    0.000M/s, total operations    8.265M/s
> >         Iter   4 ( -0.620us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
> >         Iter   5 ( -2.624us): hits    8.236M/s (  8.236M/prod), drops    0.000M/s, total operations    8.236M/s
> >         Iter   6 ( -0.840us): hits    8.232M/s (  8.232M/prod), drops    0.000M/s, total operations    8.232M/s
> >         Iter   7 ( -1.783us): hits    8.235M/s (  8.235M/prod), drops    0.000M/s, total operations    8.235M/s
> >         Summary: hits    8.249 +- 0.016M/s (  8.249M/prod), drops    0.000 +- 0.000M/s, total operations    8.249 +- 0.016M/s
> > 
> > ---
> > Jiri Olsa (11):
> >       uprobes: Rename arch_uretprobe_trampoline function
> >       uprobes: Make copy_from_page global
> >       uprobes: Add len argument to uprobe_write_opcode
> >       uprobes: Add data argument to uprobe_write_opcode function
> >       uprobes: Add mapping for optimized uprobe trampolines
> >       uprobes: Add uprobe syscall to speed up uprobe
> >       uprobes/x86: Add support to optimize uprobes
> >       selftests/bpf: Use 5-byte nop for x86 usdt probes
> >       selftests/bpf: Add usdt trigger bench
> >       selftests/bpf: Add uprobe/usdt optimized test
> >       selftests/bpf: Add hit/attach/detach race optimized uprobe test
> > 
> >  arch/x86/entry/syscalls/syscall_64.tbl                    |   1 +
> >  arch/x86/include/asm/uprobes.h                            |   7 +++
> >  arch/x86/kernel/uprobes.c                                 | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/syscalls.h                                  |   2 +
> >  include/linux/uprobes.h                                   |  25 +++++++++-
> >  kernel/events/uprobes.c                                   | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
> >  kernel/fork.c                                             |   2 +
> >  kernel/sys_ni.c                                           |   1 +
> >  tools/testing/selftests/bpf/bench.c                       |   2 +
> >  tools/testing/selftests/bpf/benchs/bench_trigger.c        |  45 +++++++++++++++++
> >  tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c | 252 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tools/testing/selftests/bpf/progs/trigger_bench.c         |  10 +++-
> >  tools/testing/selftests/bpf/progs/uprobe_optimized.c      |  29 +++++++++++
> >  tools/testing/selftests/bpf/sdt.h                         |   9 +++-
> >  14 files changed, 768 insertions(+), 19 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c
> 
> 
> -- 
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

Mark Rutland Nov. 18, 2024, 10:06 a.m. UTC | #5

On Sun, Nov 17, 2024 at 12:49:46PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 05, 2024 at 02:33:54PM +0100, Jiri Olsa wrote:
> > hi,
> > this patchset adds support to optimize usdt probes on top of 5-byte
> > nop instruction.
> > 
> > The generic approach (optimize all uprobes) is hard due to emulating
> > possible multiple original instructions and its related issues. The
> > usdt case, which stores 5-byte nop seems much easier, so starting
> > with that.
> > 
> > The basic idea is to replace breakpoint exception with syscall which
> > is faster on x86_64. For more details please see changelog of patch 7.
> 
> So this is really about the fact that syscalls are faster than traps on
> x86_64? Is there something similar on ARM64, or are they roughly the
> same speed there?

From the hardware side I would expect those to be the same speed.

From the software side, there might be a difference, but in theory we
should be able to make the non-syscall case faster because we don't have
syscall tracing there.

> That is, I don't think this scheme will work for the various RISC
> architectures, given their very limited immediate range turns a typical
> call into a multi-instruction trainwreck real quick.
> 
> Now, that isn't a problem if their exceptions and syscalls are of equal
> speed.

Yep, on arm64 we definitely can't patch in branches reliably; using BRK
(as we do today) is the only reliable option, and it *shouldn't* be
slower than a syscall.

Looking around, we have a different latent issue with uprobes on arm64
in that only certain instructions can be modified while being
concurrently executed (in addition to the atomictiy of updating the
bytes in memory), and for everything else we need to stop-the-world. We
handle that for kprobes but it looks like we don't have any
infrastructure to handle that for uprobes.

Mark.

Andrii Nakryiko Nov. 19, 2024, 6:13 a.m. UTC | #6

On Mon, Nov 18, 2024 at 2:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Sun, Nov 17, 2024 at 12:49:46PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 05, 2024 at 02:33:54PM +0100, Jiri Olsa wrote:
> > > hi,
> > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > nop instruction.
> > >
> > > The generic approach (optimize all uprobes) is hard due to emulating
> > > possible multiple original instructions and its related issues. The
> > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > with that.
> > >
> > > The basic idea is to replace breakpoint exception with syscall which
> > > is faster on x86_64. For more details please see changelog of patch 7.
> >
> > So this is really about the fact that syscalls are faster than traps on
> > x86_64? Is there something similar on ARM64, or are they roughly the
> > same speed there?
>
> From the hardware side I would expect those to be the same speed.
>
> From the software side, there might be a difference, but in theory we
> should be able to make the non-syscall case faster because we don't have
> syscall tracing there.
>
> > That is, I don't think this scheme will work for the various RISC
> > architectures, given their very limited immediate range turns a typical
> > call into a multi-instruction trainwreck real quick.
> >
> > Now, that isn't a problem if their exceptions and syscalls are of equal
> > speed.
>
> Yep, on arm64 we definitely can't patch in branches reliably; using BRK
> (as we do today) is the only reliable option, and it *shouldn't* be
> slower than a syscall.
>
> Looking around, we have a different latent issue with uprobes on arm64
> in that only certain instructions can be modified while being
> concurrently executed (in addition to the atomictiy of updating the

What does this mean for the application in practical terms? Will it
crash? Or will there be some corruption? Just curious how this can
manifest.

> bytes in memory), and for everything else we need to stop-the-world. We
> handle that for kprobes but it looks like we don't have any
> infrastructure to handle that for uprobes.
>
> Mark.

Mark Rutland Nov. 21, 2024, 6:18 p.m. UTC | #7

On Mon, Nov 18, 2024 at 10:13:04PM -0800, Andrii Nakryiko wrote:
> On Mon, Nov 18, 2024 at 2:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > Yep, on arm64 we definitely can't patch in branches reliably; using BRK
> > (as we do today) is the only reliable option, and it *shouldn't* be
> > slower than a syscall.
> >
> > Looking around, we have a different latent issue with uprobes on arm64
> > in that only certain instructions can be modified while being
> > concurrently executed (in addition to the atomictiy of updating the
> 
> What does this mean for the application in practical terms? Will it
> crash? Or will there be some corruption? Just curious how this can
> manifest.

It can result in a variety of effects including crashes, corruption of
memory, registers, issuing random syscalls, etc.

The ARM ARM (ARM DDI 0487K.a [1]) says in section B2.2.5:

  Concurrent modification and execution of instructions can lead to the
  resulting instruction performing any behavior that can be achieved by
  executing any sequence of instructions that can be executed from the
  same Exception level [...]

Which is to say basically anything might happen, except that this can't
corrupt any state userspace cannot access, and cannot provide a
mechanism to escalate privilege to a higher exception level.

So that's potentially *very bad*, and we're just getting lucky that most
implementations don't happen to do that for most instructions, though
I'm fairly certain there are implementations out there which do exhibit
this behaviour (and it gets more likely as implementations get more
aggressive).

Mark.

[1] https://developer.arm.com/documentation/ddi0487/ka/?lang=en

Andrii Nakryiko Nov. 26, 2024, 7:13 p.m. UTC | #8

On Thu, Nov 21, 2024 at 10:18 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Mon, Nov 18, 2024 at 10:13:04PM -0800, Andrii Nakryiko wrote:
> > On Mon, Nov 18, 2024 at 2:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > > Yep, on arm64 we definitely can't patch in branches reliably; using BRK
> > > (as we do today) is the only reliable option, and it *shouldn't* be
> > > slower than a syscall.
> > >
> > > Looking around, we have a different latent issue with uprobes on arm64
> > > in that only certain instructions can be modified while being
> > > concurrently executed (in addition to the atomictiy of updating the
> >
> > What does this mean for the application in practical terms? Will it
> > crash? Or will there be some corruption? Just curious how this can
> > manifest.
>
> It can result in a variety of effects including crashes, corruption of
> memory, registers, issuing random syscalls, etc.
>
> The ARM ARM (ARM DDI 0487K.a [1]) says in section B2.2.5:
>
>   Concurrent modification and execution of instructions can lead to the
>   resulting instruction performing any behavior that can be achieved by
>   executing any sequence of instructions that can be executed from the
>   same Exception level [...]
>
> Which is to say basically anything might happen, except that this can't
> corrupt any state userspace cannot access, and cannot provide a
> mechanism to escalate privilege to a higher exception level.
>
> So that's potentially *very bad*, and we're just getting lucky that most
> implementations don't happen to do that for most instructions, though
> I'm fairly certain there are implementations out there which do exhibit
> this behaviour (and it gets more likely as implementations get more
> aggressive).
>

I see. I wonder if the fact that we do __replace_page() saves us here?
Either way, if that's a problem, it would be good for someone familiar
with ARM64 to try to address it. Ideally in a way that won't ruin the
multi-uprobe attachment speeds (i.e., not doing stop-the-world for
each of many uprobe locations to be attached, but rather do that once
for all uprobes).

> Mark.
>
> [1] https://developer.arm.com/documentation/ddi0487/ka/?lang=en

[RFC,00/11] uprobes: Add support to optimize usdt probes on x86_64

Message

Comments