diff mbox series

[7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup

Message ID 20240731214256.3588718-8-andrii@kernel.org (mailing list archive)
State Superseded
Delegated to: Masami Hiramatsu
Headers show
Series uprobes: RCU-protected hot path optimizations | expand

Commit Message

Andrii Nakryiko July 31, 2024, 9:42 p.m. UTC
Another big bottleneck to scalablity is uprobe_treelock that's taken in
a very hot path in handle_swbp(). Now that uprobes are SRCU-protected,
take advantage of that and make uprobes_tree RB-tree look up lockless.

To make RB-tree RCU-protected lockless lookup correct, we need to take
into account that such RB-tree lookup can return false negatives if there
are parallel RB-tree modifications (rotations) going on. We use seqcount
lock to detect whether RB-tree changed, and if we find nothing while
RB-tree got modified inbetween, we just retry. If uprobe was found, then
it's guaranteed to be a correct lookup.

With all the lock-avoiding changes done, we get a pretty decent
improvement in performance and scalability of uprobes with number of
CPUs, even though we are still nowhere near linear scalability. This is
due to SRCU not really scaling very well with number of CPUs on
a particular hardware that was used for testing (80-core Intel Xeon Gold
6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is
currently taken to resolve interrupt address to inode+offset and then
uprobe instance. And, of course, uretprobes still need similar RCU to
avoid refcount in the hot path, which will be addressed in the follow up
patches.

Nevertheless, the improvement is good. We used BPF selftest-based
uprobe-nop and uretprobe-nop benchmarks to get the below numbers,
varying number of CPUs on which uprobes and uretprobes are triggered.

BASELINE
========
uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)

uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)

SRCU CHANGES
============
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about
8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s
to 5mln/s.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

Comments

Andrii Nakryiko Aug. 7, 2024, 5:14 p.m. UTC | #1
On Wed, Jul 31, 2024 at 2:43 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Another big bottleneck to scalablity is uprobe_treelock that's taken in
> a very hot path in handle_swbp(). Now that uprobes are SRCU-protected,
> take advantage of that and make uprobes_tree RB-tree look up lockless.
>
> To make RB-tree RCU-protected lockless lookup correct, we need to take
> into account that such RB-tree lookup can return false negatives if there
> are parallel RB-tree modifications (rotations) going on. We use seqcount
> lock to detect whether RB-tree changed, and if we find nothing while
> RB-tree got modified inbetween, we just retry. If uprobe was found, then
> it's guaranteed to be a correct lookup.
>
> With all the lock-avoiding changes done, we get a pretty decent
> improvement in performance and scalability of uprobes with number of
> CPUs, even though we are still nowhere near linear scalability. This is
> due to SRCU not really scaling very well with number of CPUs on
> a particular hardware that was used for testing (80-core Intel Xeon Gold
> 6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is
> currently taken to resolve interrupt address to inode+offset and then
> uprobe instance. And, of course, uretprobes still need similar RCU to
> avoid refcount in the hot path, which will be addressed in the follow up
> patches.
>
> Nevertheless, the improvement is good. We used BPF selftest-based
> uprobe-nop and uretprobe-nop benchmarks to get the below numbers,
> varying number of CPUs on which uprobes and uretprobes are triggered.
>
> BASELINE
> ========
> uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
> uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
> uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
> uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
> uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
> uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
> uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)
>
> uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
> uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
> uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
> uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
> uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
> uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
> uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)
>
> SRCU CHANGES
> ============
> uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
> uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
> uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
> uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
> uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
> uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
> uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)
>
> uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
> uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
> uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
> uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
> uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
> uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
> uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)
>
> Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about
> 8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s
> to 5mln/s.
>
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 30 ++++++++++++++++++++++++------
>  1 file changed, 24 insertions(+), 6 deletions(-)
>

Ok, so it seems like rb_find_rcu() and rb_find_add_rcu() are not
enough or are buggy. I managed to more or less reliably start
reproducing a crash, which was bisected to exactly this change. My
wild guess is that we'd need an rb_erase_rcu() variant or something,
because what I'm seeing is a corrupted rb_root node while performing
lockless rb_find_rcu() operation.

You can find below debugging info (which Gmail will butcher here) in
[0], for convenience.

So, I got the below crash when running `sudo ./uprobe-stress -a20 -f3
-m5 -t4` on a 16-core QEMU VM.

[  179.375551] BUG: unable to handle page fault for address: 0000441f0f660097
[  179.376612] #PF: supervisor read access in kernel mode
[  179.377334] #PF: error_code(0x0000) - not-present page
[  179.378037] PGD 0 P4D 0
[  179.378391] Oops: Oops: 0000 [#1] PREEMPT SMP
[  179.378992] CPU: 5 UID: 0 PID: 2292 Comm: uprobe-stress Tainted: G
          E      6.11.0-rc1-00025-g6f8e0d8d5b55-dirty #181
[  179.380514] Tainted: [E]=UNSIGNED_MODULE
[  179.381022] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  179.382475] RIP: 0010:uprobe_notify_resume+0x3db/0xc40
[  179.383148] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35
bb 36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85
ed 74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d
98 00
[  179.385639] RSP: 0000:ffffc90000a93e78 EFLAGS: 00010202
[  179.386338] RAX: 74c2394808478d48 RBX: ffff888105619800 RCX: 0000000000004118
[  179.387480] RDX: ffff888109c18eb0 RSI: 00000000000a6130 RDI: ffff888105619800
[  179.388677] RBP: 0000441f0f660007 R08: ffff8881098f1300 R09: 0000000000000000
[  179.389729] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888105680fc0
[  179.390694] R13: 0000000000000000 R14: ffff888105681068 R15: ffff888105681000
[  179.391717] FS:  00007f10690006c0(0000) GS:ffff88881fb40000(0000)
knlGS:0000000000000000
[  179.392800] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  179.393582] CR2: 0000441f0f660097 CR3: 00000001049bb004 CR4: 0000000000370ef0
[  179.394536] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  179.395485] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  179.396464] Call Trace:
[  179.396796]  <TASK>
[  179.397133]  ? __die+0x1f/0x60
[  179.397640]  ? page_fault_oops+0x14c/0x440
[  179.398316]  ? do_user_addr_fault+0x5f/0x6c0
[  179.398899]  ? kvm_read_and_reset_apf_flags+0x3c/0x50
[  179.399730]  ? exc_page_fault+0x66/0x130
[  179.400361]  ? asm_exc_page_fault+0x22/0x30
[  179.400912]  ? uprobe_notify_resume+0x3db/0xc40
[  179.401500]  ? uprobe_notify_resume+0x11a/0xc40
[  179.402089]  ? arch_uprobe_exception_notify+0x39/0x40
[  179.402736]  ? notifier_call_chain+0x55/0xc0
[  179.403293]  irqentry_exit_to_user_mode+0x98/0x140
[  179.403917]  asm_exc_int3+0x35/0x40
[  179.404394] RIP: 0033:0x404119
[  179.404831] Code: 8b 45 fc 89 c7 e8 43 09 00 00 83 c0 01 c9 c3 55
48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 29 09 00 00 83 c0 01
c9 c3 cc <48> 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 0f 09 00 00
83 c0
[  179.407350] RSP: 002b:00007f1068fffab8 EFLAGS: 00000206
[  179.408083] RAX: 0000000000404118 RBX: 00007f1069000cdc RCX: 000000000000000a
[  179.409020] RDX: 0000000000000012 RSI: 0000000000000064 RDI: 0000000000000012
[  179.409965] RBP: 00007f1068fffae0 R08: 00007f106a84403c R09: 00007f106a8440a0
[  179.410864] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffff80
[  179.411768] R13: 0000000000000016 R14: 00007ffc3dc8f070 R15: 00007f1068800000
[  179.412672]  </TASK>
[  179.412965] Modules linked in: aesni_intel(E) crypto_simd(E)
cryptd(E) kvm_intel(E) kvm(E) i2c_piix4(E) i2c_smbus(E) i2c_core(E)
i6300esb(E) crc32c_intel(E) floppy(E) pcspkr(E) button(E) serio_raw(E)
[  179.415227] CR2: 0000441f0f660097
[  179.415683] ---[ end trace 0000000000000000 ]---


Note RBP: 0000441f0f660007 and RIP: 0010:uprobe_notify_resume+0x3db/0xc40

Decoding:

[ 179.417075] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35 bb
36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85 ed
74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d 98
00
All code
========
0: c1 e1 0c                shl $0xc,%ecx
3: 48 2b 08                sub (%rax),%rcx
6: 48 8b 44 24 08          mov 0x8(%rsp),%rax
b: 48 01 c1                add %rax,%rcx
e: 8b 35 bb 36 0d 03       mov 0x30d36bb(%rip),%esi # 0x30d36cf
14: 40 f6 c6 01            test $0x1,%sil
18: 0f 85 be 00 00 00      jne 0xdc
1e: 48 8b 2d ba 36 0d 03   mov 0x30d36ba(%rip),%rbp # 0x30d36df
25: 48 85 ed               test %rbp,%rbp
28: 74 29                  je 0x53
2a:* 48 8b 85 90 00 00 00  mov 0x90(%rbp),%rax <-- trapping instruction
31: 48 39 c2               cmp %rax,%rdx
34: 72 2d                  jb 0x63
36: 48 39 d0               cmp %rdx,%rax
39: 72 0f                  jb 0x4a
3b: 48                     rex.W
3c: 3b                     .byte 0x3b
3d: 8d                     .byte 0x8d
3e: 98                     cwtl
...


Let's also look at uprobe_notify_resume+0x3db:

(gdb) list *(uprobe_notify_resume+0x3db)
0xffffffff81242f5b is in uprobe_notify_resume
(/data/users/andriin/linux/kernel/events/uprobes.c:662).
657
658 static __always_inline
659 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset,
660                const struct uprobe *r)
661 {
662     if (l_inode < r->inode)
663         return -1;
664
665     if (l_inode > r->inode)
666         return 1;

So this is most probably when accessing uprobe through r->inode.

Looking at uprobe_notify_resume disassembly:

0xffffffff81242f2c <+940>:  mov 0x78(%rax),%rcx
0xffffffff81242f30 <+944>:  shl $0xc,%rcx
0xffffffff81242f34 <+948>:  sub (%rax),%rcx
0xffffffff81242f37 <+951>:  mov 0x8(%rsp),%rax
0xffffffff81242f3c <+956>:  add %rax,%rcx
0xffffffff81242f3f <+959>:  mov 0x30d36bb(%rip),%esi #
0xffffffff84316600 <uprobes_seqcount>
0xffffffff81242f45 <+965>:  test $0x1,%sil
0xffffffff81242f49 <+969>:  jne 0xffffffff8124300d <uprobe_notify_resume+1165>
0xffffffff81242f4f <+975>:  mov 0x30d36ba(%rip),%rbp #
0xffffffff84316610 <uprobes_tree>
0xffffffff81242f56 <+982>:  test %rbp,%rbp
0xffffffff81242f59 <+985>:  je 0xffffffff81242f84 <uprobe_notify_resume+1028>
0xffffffff81242f5b <+987>:  mov 0x90(%rbp),%rax
0xffffffff81242f62 <+994>:  cmp %rax,%rdx
0xffffffff81242f65 <+997>:  jb 0xffffffff81242f94 <uprobe_notify_resume+1044>
0xffffffff81242f67 <+999>:  cmp %rdx,%rax
0xffffffff81242f6a <+1002>: jb 0xffffffff81242f7b <uprobe_notify_resume+1019>
0xffffffff81242f6c <+1004>: cmp 0x98(%rbp),%rcx
0xffffffff81242f73 <+1011>: jl 0xffffffff81242f94 <uprobe_notify_resume+1044>
0xffffffff81242f75 <+1013>: jle 0xffffffff81242d0e <uprobe_notify_resume+398>
0xffffffff81242f7b <+1019>: mov 0x8(%rbp),%rbp
0xffffffff81242f7f <+1023>: test %rbp,%rbp
0xffffffff81242f25 <+933>:  mov 0xa8(%rcx),%rdx

Not how in mov 0x90(%rbp),%rax %rbp is coming from

mov 0x30d36ba(%rip),%rbp # 0xffffffff84316610 <uprobes_tree>

So. This is inlined uprobe_cmp() on uprobes_tree.rb_node (i.e., the
root of RB tree). And we know that rbp was set to 0x0000441f0f660097,
which doesn't look like a valid kernel address to me. 0x90 is
offsetof(struct uprobe, inode), and offsetof(struct uprobe, rb_node)
== 0. So we load the rb_node/uprobe pointer from uprobe_tree (struct
rb_root), get 0x0000441f0f660097, and then r->inode crashes the kernel
because r looks like user space address.

Note that we never modify RB tree root (or any node) directly
anywhere. We only mutate it with rb_find_add_rcu() and rb_erase(),
both under uprobes_treelock.


So, any ideas how we can end up with "corrupted" root on lockless
lookup with rb_find_rcu()? This seems to be the very first lockless
RB-tree lookup use case in the tree, so perhaps it's not that safe
after all (certainly rb_erase() is non-trivial enough to not be
"obviously correct" w.r.t. RCU, no)?

  [0] https://gist.github.com/anakryiko/df31ab75f25544af93cd41273056ee88

> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index b0488d356399..d03962cc96de 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
>  #define no_uprobe_events()     RB_EMPTY_ROOT(&uprobes_tree)
>
>  static DEFINE_RWLOCK(uprobes_treelock);        /* serialize rbtree access */
> +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
>
>  DEFINE_STATIC_SRCU(uprobes_srcu);
>
> @@ -629,8 +630,11 @@ static void put_uprobe(struct uprobe *uprobe)
>
>         write_lock(&uprobes_treelock);
>
> -       if (uprobe_is_active(uprobe))
> +       if (uprobe_is_active(uprobe)) {
> +               write_seqcount_begin(&uprobes_seqcount);
>                 rb_erase(&uprobe->rb_node, &uprobes_tree);
> +               write_seqcount_end(&uprobes_seqcount);
> +       }
>
>         write_unlock(&uprobes_treelock);
>
> @@ -696,14 +700,26 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
>                 .offset = offset,
>         };
>         struct rb_node *node;
> +       unsigned int seq;
>
>         lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
>
> -       read_lock(&uprobes_treelock);
> -       node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
> -       read_unlock(&uprobes_treelock);
> +       do {
> +               seq = read_seqcount_begin(&uprobes_seqcount);
> +               node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
> +               /*
> +                * Lockless RB-tree lookups can result only in false negatives.
> +                * If the element is found, it is correct and can be returned
> +                * under RCU protection. If we find nothing, we need to
> +                * validate that seqcount didn't change. If it did, we have to
> +                * try again as we might have missed the element (false
> +                * negative). If seqcount is unchanged, search truly failed.
> +                */
> +               if (node)
> +                       return __node_2_uprobe(node);
> +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
>
> -       return node ? __node_2_uprobe(node) : NULL;
> +       return NULL;
>  }
>
>  /*
> @@ -725,7 +741,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
>  {
>         struct rb_node *node;
>  again:
> -       node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
> +       node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
>         if (node) {
>                 struct uprobe *u = __node_2_uprobe(node);
>
> @@ -750,7 +766,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>         struct uprobe *u;
>
>         write_lock(&uprobes_treelock);
> +       write_seqcount_begin(&uprobes_seqcount);
>         u = __insert_uprobe(uprobe);
> +       write_seqcount_end(&uprobes_seqcount);
>         write_unlock(&uprobes_treelock);
>
>         return u;
> --
> 2.43.0
>
Oleg Nesterov Aug. 8, 2024, 10:04 a.m. UTC | #2
On 08/07, Andrii Nakryiko wrote:
>
> Ok, so it seems like rb_find_rcu() and rb_find_add_rcu() are not
> enough or are buggy. I managed to more or less reliably start
> reproducing a crash, which was bisected to exactly this change. My
> wild guess is that we'd need an rb_erase_rcu() variant or something,

And then I think it is not safe to put uprobe->rb_node and uprobe->rcu
in the union, sorry...

Did you get the crash with this change?

Oleg.
Oleg Nesterov Aug. 8, 2024, 1:40 p.m. UTC | #3
On 07/31, Andrii Nakryiko wrote:
>
>  static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
> +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);

Just noticed... Why seqcount_rwlock_t?

find_uprobe_rcu() doesn't use read_seqbegin_or_lock(),
seqcount_t should work just fine.

Oleg.
Oleg Nesterov Aug. 8, 2024, 2:29 p.m. UTC | #4
On 08/07, Andrii Nakryiko wrote:
>
> So, any ideas how we can end up with "corrupted" root on lockless
> lookup with rb_find_rcu()?

I certainly can't help ;) I know ABSOLUTELY NOTHING about rb or any
other tree.

But,

> This seems to be the very first lockless
> RB-tree lookup use case in the tree,

Well, latch_tree_find() is supposed to be rcu-safe afaics, and
__lt_erase() is just rb_erase(). So it is not the 1st use case.

See also the "Notes on lockless lookups" comment in lib/rbtree.c.

So it seems that rb_erase() is supposed to be rcu-safe. However
it uses __rb_change_child(), not __rb_change_child_rcu().

Not that I think this can explain the problem, and on x86
__smp_store_release() is just WRITE_ONCE, but looks confusing...

Oleg.
Andrii Nakryiko Aug. 8, 2024, 5 p.m. UTC | #5
On Thu, Aug 8, 2024 at 7:29 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/07, Andrii Nakryiko wrote:
> >
> > So, any ideas how we can end up with "corrupted" root on lockless
> > lookup with rb_find_rcu()?
>
> I certainly can't help ;) I know ABSOLUTELY NOTHING about rb or any
> other tree.
>
> But,
>
> > This seems to be the very first lockless
> > RB-tree lookup use case in the tree,
>
> Well, latch_tree_find() is supposed to be rcu-safe afaics, and
> __lt_erase() is just rb_erase(). So it is not the 1st use case.
>
> See also the "Notes on lockless lookups" comment in lib/rbtree.c.
>
> So it seems that rb_erase() is supposed to be rcu-safe. However
> it uses __rb_change_child(), not __rb_change_child_rcu().
>

While trying to mitigate the crash locally I noticed
__rb_change_child() and changed manually all the cases to
__rb_change_child_rcu(). That didn't help :) But I think your guess
about sharing rcu and rb_node is the right now, so hopefully that will
solve the issue.

> Not that I think this can explain the problem, and on x86
> __smp_store_release() is just WRITE_ONCE, but looks confusing...
>
> Oleg.
>
Oleg Nesterov Aug. 10, 2024, 2 p.m. UTC | #6
On 08/08, Oleg Nesterov wrote:
>
> On 07/31, Andrii Nakryiko wrote:
> >
> >  static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
> > +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
>
> Just noticed... Why seqcount_rwlock_t?
>
> find_uprobe_rcu() doesn't use read_seqbegin_or_lock(),
> seqcount_t should work just fine.

Please ignore... I forgot that seqcount_t is not CONFIG_PREEMPT_RT-friendly.

Hmm. __seqprop_preemptible() returns 0, this doesn't look right... Nevermend.

Oleg.
diff mbox series

Patch

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index b0488d356399..d03962cc96de 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -40,6 +40,7 @@  static struct rb_root uprobes_tree = RB_ROOT;
 #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
 
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
+static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
 
 DEFINE_STATIC_SRCU(uprobes_srcu);
 
@@ -629,8 +630,11 @@  static void put_uprobe(struct uprobe *uprobe)
 
 	write_lock(&uprobes_treelock);
 
-	if (uprobe_is_active(uprobe))
+	if (uprobe_is_active(uprobe)) {
+		write_seqcount_begin(&uprobes_seqcount);
 		rb_erase(&uprobe->rb_node, &uprobes_tree);
+		write_seqcount_end(&uprobes_seqcount);
+	}
 
 	write_unlock(&uprobes_treelock);
 
@@ -696,14 +700,26 @@  static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
 		.offset = offset,
 	};
 	struct rb_node *node;
+	unsigned int seq;
 
 	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
 
-	read_lock(&uprobes_treelock);
-	node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
-	read_unlock(&uprobes_treelock);
+	do {
+		seq = read_seqcount_begin(&uprobes_seqcount);
+		node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
+		/*
+		 * Lockless RB-tree lookups can result only in false negatives.
+		 * If the element is found, it is correct and can be returned
+		 * under RCU protection. If we find nothing, we need to
+		 * validate that seqcount didn't change. If it did, we have to
+		 * try again as we might have missed the element (false
+		 * negative). If seqcount is unchanged, search truly failed.
+		 */
+		if (node)
+			return __node_2_uprobe(node);
+	} while (read_seqcount_retry(&uprobes_seqcount, seq));
 
-	return node ? __node_2_uprobe(node) : NULL;
+	return NULL;
 }
 
 /*
@@ -725,7 +741,7 @@  static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
 again:
-	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
+	node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
 	if (node) {
 		struct uprobe *u = __node_2_uprobe(node);
 
@@ -750,7 +766,9 @@  static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 	struct uprobe *u;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	u = __insert_uprobe(uprobe);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 
 	return u;