Message ID | 20240731214256.3588718-8-andrii@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Masami Hiramatsu |
Headers | show |
Series | uprobes: RCU-protected hot path optimizations | expand |
On Wed, Jul 31, 2024 at 2:43 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > Another big bottleneck to scalablity is uprobe_treelock that's taken in > a very hot path in handle_swbp(). Now that uprobes are SRCU-protected, > take advantage of that and make uprobes_tree RB-tree look up lockless. > > To make RB-tree RCU-protected lockless lookup correct, we need to take > into account that such RB-tree lookup can return false negatives if there > are parallel RB-tree modifications (rotations) going on. We use seqcount > lock to detect whether RB-tree changed, and if we find nothing while > RB-tree got modified inbetween, we just retry. If uprobe was found, then > it's guaranteed to be a correct lookup. > > With all the lock-avoiding changes done, we get a pretty decent > improvement in performance and scalability of uprobes with number of > CPUs, even though we are still nowhere near linear scalability. This is > due to SRCU not really scaling very well with number of CPUs on > a particular hardware that was used for testing (80-core Intel Xeon Gold > 6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is > currently taken to resolve interrupt address to inode+offset and then > uprobe instance. And, of course, uretprobes still need similar RCU to > avoid refcount in the hot path, which will be addressed in the follow up > patches. > > Nevertheless, the improvement is good. We used BPF selftest-based > uprobe-nop and uretprobe-nop benchmarks to get the below numbers, > varying number of CPUs on which uprobes and uretprobes are triggered. > > BASELINE > ======== > uprobe-nop ( 1 cpus): 3.032 ± 0.023M/s ( 3.032M/s/cpu) > uprobe-nop ( 2 cpus): 3.452 ± 0.005M/s ( 1.726M/s/cpu) > uprobe-nop ( 4 cpus): 3.663 ± 0.005M/s ( 0.916M/s/cpu) > uprobe-nop ( 8 cpus): 3.718 ± 0.038M/s ( 0.465M/s/cpu) > uprobe-nop (16 cpus): 3.344 ± 0.008M/s ( 0.209M/s/cpu) > uprobe-nop (32 cpus): 2.288 ± 0.021M/s ( 0.071M/s/cpu) > uprobe-nop (64 cpus): 3.205 ± 0.004M/s ( 0.050M/s/cpu) > > uretprobe-nop ( 1 cpus): 1.979 ± 0.005M/s ( 1.979M/s/cpu) > uretprobe-nop ( 2 cpus): 2.361 ± 0.005M/s ( 1.180M/s/cpu) > uretprobe-nop ( 4 cpus): 2.309 ± 0.002M/s ( 0.577M/s/cpu) > uretprobe-nop ( 8 cpus): 2.253 ± 0.001M/s ( 0.282M/s/cpu) > uretprobe-nop (16 cpus): 2.007 ± 0.000M/s ( 0.125M/s/cpu) > uretprobe-nop (32 cpus): 1.624 ± 0.003M/s ( 0.051M/s/cpu) > uretprobe-nop (64 cpus): 2.149 ± 0.001M/s ( 0.034M/s/cpu) > > SRCU CHANGES > ============ > uprobe-nop ( 1 cpus): 3.276 ± 0.005M/s ( 3.276M/s/cpu) > uprobe-nop ( 2 cpus): 4.125 ± 0.002M/s ( 2.063M/s/cpu) > uprobe-nop ( 4 cpus): 7.713 ± 0.002M/s ( 1.928M/s/cpu) > uprobe-nop ( 8 cpus): 8.097 ± 0.006M/s ( 1.012M/s/cpu) > uprobe-nop (16 cpus): 6.501 ± 0.056M/s ( 0.406M/s/cpu) > uprobe-nop (32 cpus): 4.398 ± 0.084M/s ( 0.137M/s/cpu) > uprobe-nop (64 cpus): 6.452 ± 0.000M/s ( 0.101M/s/cpu) > > uretprobe-nop ( 1 cpus): 2.055 ± 0.001M/s ( 2.055M/s/cpu) > uretprobe-nop ( 2 cpus): 2.677 ± 0.000M/s ( 1.339M/s/cpu) > uretprobe-nop ( 4 cpus): 4.561 ± 0.003M/s ( 1.140M/s/cpu) > uretprobe-nop ( 8 cpus): 5.291 ± 0.002M/s ( 0.661M/s/cpu) > uretprobe-nop (16 cpus): 5.065 ± 0.019M/s ( 0.317M/s/cpu) > uretprobe-nop (32 cpus): 3.622 ± 0.003M/s ( 0.113M/s/cpu) > uretprobe-nop (64 cpus): 3.723 ± 0.002M/s ( 0.058M/s/cpu) > > Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about > 8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s > to 5mln/s. > > Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > kernel/events/uprobes.c | 30 ++++++++++++++++++++++++------ > 1 file changed, 24 insertions(+), 6 deletions(-) > Ok, so it seems like rb_find_rcu() and rb_find_add_rcu() are not enough or are buggy. I managed to more or less reliably start reproducing a crash, which was bisected to exactly this change. My wild guess is that we'd need an rb_erase_rcu() variant or something, because what I'm seeing is a corrupted rb_root node while performing lockless rb_find_rcu() operation. You can find below debugging info (which Gmail will butcher here) in [0], for convenience. So, I got the below crash when running `sudo ./uprobe-stress -a20 -f3 -m5 -t4` on a 16-core QEMU VM. [ 179.375551] BUG: unable to handle page fault for address: 0000441f0f660097 [ 179.376612] #PF: supervisor read access in kernel mode [ 179.377334] #PF: error_code(0x0000) - not-present page [ 179.378037] PGD 0 P4D 0 [ 179.378391] Oops: Oops: 0000 [#1] PREEMPT SMP [ 179.378992] CPU: 5 UID: 0 PID: 2292 Comm: uprobe-stress Tainted: G E 6.11.0-rc1-00025-g6f8e0d8d5b55-dirty #181 [ 179.380514] Tainted: [E]=UNSIGNED_MODULE [ 179.381022] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 179.382475] RIP: 0010:uprobe_notify_resume+0x3db/0xc40 [ 179.383148] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35 bb 36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85 ed 74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d 98 00 [ 179.385639] RSP: 0000:ffffc90000a93e78 EFLAGS: 00010202 [ 179.386338] RAX: 74c2394808478d48 RBX: ffff888105619800 RCX: 0000000000004118 [ 179.387480] RDX: ffff888109c18eb0 RSI: 00000000000a6130 RDI: ffff888105619800 [ 179.388677] RBP: 0000441f0f660007 R08: ffff8881098f1300 R09: 0000000000000000 [ 179.389729] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888105680fc0 [ 179.390694] R13: 0000000000000000 R14: ffff888105681068 R15: ffff888105681000 [ 179.391717] FS: 00007f10690006c0(0000) GS:ffff88881fb40000(0000) knlGS:0000000000000000 [ 179.392800] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 179.393582] CR2: 0000441f0f660097 CR3: 00000001049bb004 CR4: 0000000000370ef0 [ 179.394536] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 179.395485] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 179.396464] Call Trace: [ 179.396796] <TASK> [ 179.397133] ? __die+0x1f/0x60 [ 179.397640] ? page_fault_oops+0x14c/0x440 [ 179.398316] ? do_user_addr_fault+0x5f/0x6c0 [ 179.398899] ? kvm_read_and_reset_apf_flags+0x3c/0x50 [ 179.399730] ? exc_page_fault+0x66/0x130 [ 179.400361] ? asm_exc_page_fault+0x22/0x30 [ 179.400912] ? uprobe_notify_resume+0x3db/0xc40 [ 179.401500] ? uprobe_notify_resume+0x11a/0xc40 [ 179.402089] ? arch_uprobe_exception_notify+0x39/0x40 [ 179.402736] ? notifier_call_chain+0x55/0xc0 [ 179.403293] irqentry_exit_to_user_mode+0x98/0x140 [ 179.403917] asm_exc_int3+0x35/0x40 [ 179.404394] RIP: 0033:0x404119 [ 179.404831] Code: 8b 45 fc 89 c7 e8 43 09 00 00 83 c0 01 c9 c3 55 48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 29 09 00 00 83 c0 01 c9 c3 cc <48> 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 0f 09 00 00 83 c0 [ 179.407350] RSP: 002b:00007f1068fffab8 EFLAGS: 00000206 [ 179.408083] RAX: 0000000000404118 RBX: 00007f1069000cdc RCX: 000000000000000a [ 179.409020] RDX: 0000000000000012 RSI: 0000000000000064 RDI: 0000000000000012 [ 179.409965] RBP: 00007f1068fffae0 R08: 00007f106a84403c R09: 00007f106a8440a0 [ 179.410864] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffff80 [ 179.411768] R13: 0000000000000016 R14: 00007ffc3dc8f070 R15: 00007f1068800000 [ 179.412672] </TASK> [ 179.412965] Modules linked in: aesni_intel(E) crypto_simd(E) cryptd(E) kvm_intel(E) kvm(E) i2c_piix4(E) i2c_smbus(E) i2c_core(E) i6300esb(E) crc32c_intel(E) floppy(E) pcspkr(E) button(E) serio_raw(E) [ 179.415227] CR2: 0000441f0f660097 [ 179.415683] ---[ end trace 0000000000000000 ]--- Note RBP: 0000441f0f660007 and RIP: 0010:uprobe_notify_resume+0x3db/0xc40 Decoding: [ 179.417075] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35 bb 36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85 ed 74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d 98 00 All code ======== 0: c1 e1 0c shl $0xc,%ecx 3: 48 2b 08 sub (%rax),%rcx 6: 48 8b 44 24 08 mov 0x8(%rsp),%rax b: 48 01 c1 add %rax,%rcx e: 8b 35 bb 36 0d 03 mov 0x30d36bb(%rip),%esi # 0x30d36cf 14: 40 f6 c6 01 test $0x1,%sil 18: 0f 85 be 00 00 00 jne 0xdc 1e: 48 8b 2d ba 36 0d 03 mov 0x30d36ba(%rip),%rbp # 0x30d36df 25: 48 85 ed test %rbp,%rbp 28: 74 29 je 0x53 2a:* 48 8b 85 90 00 00 00 mov 0x90(%rbp),%rax <-- trapping instruction 31: 48 39 c2 cmp %rax,%rdx 34: 72 2d jb 0x63 36: 48 39 d0 cmp %rdx,%rax 39: 72 0f jb 0x4a 3b: 48 rex.W 3c: 3b .byte 0x3b 3d: 8d .byte 0x8d 3e: 98 cwtl ... Let's also look at uprobe_notify_resume+0x3db: (gdb) list *(uprobe_notify_resume+0x3db) 0xffffffff81242f5b is in uprobe_notify_resume (/data/users/andriin/linux/kernel/events/uprobes.c:662). 657 658 static __always_inline 659 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset, 660 const struct uprobe *r) 661 { 662 if (l_inode < r->inode) 663 return -1; 664 665 if (l_inode > r->inode) 666 return 1; So this is most probably when accessing uprobe through r->inode. Looking at uprobe_notify_resume disassembly: 0xffffffff81242f2c <+940>: mov 0x78(%rax),%rcx 0xffffffff81242f30 <+944>: shl $0xc,%rcx 0xffffffff81242f34 <+948>: sub (%rax),%rcx 0xffffffff81242f37 <+951>: mov 0x8(%rsp),%rax 0xffffffff81242f3c <+956>: add %rax,%rcx 0xffffffff81242f3f <+959>: mov 0x30d36bb(%rip),%esi # 0xffffffff84316600 <uprobes_seqcount> 0xffffffff81242f45 <+965>: test $0x1,%sil 0xffffffff81242f49 <+969>: jne 0xffffffff8124300d <uprobe_notify_resume+1165> 0xffffffff81242f4f <+975>: mov 0x30d36ba(%rip),%rbp # 0xffffffff84316610 <uprobes_tree> 0xffffffff81242f56 <+982>: test %rbp,%rbp 0xffffffff81242f59 <+985>: je 0xffffffff81242f84 <uprobe_notify_resume+1028> 0xffffffff81242f5b <+987>: mov 0x90(%rbp),%rax 0xffffffff81242f62 <+994>: cmp %rax,%rdx 0xffffffff81242f65 <+997>: jb 0xffffffff81242f94 <uprobe_notify_resume+1044> 0xffffffff81242f67 <+999>: cmp %rdx,%rax 0xffffffff81242f6a <+1002>: jb 0xffffffff81242f7b <uprobe_notify_resume+1019> 0xffffffff81242f6c <+1004>: cmp 0x98(%rbp),%rcx 0xffffffff81242f73 <+1011>: jl 0xffffffff81242f94 <uprobe_notify_resume+1044> 0xffffffff81242f75 <+1013>: jle 0xffffffff81242d0e <uprobe_notify_resume+398> 0xffffffff81242f7b <+1019>: mov 0x8(%rbp),%rbp 0xffffffff81242f7f <+1023>: test %rbp,%rbp 0xffffffff81242f25 <+933>: mov 0xa8(%rcx),%rdx Not how in mov 0x90(%rbp),%rax %rbp is coming from mov 0x30d36ba(%rip),%rbp # 0xffffffff84316610 <uprobes_tree> So. This is inlined uprobe_cmp() on uprobes_tree.rb_node (i.e., the root of RB tree). And we know that rbp was set to 0x0000441f0f660097, which doesn't look like a valid kernel address to me. 0x90 is offsetof(struct uprobe, inode), and offsetof(struct uprobe, rb_node) == 0. So we load the rb_node/uprobe pointer from uprobe_tree (struct rb_root), get 0x0000441f0f660097, and then r->inode crashes the kernel because r looks like user space address. Note that we never modify RB tree root (or any node) directly anywhere. We only mutate it with rb_find_add_rcu() and rb_erase(), both under uprobes_treelock. So, any ideas how we can end up with "corrupted" root on lockless lookup with rb_find_rcu()? This seems to be the very first lockless RB-tree lookup use case in the tree, so perhaps it's not that safe after all (certainly rb_erase() is non-trivial enough to not be "obviously correct" w.r.t. RCU, no)? [0] https://gist.github.com/anakryiko/df31ab75f25544af93cd41273056ee88 > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index b0488d356399..d03962cc96de 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock); > > DEFINE_STATIC_SRCU(uprobes_srcu); > > @@ -629,8 +630,11 @@ static void put_uprobe(struct uprobe *uprobe) > > write_lock(&uprobes_treelock); > > - if (uprobe_is_active(uprobe)) > + if (uprobe_is_active(uprobe)) { > + write_seqcount_begin(&uprobes_seqcount); > rb_erase(&uprobe->rb_node, &uprobes_tree); > + write_seqcount_end(&uprobes_seqcount); > + } > > write_unlock(&uprobes_treelock); > > @@ -696,14 +700,26 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset) > .offset = offset, > }; > struct rb_node *node; > + unsigned int seq; > > lockdep_assert(srcu_read_lock_held(&uprobes_srcu)); > > - read_lock(&uprobes_treelock); > - node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key); > - read_unlock(&uprobes_treelock); > + do { > + seq = read_seqcount_begin(&uprobes_seqcount); > + node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key); > + /* > + * Lockless RB-tree lookups can result only in false negatives. > + * If the element is found, it is correct and can be returned > + * under RCU protection. If we find nothing, we need to > + * validate that seqcount didn't change. If it did, we have to > + * try again as we might have missed the element (false > + * negative). If seqcount is unchanged, search truly failed. > + */ > + if (node) > + return __node_2_uprobe(node); > + } while (read_seqcount_retry(&uprobes_seqcount, seq)); > > - return node ? __node_2_uprobe(node) : NULL; > + return NULL; > } > > /* > @@ -725,7 +741,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) > { > struct rb_node *node; > again: > - node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); > + node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); > if (node) { > struct uprobe *u = __node_2_uprobe(node); > > @@ -750,7 +766,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > struct uprobe *u; > > write_lock(&uprobes_treelock); > + write_seqcount_begin(&uprobes_seqcount); > u = __insert_uprobe(uprobe); > + write_seqcount_end(&uprobes_seqcount); > write_unlock(&uprobes_treelock); > > return u; > -- > 2.43.0 >
On 08/07, Andrii Nakryiko wrote: > > Ok, so it seems like rb_find_rcu() and rb_find_add_rcu() are not > enough or are buggy. I managed to more or less reliably start > reproducing a crash, which was bisected to exactly this change. My > wild guess is that we'd need an rb_erase_rcu() variant or something, And then I think it is not safe to put uprobe->rb_node and uprobe->rcu in the union, sorry... Did you get the crash with this change? Oleg.
On 07/31, Andrii Nakryiko wrote: > > static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock); Just noticed... Why seqcount_rwlock_t? find_uprobe_rcu() doesn't use read_seqbegin_or_lock(), seqcount_t should work just fine. Oleg.
On 08/07, Andrii Nakryiko wrote: > > So, any ideas how we can end up with "corrupted" root on lockless > lookup with rb_find_rcu()? I certainly can't help ;) I know ABSOLUTELY NOTHING about rb or any other tree. But, > This seems to be the very first lockless > RB-tree lookup use case in the tree, Well, latch_tree_find() is supposed to be rcu-safe afaics, and __lt_erase() is just rb_erase(). So it is not the 1st use case. See also the "Notes on lockless lookups" comment in lib/rbtree.c. So it seems that rb_erase() is supposed to be rcu-safe. However it uses __rb_change_child(), not __rb_change_child_rcu(). Not that I think this can explain the problem, and on x86 __smp_store_release() is just WRITE_ONCE, but looks confusing... Oleg.
On Thu, Aug 8, 2024 at 7:29 AM Oleg Nesterov <oleg@redhat.com> wrote: > > On 08/07, Andrii Nakryiko wrote: > > > > So, any ideas how we can end up with "corrupted" root on lockless > > lookup with rb_find_rcu()? > > I certainly can't help ;) I know ABSOLUTELY NOTHING about rb or any > other tree. > > But, > > > This seems to be the very first lockless > > RB-tree lookup use case in the tree, > > Well, latch_tree_find() is supposed to be rcu-safe afaics, and > __lt_erase() is just rb_erase(). So it is not the 1st use case. > > See also the "Notes on lockless lookups" comment in lib/rbtree.c. > > So it seems that rb_erase() is supposed to be rcu-safe. However > it uses __rb_change_child(), not __rb_change_child_rcu(). > While trying to mitigate the crash locally I noticed __rb_change_child() and changed manually all the cases to __rb_change_child_rcu(). That didn't help :) But I think your guess about sharing rcu and rb_node is the right now, so hopefully that will solve the issue. > Not that I think this can explain the problem, and on x86 > __smp_store_release() is just WRITE_ONCE, but looks confusing... > > Oleg. >
On 08/08, Oleg Nesterov wrote: > > On 07/31, Andrii Nakryiko wrote: > > > > static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock); > > Just noticed... Why seqcount_rwlock_t? > > find_uprobe_rcu() doesn't use read_seqbegin_or_lock(), > seqcount_t should work just fine. Please ignore... I forgot that seqcount_t is not CONFIG_PREEMPT_RT-friendly. Hmm. __seqprop_preemptible() returns 0, this doesn't look right... Nevermend. Oleg.
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index b0488d356399..d03962cc96de 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT; #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock); DEFINE_STATIC_SRCU(uprobes_srcu); @@ -629,8 +630,11 @@ static void put_uprobe(struct uprobe *uprobe) write_lock(&uprobes_treelock); - if (uprobe_is_active(uprobe)) + if (uprobe_is_active(uprobe)) { + write_seqcount_begin(&uprobes_seqcount); rb_erase(&uprobe->rb_node, &uprobes_tree); + write_seqcount_end(&uprobes_seqcount); + } write_unlock(&uprobes_treelock); @@ -696,14 +700,26 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset) .offset = offset, }; struct rb_node *node; + unsigned int seq; lockdep_assert(srcu_read_lock_held(&uprobes_srcu)); - read_lock(&uprobes_treelock); - node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key); - read_unlock(&uprobes_treelock); + do { + seq = read_seqcount_begin(&uprobes_seqcount); + node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key); + /* + * Lockless RB-tree lookups can result only in false negatives. + * If the element is found, it is correct and can be returned + * under RCU protection. If we find nothing, we need to + * validate that seqcount didn't change. If it did, we have to + * try again as we might have missed the element (false + * negative). If seqcount is unchanged, search truly failed. + */ + if (node) + return __node_2_uprobe(node); + } while (read_seqcount_retry(&uprobes_seqcount, seq)); - return node ? __node_2_uprobe(node) : NULL; + return NULL; } /* @@ -725,7 +741,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) { struct rb_node *node; again: - node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); + node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); if (node) { struct uprobe *u = __node_2_uprobe(node); @@ -750,7 +766,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) struct uprobe *u; write_lock(&uprobes_treelock); + write_seqcount_begin(&uprobes_seqcount); u = __insert_uprobe(uprobe); + write_seqcount_end(&uprobes_seqcount); write_unlock(&uprobes_treelock); return u;
Another big bottleneck to scalablity is uprobe_treelock that's taken in a very hot path in handle_swbp(). Now that uprobes are SRCU-protected, take advantage of that and make uprobes_tree RB-tree look up lockless. To make RB-tree RCU-protected lockless lookup correct, we need to take into account that such RB-tree lookup can return false negatives if there are parallel RB-tree modifications (rotations) going on. We use seqcount lock to detect whether RB-tree changed, and if we find nothing while RB-tree got modified inbetween, we just retry. If uprobe was found, then it's guaranteed to be a correct lookup. With all the lock-avoiding changes done, we get a pretty decent improvement in performance and scalability of uprobes with number of CPUs, even though we are still nowhere near linear scalability. This is due to SRCU not really scaling very well with number of CPUs on a particular hardware that was used for testing (80-core Intel Xeon Gold 6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is currently taken to resolve interrupt address to inode+offset and then uprobe instance. And, of course, uretprobes still need similar RCU to avoid refcount in the hot path, which will be addressed in the follow up patches. Nevertheless, the improvement is good. We used BPF selftest-based uprobe-nop and uretprobe-nop benchmarks to get the below numbers, varying number of CPUs on which uprobes and uretprobes are triggered. BASELINE ======== uprobe-nop ( 1 cpus): 3.032 ± 0.023M/s ( 3.032M/s/cpu) uprobe-nop ( 2 cpus): 3.452 ± 0.005M/s ( 1.726M/s/cpu) uprobe-nop ( 4 cpus): 3.663 ± 0.005M/s ( 0.916M/s/cpu) uprobe-nop ( 8 cpus): 3.718 ± 0.038M/s ( 0.465M/s/cpu) uprobe-nop (16 cpus): 3.344 ± 0.008M/s ( 0.209M/s/cpu) uprobe-nop (32 cpus): 2.288 ± 0.021M/s ( 0.071M/s/cpu) uprobe-nop (64 cpus): 3.205 ± 0.004M/s ( 0.050M/s/cpu) uretprobe-nop ( 1 cpus): 1.979 ± 0.005M/s ( 1.979M/s/cpu) uretprobe-nop ( 2 cpus): 2.361 ± 0.005M/s ( 1.180M/s/cpu) uretprobe-nop ( 4 cpus): 2.309 ± 0.002M/s ( 0.577M/s/cpu) uretprobe-nop ( 8 cpus): 2.253 ± 0.001M/s ( 0.282M/s/cpu) uretprobe-nop (16 cpus): 2.007 ± 0.000M/s ( 0.125M/s/cpu) uretprobe-nop (32 cpus): 1.624 ± 0.003M/s ( 0.051M/s/cpu) uretprobe-nop (64 cpus): 2.149 ± 0.001M/s ( 0.034M/s/cpu) SRCU CHANGES ============ uprobe-nop ( 1 cpus): 3.276 ± 0.005M/s ( 3.276M/s/cpu) uprobe-nop ( 2 cpus): 4.125 ± 0.002M/s ( 2.063M/s/cpu) uprobe-nop ( 4 cpus): 7.713 ± 0.002M/s ( 1.928M/s/cpu) uprobe-nop ( 8 cpus): 8.097 ± 0.006M/s ( 1.012M/s/cpu) uprobe-nop (16 cpus): 6.501 ± 0.056M/s ( 0.406M/s/cpu) uprobe-nop (32 cpus): 4.398 ± 0.084M/s ( 0.137M/s/cpu) uprobe-nop (64 cpus): 6.452 ± 0.000M/s ( 0.101M/s/cpu) uretprobe-nop ( 1 cpus): 2.055 ± 0.001M/s ( 2.055M/s/cpu) uretprobe-nop ( 2 cpus): 2.677 ± 0.000M/s ( 1.339M/s/cpu) uretprobe-nop ( 4 cpus): 4.561 ± 0.003M/s ( 1.140M/s/cpu) uretprobe-nop ( 8 cpus): 5.291 ± 0.002M/s ( 0.661M/s/cpu) uretprobe-nop (16 cpus): 5.065 ± 0.019M/s ( 0.317M/s/cpu) uretprobe-nop (32 cpus): 3.622 ± 0.003M/s ( 0.113M/s/cpu) uretprobe-nop (64 cpus): 3.723 ± 0.002M/s ( 0.058M/s/cpu) Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about 8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s to 5mln/s. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- kernel/events/uprobes.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-)