diff mbox series

[bpf-next,3/3] uprobes: add speculative lockless system-wide uprobe filter check

Message ID 20240312210233.1941599-4-andrii@kernel.org (mailing list archive)
State Superseded
Headers show
Series uprobes: two common case speed ups | expand

Commit Message

Andrii Nakryiko March 12, 2024, 9:02 p.m. UTC
It's very common with BPF-based uprobe/uretprobe use cases to have
a system-wide (not PID specific) probes used. In this case uprobe's
trace_uprobe_filter->nr_systemwide counter is bumped at registration
time, and actual filtering is short circuited at the time when
uprobe/uretprobe is triggered.

This is a great optimization, and the only issue with it is that to even
get to checking this counter uprobe subsystem is taking
read-side trace_uprobe_filter->rwlock. This is actually noticeable in
profiles and is just another point of contention when uprobe is
triggered on multiple CPUs simultaneously.

This patch adds a speculative check before grabbing that rwlock. If
nr_systemwide is non-zero, lock is skipped and event is passed through.
From examining existing logic it looks correct and safe to do. If
nr_systemwide is being modified under rwlock in parallel, we have to
consider basically just one important race condition: the case when
nr_systemwide is dropped from one to zero (from
trace_uprobe_filter_remove()) under filter->rwlock, but
uprobe_perf_filter() raced and saw it as >0.

In this case, we'll proceed with uprobe/uretprobe execution, while
uprobe_perf_close() and uprobe_apply() will be blocked on trying to grab
uprobe->register_rwsem as a writer. It will be blocked because
uprobe_dispatcher() (and, similarly, uretprobe_dispatcher()) runs with
uprobe->register_rwsem taken as a reader. So there is no real race
besides uprobe/uretprobe might execute one last time before it's
removed, which is fine because from user space perspective
uprobe/uretprobe hasn't been yet deactivated.

In case we speculatively read nr_systemwide as zero, while it was
incremented in parallel, we'll proceed to grabbing filter->rwlock and
re-doing the check, this time in lock-protected and non-racy way.

As such, it looks safe to do a quick short circuiting check and save
some performance in a very common system-wide case, not sacrificing hot
path performance due to much rarer possibility of registration or
unregistration of uprobes.

Again, confirming with BPF selftests's based benchmarks.

BEFORE (based on changes in previous patch)
===========================================
uprobe-nop     :    2.732 ± 0.022M/s
uprobe-push    :    2.621 ± 0.016M/s
uprobe-ret     :    1.105 ± 0.007M/s
uretprobe-nop  :    1.396 ± 0.007M/s
uretprobe-push :    1.347 ± 0.008M/s
uretprobe-ret  :    0.800 ± 0.006M/s

AFTER
=====
uprobe-nop     :    2.878 ± 0.017M/s (+5.5%, total +8.3%)
uprobe-push    :    2.753 ± 0.013M/s (+5.3%, total +10.2%)
uprobe-ret     :    1.142 ± 0.010M/s (+3.8%, total +3.8%)
uretprobe-nop  :    1.444 ± 0.008M/s (+3.5%, total +6.5%)
uretprobe-push :    1.410 ± 0.010M/s (+4.8%, total +7.1%)
uretprobe-ret  :    0.816 ± 0.002M/s (+2.0%, total +3.9%)

In the above, first percentage value is based on top of previous patch
(lazy uprobe buffer optimization), while the "total" percentage is
based on kernel without any of the changes in this patch set.

As can be seen, we get about 4% - 10% speed up, in total, with both lazy
uprobe buffer and speculative filter check optimizations.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/trace/trace_uprobe.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Oleg Nesterov March 13, 2024, 1:19 p.m. UTC | #1
I forgot everything about this code, plus it has changed a lot since
I looked at it many years ago, but ...

I think this change is fine but the changelog looks a bit confusing
(overcomplicated) to me.

On 03/12, Andrii Nakryiko wrote:
>
> This patch adds a speculative check before grabbing that rwlock. If
> nr_systemwide is non-zero, lock is skipped and event is passed through.
> From examining existing logic it looks correct and safe to do. If
> nr_systemwide is being modified under rwlock in parallel, we have to
> consider basically just one important race condition: the case when
> nr_systemwide is dropped from one to zero (from
> trace_uprobe_filter_remove()) under filter->rwlock, but
> uprobe_perf_filter() raced and saw it as >0.

Unless I am totally confused, there is nothing new. Even without
this change trace_uprobe_filter_remove() can clear nr_systemwide
right after uprobe_perf_filter() drops filter->rwlock.

And of course, trace_uprobe_filter_add() can change nr_systemwide
from 0 to 1. In this case uprobe_perf_func() can "wrongly" return
UPROBE_HANDLER_REMOVE but we can't avoid this and afaics this is
fine even if handler_chain() does unapply_uprobe(), uprobe_perf_open()
will do uprobe_apply() after that, we can rely on ->register_rwsem.

> In case we speculatively read nr_systemwide as zero, while it was
> incremented in parallel, we'll proceed to grabbing filter->rwlock and
> re-doing the check, this time in lock-protected and non-racy way.

See above...


So I think uprobe_perf_filter() needs filter->rwlock only to iterate
the list, it can check nr_systemwide lockless and this means that you
can also remove the same check in __uprobe_perf_filter(), other callers
trace_uprobe_filter_add/remove check it themselves.


> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1351,6 +1351,10 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc,
>  	tu = container_of(uc, struct trace_uprobe, consumer);
>  	filter = tu->tp.event->filter;
>
> +	/* speculative check */
> +	if (READ_ONCE(filter->nr_systemwide))
> +		return true;
> +
>  	read_lock(&filter->rwlock);
>  	ret = __uprobe_perf_filter(filter, mm);
>  	read_unlock(&filter->rwlock);

ACK,

but see above. I think the changelog should be simplified and the
filter->nr_systemwide check in __uprobe_perf_filter() should be
removed. But I won't insist and perhaps I missed something...

Oleg.
Andrii Nakryiko March 13, 2024, 5:01 p.m. UTC | #2
On Wed, Mar 13, 2024 at 6:20 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> I forgot everything about this code, plus it has changed a lot since
> I looked at it many years ago, but ...
>
> I think this change is fine but the changelog looks a bit confusing
> (overcomplicated) to me.

It's a new piece of code and logic, so I tried to do my due diligence
and argue why I think it's fine. I'll drop the overcomplicated
explanation, as I agree with you that it's inherently racy even
without my changes (and use-after-free safety is provided with
uprobe->register_rwsem independent from all this).

>
> On 03/12, Andrii Nakryiko wrote:
> >
> > This patch adds a speculative check before grabbing that rwlock. If
> > nr_systemwide is non-zero, lock is skipped and event is passed through.
> > From examining existing logic it looks correct and safe to do. If
> > nr_systemwide is being modified under rwlock in parallel, we have to
> > consider basically just one important race condition: the case when
> > nr_systemwide is dropped from one to zero (from
> > trace_uprobe_filter_remove()) under filter->rwlock, but
> > uprobe_perf_filter() raced and saw it as >0.
>
> Unless I am totally confused, there is nothing new. Even without
> this change trace_uprobe_filter_remove() can clear nr_systemwide
> right after uprobe_perf_filter() drops filter->rwlock.
>
> And of course, trace_uprobe_filter_add() can change nr_systemwide
> from 0 to 1. In this case uprobe_perf_func() can "wrongly" return
> UPROBE_HANDLER_REMOVE but we can't avoid this and afaics this is
> fine even if handler_chain() does unapply_uprobe(), uprobe_perf_open()
> will do uprobe_apply() after that, we can rely on ->register_rwsem.
>

yep, agreed

> > In case we speculatively read nr_systemwide as zero, while it was
> > incremented in parallel, we'll proceed to grabbing filter->rwlock and
> > re-doing the check, this time in lock-protected and non-racy way.
>
> See above...
>
>
> So I think uprobe_perf_filter() needs filter->rwlock only to iterate
> the list, it can check nr_systemwide lockless and this means that you
> can also remove the same check in __uprobe_perf_filter(), other callers
> trace_uprobe_filter_add/remove check it themselves.
>

makes sense, will do

>
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -1351,6 +1351,10 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc,
> >       tu = container_of(uc, struct trace_uprobe, consumer);
> >       filter = tu->tp.event->filter;
> >
> > +     /* speculative check */
> > +     if (READ_ONCE(filter->nr_systemwide))
> > +             return true;
> > +
> >       read_lock(&filter->rwlock);
> >       ret = __uprobe_perf_filter(filter, mm);
> >       read_unlock(&filter->rwlock);
>
> ACK,
>
> but see above. I think the changelog should be simplified and the
> filter->nr_systemwide check in __uprobe_perf_filter() should be
> removed. But I won't insist and perhaps I missed something...
>

I think you are right, I'll move the check

> Oleg.
>
diff mbox series

Patch

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index f2875349d124..be28e6d0578e 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1351,6 +1351,10 @@  static bool uprobe_perf_filter(struct uprobe_consumer *uc,
 	tu = container_of(uc, struct trace_uprobe, consumer);
 	filter = tu->tp.event->filter;
 
+	/* speculative check */
+	if (READ_ONCE(filter->nr_systemwide))
+		return true;
+
 	read_lock(&filter->rwlock);
 	ret = __uprobe_perf_filter(filter, mm);
 	read_unlock(&filter->rwlock);