diff mbox series

[RFC] tracing: Fix syscall tracepoint use-after-free

Message ID 20241022151804.284424-1-mathieu.desnoyers@efficios.com (mailing list archive)
State Superseded
Headers show
Series [RFC] tracing: Fix syscall tracepoint use-after-free | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Mathieu Desnoyers Oct. 22, 2024, 3:18 p.m. UTC
The grace period used internally within tracepoint.c:release_probes()
uses call_rcu() to batch waiting for quiescence of old probe arrays,
rather than using the tracepoint_synchronize_unregister() which blocks
while waiting for quiescence.

This causes use-after-free issues reproduced with syzkaller.

Fix this by introducing the following call_rcu chaining:

   call_rcu() -> rcu_free_old_probes -> call_rcu_tasks_trace() -> rcu_tasks_trace_free_old_probes.

Just like it was done when SRCU was used.

Reported-by: syzbot+b390c8062d8387b6272a@syzkaller.appspotmail.com
Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Jordan Rife <jrife@google.com>
---
 kernel/tracepoint.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Jordan Rife Oct. 22, 2024, 4:14 p.m. UTC | #1
I assume this patch isn't meant to fix the related issues with freeing
BPF programs/links with call_rcu?

On the BPF side I think there needs to be some smarter handling of
when to use call_rcu or call_rcu_tasks_trace to free links/programs
based on whether or not the program type can be executed in this
context. Right now call_rcu_tasks_trace is used if the program is
sleepable, but that isn't necessarily the case here. Off the top of my
head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to
BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in
here, as I'm not entirely sure.

-Jordan
Mathieu Desnoyers Oct. 22, 2024, 5:54 p.m. UTC | #2
On 2024-10-22 12:14, Jordan Rife wrote:
> I assume this patch isn't meant to fix the related issues with freeing
> BPF programs/links with call_rcu?

No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to
choose between:

                 if (sleepable)
                         call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
                 else
                         call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);

But the faultable syscall tracepoint series does not require syscall programs
to be sleepable. So some changes may be needed on the ebpf side there.

> 
> On the BPF side I think there needs to be some smarter handling of
> when to use call_rcu or call_rcu_tasks_trace to free links/programs
> based on whether or not the program type can be executed in this
> context. Right now call_rcu_tasks_trace is used if the program is
> sleepable, but that isn't necessarily the case here. Off the top of my
> head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and
> BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to
> BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in
> here, as I'm not entirely sure.

A big hammer solution would be to make all grace periods waited for after
a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace.

Else, if we properly tag all programs attached to syscall tracepoints as
sleepable, then keeping the call_rcu_tasks_trace() only for those would
work.

Thanks,

Mathieu
Andrii Nakryiko Oct. 22, 2024, 7:53 p.m. UTC | #3
On Tue, Oct 22, 2024 at 10:55 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2024-10-22 12:14, Jordan Rife wrote:
> > I assume this patch isn't meant to fix the related issues with freeing
> > BPF programs/links with call_rcu?
>
> No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to
> choose between:
>
>                  if (sleepable)
>                          call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
>                  else
>                          call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);
>
> But the faultable syscall tracepoint series does not require syscall programs
> to be sleepable. So some changes may be needed on the ebpf side there.

Your fix now adds a chain of call_rcu -> call_rcu_tasks_trace ->
kfree, which should work regardless of sleepable/non-sleepable. For
the BPF-side, yes, we do different things depending on prog->sleepable
(adding extra call_rcu_tasks_trace for sleepable, while still keeping
call_rcu in the chain), so the BPF side should be good, I think.

>
> >
> > On the BPF side I think there needs to be some smarter handling of
> > when to use call_rcu or call_rcu_tasks_trace to free links/programs
> > based on whether or not the program type can be executed in this
> > context. Right now call_rcu_tasks_trace is used if the program is
> > sleepable, but that isn't necessarily the case here. Off the top of my
> > head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and
> > BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to
> > BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in
> > here, as I'm not entirely sure.
>

From the BPF standpoint, as of right now, neither of RAW_TRACEPOINT or
TRACEPOINT programs are sleepable. So a single RCU grace period is
fine. But even if they were (and we'll allow that later on), we handle
sleepable programs with the same call_rcu_tasks_trace -> call_rcu
chain.

That's just to say that I don't think that we need any BPF-specific
fix beyond what Mathieu is doing in this patch, so:

Acked-by: Andrii Nakryiko <andrii@kernel.org>


> A big hammer solution would be to make all grace periods waited for after
> a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace.
>
> Else, if we properly tag all programs attached to syscall tracepoints as
> sleepable, then keeping the call_rcu_tasks_trace() only for those would
> work.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Mathieu Desnoyers Oct. 22, 2024, 8:04 p.m. UTC | #4
On 2024-10-22 15:53, Andrii Nakryiko wrote:
> On Tue, Oct 22, 2024 at 10:55 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 2024-10-22 12:14, Jordan Rife wrote:
>>> I assume this patch isn't meant to fix the related issues with freeing
>>> BPF programs/links with call_rcu?
>>
>> No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to
>> choose between:
>>
>>                   if (sleepable)
>>                           call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
>>                   else
>>                           call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);
>>
>> But the faultable syscall tracepoint series does not require syscall programs
>> to be sleepable. So some changes may be needed on the ebpf side there.
> 
> Your fix now adds a chain of call_rcu -> call_rcu_tasks_trace ->
> kfree, which should work regardless of sleepable/non-sleepable. For
> the BPF-side, yes, we do different things depending on prog->sleepable
> (adding extra call_rcu_tasks_trace for sleepable, while still keeping
> call_rcu in the chain), so the BPF side should be good, I think.
> 
>>
>>>
>>> On the BPF side I think there needs to be some smarter handling of
>>> when to use call_rcu or call_rcu_tasks_trace to free links/programs
>>> based on whether or not the program type can be executed in this
>>> context. Right now call_rcu_tasks_trace is used if the program is
>>> sleepable, but that isn't necessarily the case here. Off the top of my
>>> head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and
>>> BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to
>>> BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in
>>> here, as I'm not entirely sure.
>>
> 
>  From the BPF standpoint, as of right now, neither of RAW_TRACEPOINT or
> TRACEPOINT programs are sleepable. So a single RCU grace period is
> fine. But even if they were (and we'll allow that later on), we handle
> sleepable programs with the same call_rcu_tasks_trace -> call_rcu
> chain.

Good points, in this commit:

commit 4aadde89d8 ("tracing/bpf: disable preemption in syscall probe")
I took care to disable preemption around use of the bpf program attached
to a syscall tracepoint, which makes this change a no-op from the
tracers' perspective.

It's only when you'll decide to remove this preempt-off and allow
syscall tracepoints to sleep in bpf that you'll need to tweak that.

> 
> That's just to say that I don't think that we need any BPF-specific
> fix beyond what Mathieu is doing in this patch, so:
> 
> Acked-by: Andrii Nakryiko <andrii@kernel.org>

Thanks!

Mathieu

> 
> 
>> A big hammer solution would be to make all grace periods waited for after
>> a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace.
>>
>> Else, if we properly tag all programs attached to syscall tracepoints as
>> sleepable, then keeping the call_rcu_tasks_trace() only for those would
>> work.
>>
>> Thanks,
>>
>> Mathieu
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
Steven Rostedt Oct. 23, 2024, 12:20 a.m. UTC | #5
On Tue, 22 Oct 2024 16:04:49 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > 
> > That's just to say that I don't think that we need any BPF-specific
> > fix beyond what Mathieu is doing in this patch, so:
> > 
> > Acked-by: Andrii Nakryiko <andrii@kernel.org>  
> 
> Thanks!

Does this mean I can pull this patch as is? I don't usually take RFCs.

-- Steve
Mathieu Desnoyers Oct. 23, 2024, 1:11 a.m. UTC | #6
On 2024-10-22 20:20, Steven Rostedt wrote:
> On Tue, 22 Oct 2024 16:04:49 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>>>
>>> That's just to say that I don't think that we need any BPF-specific
>>> fix beyond what Mathieu is doing in this patch, so:
>>>
>>> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>>
>> Thanks!
> 
> Does this mean I can pull this patch as is? I don't usually take RFCs.

It's only compile-tested on my end, so we should at least get a
Tested-by before I feel confident posting this as a non-RFC patch.

Thanks,

Mathieu
Jordan Rife Oct. 23, 2024, 1:24 a.m. UTC | #7
> so we should at least get a Tested-by

I could apply this and run the syz repro script on my end if that
helps, since I already have everything set up.
Jordan Rife Oct. 23, 2024, 2:56 p.m. UTC | #8
Mathieu's patch alone does not seem to be enough to prevent the
use-after-free issue reported by syzbot.

Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/T/#u

I reran the repro script with his patch applied to my tree and was
still able to get the same KASAN crash to occur.

In this case, when bpf_link_free is invoked it kicks off three instances
of call_rcu*.

bpf_link_free()
  ops->release()
     bpf_raw_tp_link_release()
       bpf_probe_unregister()
         tracepoint_probe_unregister()
           tracepoint_remove_func()
             release_probes()
               call_rcu()               [1]
  bpf_prog_put()
    __bpf_prog_put()
      bpf_prog_put_deferred()
        __bpf_prog_put_noref()
           call_rcu()                   [2]
  call_rcu()                            [3]

With Mathieu's patch, [1] is chained with call_rcu_tasks_trace()
making the grace period suffiently long to safely free the probe itself.
The callback for [2] and [3] may be invoked before the
call_rcu_tasks_trace() grace period has elapsed leading to the link or
program itself being freed while still in use. I was able to prevent
any crashes with the patch below which also chains
call_rcu_tasks_trace() and call_rcu() at [2] and [3].

---
 kernel/bpf/syscall.c | 24 ++++++++++--------------
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 59de664e580d..5290eccb465e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
 	bpf_prog_free(aux->prog);
 }
 
+static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu)
+{
+	if (rcu_trace_implies_rcu_gp())
+		__bpf_prog_put_rcu(rcu);
+	else
+		call_rcu(rcu, __bpf_prog_put_rcu);
+}
+
 static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
 {
 	bpf_prog_kallsyms_del_all(prog);
@@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
 		btf_put(prog->aux->attach_btf);
 
 	if (deferred) {
-		if (prog->sleepable)
-			call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu);
-		else
-			call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
+		call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu);
 	} else {
 		__bpf_prog_put_rcu(&prog->aux->rcu);
 	}
@@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu)
 static void bpf_link_free(struct bpf_link *link)
 {
 	const struct bpf_link_ops *ops = link->ops;
-	bool sleepable = false;
 
 	bpf_link_free_id(link->id);
 	if (link->prog) {
-		sleepable = link->prog->sleepable;
 		/* detach BPF program, clean up used resources */
 		ops->release(link);
 		bpf_prog_put(link->prog);
 	}
 	if (ops->dealloc_deferred) {
-		/* schedule BPF link deallocation; if underlying BPF program
-		 * is sleepable, we need to first wait for RCU tasks trace
-		 * sync, then go through "classic" RCU grace period
-		 */
-		if (sleepable)
-			call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
-		else
-			call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);
+		call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
 	} else if (ops->dealloc)
 		ops->dealloc(link);
 }
Mathieu Desnoyers Oct. 23, 2024, 3:13 p.m. UTC | #9
On 2024-10-23 10:56, Jordan Rife wrote:
> Mathieu's patch alone does not seem to be enough to prevent the
> use-after-free issue reported by syzbot.
> 
> Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/T/#u
> 
> I reran the repro script with his patch applied to my tree and was
> still able to get the same KASAN crash to occur.
> 
> In this case, when bpf_link_free is invoked it kicks off three instances
> of call_rcu*.
> 
> bpf_link_free()
>    ops->release()
>       bpf_raw_tp_link_release()
>         bpf_probe_unregister()
>           tracepoint_probe_unregister()
>             tracepoint_remove_func()
>               release_probes()
>                 call_rcu()               [1]
>    bpf_prog_put()
>      __bpf_prog_put()
>        bpf_prog_put_deferred()
>          __bpf_prog_put_noref()
>             call_rcu()                   [2]
>    call_rcu()                            [3]
> 
> With Mathieu's patch, [1] is chained with call_rcu_tasks_trace()
> making the grace period suffiently long to safely free the probe itself.
> The callback for [2] and [3] may be invoked before the
> call_rcu_tasks_trace() grace period has elapsed leading to the link or
> program itself being freed while still in use. I was able to prevent
> any crashes with the patch below which also chains
> call_rcu_tasks_trace() and call_rcu() at [2] and [3].

Right, so removal of the tracepoint probe is done by
tracepoint_probe_unregister by effectively removing the
probe function from the array. The read-side counterpart
of that is in __DO_TRACE(), where the rcu dereference is
protected by rcu_read_lock_trace for syscall tracepoints
now.

We cannot expect that surrounding the ebpf probe execution
with preempt disable like so:

#define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)                  \
static notrace void                                                     \
__bpf_trace_##call(void *__data, proto)                                 \
{                                                                       \
         might_fault();                                                  \
         preempt_disable_notrace();                                      \
         CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));        \
         preempt_enable_notrace();                                       \
}

Is sufficient to delay reclaim with call_rcu() after a tracepoint
unregister, because the preempt disable does not include the rcu
dereference done by the tracepoint in its critical section.

So relying on a call_rcu() to delay reclaim of the bpf objects
after unregistering their associated tracepoint is indeed not
enough. Chaining call_rcu with call_rcu_tasks_trace works though.

That question is relevant for ftrace and perf too: are there data
structures that are reclaimed with call_rcu() after being unregistered
from syscall tracepoints ?

Thanks Jordan for your thorough analysis,

Mathieu

> 
> ---
>   kernel/bpf/syscall.c | 24 ++++++++++--------------
>   1 file changed, 10 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 59de664e580d..5290eccb465e 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
>   	bpf_prog_free(aux->prog);
>   }
>   
> +static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu)
> +{
> +	if (rcu_trace_implies_rcu_gp())
> +		__bpf_prog_put_rcu(rcu);
> +	else
> +		call_rcu(rcu, __bpf_prog_put_rcu);
> +}
> +
>   static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
>   {
>   	bpf_prog_kallsyms_del_all(prog);
> @@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
>   		btf_put(prog->aux->attach_btf);
>   
>   	if (deferred) {
> -		if (prog->sleepable)
> -			call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu);
> -		else
> -			call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
> +		call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu);
>   	} else {
>   		__bpf_prog_put_rcu(&prog->aux->rcu);
>   	}
> @@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu)
>   static void bpf_link_free(struct bpf_link *link)
>   {
>   	const struct bpf_link_ops *ops = link->ops;
> -	bool sleepable = false;
>   
>   	bpf_link_free_id(link->id);
>   	if (link->prog) {
> -		sleepable = link->prog->sleepable;
>   		/* detach BPF program, clean up used resources */
>   		ops->release(link);
>   		bpf_prog_put(link->prog);
>   	}
>   	if (ops->dealloc_deferred) {
> -		/* schedule BPF link deallocation; if underlying BPF program
> -		 * is sleepable, we need to first wait for RCU tasks trace
> -		 * sync, then go through "classic" RCU grace period
> -		 */
> -		if (sleepable)
> -			call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
> -		else
> -			call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);
> +		call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
>   	} else if (ops->dealloc)
>   		ops->dealloc(link);
>   }
Alexei Starovoitov Oct. 23, 2024, 3:14 p.m. UTC | #10
On Wed, Oct 23, 2024 at 7:56 AM Jordan Rife <jrife@google.com> wrote:
>
> Mathieu's patch alone does not seem to be enough to prevent the
> use-after-free issue reported by syzbot.
>
> Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/T/#u
>
> I reran the repro script with his patch applied to my tree and was
> still able to get the same KASAN crash to occur.
>
> In this case, when bpf_link_free is invoked it kicks off three instances
> of call_rcu*.
>
> bpf_link_free()
>   ops->release()
>      bpf_raw_tp_link_release()
>        bpf_probe_unregister()
>          tracepoint_probe_unregister()
>            tracepoint_remove_func()
>              release_probes()
>                call_rcu()               [1]
>   bpf_prog_put()
>     __bpf_prog_put()
>       bpf_prog_put_deferred()
>         __bpf_prog_put_noref()
>            call_rcu()                   [2]
>   call_rcu()                            [3]
>
> With Mathieu's patch, [1] is chained with call_rcu_tasks_trace()
> making the grace period suffiently long to safely free the probe itself.
> The callback for [2] and [3] may be invoked before the
> call_rcu_tasks_trace() grace period has elapsed leading to the link or
> program itself being freed while still in use. I was able to prevent
> any crashes with the patch below which also chains
> call_rcu_tasks_trace() and call_rcu() at [2] and [3].
>
> ---
>  kernel/bpf/syscall.c | 24 ++++++++++--------------
>  1 file changed, 10 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 59de664e580d..5290eccb465e 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
>         bpf_prog_free(aux->prog);
>  }
>
> +static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu)
> +{
> +       if (rcu_trace_implies_rcu_gp())
> +               __bpf_prog_put_rcu(rcu);
> +       else
> +               call_rcu(rcu, __bpf_prog_put_rcu);
> +}
> +
>  static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
>  {
>         bpf_prog_kallsyms_del_all(prog);
> @@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
>                 btf_put(prog->aux->attach_btf);
>
>         if (deferred) {
> -               if (prog->sleepable)
> -                       call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu);
> -               else
> -                       call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
> +               call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu);
>         } else {
>                 __bpf_prog_put_rcu(&prog->aux->rcu);
>         }
> @@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu)
>  static void bpf_link_free(struct bpf_link *link)
>  {
>         const struct bpf_link_ops *ops = link->ops;
> -       bool sleepable = false;
>
>         bpf_link_free_id(link->id);
>         if (link->prog) {
> -               sleepable = link->prog->sleepable;
>                 /* detach BPF program, clean up used resources */
>                 ops->release(link);
>                 bpf_prog_put(link->prog);
>         }
>         if (ops->dealloc_deferred) {
> -               /* schedule BPF link deallocation; if underlying BPF program
> -                * is sleepable, we need to first wait for RCU tasks trace
> -                * sync, then go through "classic" RCU grace period
> -                */
> -               if (sleepable)
> -                       call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
> -               else
> -                       call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);
> +               call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);

This patch is completely wrong.
Looks like Mathieu patch broke bpf program contract somewhere.
The tracepoint type bpf programs must be called with rcu_read_lock held.
Looks like it's not happening anymore.
Mathieu Desnoyers Oct. 23, 2024, 3:19 p.m. UTC | #11
On 2024-10-23 11:14, Alexei Starovoitov wrote:
> On Wed, Oct 23, 2024 at 7:56 AM Jordan Rife <jrife@google.com> wrote:
>>
>> Mathieu's patch alone does not seem to be enough to prevent the
>> use-after-free issue reported by syzbot.
>>
>> Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/T/#u
>>
>> I reran the repro script with his patch applied to my tree and was
>> still able to get the same KASAN crash to occur.
>>
>> In this case, when bpf_link_free is invoked it kicks off three instances
>> of call_rcu*.
>>
>> bpf_link_free()
>>    ops->release()
>>       bpf_raw_tp_link_release()
>>         bpf_probe_unregister()
>>           tracepoint_probe_unregister()
>>             tracepoint_remove_func()
>>               release_probes()
>>                 call_rcu()               [1]
>>    bpf_prog_put()
>>      __bpf_prog_put()
>>        bpf_prog_put_deferred()
>>          __bpf_prog_put_noref()
>>             call_rcu()                   [2]
>>    call_rcu()                            [3]
>>
>> With Mathieu's patch, [1] is chained with call_rcu_tasks_trace()
>> making the grace period suffiently long to safely free the probe itself.
>> The callback for [2] and [3] may be invoked before the
>> call_rcu_tasks_trace() grace period has elapsed leading to the link or
>> program itself being freed while still in use. I was able to prevent
>> any crashes with the patch below which also chains
>> call_rcu_tasks_trace() and call_rcu() at [2] and [3].
>>
>> ---
>>   kernel/bpf/syscall.c | 24 ++++++++++--------------
>>   1 file changed, 10 insertions(+), 14 deletions(-)
>>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 59de664e580d..5290eccb465e 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
>>          bpf_prog_free(aux->prog);
>>   }
>>
>> +static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu)
>> +{
>> +       if (rcu_trace_implies_rcu_gp())
>> +               __bpf_prog_put_rcu(rcu);
>> +       else
>> +               call_rcu(rcu, __bpf_prog_put_rcu);
>> +}
>> +
>>   static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
>>   {
>>          bpf_prog_kallsyms_del_all(prog);
>> @@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
>>                  btf_put(prog->aux->attach_btf);
>>
>>          if (deferred) {
>> -               if (prog->sleepable)
>> -                       call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu);
>> -               else
>> -                       call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
>> +               call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu);
>>          } else {
>>                  __bpf_prog_put_rcu(&prog->aux->rcu);
>>          }
>> @@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu)
>>   static void bpf_link_free(struct bpf_link *link)
>>   {
>>          const struct bpf_link_ops *ops = link->ops;
>> -       bool sleepable = false;
>>
>>          bpf_link_free_id(link->id);
>>          if (link->prog) {
>> -               sleepable = link->prog->sleepable;
>>                  /* detach BPF program, clean up used resources */
>>                  ops->release(link);
>>                  bpf_prog_put(link->prog);
>>          }
>>          if (ops->dealloc_deferred) {
>> -               /* schedule BPF link deallocation; if underlying BPF program
>> -                * is sleepable, we need to first wait for RCU tasks trace
>> -                * sync, then go through "classic" RCU grace period
>> -                */
>> -               if (sleepable)
>> -                       call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
>> -               else
>> -                       call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);
>> +               call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
> 
> This patch is completely wrong.

Actually I suspect Jordan's patch works.

> Looks like Mathieu patch broke bpf program contract somewhere.

My patch series introduced this in the probe:

#define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)                  \
static notrace void                                                     \
__bpf_trace_##call(void *__data, proto)                                 \
{                                                                       \
         might_fault();                                                  \
         preempt_disable_notrace();                                      \
         CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));        \
         preempt_enable_notrace();                                       \
}

To ensure we'd call the bpf program from preempt-off context.

> The tracepoint type bpf programs must be called with rcu_read_lock held.

Calling the bpf program with preempt off is equivalent. __DO_TRACE() calls
the probes with preempt_disable_notrace() in the normal case.

> Looks like it's not happening anymore.

The issue here is not about the context in which the bpf program runs, that's
still preempt off. The problem is about expectations that a call_rcu grace period
is enough to delay reclaim after unregistration of the tracepoint. Now that
__DO_TRACE() uses rcu_read_lock_trace() to protect RCU dereference, it's not
sufficient anymore, at least for syscall tracepoints.

Thanks,

Mathieu
Steven Rostedt Oct. 24, 2024, 12:40 a.m. UTC | #12
On Wed, 23 Oct 2024 11:13:53 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> That question is relevant for ftrace and perf too: are there data
> structures that are reclaimed with call_rcu() after being unregistered
> from syscall tracepoints ?

ftrace doesn't assume RCU for when to synchronize with tracepoints
being unregistered. It uses:

  tracepoint_synchronize_unregister()

-- Steve
Alexei Starovoitov Oct. 24, 2024, 1:28 a.m. UTC | #13
On Wed, Oct 23, 2024 at 8:21 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2024-10-23 11:14, Alexei Starovoitov wrote:
> > On Wed, Oct 23, 2024 at 7:56 AM Jordan Rife <jrife@google.com> wrote:
> >>
> >> Mathieu's patch alone does not seem to be enough to prevent the
> >> use-after-free issue reported by syzbot.
> >>
> >> Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/T/#u
> >>
> >> I reran the repro script with his patch applied to my tree and was
> >> still able to get the same KASAN crash to occur.
> >>
> >> In this case, when bpf_link_free is invoked it kicks off three instances
> >> of call_rcu*.
> >>
> >> bpf_link_free()
> >>    ops->release()
> >>       bpf_raw_tp_link_release()
> >>         bpf_probe_unregister()
> >>           tracepoint_probe_unregister()
> >>             tracepoint_remove_func()
> >>               release_probes()
> >>                 call_rcu()               [1]
> >>    bpf_prog_put()
> >>      __bpf_prog_put()
> >>        bpf_prog_put_deferred()
> >>          __bpf_prog_put_noref()
> >>             call_rcu()                   [2]
> >>    call_rcu()                            [3]
> >>
> >> With Mathieu's patch, [1] is chained with call_rcu_tasks_trace()
> >> making the grace period suffiently long to safely free the probe itself.
> >> The callback for [2] and [3] may be invoked before the
> >> call_rcu_tasks_trace() grace period has elapsed leading to the link or
> >> program itself being freed while still in use. I was able to prevent
> >> any crashes with the patch below which also chains
> >> call_rcu_tasks_trace() and call_rcu() at [2] and [3].
> >>
> >> ---
> >>   kernel/bpf/syscall.c | 24 ++++++++++--------------
> >>   1 file changed, 10 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index 59de664e580d..5290eccb465e 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
> >>          bpf_prog_free(aux->prog);
> >>   }
> >>
> >> +static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu)
> >> +{
> >> +       if (rcu_trace_implies_rcu_gp())
> >> +               __bpf_prog_put_rcu(rcu);
> >> +       else
> >> +               call_rcu(rcu, __bpf_prog_put_rcu);
> >> +}
> >> +
> >>   static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
> >>   {
> >>          bpf_prog_kallsyms_del_all(prog);
> >> @@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
> >>                  btf_put(prog->aux->attach_btf);
> >>
> >>          if (deferred) {
> >> -               if (prog->sleepable)
> >> -                       call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu);
> >> -               else
> >> -                       call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
> >> +               call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu);
> >>          } else {
> >>                  __bpf_prog_put_rcu(&prog->aux->rcu);
> >>          }
> >> @@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu)
> >>   static void bpf_link_free(struct bpf_link *link)
> >>   {
> >>          const struct bpf_link_ops *ops = link->ops;
> >> -       bool sleepable = false;
> >>
> >>          bpf_link_free_id(link->id);
> >>          if (link->prog) {
> >> -               sleepable = link->prog->sleepable;
> >>                  /* detach BPF program, clean up used resources */
> >>                  ops->release(link);
> >>                  bpf_prog_put(link->prog);
> >>          }
> >>          if (ops->dealloc_deferred) {
> >> -               /* schedule BPF link deallocation; if underlying BPF program
> >> -                * is sleepable, we need to first wait for RCU tasks trace
> >> -                * sync, then go through "classic" RCU grace period
> >> -                */
> >> -               if (sleepable)
> >> -                       call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
> >> -               else
> >> -                       call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);
> >> +               call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
> >
> > This patch is completely wrong.
>
> Actually I suspect Jordan's patch works.

We're not going to penalize all bpf progs for that.
This patch is a non-starter.

> > Looks like Mathieu patch broke bpf program contract somewhere.
>
> My patch series introduced this in the probe:
>
> #define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)                  \
> static notrace void                                                     \
> __bpf_trace_##call(void *__data, proto)                                 \
> {                                                                       \
>          might_fault();                                                  \
>          preempt_disable_notrace();                                      \
>          CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));        \
>          preempt_enable_notrace();                                       \
> }
>
> To ensure we'd call the bpf program from preempt-off context.
>
> > The tracepoint type bpf programs must be called with rcu_read_lock held.
>
> Calling the bpf program with preempt off is equivalent. __DO_TRACE() calls
> the probes with preempt_disable_notrace() in the normal case.
>
> > Looks like it's not happening anymore.
>
> The issue here is not about the context in which the bpf program runs, that's
> still preempt off. The problem is about expectations that a call_rcu grace period
> is enough to delay reclaim after unregistration of the tracepoint. Now that
> __DO_TRACE() uses rcu_read_lock_trace() to protect RCU dereference, it's not
> sufficient anymore, at least for syscall tracepoints.

I don't see how preempt_disable vs preempt_disable_notrace and
preemption in general are relevant here.

The prog dereference needs to happen under rcu_read_lock.

But since you've switched to use rcu_read_lock_trace,
so then this part:
> >> bpf_link_free()
> >>    ops->release()
> >>       bpf_raw_tp_link_release()
> >>         bpf_probe_unregister()
> >>           tracepoint_probe_unregister()
> >>             tracepoint_remove_func()
> >>               release_probes()
> >>                 call_rcu()               [1]

is probably incorrect. It should be call_rcu_tasks_trace ?

and in addition all tracepoints (both sleepable and not)
should deref __data under normal rcu_read_lock() before
passing that pointer into __bpf_trace_##call.
Because that's bpf link and prog are rcu protected.

I don't have patches in front of me, so I'm guessing a bit.
So pls remind me where all these patches are?
What tree/branch are we talking about?
Steven Rostedt Oct. 24, 2024, 2:05 a.m. UTC | #14
On Wed, 23 Oct 2024 11:19:40 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > Looks like Mathieu patch broke bpf program contract somewhere.  
> 
> My patch series introduced this in the probe:
> 
> #define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)                  \
> static notrace void                                                     \
> __bpf_trace_##call(void *__data, proto)                                 \
> {                                                                       \
>          might_fault();                                                  \
>          preempt_disable_notrace();                                      \

Is the problem that we can call this function *after* the prog has been
freed? That is, the preempt_disable_notrace() here is meaningless.

Is there a way to add something here to make sure the program is still
valid? Like set a flag in the link structure?

(I don't know how BPF works well enough to know what is involved here,
so excuse me if this is totally off)

-- Steve


>          CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));        \
>          preempt_enable_notrace();                                       \
> }
>
Jordan Rife Oct. 24, 2024, 5:12 p.m. UTC | #15
> I don't have patches in front of me, so I'm guessing a bit.
> So pls remind me where all these patches are?
> What tree/branch are we talking about?

I tested this in linux-next. Here is the patch series.
Link: https://lore.kernel.org/bpf/20241009010718.2050182-1-mathieu.desnoyers@efficios.com/T/#u

> and in addition all tracepoints (both sleepable and not)
> should deref __data under normal rcu_read_lock() before
> passing that pointer into __bpf_trace_##call.
> Because that's bpf link and prog are rcu protected.

I think this is the crux of the issue.

>  #define __DO_TRACE_CALL(name, args) \
>  do { \
>     struct tracepoint_func *it_func_ptr; \
>     void *__data; \
>     it_func_ptr = \
>       rcu_dereference_raw((&__tracepoint_##name)->funcs); \
>     if (it_func_ptr) { \
>       __data = (it_func_ptr)->data; \
>       static_call(tp_func_##name)(__data, args); \
>     } \
>   } while (0)
> #else

This code now executes under rcu_read_lock_trace for syscall
tracepoints where __data gets dereferenced but call_rcu is used to
free non-sleepable programs/links inside bpf_link_free. Is it viable
to just use rcu_read_lock()/rcu_read_unlock() here to denote an RCU
read-side critical section around where __data is used?

> We're not going to penalize all bpf progs for that.
> This patch is a non-starter.

Fair enough. This could probably be more targeted, using
call_rcu_tasks_trace only for raw tracepoint programs or those
attached to syscall tracepoints instead of doing this chaining for
everything. That is, assuming a change here is even needed at all and
that this issue can't be fixed outside of the BPF stuff. I'm not sure
whether or not that's the case.

-Jordan
Andrii Nakryiko Oct. 24, 2024, 5:12 p.m. UTC | #16
On Wed, Oct 23, 2024 at 7:05 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 23 Oct 2024 11:19:40 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> > > Looks like Mathieu patch broke bpf program contract somewhere.
> >
> > My patch series introduced this in the probe:
> >
> > #define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)                  \
> > static notrace void                                                     \
> > __bpf_trace_##call(void *__data, proto)                                 \
> > {                                                                       \
> >          might_fault();                                                  \
> >          preempt_disable_notrace();                                      \
>
> Is the problem that we can call this function *after* the prog has been
> freed? That is, the preempt_disable_notrace() here is meaningless.
>

Yes, I think so.

> Is there a way to add something here to make sure the program is still
> valid? Like set a flag in the link structure?

So I think a big assumption right now is that waiting for RCU grace
period is enough to make sure that BPF link (and thus its underlying
BPF program) are not referenced from that tracepoint anymore, and so
we can proceed with freeing the memory.

Now that some tracepoints are sleepable and are RCU Tasks Trace
protected, this assumption is wrong.

One solution might be to teach BPF raw tracepoint link to recognize
sleepable tracepoints, and then go through cal_rcu_task_trace ->
call_rcu chain instead of normal call_rcu. Similarly, for such cases
we'd need to do the same chain for underlying BPF program, even if BPF
program itself is not sleepable.

Alternatively, we'd need to add synchronize_rcu() +
synchronize_rcu_task_trace() somewhere inside
tracepoint_probe_unregister() or bpf_probe_unregister(), which is just
a thin wrapper around the former. Which would make detaching multiple
tracepoints extremely slow (just like the problem we had with kprobe
detachment before multi-kprobes were added).

The fundamental reason for this is how we do lifetime tracking between
tracepoint object and bpf_link/bpf_program. tracepoint doesn't hold a
refcount on bpf_link. It's rather that when bpf_link's last refcount
drops to zero, we go and do this:

static void bpf_raw_tp_link_release(struct bpf_link *link)
{
        struct bpf_raw_tp_link *raw_tp =
                container_of(link, struct bpf_raw_tp_link, link);

        bpf_probe_unregister(raw_tp->btp, raw_tp);
        bpf_put_raw_tracepoint(raw_tp->btp);
}

And the assumption is that after bpf_probe_unregister() it should be
fine to free BPF link and program after call_rcu(). So we either make
bpf_probe_unregister() synchronously wait for
synchronize_rcu[_task_trace](), or we make sure to not free link/prog
until call_rcu_tasks_trace->call_rcu.

I'd prefer the former (add call_rcu_tasks_trace for sleepable BPF raw
tracepoint link).

You guys said you have a reproducer, right? Can you please share
details (I know it's somewhere on another thread, but let's put all
this in this thread). And as Alexei asked, where are the applied
patches adding faultable tracepoints?

>
> (I don't know how BPF works well enough to know what is involved here,
> so excuse me if this is totally off)

it's not off, but I don't think we want tracepoint to hold a refcount
on bpf_link (and thus bpf_program), because that invalidates how we do
detachment.

>
> -- Steve
>
>
> >          CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));        \
> >          preempt_enable_notrace();                                       \
> > }
> >
Jordan Rife Oct. 24, 2024, 5:50 p.m. UTC | #17
> You guys said you have a reproducer, right? Can you please share
> details (I know it's somewhere on another thread, but let's put all
> this in this thread).

For reference, the original syzbot report is here along with links to artifacts.
Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/

syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=153ef887980000
disk image: https://storage.googleapis.com/syzbot-assets/cf2ad43c81cc/disk-15e7d45e.raw.xz

The steps I performed to reproduce locally are roughly as follows:

1. Copy the syz repro script to a file, repro.syz.txt
2. Download the disk image, disk.img
3. Build syzkaller (https://github.com/google/syzkaller)
4. Start up QEMU using disk.img: qemu-system-x86_64 -m 2G -smp
2,sockets=2,cores=1 -drive file=./disk.raw,format=raw -net
nic,model=e1000 -net user,host=10.0.2.10,hostfwd:tcp::10022-:22
-enable-kvm -nographic
5. SCP syzkaller/bin/linux_amd64/syz-execprog and
syzkaller/bin/linux_amd64/syz-executor to root@127.0.0.1:/root/
6. SCP repro.syz.txt to root@127.0.0.1:/root/
7. Run './syz-execprog -repeat=0 -procs=5 ./repro.syz.txt' over SSH on
root@127.0.0.1

This typically crashes things within 20 seconds or so on my machine.

-Jordan
Mathieu Desnoyers Oct. 24, 2024, 6:18 p.m. UTC | #18
On 2024-10-24 13:12, Andrii Nakryiko wrote:
[...]

> And as Alexei asked, where are the applied
> patches adding faultable tracepoints?
> 

On linux-next next-20241024 :

0850e1bc88b1 tracing/bpf: Add might_fault check to syscall probes
cdb537ac4179 tracing/perf: Add might_fault check to syscall probes
a3204c740a59 tracing/ftrace: Add might_fault check to syscall probes
a363d27cdbc2 tracing: Allow system call tracepoints to handle page faults   <---- This is where bisection points as first offending commit.
4aadde89d81f tracing/bpf: disable preemption in syscall probe
65e7462a16ce tracing/perf: disable preemption in syscall probe
13d750c2c03e tracing/ftrace: disable preemption in syscall probe
0e6caab8db8b tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL

Thanks,

Mathieu
Jordan Rife Oct. 25, 2024, 3:01 p.m. UTC | #19
> One solution might be to teach BPF raw tracepoint link to recognize
> sleepable tracepoints, and then go through cal_rcu_task_trace ->
> call_rcu chain instead of normal call_rcu. Similarly, for such cases
> we'd need to do the same chain for underlying BPF program, even if BPF
> program itself is not sleepable.

I don't suppose that tracepoints could themselves be marked as sleepable
(e.g. a new 'sleepable' member of `struct tracepoint`), which could be
checked when initializing or freeing the link? Something like this,

static void bpf_link_defer_bpf_prog_put(struct rcu_head *rcu)
{
	struct bpf_prog_aux *aux = container_of(rcu, struct bpf_prog_aux, rcu);
	bpf_prog_put(aux->prog);
}

 /* bpf_link_free is guaranteed to be called from process context */
 static void bpf_link_free(struct bpf_link *link)
 {
 	const struct bpf_link_ops *ops = link->ops;
 	bool sleepable = false;
 
+	if (ops->attachment_is_sleepable)
+		sleepable = ops->attachment_is_sleepable(link);
+
 	bpf_link_free_id(link->id);
 	if (link->prog) {
-		sleepable = link->prog->sleepable;
+		sleepable = sleepable || link->prog->sleepable;
 		/* detach BPF program, clean up used resources */
 		ops->release(link);
-		bpf_prog_put(link->prog);
+		if (sleepable)
+			call_rcu_tasks_trace(&link->prog->aux->rcu,
+ 					     bpf_link_defer_bpf_prog_put);
+		else
+			bpf_prog_put(link->prog);
 	}
 	if (ops->dealloc_deferred) {
 		/* schedule BPF link deallocation; if underlying BPF program
	...
 }
 
static bool bpf_raw_tp_link_attachment_is_sleepable(struct bpf_link *link)
{
	struct bpf_raw_tp_link *raw_tp =
		container_of(link, struct bpf_raw_tp_link, link);

	return raw_tp->btp->tp->sleepable;
}

where if the attachment point of the link is sleepable as with BPF raw
syscall tracepoints then wait for the RCU tasks trace grace period
to elapse before freeing up the program and link.

-Jordan
Andrii Nakryiko Oct. 25, 2024, 8:19 p.m. UTC | #20
On Fri, Oct 25, 2024 at 8:01 AM Jordan Rife <jrife@google.com> wrote:
>
> > One solution might be to teach BPF raw tracepoint link to recognize
> > sleepable tracepoints, and then go through cal_rcu_task_trace ->
> > call_rcu chain instead of normal call_rcu. Similarly, for such cases
> > we'd need to do the same chain for underlying BPF program, even if BPF
> > program itself is not sleepable.
>
> I don't suppose that tracepoints could themselves be marked as sleepable
> (e.g. a new 'sleepable' member of `struct tracepoint`), which could be
> checked when initializing or freeing the link? Something like this,
>
> static void bpf_link_defer_bpf_prog_put(struct rcu_head *rcu)
> {
>         struct bpf_prog_aux *aux = container_of(rcu, struct bpf_prog_aux, rcu);
>         bpf_prog_put(aux->prog);
> }
>
>  /* bpf_link_free is guaranteed to be called from process context */
>  static void bpf_link_free(struct bpf_link *link)
>  {
>         const struct bpf_link_ops *ops = link->ops;
>         bool sleepable = false;
>
> +       if (ops->attachment_is_sleepable)
> +               sleepable = ops->attachment_is_sleepable(link);
> +
>         bpf_link_free_id(link->id);
>         if (link->prog) {
> -               sleepable = link->prog->sleepable;
> +               sleepable = sleepable || link->prog->sleepable;
>                 /* detach BPF program, clean up used resources */
>                 ops->release(link);
> -               bpf_prog_put(link->prog);
> +               if (sleepable)
> +                       call_rcu_tasks_trace(&link->prog->aux->rcu,
> +                                            bpf_link_defer_bpf_prog_put);
> +               else
> +                       bpf_prog_put(link->prog);
>         }
>         if (ops->dealloc_deferred) {
>                 /* schedule BPF link deallocation; if underlying BPF program
>         ...
>  }
>
> static bool bpf_raw_tp_link_attachment_is_sleepable(struct bpf_link *link)
> {
>         struct bpf_raw_tp_link *raw_tp =
>                 container_of(link, struct bpf_raw_tp_link, link);
>
>         return raw_tp->btp->tp->sleepable;
> }
>
> where if the attachment point of the link is sleepable as with BPF raw
> syscall tracepoints then wait for the RCU tasks trace grace period
> to elapse before freeing up the program and link.

Yes, that's the direction I'm leaning towards (though implementation
details would be different, I don't think we need
attachment_is_sleepable callback). I replied to Mathieu's patch, I'll
try to do fixes for the BPF side next week, hope that works.

>
> -Jordan
diff mbox series

Patch

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 6474e2cf22c9..33f6fa94d383 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -101,11 +101,16 @@  static inline void *allocate_probes(int count)
 	return p == NULL ? NULL : p->probes;
 }
 
-static void rcu_free_old_probes(struct rcu_head *head)
+static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
 {
 	kfree(container_of(head, struct tp_probes, rcu));
 }
 
+static void rcu_free_old_probes(struct rcu_head *head)
+{
+	call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
+}
+
 static inline void release_probes(struct tracepoint_func *old)
 {
 	if (old) {