[v4] bpf: Remove in_atomic() from bpf_link_put().

Message ID	20230614083430.oENawF8f@linutronix.de (mailing list archive)
State	Accepted
Commit	ab5d47bd41b1db82c295b0e751e2b822b43a4b5a
Delegated to:	BPF
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 460BB369 for <bpf@vger.kernel.org>; Wed, 14 Jun 2023 08:34:36 +0000 (UTC) Date: Wed, 14 Jun 2023 10:34:30 +0200 From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>, bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, John Fastabend <john.fastabend@gmail.com>, "Paul E. McKenney" <paulmck@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Linux-Fsdevel <linux-fsdevel@vger.kernel.org> Subject: [PATCH v4] bpf: Remove in_atomic() from bpf_link_put(). Message-ID: <20230614083430.oENawF8f@linutronix.de> References: <20230509132433.2FSY_6t7@linutronix.de> <CAEf4BzZcPKsRJDQfdVk9D1Nt6kgT4STpEUrsQ=UD3BDZnNp8eQ@mail.gmail.com> <CAADnVQLzZyZ+cPqBFfrqa8wtQ8ZhWvTSN6oD9z4Y2gtrfs8Vdg@mail.gmail.com> <CAEf4BzY-MRYnzGiZmW7AVJYgYdHW1_jOphbipRrHRTtdfq3_wQ@mail.gmail.com> <20230525141813.TFZLWM4M@linutronix.de> <CAEf4Bzaipoo6X_2Fh5WTV-m0yjP0pvhqi7-FPFtGOrSzNpdGJQ@mail.gmail.com> <20230526112356.fOlWmeOF@linutronix.de> <CAEf4Bzawgrn2DhR9uvXwFFiLR9g+j4RYC6cr3n+eRD_RoKBAJA@mail.gmail.com> <20230605163733.LD-UCcso@linutronix.de> <CAEf4BzZ=VZcLZmtRefLtRyRb7uLTb6e=RVw82rxjLNqE=8kT-w@mail.gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <CAEf4BzZ=VZcLZmtRefLtRyRb7uLTb6e=RVw82rxjLNqE=8kT-w@mail.gmail.com>
Series	[v4] bpf: Remove in_atomic() from bpf_link_put(). \| expand [v4] bpf: Remove in_atomic() from bpf_link_put().

Context	Check	Description
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 10 this patch: 10
netdev/cc_maintainers	warning	8 maintainers not CCed: yhs@fb.com kpsingh@kernel.org martin.lau@linux.dev song@kernel.org sdf@google.com andrii@kernel.org jolsa@kernel.org haoluo@google.com
netdev/build_clang	success	Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 10 this patch: 10
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 72 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-PR	success	PR summary
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4	success	Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5	success	Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-6	success	Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7	success	Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9	success	Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10	success	Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-11	success	Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13	success	Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14	success	Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15	success	Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17	success	Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18	success	Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19	success	Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20	success	Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21	success	Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22	success	Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23	success	Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24	success	Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25	success	Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-27	success	Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28	success	Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29	success	Logs for veristat
bpf/vmtest-bpf-next-VM_Test-26	success	Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12	fail	Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-8	success	Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16	success	Logs for test_progs_no_alu32 on s390x with gcc

Sebastian Andrzej Siewior June 14, 2023, 8:34 a.m. UTC

bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
invoked within softirq context. By setting rcutree.use_softirq=0 boot
option the RCU callbacks will be invoked in a per-CPU kthread with
bottom halves disabled which implies a RCU read section.

On PREEMPT_RT the context remains fully preemptible. The RCU read
section however does not allow schedule() invocation. The latter happens
in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
from bpf_link_put().

It was pointed out that the bpf_link_put() invocation should not be
delayed if originated from close(). It was also pointed out that other
invocations from within a syscall should also avoid the workqueue.
Everyone else should use workqueue by default to remain safe in the
future (while auditing the code, every caller was preemptible except for
the RCU case).

Let bpf_link_put() use the worker unconditionally. Add
bpf_link_put_direct() which will directly free the resources and is used
by close() and from within __sys_bpf().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v3…v4:
  - Revert back to bpf_link_put_direct() to the direct free and let
    bpf_link_put() use the worker. Let close() and all invocations from
    within the syscall use bpf_link_put_direct() which are all instances
    within syscall.c here.

v2…v3:
  - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
    and add bpf_link_put_from_atomic() to do the delayed free via the
    worker.

v1…v2:
   - Add bpf_link_put_direct() to be used from bpf_link_release() as
     suggested.

 kernel/bpf/syscall.c | 29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

Paul E. McKenney June 15, 2023, 4:43 p.m. UTC | #1

On Wed, Jun 14, 2023 at 10:34:30AM +0200, Sebastian Andrzej Siewior wrote:
> bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> invoked within softirq context. By setting rcutree.use_softirq=0 boot
> option the RCU callbacks will be invoked in a per-CPU kthread with
> bottom halves disabled which implies a RCU read section.
> 
> On PREEMPT_RT the context remains fully preemptible. The RCU read
> section however does not allow schedule() invocation. The latter happens
> in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> from bpf_link_put().

Just to make sure that I understand, you are proposing that the RCU
callbacks continue to run with BH disabled, but that BH-disabled regions
are preemptible in kernels built with CONFIG_PREEMPT_RT=y?

Or did I miss a turn in there somewhere?

							Thanx, Paul

> It was pointed out that the bpf_link_put() invocation should not be
> delayed if originated from close(). It was also pointed out that other
> invocations from within a syscall should also avoid the workqueue.
> Everyone else should use workqueue by default to remain safe in the
> future (while auditing the code, every caller was preemptible except for
> the RCU case).
> 
> Let bpf_link_put() use the worker unconditionally. Add
> bpf_link_put_direct() which will directly free the resources and is used
> by close() and from within __sys_bpf().
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> v3…v4:
>   - Revert back to bpf_link_put_direct() to the direct free and let
>     bpf_link_put() use the worker. Let close() and all invocations from
>     within the syscall use bpf_link_put_direct() which are all instances
>     within syscall.c here.
> 
> v2…v3:
>   - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
>     and add bpf_link_put_from_atomic() to do the delayed free via the
>     worker.
> 
> v1…v2:
>    - Add bpf_link_put_direct() to be used from bpf_link_release() as
>      suggested.
> 
>  kernel/bpf/syscall.c | 29 ++++++++++++++++-------------
>  1 file changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 14f39c1e573ee..8f09aef5949d4 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2777,28 +2777,31 @@ static void bpf_link_put_deferred(struct work_struct *work)
>  	bpf_link_free(link);
>  }
>  
> -/* bpf_link_put can be called from atomic context, but ensures that resources
> - * are freed from process context
> +/* bpf_link_put might be called from atomic context. It needs to be called
> + * from sleepable context in order to acquire sleeping locks during the process.
>   */
>  void bpf_link_put(struct bpf_link *link)
>  {
>  	if (!atomic64_dec_and_test(&link->refcnt))
>  		return;
>  
> -	if (in_atomic()) {
> -		INIT_WORK(&link->work, bpf_link_put_deferred);
> -		schedule_work(&link->work);
> -	} else {
> -		bpf_link_free(link);
> -	}
> +	INIT_WORK(&link->work, bpf_link_put_deferred);
> +	schedule_work(&link->work);
>  }
>  EXPORT_SYMBOL(bpf_link_put);
>  
> +static void bpf_link_put_direct(struct bpf_link *link)
> +{
> +	if (!atomic64_dec_and_test(&link->refcnt))
> +		return;
> +	bpf_link_free(link);
> +}
> +
>  static int bpf_link_release(struct inode *inode, struct file *filp)
>  {
>  	struct bpf_link *link = filp->private_data;
>  
> -	bpf_link_put(link);
> +	bpf_link_put_direct(link);
>  	return 0;
>  }
>  
> @@ -4764,7 +4767,7 @@ static int link_update(union bpf_attr *attr)
>  	if (ret)
>  		bpf_prog_put(new_prog);
>  out_put_link:
> -	bpf_link_put(link);
> +	bpf_link_put_direct(link);
>  	return ret;
>  }
>  
> @@ -4787,7 +4790,7 @@ static int link_detach(union bpf_attr *attr)
>  	else
>  		ret = -EOPNOTSUPP;
>  
> -	bpf_link_put(link);
> +	bpf_link_put_direct(link);
>  	return ret;
>  }
>  
> @@ -4857,7 +4860,7 @@ static int bpf_link_get_fd_by_id(const union bpf_attr *attr)
>  
>  	fd = bpf_link_new_fd(link);
>  	if (fd < 0)
> -		bpf_link_put(link);
> +		bpf_link_put_direct(link);
>  
>  	return fd;
>  }
> @@ -4934,7 +4937,7 @@ static int bpf_iter_create(union bpf_attr *attr)
>  		return PTR_ERR(link);
>  
>  	err = bpf_iter_new_fd(link);
> -	bpf_link_put(link);
> +	bpf_link_put_direct(link);
>  
>  	return err;
>  }
> -- 
> 2.40.1
>

Sebastian Andrzej Siewior June 15, 2023, 7:13 p.m. UTC | #2

On 2023-06-15 09:43:11 [-0700], Paul E. McKenney wrote:
> On Wed, Jun 14, 2023 at 10:34:30AM +0200, Sebastian Andrzej Siewior wrote:
> > bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> > invoked within softirq context. By setting rcutree.use_softirq=0 boot
> > option the RCU callbacks will be invoked in a per-CPU kthread with
> > bottom halves disabled which implies a RCU read section.
> > 
> > On PREEMPT_RT the context remains fully preemptible. The RCU read
> > section however does not allow schedule() invocation. The latter happens
> > in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> > from bpf_link_put().
> 
> Just to make sure that I understand, you are proposing that the RCU
> callbacks continue to run with BH disabled, but that BH-disabled regions
> are preemptible in kernels built with CONFIG_PREEMPT_RT=y?
> 
> Or did I miss a turn in there somewhere?

I'm not proposing anything, just stating what we have. On PREEMPT_RT
you are preemptible within the RCU callback but must not invoke
schedule(). Similar to the RCU read section on CONFIG_PREEMPT where you
are preemptible but must not invoke schedule(). 

> 
> 							Thanx, Paul

Sebastian

Paul E. McKenney June 15, 2023, 7:32 p.m. UTC | #3

On Thu, Jun 15, 2023 at 09:13:41PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-06-15 09:43:11 [-0700], Paul E. McKenney wrote:
> > On Wed, Jun 14, 2023 at 10:34:30AM +0200, Sebastian Andrzej Siewior wrote:
> > > bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> > > invoked within softirq context. By setting rcutree.use_softirq=0 boot
> > > option the RCU callbacks will be invoked in a per-CPU kthread with
> > > bottom halves disabled which implies a RCU read section.
> > > 
> > > On PREEMPT_RT the context remains fully preemptible. The RCU read
> > > section however does not allow schedule() invocation. The latter happens
> > > in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> > > from bpf_link_put().
> > 
> > Just to make sure that I understand, you are proposing that the RCU
> > callbacks continue to run with BH disabled, but that BH-disabled regions
> > are preemptible in kernels built with CONFIG_PREEMPT_RT=y?
> > 
> > Or did I miss a turn in there somewhere?
> 
> I'm not proposing anything, just stating what we have. On PREEMPT_RT
> you are preemptible within the RCU callback but must not invoke
> schedule(). Similar to the RCU read section on CONFIG_PREEMPT where you
> are preemptible but must not invoke schedule(). 

Thank you for the clarification!

The main risk of preemptible RCU callbacks is callback flooding, but
RCU priority boosting should take care of that.

							Thanx, Paul

patchwork-bot+netdevbpf@kernel.org June 16, 2023, 4:30 p.m. UTC | #4

Hello:

This patch was applied to bpf/bpf-next.git (master)
by Andrii Nakryiko <andrii@kernel.org>:

On Wed, 14 Jun 2023 10:34:30 +0200 you wrote:
> bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> invoked within softirq context. By setting rcutree.use_softirq=0 boot
> option the RCU callbacks will be invoked in a per-CPU kthread with
> bottom halves disabled which implies a RCU read section.
> 
> On PREEMPT_RT the context remains fully preemptible. The RCU read
> section however does not allow schedule() invocation. The latter happens
> in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> from bpf_link_put().
> 
> [...]

Here is the summary with links:
  - [v4] bpf: Remove in_atomic() from bpf_link_put().
    https://git.kernel.org/bpf/bpf-next/c/ab5d47bd41b1

You are awesome, thank you!

[v4] bpf: Remove in_atomic() from bpf_link_put().

Checks

Commit Message

Comments

Patch