diff mbox series

[bpf-next,11/15] bpf: Add bpf_sys_close() helper.

Message ID 20210417033224.8063-12-alexei.starovoitov@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf: syscall program, FD array, loader program, light skeleton. | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 8 maintainers not CCed: joe@cilium.io yhs@fb.com kpsingh@kernel.org kafai@fb.com ast@kernel.org john.fastabend@gmail.com songliubraving@fb.com quentin@isovalent.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit fail Errors and warnings before: 12006 this patch: 12007
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: please, no space before tabs
netdev/build_allmodconfig_warn fail Errors and warnings before: 12493 this patch: 12494
netdev/header_inline success Link

Commit Message

Alexei Starovoitov April 17, 2021, 3:32 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

Add bpf_sys_close() helper to be used by the syscall/loader program to close
intermediate FDs and other cleanup.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/uapi/linux/bpf.h       |  7 +++++++
 kernel/bpf/syscall.c           | 14 ++++++++++++++
 tools/include/uapi/linux/bpf.h |  7 +++++++
 3 files changed, 28 insertions(+)

Comments

Al Viro April 17, 2021, 3:42 a.m. UTC | #1
On Fri, Apr 16, 2021 at 08:32:20PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Add bpf_sys_close() helper to be used by the syscall/loader program to close
> intermediate FDs and other cleanup.

Conditional NAK.  In a lot of contexts close_fd() is very much unsafe.
In particular, anything that might call it between fdget() and fdput()
is Right Fucking Out(tm).

In which contexts can that thing be executed?
Alexei Starovoitov April 17, 2021, 3:46 a.m. UTC | #2
On Fri, Apr 16, 2021 at 8:42 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Apr 16, 2021 at 08:32:20PM -0700, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Add bpf_sys_close() helper to be used by the syscall/loader program to close
> > intermediate FDs and other cleanup.
>
> Conditional NAK.  In a lot of contexts close_fd() is very much unsafe.
> In particular, anything that might call it between fdget() and fdput()
> is Right Fucking Out(tm).
> In which contexts can that thing be executed?

user context only.
It's not for all of bpf _obviously_.
Al Viro April 17, 2021, 4:04 a.m. UTC | #3
On Fri, Apr 16, 2021 at 08:46:05PM -0700, Alexei Starovoitov wrote:
> On Fri, Apr 16, 2021 at 8:42 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Fri, Apr 16, 2021 at 08:32:20PM -0700, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Add bpf_sys_close() helper to be used by the syscall/loader program to close
> > > intermediate FDs and other cleanup.
> >
> > Conditional NAK.  In a lot of contexts close_fd() is very much unsafe.
> > In particular, anything that might call it between fdget() and fdput()
> > is Right Fucking Out(tm).
> > In which contexts can that thing be executed?
> 
> user context only.
> It's not for all of bpf _obviously_.

Let me restate the question: what call chains could lead to bpf_sys_close()?
Alexei Starovoitov April 17, 2021, 5:01 a.m. UTC | #4
On Fri, Apr 16, 2021 at 9:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Apr 16, 2021 at 08:46:05PM -0700, Alexei Starovoitov wrote:
> > On Fri, Apr 16, 2021 at 8:42 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Fri, Apr 16, 2021 at 08:32:20PM -0700, Alexei Starovoitov wrote:
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > Add bpf_sys_close() helper to be used by the syscall/loader program to close
> > > > intermediate FDs and other cleanup.
> > >
> > > Conditional NAK.  In a lot of contexts close_fd() is very much unsafe.
> > > In particular, anything that might call it between fdget() and fdput()
> > > is Right Fucking Out(tm).
> > > In which contexts can that thing be executed?
> >
> > user context only.
> > It's not for all of bpf _obviously_.
>
> Let me restate the question: what call chains could lead to bpf_sys_close()?

Already answered. User context only. It's all safe.
Alexei Starovoitov April 17, 2021, 2:36 p.m. UTC | #5
On Fri, Apr 16, 2021 at 10:01:43PM -0700, Alexei Starovoitov wrote:
> On Fri, Apr 16, 2021 at 9:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Fri, Apr 16, 2021 at 08:46:05PM -0700, Alexei Starovoitov wrote:
> > > On Fri, Apr 16, 2021 at 8:42 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > >
> > > > On Fri, Apr 16, 2021 at 08:32:20PM -0700, Alexei Starovoitov wrote:
> > > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > >
> > > > > Add bpf_sys_close() helper to be used by the syscall/loader program to close
> > > > > intermediate FDs and other cleanup.
> > > >
> > > > Conditional NAK.  In a lot of contexts close_fd() is very much unsafe.
> > > > In particular, anything that might call it between fdget() and fdput()
> > > > is Right Fucking Out(tm).
> > > > In which contexts can that thing be executed?
> > >
> > > user context only.
> > > It's not for all of bpf _obviously_.
> >
> > Let me restate the question: what call chains could lead to bpf_sys_close()?
> 
> Already answered. User context only. It's all safe.

Not only sys_close is safe to call. Literally all syscalls are safe to call.
The current allowlist contains two syscalls. It may get extended as use cases come up.

The following two codes are equivalent:
1.
bpf_prog.c:
  SEC("syscall")
  int bpf_prog(struct args *ctx)
  {
    bpf_sys_close(1);
    bpf_sys_close(2);
    bpf_sys_close(3);
    return 0;
  }
main.c:
  int main(int ac, char **av)
  {
    bpf_prog_load_and_run("bpf_prog.o");
  }

2.
main.c:
  int main(int ac, char **av)
  {
    close(1);
    close(2);
    close(3);
  }

The kernel will perform the same work with FDs. The same locks are held
and the same execution conditions are in both cases. The LSM hooks,
fsnotify, etc will be called the same way.
It's no different if new syscall was introduced "sys_foo(int num)" that
would do { return close_fd(num); }.
It would opearate in the same user context.
Al Viro April 17, 2021, 4:48 p.m. UTC | #6
On Sat, Apr 17, 2021 at 07:36:39AM -0700, Alexei Starovoitov wrote:

> The kernel will perform the same work with FDs. The same locks are held
> and the same execution conditions are in both cases. The LSM hooks,
> fsnotify, etc will be called the same way.
> It's no different if new syscall was introduced "sys_foo(int num)" that
> would do { return close_fd(num); }.
> It would opearate in the same user context.

Hmm...  unless I'm misreading the code, one of the call chains would seem to
be sys_bpf() -> bpf_prog_test_run() -> ->test_run() -> ... -> bpf_sys_close().
OK, as long as you make sure bpf_prog_get() does fdput() (i.e. that we
don't have it restructured so that fdget()/fdput() pair would be lifted into
bpf_prog_test_run(), with fdput() moved in place of bpf_prog_put()).

Note that we *really* can not allow close_fd() on anything to be bracketed
by fdget()/fdput() pair; we had bugs of that sort and, as the matter of fact,
still have one in autofs_dev_ioctl().

The trouble happens if you have file F with 2 references, held by descriptor
tables of different processes.  Say, process A has descriptor 6 refering to
it, while B has descriptor 42 doing the same.  Descriptor tables of A and B
are not shared with anyone.

A: fdget(6) 	-> returns a reference to F, refcount _not_ touched
A: close_fd(6)	-> rips the reference to F from descriptor table, does fput(F)
		   refcount drops to 1.
B: close(42)	-> rips the reference to F from B's descriptor table, does fput(F)
		   This time refcount does reach 0 and we use task_work_add() to
		   make sure the destructor (__fput()) runs before B returns to
		   userland.  sys_close() returns and B goes off to userland.
		   On the way out __fput() is run, and among other things,
		   ->release() of F is executed, doing whatever it wants to do.
		   F is freed.
And at that point A, which presumably is using the guts of F, gets screwed.

In case of autofs_dev_ioctl(), it's possible for a thread to end up blocked
inside copy_to_user(), with autofs functions in call chains *AND* module
refcount of autofs not pinned by anything.  The effects of returning into a
function that used to reside in now unmapped page are obviously not pretty...

Basically, the rule is
	* never remove from descriptor table if you currently have an outstadning
fdget() (i.e. without having done the matching fdput() yet).

	That, obviously, covers all ioctls - there we have fdget() done
by sys_ioctl() on the issuing descriptor.  In your case you seem to be
safe, but it's a bloody dangerous minefield - you really need a big warning
in all call sites.  The worst part is that it won't happen with intended
use, so it doesn't show up in routine regression testing.  In particular,
for autofs the normal case is AUTOFS_DEV_IOCTL_CLOSEMOUNT getting passed
a file descriptor of something mounted and *not* the descriptor of
/dev/autofs we are holding fdget() on.  However, there's no way to prevent
a malicious call when we pass exactly that.

	So please, mark all call sites with "make very sure you never get
here with unpaired fdget()".

	BTW, if my reading (re ->test_run()) is correct, what limits the recursion
via bpf_sys_bpf()?
Alexei Starovoitov April 17, 2021, 5:09 p.m. UTC | #7
On Sat, Apr 17, 2021 at 04:48:53PM +0000, Al Viro wrote:
> On Sat, Apr 17, 2021 at 07:36:39AM -0700, Alexei Starovoitov wrote:
> 
> > The kernel will perform the same work with FDs. The same locks are held
> > and the same execution conditions are in both cases. The LSM hooks,
> > fsnotify, etc will be called the same way.
> > It's no different if new syscall was introduced "sys_foo(int num)" that
> > would do { return close_fd(num); }.
> > It would opearate in the same user context.
> 
> Hmm...  unless I'm misreading the code, one of the call chains would seem to
> be sys_bpf() -> bpf_prog_test_run() -> ->test_run() -> ... -> bpf_sys_close().
> OK, as long as you make sure bpf_prog_get() does fdput() (i.e. that we
> don't have it restructured so that fdget()/fdput() pair would be lifted into
> bpf_prog_test_run(), with fdput() moved in place of bpf_prog_put()).

Got it. There is no fdget/put bracketing in the code.
On the way to test_run we do __bpf_prog_get() which does fdget and immediately
fdput after incrementing refcnt of the prog.
I believe this pattern is consistent everywhere in kernel/bpf/*

> Note that we *really* can not allow close_fd() on anything to be bracketed
> by fdget()/fdput() pair; we had bugs of that sort and, as the matter of fact,
> still have one in autofs_dev_ioctl().
> 
> The trouble happens if you have file F with 2 references, held by descriptor
> tables of different processes.  Say, process A has descriptor 6 refering to
> it, while B has descriptor 42 doing the same.  Descriptor tables of A and B
> are not shared with anyone.
> 
> A: fdget(6) 	-> returns a reference to F, refcount _not_ touched
> A: close_fd(6)	-> rips the reference to F from descriptor table, does fput(F)
> 		   refcount drops to 1.
> B: close(42)	-> rips the reference to F from B's descriptor table, does fput(F)
> 		   This time refcount does reach 0 and we use task_work_add() to
> 		   make sure the destructor (__fput()) runs before B returns to
> 		   userland.  sys_close() returns and B goes off to userland.
> 		   On the way out __fput() is run, and among other things,
> 		   ->release() of F is executed, doing whatever it wants to do.
> 		   F is freed.
> And at that point A, which presumably is using the guts of F, gets screwed.

Thanks for these details. That's really helpful.

> 	So please, mark all call sites with "make very sure you never get
> here with unpaired fdget()".

Good point. Will add this comment.

> 	BTW, if my reading (re ->test_run()) is correct, what limits the recursion
> via bpf_sys_bpf()?

Glad you asked! This kind of code review questions are much appreciated.

It's an allowlist of possible commands in bpf_sys_bpf().
'case BPF_PROG_TEST_RUN:' is not there for this exact reason.
I'll add a comment to make it more obvious.
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 47c4b21a51b6..2251e7894799 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4728,6 +4728,12 @@  union bpf_attr {
  * 		If btf_fd is zero look for the name in vmlinux BTF and kernel module BTFs.
  * 	Return
  * 		Return btf_id and store module's BTF FD into attach_btf_obj_fd.
+ *
+ * long bpf_sys_close(u32 fd)
+ * 	Description
+ * 		Execute close syscall for given FD.
+ * 	Return
+ * 		A syscall result.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4897,6 +4903,7 @@  union bpf_attr {
 	FN(for_each_map_elem),		\
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
+	FN(sys_close),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 6d4d9925c0ec..822b00908c58 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4557,6 +4557,18 @@  const struct bpf_func_proto bpf_sys_bpf_proto = {
 	.arg3_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_1(bpf_sys_close, u32, fd)
+{
+	return close_fd(fd);
+}
+
+const struct bpf_func_proto bpf_sys_close_proto = {
+	.func		= bpf_sys_close,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -4565,6 +4577,8 @@  syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sys_bpf_proto;
 	case BPF_FUNC_btf_find_by_name_kind:
 		return &bpf_btf_find_by_name_kind_proto;
+	case BPF_FUNC_sys_close:
+		return &bpf_sys_close_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 47c4b21a51b6..2251e7894799 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4728,6 +4728,12 @@  union bpf_attr {
  * 		If btf_fd is zero look for the name in vmlinux BTF and kernel module BTFs.
  * 	Return
  * 		Return btf_id and store module's BTF FD into attach_btf_obj_fd.
+ *
+ * long bpf_sys_close(u32 fd)
+ * 	Description
+ * 		Execute close syscall for given FD.
+ * 	Return
+ * 		A syscall result.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4897,6 +4903,7 @@  union bpf_attr {
 	FN(for_each_map_elem),		\
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
+	FN(sys_close),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper