diff mbox series

[v5,8/9] rust: file: add `DeferredFdCloser`

Message ID 20240209-alice-file-v5-8-a37886783025@google.com (mailing list archive)
State New, archived
Headers show
Series File abstractions needed by Rust Binder | expand

Commit Message

Alice Ryhl Feb. 9, 2024, 11:18 a.m. UTC
To close an fd from kernel space, we could call `ksys_close`. However,
if we do this to an fd that is held using `fdget`, then we may trigger a
use-after-free. Introduce a helper that can be used to close an fd even
if the fd is currently held with `fdget`. This is done by grabbing an
extra refcount to the file and dropping it in a task work once we return
to userspace.

This is necessary for Rust Binder because otherwise the user might try
to have Binder close its fd for /dev/binder, which would cause problems
as this happens inside an ioctl on /dev/binder, and ioctls hold the fd
using `fdget`.

Additional motivation can be found in commit 80cd795630d6 ("binder: fix
use-after-free due to ksys_close() during fdget()") and in the comments
on `binder_do_fd_close`.

If there is some way to detect whether an fd is currently held with
`fdget`, then this could be optimized to skip the allocation and task
work when this is not the case. Another possible optimization would be
to combine several fds into a single task work, since this is used with
fd arrays that might hold several fds.

That said, it might not be necessary to optimize it, because Rust Binder
has two ways to send fds: BINDER_TYPE_FD and BINDER_TYPE_FDA. With
BINDER_TYPE_FD, it is userspace's responsibility to close the fd, so
this mechanism is used only by BINDER_TYPE_FDA, but fd arrays are used
rarely these days.

Reviewed-by: Benno Lossin <benno.lossin@proton.me>
Reviewed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
---
 rust/bindings/bindings_helper.h |   2 +
 rust/helpers.c                  |   8 ++
 rust/kernel/file.rs             | 184 +++++++++++++++++++++++++++++++++++++++-
 rust/kernel/task.rs             |  14 +++
 4 files changed, 207 insertions(+), 1 deletion(-)

Comments

Trevor Gross Feb. 10, 2024, 7:47 a.m. UTC | #1
On Fri, Feb 9, 2024 at 5:22 AM Alice Ryhl <aliceryhl@google.com> wrote:
>
> To close an fd from kernel space, we could call `ksys_close`. However,
> if we do this to an fd that is held using `fdget`, then we may trigger a
> use-after-free. Introduce a helper that can be used to close an fd even
> if the fd is currently held with `fdget`. This is done by grabbing an
> extra refcount to the file and dropping it in a task work once we return
> to userspace.
>
> This is necessary for Rust Binder because otherwise the user might try
> to have Binder close its fd for /dev/binder, which would cause problems
> as this happens inside an ioctl on /dev/binder, and ioctls hold the fd
> using `fdget`.
>
> Additional motivation can be found in commit 80cd795630d6 ("binder: fix
> use-after-free due to ksys_close() during fdget()") and in the comments
> on `binder_do_fd_close`.
>
> If there is some way to detect whether an fd is currently held with
> `fdget`, then this could be optimized to skip the allocation and task
> work when this is not the case. Another possible optimization would be
> to combine several fds into a single task work, since this is used with
> fd arrays that might hold several fds.
>
> That said, it might not be necessary to optimize it, because Rust Binder
> has two ways to send fds: BINDER_TYPE_FD and BINDER_TYPE_FDA. With
> BINDER_TYPE_FD, it is userspace's responsibility to close the fd, so
> this mechanism is used only by BINDER_TYPE_FDA, but fd arrays are used
> rarely these days.
>
> Reviewed-by: Benno Lossin <benno.lossin@proton.me>
> Reviewed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
> Signed-off-by: Alice Ryhl <aliceryhl@google.com>
> ---

Reviewed-by: Trevor Gross <tmgross@umich.edu>

> +    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> +    ///
> +    /// Fails if this is called from a context where we cannot run work when returning to
> +    /// userspace. (E.g., from a kthread.)
> +    pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> +        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> +
> +        // In this method, we schedule the task work before closing the file. This is because
> +        // scheduling a task work is fallible, and we need to know whether it will fail before we
> +        // attempt to close the file.
> +
> +        // Task works are not available on kthreads.
> +        let current = crate::current!();
> +        if current.is_kthread() {
> +            return Err(DeferredFdCloseError::TaskWorkUnavailable);
> +        }
> +
> +        // Transfer ownership of the box's allocation to a raw pointer. This disables the
> +        // destructor, so we must manually convert it back to a Box to drop it.
> +        //
> +        // Until we convert it back to a `Box`, there are no aliasing requirements on this
> +        // pointer.
> +        let inner = Box::into_raw(self.inner);
> +
> +        // The `callback_head` field is first in the struct, so this cast correctly gives us a
> +        // pointer to the field.
> +        let callback_head = inner.cast::<bindings::callback_head>();
> +        // SAFETY: This pointer offset operation does not go out-of-bounds.
> +        let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
> +
> +        let current = current.as_raw();
> +
> +        // SAFETY: This function currently has exclusive access to the `DeferredFdCloserInner`, so
> +        // it is okay for us to perform unsynchronized writes to its `callback_head` field.
> +        unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
> +
> +        // SAFETY: This inserts the `DeferredFdCloserInner` into the task workqueue for the current
> +        // task. If this operation is successful, then this transfers exclusive ownership of the
> +        // `callback_head` field to the C side until it calls `do_close_fd`, and we don't touch or
> +        // invalidate the field during that time.
> +        //
> +        // When the C side calls `do_close_fd`, the safety requirements of that method are
> +        // satisfied because when a task work is executed, the callback is given ownership of the
> +        // pointer.
> +        //
> +        // The file pointer is currently null. If it is changed to be non-null before `do_close_fd`
> +        // is called, then that change happens due to the write at the end of this function, and
> +        // that write has a safety comment that explains why the refcount can be dropped when
> +        // `do_close_fd` runs.
> +        let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
> +
> +        if res != 0 {
> +            // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
> +            // we may destroy it.
> +            unsafe { drop(Box::from_raw(inner)) };
> +
> +            return Err(DeferredFdCloseError::TaskWorkUnavailable);
> +        }
> +
> +        // This removes the fd from the fd table in `current`. The file is not fully closed until
> +        // `filp_close` is called. We are given ownership of one refcount to the file.
> +        //
> +        // SAFETY: This is safe no matter what `fd` is. If the `fd` is valid (that is, if the
> +        // pointer is non-null), then we call `filp_close` on the returned pointer as required by
> +        // `file_close_fd`.
> +        let file = unsafe { bindings::file_close_fd(fd) };
> +        if file.is_null() {
> +            // We don't clean up the task work since that might be expensive if the task work queue
> +            // is long. Just let it execute and let it clean up for itself.
> +            return Err(DeferredFdCloseError::BadFd);
> +        }
> +
> +        // Acquire a second refcount to the file.
> +        //
> +        // SAFETY: The `file` pointer points at a file with a non-zero refcount.
> +        unsafe { bindings::get_file(file) };
> +
> +        // This method closes the fd, consuming one of our two refcounts. There could be active
> +        // light refcounts created from that fd, so we must ensure that the file has a positive
> +        // refcount for the duration of those active light refcounts. We do that by holding on to
> +        // the second refcount until the current task returns to userspace.
> +        //
> +        // SAFETY: The `file` pointer is valid. Passing `current->files` as the file table to close
> +        // it in is correct, since we just got the `fd` from `file_close_fd` which also uses
> +        // `current->files`.
> +        //
> +        // Note: fl_owner_t is currently a void pointer.
> +        unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
> +
> +        // We update the file pointer that the task work is supposed to fput. This transfers
> +        // ownership of our last refcount.
> +        //
> +        // INVARIANT: This changes the `file` field of a `DeferredFdCloserInner` from null to
> +        // non-null. This doesn't break the type invariant for `DeferredFdCloserInner` because we
> +        // still own a refcount to the file, so we can pass ownership of that refcount to the
> +        // `DeferredFdCloserInner`.
> +        //
> +        // When `do_close_fd` runs, it must be safe for it to `fput` the refcount. However, this is
> +        // the case because all light refcounts that are associated with the fd we closed
> +        // previously must be dropped when `do_close_fd`, since light refcounts must be dropped
> +        // before returning to userspace.
> +        //
> +        // SAFETY: Task works are executed on the current thread right before we return to
> +        // userspace, so this write is guaranteed to happen before `do_close_fd` is called, which
> +        // means that a race is not possible here.
> +        unsafe { *file_field = file };
> +
> +        Ok(())
> +    }

This documentation is quite a work of art :)
Christian Brauner March 20, 2024, 2:21 p.m. UTC | #2
On Fri, Feb 09, 2024 at 11:18:21AM +0000, Alice Ryhl wrote:
> To close an fd from kernel space, we could call `ksys_close`. However,
> if we do this to an fd that is held using `fdget`, then we may trigger a
> use-after-free. Introduce a helper that can be used to close an fd even
> if the fd is currently held with `fdget`. This is done by grabbing an
> extra refcount to the file and dropping it in a task work once we return
> to userspace.
> 
> This is necessary for Rust Binder because otherwise the user might try
> to have Binder close its fd for /dev/binder, which would cause problems
> as this happens inside an ioctl on /dev/binder, and ioctls hold the fd
> using `fdget`.
> 
> Additional motivation can be found in commit 80cd795630d6 ("binder: fix
> use-after-free due to ksys_close() during fdget()") and in the comments
> on `binder_do_fd_close`.
> 
> If there is some way to detect whether an fd is currently held with
> `fdget`, then this could be optimized to skip the allocation and task
> work when this is not the case. Another possible optimization would be
> to combine several fds into a single task work, since this is used with
> fd arrays that might hold several fds.
> 
> That said, it might not be necessary to optimize it, because Rust Binder
> has two ways to send fds: BINDER_TYPE_FD and BINDER_TYPE_FDA. With
> BINDER_TYPE_FD, it is userspace's responsibility to close the fd, so
> this mechanism is used only by BINDER_TYPE_FDA, but fd arrays are used
> rarely these days.
> 
> Reviewed-by: Benno Lossin <benno.lossin@proton.me>
> Reviewed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
> Signed-off-by: Alice Ryhl <aliceryhl@google.com>
> ---
>  rust/bindings/bindings_helper.h |   2 +
>  rust/helpers.c                  |   8 ++
>  rust/kernel/file.rs             | 184 +++++++++++++++++++++++++++++++++++++++-
>  rust/kernel/task.rs             |  14 +++
>  4 files changed, 207 insertions(+), 1 deletion(-)
> 
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index 4194b057ef6b..f4d9d04333c0 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -10,6 +10,7 @@
>  #include <linux/cred.h>
>  #include <linux/errname.h>
>  #include <linux/ethtool.h>
> +#include <linux/fdtable.h>
>  #include <linux/file.h>
>  #include <linux/fs.h>
>  #include <linux/jiffies.h>
> @@ -21,6 +22,7 @@
>  #include <linux/refcount.h>
>  #include <linux/wait.h>
>  #include <linux/sched.h>
> +#include <linux/task_work.h>
>  #include <linux/workqueue.h>
>  
>  /* `bindgen` gets confused at certain things. */
> diff --git a/rust/helpers.c b/rust/helpers.c
> index 58e3a9dff349..d146bbf25aec 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -32,6 +32,7 @@
>  #include <linux/sched/signal.h>
>  #include <linux/security.h>
>  #include <linux/spinlock.h>
> +#include <linux/task_work.h>
>  #include <linux/wait.h>
>  #include <linux/workqueue.h>
>  
> @@ -243,6 +244,13 @@ void rust_helper_security_release_secctx(char *secdata, u32 seclen)
>  EXPORT_SYMBOL_GPL(rust_helper_security_release_secctx);
>  #endif
>  
> +void rust_helper_init_task_work(struct callback_head *twork,
> +				task_work_func_t func)
> +{
> +	init_task_work(twork, func);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_init_task_work);
> +
>  /*
>   * `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
>   * use it in contexts where Rust expects a `usize` like slice (array) indices.
> diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
> index fb903b7f23fe..8902f490ccc8 100644
> --- a/rust/kernel/file.rs
> +++ b/rust/kernel/file.rs
> @@ -11,7 +11,8 @@
>      error::{code::*, Error, Result},
>      types::{ARef, AlwaysRefCounted, NotThreadSafe, Opaque},
>  };
> -use core::ptr;
> +use alloc::boxed::Box;
> +use core::{alloc::AllocError, mem, ptr};
>  
>  /// Flags associated with a [`File`].
>  pub mod flags {
> @@ -318,6 +319,187 @@ fn drop(&mut self) {
>      }
>  }
>  
> +/// Helper used for closing file descriptors in a way that is safe even if the file is currently
> +/// held using `fdget`.
> +///
> +/// Additional motivation can be found in commit 80cd795630d6 ("binder: fix use-after-free due to
> +/// ksys_close() during fdget()") and in the comments on `binder_do_fd_close`.
> +pub struct DeferredFdCloser {
> +    inner: Box<DeferredFdCloserInner>,
> +}
> +
> +/// SAFETY: This just holds an allocation with no real content, so there's no safety issue with
> +/// moving it across threads.
> +unsafe impl Send for DeferredFdCloser {}
> +unsafe impl Sync for DeferredFdCloser {}
> +
> +/// # Invariants
> +///
> +/// If the `file` pointer is non-null, then it points at a `struct file` and owns a refcount to
> +/// that file.
> +#[repr(C)]
> +struct DeferredFdCloserInner {
> +    twork: mem::MaybeUninit<bindings::callback_head>,
> +    file: *mut bindings::file,
> +}
> +
> +impl DeferredFdCloser {

So the explicitly deferred close is due to how binder works so it's not
much of a general purpose interface as I don't recall having other
codepaths with similar problems. So this should live in the binder
specific rust code imo.

> +    /// Create a new [`DeferredFdCloser`].
> +    pub fn new() -> Result<Self, AllocError> {
> +        Ok(Self {
> +            // INVARIANT: The `file` pointer is null, so the type invariant does not apply.
> +            inner: Box::try_new(DeferredFdCloserInner {
> +                twork: mem::MaybeUninit::uninit(),
> +                file: core::ptr::null_mut(),
> +            })?,
> +        })
> +    }
> +
> +    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> +    ///
> +    /// Fails if this is called from a context where we cannot run work when returning to
> +    /// userspace. (E.g., from a kthread.)
> +    pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> +        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> +
> +        // In this method, we schedule the task work before closing the file. This is because
> +        // scheduling a task work is fallible, and we need to know whether it will fail before we
> +        // attempt to close the file.
> +
> +        // Task works are not available on kthreads.
> +        let current = crate::current!();
> +        if current.is_kthread() {
> +            return Err(DeferredFdCloseError::TaskWorkUnavailable);
> +        }
> +
> +        // Transfer ownership of the box's allocation to a raw pointer. This disables the
> +        // destructor, so we must manually convert it back to a Box to drop it.
> +        //
> +        // Until we convert it back to a `Box`, there are no aliasing requirements on this
> +        // pointer.
> +        let inner = Box::into_raw(self.inner);
> +
> +        // The `callback_head` field is first in the struct, so this cast correctly gives us a
> +        // pointer to the field.
> +        let callback_head = inner.cast::<bindings::callback_head>();
> +        // SAFETY: This pointer offset operation does not go out-of-bounds.
> +        let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
> +
> +        let current = current.as_raw();
> +
> +        // SAFETY: This function currently has exclusive access to the `DeferredFdCloserInner`, so
> +        // it is okay for us to perform unsynchronized writes to its `callback_head` field.
> +        unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
> +
> +        // SAFETY: This inserts the `DeferredFdCloserInner` into the task workqueue for the current
> +        // task. If this operation is successful, then this transfers exclusive ownership of the
> +        // `callback_head` field to the C side until it calls `do_close_fd`, and we don't touch or
> +        // invalidate the field during that time.
> +        //
> +        // When the C side calls `do_close_fd`, the safety requirements of that method are
> +        // satisfied because when a task work is executed, the callback is given ownership of the
> +        // pointer.
> +        //
> +        // The file pointer is currently null. If it is changed to be non-null before `do_close_fd`
> +        // is called, then that change happens due to the write at the end of this function, and
> +        // that write has a safety comment that explains why the refcount can be dropped when
> +        // `do_close_fd` runs.
> +        let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
> +
> +        if res != 0 {
> +            // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
> +            // we may destroy it.
> +            unsafe { drop(Box::from_raw(inner)) };
> +
> +            return Err(DeferredFdCloseError::TaskWorkUnavailable);
> +        }
> +
> +        // This removes the fd from the fd table in `current`. The file is not fully closed until
> +        // `filp_close` is called. We are given ownership of one refcount to the file.
> +        //
> +        // SAFETY: This is safe no matter what `fd` is. If the `fd` is valid (that is, if the
> +        // pointer is non-null), then we call `filp_close` on the returned pointer as required by
> +        // `file_close_fd`.
> +        let file = unsafe { bindings::file_close_fd(fd) };
> +        if file.is_null() {
> +            // We don't clean up the task work since that might be expensive if the task work queue
> +            // is long. Just let it execute and let it clean up for itself.
> +            return Err(DeferredFdCloseError::BadFd);
> +        }
> +
> +        // Acquire a second refcount to the file.
> +        //
> +        // SAFETY: The `file` pointer points at a file with a non-zero refcount.
> +        unsafe { bindings::get_file(file) };
> +
> +        // This method closes the fd, consuming one of our two refcounts. There could be active
> +        // light refcounts created from that fd, so we must ensure that the file has a positive
> +        // refcount for the duration of those active light refcounts. We do that by holding on to
> +        // the second refcount until the current task returns to userspace.
> +        //
> +        // SAFETY: The `file` pointer is valid. Passing `current->files` as the file table to close
> +        // it in is correct, since we just got the `fd` from `file_close_fd` which also uses
> +        // `current->files`.
> +        //
> +        // Note: fl_owner_t is currently a void pointer.
> +        unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
> +
> +        // We update the file pointer that the task work is supposed to fput. This transfers
> +        // ownership of our last refcount.
> +        //
> +        // INVARIANT: This changes the `file` field of a `DeferredFdCloserInner` from null to
> +        // non-null. This doesn't break the type invariant for `DeferredFdCloserInner` because we
> +        // still own a refcount to the file, so we can pass ownership of that refcount to the
> +        // `DeferredFdCloserInner`.
> +        //
> +        // When `do_close_fd` runs, it must be safe for it to `fput` the refcount. However, this is
> +        // the case because all light refcounts that are associated with the fd we closed
> +        // previously must be dropped when `do_close_fd`, since light refcounts must be dropped
> +        // before returning to userspace.
> +        //
> +        // SAFETY: Task works are executed on the current thread right before we return to
> +        // userspace, so this write is guaranteed to happen before `do_close_fd` is called, which
> +        // means that a race is not possible here.
> +        unsafe { *file_field = file };
> +
> +        Ok(())
> +    }
> +
> +    /// # Safety
> +    ///
> +    /// The provided pointer must point at the `twork` field of a `DeferredFdCloserInner` stored in
> +    /// a `Box`, and the caller must pass exclusive ownership of that `Box`. Furthermore, if the
> +    /// file pointer is non-null, then it must be okay to release the refcount by calling `fput`.
> +    unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
> +        // SAFETY: The caller just passed us ownership of this box.
> +        let inner = unsafe { Box::from_raw(inner.cast::<DeferredFdCloserInner>()) };
> +        if !inner.file.is_null() {
> +            // SAFETY: By the type invariants, we own a refcount to this file, and the caller
> +            // guarantees that dropping the refcount now is okay.
> +            unsafe { bindings::fput(inner.file) };
> +        }
> +        // The allocation is freed when `inner` goes out of scope.
> +    }
> +}
> +
> +/// Represents a failure to close an fd in a deferred manner.
> +#[derive(Copy, Clone, Debug, Eq, PartialEq)]
> +pub enum DeferredFdCloseError {
> +    /// Closing the fd failed because we were unable to schedule a task work.
> +    TaskWorkUnavailable,
> +    /// Closing the fd failed because the fd does not exist.
> +    BadFd,
> +}
> +
> +impl From<DeferredFdCloseError> for Error {
> +    fn from(err: DeferredFdCloseError) -> Error {
> +        match err {
> +            DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
> +            DeferredFdCloseError::BadFd => EBADF,
> +        }
> +    }
> +}
> +
>  /// Represents the `EBADF` error code.
>  ///
>  /// Used for methods that can only fail with `EBADF`.
> diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
> index f46ea3ba9e8e..6adcd8ffcfde 100644
> --- a/rust/kernel/task.rs
> +++ b/rust/kernel/task.rs
> @@ -145,6 +145,12 @@ fn deref(&self) -> &Self::Target {
>          }
>      }
>  
> +    /// Returns a raw pointer to the task.
> +    #[inline]
> +    pub fn as_raw(&self) -> *mut bindings::task_struct {
> +        self.0.get()
> +    }
> +
>      /// Returns the group leader of the given task.
>      pub fn group_leader(&self) -> &Task {
>          // SAFETY: By the type invariant, we know that `self.0` is a valid task. Valid tasks always
> @@ -189,6 +195,14 @@ pub fn pid_in_current_ns(&self) -> Pid {
>          unsafe { bindings::task_tgid_nr_ns(self.0.get(), ptr::null_mut()) }
>      }
>  
> +    /// Returns whether this task corresponds to a kernel thread.
> +    pub fn is_kthread(&self) -> bool {
> +        // SAFETY: By the type invariant, we know that `self.0.get()` is non-null and valid. There
> +        // are no further requirements to read the task's flags.
> +        let flags = unsafe { (*self.0.get()).flags };
> +        (flags & bindings::PF_KTHREAD) != 0
> +    }
> +
>      /// Wakes up the task.
>      pub fn wake_up(&self) {
>          // SAFETY: By the type invariant, we know that `self.0.get()` is non-null and valid.
> 
> -- 
> 2.43.0.687.g38aa6559b0-goog
>
Alice Ryhl March 21, 2024, 1:28 p.m. UTC | #3
On Wed, Mar 20, 2024 at 3:22 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, Feb 09, 2024 at 11:18:21AM +0000, Alice Ryhl wrote:
> >
> > +/// Helper used for closing file descriptors in a way that is safe even if the file is currently
> > +/// held using `fdget`.
> > +///
> > +/// Additional motivation can be found in commit 80cd795630d6 ("binder: fix use-after-free due to
> > +/// ksys_close() during fdget()") and in the comments on `binder_do_fd_close`.
> > +pub struct DeferredFdCloser {
> > +    inner: Box<DeferredFdCloserInner>,
> > +}
> > +
> > +/// SAFETY: This just holds an allocation with no real content, so there's no safety issue with
> > +/// moving it across threads.
> > +unsafe impl Send for DeferredFdCloser {}
> > +unsafe impl Sync for DeferredFdCloser {}
> > +
> > +/// # Invariants
> > +///
> > +/// If the `file` pointer is non-null, then it points at a `struct file` and owns a refcount to
> > +/// that file.
> > +#[repr(C)]
> > +struct DeferredFdCloserInner {
> > +    twork: mem::MaybeUninit<bindings::callback_head>,
> > +    file: *mut bindings::file,
> > +}
> > +
> > +impl DeferredFdCloser {
>
> So the explicitly deferred close is due to how binder works so it's not
> much of a general purpose interface as I don't recall having other
> codepaths with similar problems. So this should live in the binder
> specific rust code imo.

Hmm. Are there really no other ioctls that call ksys_close on a
user-provided fd?

As far as I can tell, this kind of deferred API is the only way for us
to provide a fully safe Rust api for closing an fd. Directly calling
ksys_close must unsafely assert that the fd does not have an active
fdget call. So it makes sense to me as an API that others might want
to use.

Still I can move it elsewhere if necessary.

Alice
Christian Brauner March 31, 2024, 10:26 a.m. UTC | #4
On Thu, Mar 21, 2024 at 02:28:15PM +0100, Alice Ryhl wrote:
> On Wed, Mar 20, 2024 at 3:22 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Fri, Feb 09, 2024 at 11:18:21AM +0000, Alice Ryhl wrote:
> > >
> > > +/// Helper used for closing file descriptors in a way that is safe even if the file is currently
> > > +/// held using `fdget`.
> > > +///
> > > +/// Additional motivation can be found in commit 80cd795630d6 ("binder: fix use-after-free due to
> > > +/// ksys_close() during fdget()") and in the comments on `binder_do_fd_close`.
> > > +pub struct DeferredFdCloser {
> > > +    inner: Box<DeferredFdCloserInner>,
> > > +}
> > > +
> > > +/// SAFETY: This just holds an allocation with no real content, so there's no safety issue with
> > > +/// moving it across threads.
> > > +unsafe impl Send for DeferredFdCloser {}
> > > +unsafe impl Sync for DeferredFdCloser {}
> > > +
> > > +/// # Invariants
> > > +///
> > > +/// If the `file` pointer is non-null, then it points at a `struct file` and owns a refcount to
> > > +/// that file.
> > > +#[repr(C)]
> > > +struct DeferredFdCloserInner {
> > > +    twork: mem::MaybeUninit<bindings::callback_head>,
> > > +    file: *mut bindings::file,
> > > +}
> > > +
> > > +impl DeferredFdCloser {
> >
> > So the explicitly deferred close is due to how binder works so it's not
> > much of a general purpose interface as I don't recall having other
> > codepaths with similar problems. So this should live in the binder
> > specific rust code imo.
> 
> Hmm. Are there really no other ioctls that call ksys_close on a
> user-provided fd?

No, I don't think there are otherwise they would have the same bug that
binder had. io_uring comes closest but they have their own task work
and deferred close implementation.

> 
> As far as I can tell, this kind of deferred API is the only way for us
> to provide a fully safe Rust api for closing an fd. Directly calling
> ksys_close must unsafely assert that the fd does not have an active
> fdget call. So it makes sense to me as an API that others might want
> to use.

I'm sorry, I don't quite understand you here.

What binder is doing iirc is that it's performing an ioctl() using a fd
to /dev/binderfs/$device-node or similar. The fatal flaw is that binder
now allows that fd to be closed during that ioctl - and by accident at
that. It's effectively closing a file it's relying on to not go away.
That's the original nonsense/creativity that necessitates this whole
manual task work shenanigans binder is doing. Unless I misremember.

But that's really frowned upon generally and absolutely not encouraged
by providing a generic interface for this stuff.

Sure, we have some users in the kernel that do stuff like call
close_fd() on a file descriptor they just installed into their file
descriptor table. That's usually bad design with maybe a few valid
exceptions. One example where that's still done and should be fixed is
e.g., cachefiles in fs/cachefiles/ondemand.c. The point is that they at
least close a file descriptor that they just installed and that they
don't rely on being valid.

So really, I need more details why without this interface it would
prevent Rust from implementing safe file descriptor closing because
right now all I see is a design flaw in binder being promoted to a
generic interface. And unless there's detailed reasons for it's
existence we're not going to do it.
diff mbox series

Patch

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 4194b057ef6b..f4d9d04333c0 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -10,6 +10,7 @@ 
 #include <linux/cred.h>
 #include <linux/errname.h>
 #include <linux/ethtool.h>
+#include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/jiffies.h>
@@ -21,6 +22,7 @@ 
 #include <linux/refcount.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
+#include <linux/task_work.h>
 #include <linux/workqueue.h>
 
 /* `bindgen` gets confused at certain things. */
diff --git a/rust/helpers.c b/rust/helpers.c
index 58e3a9dff349..d146bbf25aec 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -32,6 +32,7 @@ 
 #include <linux/sched/signal.h>
 #include <linux/security.h>
 #include <linux/spinlock.h>
+#include <linux/task_work.h>
 #include <linux/wait.h>
 #include <linux/workqueue.h>
 
@@ -243,6 +244,13 @@  void rust_helper_security_release_secctx(char *secdata, u32 seclen)
 EXPORT_SYMBOL_GPL(rust_helper_security_release_secctx);
 #endif
 
+void rust_helper_init_task_work(struct callback_head *twork,
+				task_work_func_t func)
+{
+	init_task_work(twork, func);
+}
+EXPORT_SYMBOL_GPL(rust_helper_init_task_work);
+
 /*
  * `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
  * use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
index fb903b7f23fe..8902f490ccc8 100644
--- a/rust/kernel/file.rs
+++ b/rust/kernel/file.rs
@@ -11,7 +11,8 @@ 
     error::{code::*, Error, Result},
     types::{ARef, AlwaysRefCounted, NotThreadSafe, Opaque},
 };
-use core::ptr;
+use alloc::boxed::Box;
+use core::{alloc::AllocError, mem, ptr};
 
 /// Flags associated with a [`File`].
 pub mod flags {
@@ -318,6 +319,187 @@  fn drop(&mut self) {
     }
 }
 
+/// Helper used for closing file descriptors in a way that is safe even if the file is currently
+/// held using `fdget`.
+///
+/// Additional motivation can be found in commit 80cd795630d6 ("binder: fix use-after-free due to
+/// ksys_close() during fdget()") and in the comments on `binder_do_fd_close`.
+pub struct DeferredFdCloser {
+    inner: Box<DeferredFdCloserInner>,
+}
+
+/// SAFETY: This just holds an allocation with no real content, so there's no safety issue with
+/// moving it across threads.
+unsafe impl Send for DeferredFdCloser {}
+unsafe impl Sync for DeferredFdCloser {}
+
+/// # Invariants
+///
+/// If the `file` pointer is non-null, then it points at a `struct file` and owns a refcount to
+/// that file.
+#[repr(C)]
+struct DeferredFdCloserInner {
+    twork: mem::MaybeUninit<bindings::callback_head>,
+    file: *mut bindings::file,
+}
+
+impl DeferredFdCloser {
+    /// Create a new [`DeferredFdCloser`].
+    pub fn new() -> Result<Self, AllocError> {
+        Ok(Self {
+            // INVARIANT: The `file` pointer is null, so the type invariant does not apply.
+            inner: Box::try_new(DeferredFdCloserInner {
+                twork: mem::MaybeUninit::uninit(),
+                file: core::ptr::null_mut(),
+            })?,
+        })
+    }
+
+    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
+    ///
+    /// Fails if this is called from a context where we cannot run work when returning to
+    /// userspace. (E.g., from a kthread.)
+    pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
+        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
+
+        // In this method, we schedule the task work before closing the file. This is because
+        // scheduling a task work is fallible, and we need to know whether it will fail before we
+        // attempt to close the file.
+
+        // Task works are not available on kthreads.
+        let current = crate::current!();
+        if current.is_kthread() {
+            return Err(DeferredFdCloseError::TaskWorkUnavailable);
+        }
+
+        // Transfer ownership of the box's allocation to a raw pointer. This disables the
+        // destructor, so we must manually convert it back to a Box to drop it.
+        //
+        // Until we convert it back to a `Box`, there are no aliasing requirements on this
+        // pointer.
+        let inner = Box::into_raw(self.inner);
+
+        // The `callback_head` field is first in the struct, so this cast correctly gives us a
+        // pointer to the field.
+        let callback_head = inner.cast::<bindings::callback_head>();
+        // SAFETY: This pointer offset operation does not go out-of-bounds.
+        let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
+
+        let current = current.as_raw();
+
+        // SAFETY: This function currently has exclusive access to the `DeferredFdCloserInner`, so
+        // it is okay for us to perform unsynchronized writes to its `callback_head` field.
+        unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
+
+        // SAFETY: This inserts the `DeferredFdCloserInner` into the task workqueue for the current
+        // task. If this operation is successful, then this transfers exclusive ownership of the
+        // `callback_head` field to the C side until it calls `do_close_fd`, and we don't touch or
+        // invalidate the field during that time.
+        //
+        // When the C side calls `do_close_fd`, the safety requirements of that method are
+        // satisfied because when a task work is executed, the callback is given ownership of the
+        // pointer.
+        //
+        // The file pointer is currently null. If it is changed to be non-null before `do_close_fd`
+        // is called, then that change happens due to the write at the end of this function, and
+        // that write has a safety comment that explains why the refcount can be dropped when
+        // `do_close_fd` runs.
+        let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
+
+        if res != 0 {
+            // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
+            // we may destroy it.
+            unsafe { drop(Box::from_raw(inner)) };
+
+            return Err(DeferredFdCloseError::TaskWorkUnavailable);
+        }
+
+        // This removes the fd from the fd table in `current`. The file is not fully closed until
+        // `filp_close` is called. We are given ownership of one refcount to the file.
+        //
+        // SAFETY: This is safe no matter what `fd` is. If the `fd` is valid (that is, if the
+        // pointer is non-null), then we call `filp_close` on the returned pointer as required by
+        // `file_close_fd`.
+        let file = unsafe { bindings::file_close_fd(fd) };
+        if file.is_null() {
+            // We don't clean up the task work since that might be expensive if the task work queue
+            // is long. Just let it execute and let it clean up for itself.
+            return Err(DeferredFdCloseError::BadFd);
+        }
+
+        // Acquire a second refcount to the file.
+        //
+        // SAFETY: The `file` pointer points at a file with a non-zero refcount.
+        unsafe { bindings::get_file(file) };
+
+        // This method closes the fd, consuming one of our two refcounts. There could be active
+        // light refcounts created from that fd, so we must ensure that the file has a positive
+        // refcount for the duration of those active light refcounts. We do that by holding on to
+        // the second refcount until the current task returns to userspace.
+        //
+        // SAFETY: The `file` pointer is valid. Passing `current->files` as the file table to close
+        // it in is correct, since we just got the `fd` from `file_close_fd` which also uses
+        // `current->files`.
+        //
+        // Note: fl_owner_t is currently a void pointer.
+        unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
+
+        // We update the file pointer that the task work is supposed to fput. This transfers
+        // ownership of our last refcount.
+        //
+        // INVARIANT: This changes the `file` field of a `DeferredFdCloserInner` from null to
+        // non-null. This doesn't break the type invariant for `DeferredFdCloserInner` because we
+        // still own a refcount to the file, so we can pass ownership of that refcount to the
+        // `DeferredFdCloserInner`.
+        //
+        // When `do_close_fd` runs, it must be safe for it to `fput` the refcount. However, this is
+        // the case because all light refcounts that are associated with the fd we closed
+        // previously must be dropped when `do_close_fd`, since light refcounts must be dropped
+        // before returning to userspace.
+        //
+        // SAFETY: Task works are executed on the current thread right before we return to
+        // userspace, so this write is guaranteed to happen before `do_close_fd` is called, which
+        // means that a race is not possible here.
+        unsafe { *file_field = file };
+
+        Ok(())
+    }
+
+    /// # Safety
+    ///
+    /// The provided pointer must point at the `twork` field of a `DeferredFdCloserInner` stored in
+    /// a `Box`, and the caller must pass exclusive ownership of that `Box`. Furthermore, if the
+    /// file pointer is non-null, then it must be okay to release the refcount by calling `fput`.
+    unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
+        // SAFETY: The caller just passed us ownership of this box.
+        let inner = unsafe { Box::from_raw(inner.cast::<DeferredFdCloserInner>()) };
+        if !inner.file.is_null() {
+            // SAFETY: By the type invariants, we own a refcount to this file, and the caller
+            // guarantees that dropping the refcount now is okay.
+            unsafe { bindings::fput(inner.file) };
+        }
+        // The allocation is freed when `inner` goes out of scope.
+    }
+}
+
+/// Represents a failure to close an fd in a deferred manner.
+#[derive(Copy, Clone, Debug, Eq, PartialEq)]
+pub enum DeferredFdCloseError {
+    /// Closing the fd failed because we were unable to schedule a task work.
+    TaskWorkUnavailable,
+    /// Closing the fd failed because the fd does not exist.
+    BadFd,
+}
+
+impl From<DeferredFdCloseError> for Error {
+    fn from(err: DeferredFdCloseError) -> Error {
+        match err {
+            DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
+            DeferredFdCloseError::BadFd => EBADF,
+        }
+    }
+}
+
 /// Represents the `EBADF` error code.
 ///
 /// Used for methods that can only fail with `EBADF`.
diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index f46ea3ba9e8e..6adcd8ffcfde 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -145,6 +145,12 @@  fn deref(&self) -> &Self::Target {
         }
     }
 
+    /// Returns a raw pointer to the task.
+    #[inline]
+    pub fn as_raw(&self) -> *mut bindings::task_struct {
+        self.0.get()
+    }
+
     /// Returns the group leader of the given task.
     pub fn group_leader(&self) -> &Task {
         // SAFETY: By the type invariant, we know that `self.0` is a valid task. Valid tasks always
@@ -189,6 +195,14 @@  pub fn pid_in_current_ns(&self) -> Pid {
         unsafe { bindings::task_tgid_nr_ns(self.0.get(), ptr::null_mut()) }
     }
 
+    /// Returns whether this task corresponds to a kernel thread.
+    pub fn is_kthread(&self) -> bool {
+        // SAFETY: By the type invariant, we know that `self.0.get()` is non-null and valid. There
+        // are no further requirements to read the task's flags.
+        let flags = unsafe { (*self.0.get()).flags };
+        (flags & bindings::PF_KTHREAD) != 0
+    }
+
     /// Wakes up the task.
     pub fn wake_up(&self) {
         // SAFETY: By the type invariant, we know that `self.0.get()` is non-null and valid.