diff mbox series

[v2,4/6] userfaultfd: update documentation to describe /dev/userfaultfd

Message ID 20220422212945.2227722-5-axelrasmussen@google.com (mailing list archive)
State New, archived
Headers show
Series userfaultfd: add /dev/userfaultfd for fine grained access control | expand

Commit Message

Axel Rasmussen April 22, 2022, 9:29 p.m. UTC
Explain the different ways to create a new userfaultfd, and how access
control works for each way.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 Documentation/admin-guide/mm/userfaultfd.rst | 38 ++++++++++++++++++--
 Documentation/admin-guide/sysctl/vm.rst      |  3 ++
 2 files changed, 39 insertions(+), 2 deletions(-)

Comments

Shuah Khan April 26, 2022, 4:46 p.m. UTC | #1
On 4/22/22 3:29 PM, Axel Rasmussen wrote:
> Explain the different ways to create a new userfaultfd, and how access
> control works for each way.
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>   Documentation/admin-guide/mm/userfaultfd.rst | 38 ++++++++++++++++++--
>   Documentation/admin-guide/sysctl/vm.rst      |  3 ++
>   2 files changed, 39 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
> index 6528036093e1..4c079b5377d4 100644
> --- a/Documentation/admin-guide/mm/userfaultfd.rst
> +++ b/Documentation/admin-guide/mm/userfaultfd.rst
> @@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
>   Design
>   ======
>   
> -Userfaults are delivered and resolved through the ``userfaultfd`` syscall.

Please keep this sentence in there and rephrase it to indicate how it was
done in the past.

Also explain here why this new approach is better than the syscall approach
before getting into the below details.

> +Userspace creates a new userfaultfd, initializes it, and registers one or more
> +regions of virtual memory with it. Then, any page faults which occur within the
> +region(s) result in a message being delivered to the userfaultfd, notifying
> +userspace of the fault.
>   
>   The ``userfaultfd`` (aside from registering and unregistering virtual
>   memory ranges) provides two primary functionalities:
> @@ -39,7 +42,7 @@ Vmas are not suitable for page- (or hugepage) granular fault tracking
>   when dealing with virtual address spaces that could span
>   Terabytes. Too many vmas would be needed for that.>   
> -The ``userfaultfd`` once opened by invoking the syscall, can also be
> +The ``userfaultfd``, once created, can also be

This is sentence is too short and would look odd. Combine the sentences
so it renders well in the generated doc.

>   passed using unix domain sockets to a manager process, so the same
>   manager process could handle the userfaults of a multitude of
>   different processes without them being aware about what is going on
> @@ -50,6 +53,37 @@ is a corner case that would currently return ``-EBUSY``).
>   API
>   ===
>   
> +Creating a userfaultfd
> +----------------------
> +
> +There are two mechanisms to create a userfaultfd. There are various ways to
> +restrict this too, since userfaultfds which handle kernel page faults have
> +historically been a useful tool for exploiting the kernel.
> +
> +The first is the userfaultfd(2) syscall. Access to this is controlled in several
> +ways:
> +
> +- By default, the userfaultfd will be able to handle kernel page faults. This
> +  can be disabled by passing in UFFD_USER_MODE_ONLY.
> +
> +- If vm.unprivileged_userfaultfd is 0, then the caller must *either* have
> +  CAP_SYS_PTRACE, or pass in UFFD_USER_MODE_ONLY.
> +
> +- If vm.unprivileged_userfaultfd is 1, then no particular privilege is needed to
> +  use this syscall, even if UFFD_USER_MODE_ONLY is *not* set.
> +
> +Alternatively, userfaultfds can be created by opening /dev/userfaultfd, and
> +issuing a USERFAULTFD_IOC_NEW ioctl to this device. Access to this device is

New ioctl? I thought we are moving away from using ioctls?

> +controlled via normal filesystem permissions (user/group/mode for example) - no
> +additional permission (capability/sysctl) is needed to be able to handle kernel
> +faults this way. This is useful because it allows e.g. a specific user or group
> +to be able to create kernel-fault-handling userfaultfds, without allowing it
> +more broadly, or granting more privileges in addition to that particular ability
> +(CAP_SYS_PTRACE). In other words, it allows permissions to be minimized.
> +
> +Initializing up a userfaultfd
> +------------------------
> +

This will generate doc warn very likley - extend the dashes to the
entire length of the subtitle.

>   When first opened the ``userfaultfd`` must be enabled invoking the
>   ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
>   a later API version) which will specify the ``read/POLLIN`` protocol
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index f4804ce37c58..8682d5fbc8ea 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -880,6 +880,9 @@ calls without any restrictions.
>   
>   The default value is 0.
>   
> +An alternative to this sysctl / the userfaultfd(2) syscall is to create
> +userfaultfds via /dev/userfaultfd. See
> +Documentation/admin-guide/mm/userfaultfd.rst.
>   
>   user_reserve_kbytes
>   ===================
> 

thanks,
-- Shuah
Axel Rasmussen May 19, 2022, 6:58 p.m. UTC | #2
On Tue, Apr 26, 2022 at 9:46 AM Shuah Khan <skhan@linuxfoundation.org> wrote:
>
> On 4/22/22 3:29 PM, Axel Rasmussen wrote:
> > Explain the different ways to create a new userfaultfd, and how access
> > control works for each way.
> >
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> > ---
> >   Documentation/admin-guide/mm/userfaultfd.rst | 38 ++++++++++++++++++--
> >   Documentation/admin-guide/sysctl/vm.rst      |  3 ++
> >   2 files changed, 39 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
> > index 6528036093e1..4c079b5377d4 100644
> > --- a/Documentation/admin-guide/mm/userfaultfd.rst
> > +++ b/Documentation/admin-guide/mm/userfaultfd.rst
> > @@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
> >   Design
> >   ======
> >
> > -Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
>
> Please keep this sentence in there and rephrase it to indicate how it was
> done in the past.
>
> Also explain here why this new approach is better than the syscall approach
> before getting into the below details.

Hmm, so the old sentence I think was incorrect already. Notifications
of *the faults* aren't delivered and resolved through the syscall.
Rather, the syscall just gives you a file descriptor, and then
notification / resolution of faults happens though the file
descriptor, not through the syscall. So I think it needs to be
reworded in any case.

I think the overall structure of the doc as-is makes the most sense as
well - first explain how this will be used at a very high level, and
then go into the details (first how to create a userfaultfd, then how
to use it).

So, in the end I reworded the "Creating a userfaultfd" section, to
cover the two things you mentioned:

- Which is the "older" way and which is the "newer" way
- What the benefit of the newer way is

Hopefully this addresses the comment? I can tweak it more if needed.
In any case, thanks for taking a look at this series!

>
> > +Userspace creates a new userfaultfd, initializes it, and registers one or more
> > +regions of virtual memory with it. Then, any page faults which occur within the
> > +region(s) result in a message being delivered to the userfaultfd, notifying
> > +userspace of the fault.
> >
> >   The ``userfaultfd`` (aside from registering and unregistering virtual
> >   memory ranges) provides two primary functionalities:
> > @@ -39,7 +42,7 @@ Vmas are not suitable for page- (or hugepage) granular fault tracking
> >   when dealing with virtual address spaces that could span
> >   Terabytes. Too many vmas would be needed for that.>
> > -The ``userfaultfd`` once opened by invoking the syscall, can also be
> > +The ``userfaultfd``, once created, can also be
>
> This is sentence is too short and would look odd. Combine the sentences
> so it renders well in the generated doc.

Not 100% sure I understood the concern, but I do think it makes sense
to move "Vmas are not suitable ..." up into the same paragraph with
the other sentence about scalability. I'll do this in v3 as it looks a
bit nicer. This leaves the "The userfaultfd, once created, ..." part
alone, though. I think s/once opened by invoking the syscall/once
created/ is correct, since there are now various ways to create it. I
also think that second comma technically should have been there even
in the previous version.

>
> >   passed using unix domain sockets to a manager process, so the same
> >   manager process could handle the userfaults of a multitude of
> >   different processes without them being aware about what is going on
> > @@ -50,6 +53,37 @@ is a corner case that would currently return ``-EBUSY``).
> >   API
> >   ===
> >
> > +Creating a userfaultfd
> > +----------------------
> > +
> > +There are two mechanisms to create a userfaultfd. There are various ways to
> > +restrict this too, since userfaultfds which handle kernel page faults have
> > +historically been a useful tool for exploiting the kernel.
> > +
> > +The first is the userfaultfd(2) syscall. Access to this is controlled in several
> > +ways:
> > +
> > +- By default, the userfaultfd will be able to handle kernel page faults. This
> > +  can be disabled by passing in UFFD_USER_MODE_ONLY.
> > +
> > +- If vm.unprivileged_userfaultfd is 0, then the caller must *either* have
> > +  CAP_SYS_PTRACE, or pass in UFFD_USER_MODE_ONLY.
> > +
> > +- If vm.unprivileged_userfaultfd is 1, then no particular privilege is needed to
> > +  use this syscall, even if UFFD_USER_MODE_ONLY is *not* set.
> > +
> > +Alternatively, userfaultfds can be created by opening /dev/userfaultfd, and
> > +issuing a USERFAULTFD_IOC_NEW ioctl to this device. Access to this device is
>
> New ioctl? I thought we are moving away from using ioctls?

Hmm, looking at alternatives [1] am not sure I see a viable one:

We could have defined a new "userfaultfdfs" filesystem, but it seems
to me to be overkill for this feature.

We could have used a syscall instead and supported fine-grained access
control with a new capability, but this approach was rejected [2]
generally because we prefer to avoid adding capabilities, and this new
capability's scope (just userfaultfd) was considered too narrow.

So, I'm not sure of another better way to do this. I suppose one could
argue that the dislike of ioctls outweighs the usefulness of this
feature, but to me at least the tradeoff seems worth it. :)

[1]: https://www.kernel.org/doc/html/latest/driver-api/ioctl.html#alternatives-to-ioctl
[2]: https://lkml.org/lkml/2022/2/24/1012

>
> > +controlled via normal filesystem permissions (user/group/mode for example) - no
> > +additional permission (capability/sysctl) is needed to be able to handle kernel
> > +faults this way. This is useful because it allows e.g. a specific user or group
> > +to be able to create kernel-fault-handling userfaultfds, without allowing it
> > +more broadly, or granting more privileges in addition to that particular ability
> > +(CAP_SYS_PTRACE). In other words, it allows permissions to be minimized.
> > +
> > +Initializing up a userfaultfd
> > +------------------------
> > +
>
> This will generate doc warn very likley - extend the dashes to the
> entire length of the subtitle.

I'll fix this in v3.

>
> >   When first opened the ``userfaultfd`` must be enabled invoking the
> >   ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
> >   a later API version) which will specify the ``read/POLLIN`` protocol
> > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > index f4804ce37c58..8682d5fbc8ea 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -880,6 +880,9 @@ calls without any restrictions.
> >
> >   The default value is 0.
> >
> > +An alternative to this sysctl / the userfaultfd(2) syscall is to create
> > +userfaultfds via /dev/userfaultfd. See
> > +Documentation/admin-guide/mm/userfaultfd.rst.
> >
> >   user_reserve_kbytes
> >   ===================
> >
>
> thanks,
> -- Shuah
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 6528036093e1..4c079b5377d4 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -17,7 +17,10 @@  of the ``PROT_NONE+SIGSEGV`` trick.
 Design
 ======
 
-Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
+Userspace creates a new userfaultfd, initializes it, and registers one or more
+regions of virtual memory with it. Then, any page faults which occur within the
+region(s) result in a message being delivered to the userfaultfd, notifying
+userspace of the fault.
 
 The ``userfaultfd`` (aside from registering and unregistering virtual
 memory ranges) provides two primary functionalities:
@@ -39,7 +42,7 @@  Vmas are not suitable for page- (or hugepage) granular fault tracking
 when dealing with virtual address spaces that could span
 Terabytes. Too many vmas would be needed for that.
 
-The ``userfaultfd`` once opened by invoking the syscall, can also be
+The ``userfaultfd``, once created, can also be
 passed using unix domain sockets to a manager process, so the same
 manager process could handle the userfaults of a multitude of
 different processes without them being aware about what is going on
@@ -50,6 +53,37 @@  is a corner case that would currently return ``-EBUSY``).
 API
 ===
 
+Creating a userfaultfd
+----------------------
+
+There are two mechanisms to create a userfaultfd. There are various ways to
+restrict this too, since userfaultfds which handle kernel page faults have
+historically been a useful tool for exploiting the kernel.
+
+The first is the userfaultfd(2) syscall. Access to this is controlled in several
+ways:
+
+- By default, the userfaultfd will be able to handle kernel page faults. This
+  can be disabled by passing in UFFD_USER_MODE_ONLY.
+
+- If vm.unprivileged_userfaultfd is 0, then the caller must *either* have
+  CAP_SYS_PTRACE, or pass in UFFD_USER_MODE_ONLY.
+
+- If vm.unprivileged_userfaultfd is 1, then no particular privilege is needed to
+  use this syscall, even if UFFD_USER_MODE_ONLY is *not* set.
+
+Alternatively, userfaultfds can be created by opening /dev/userfaultfd, and
+issuing a USERFAULTFD_IOC_NEW ioctl to this device. Access to this device is
+controlled via normal filesystem permissions (user/group/mode for example) - no
+additional permission (capability/sysctl) is needed to be able to handle kernel
+faults this way. This is useful because it allows e.g. a specific user or group
+to be able to create kernel-fault-handling userfaultfds, without allowing it
+more broadly, or granting more privileges in addition to that particular ability
+(CAP_SYS_PTRACE). In other words, it allows permissions to be minimized.
+
+Initializing up a userfaultfd
+------------------------
+
 When first opened the ``userfaultfd`` must be enabled invoking the
 ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
 a later API version) which will specify the ``read/POLLIN`` protocol
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index f4804ce37c58..8682d5fbc8ea 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -880,6 +880,9 @@  calls without any restrictions.
 
 The default value is 0.
 
+An alternative to this sysctl / the userfaultfd(2) syscall is to create
+userfaultfds via /dev/userfaultfd. See
+Documentation/admin-guide/mm/userfaultfd.rst.
 
 user_reserve_kbytes
 ===================