diff mbox series

[v2] ioctl_userfaultfd.2, userfaultfd.2: add minor fault mode

Message ID 20210604195622.1249588-1-axelrasmussen@google.com (mailing list archive)
State New, archived
Headers show
Series [v2] ioctl_userfaultfd.2, userfaultfd.2: add minor fault mode | expand

Commit Message

Axel Rasmussen June 4, 2021, 7:56 p.m. UTC
Userfaultfd minor fault mode is supported starting from Linux 5.13.

This commit adds a description of the new mode, as well as the new ioctl
used to resolve such faults. The two go hand-in-hand: one can't resolve
a minor fault without continue, and continue can't be used to resolve
any other kind of fault.

This patch covers just the hugetlbfs implementation (in 5.13). Support
for shmem is forthcoming, but as it has not yet made it into a kernel
release candidate, it will be added in a future commit.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 man2/ioctl_userfaultfd.2 | 125 ++++++++++++++++++++++++++++++++++++---
 man2/userfaultfd.2       |  79 ++++++++++++++++++++-----
 2 files changed, 182 insertions(+), 22 deletions(-)

Comments

Axel Rasmussen July 27, 2021, 4:32 p.m. UTC | #1
Any remaining issues with this patch? I just realized today it was
never merged. 5.13 (which contains this new feature) was released some
weeks ago.

On Fri, Jun 4, 2021 at 12:56 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> Userfaultfd minor fault mode is supported starting from Linux 5.13.
>
> This commit adds a description of the new mode, as well as the new ioctl
> used to resolve such faults. The two go hand-in-hand: one can't resolve
> a minor fault without continue, and continue can't be used to resolve
> any other kind of fault.
>
> This patch covers just the hugetlbfs implementation (in 5.13). Support
> for shmem is forthcoming, but as it has not yet made it into a kernel
> release candidate, it will be added in a future commit.
>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  man2/ioctl_userfaultfd.2 | 125 ++++++++++++++++++++++++++++++++++++---
>  man2/userfaultfd.2       |  79 ++++++++++++++++++++-----
>  2 files changed, 182 insertions(+), 22 deletions(-)
>
> diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
> index 504f61d4b..7b990c24a 100644
> --- a/man2/ioctl_userfaultfd.2
> +++ b/man2/ioctl_userfaultfd.2
> @@ -214,6 +214,10 @@ memory accesses to the regions registered with userfaultfd.
>  If this feature bit is set,
>  .I uffd_msg.pagefault.feat.ptid
>  will be set to the faulted thread ID for each page-fault message.
> +.TP
> +.BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)"
> +If this feature bit is set, the kernel supports registering userfaultfd ranges
> +in minor mode on hugetlbfs-backed memory areas.
>  .PP
>  The returned
>  .I ioctls
> @@ -240,6 +244,11 @@ operation is supported.
>  The
>  .B UFFDIO_WRITEPROTECT
>  operation is supported.
> +.TP
> +.B 1 << _UFFDIO_CONTINUE
> +The
> +.B UFFDIO_CONTINUE
> +operation is supported.
>  .PP
>  This
>  .BR ioctl (2)
> @@ -278,14 +287,8 @@ by the current kernel version.
>  (Since Linux 4.3.)
>  Register a memory address range with the userfaultfd object.
>  The pages in the range must be "compatible".
> -.PP
> -Up to Linux kernel 4.11,
> -only private anonymous ranges are compatible for registering with
> -.BR UFFDIO_REGISTER .
> -.PP
> -Since Linux 4.11,
> -hugetlbfs and shared memory ranges are also compatible with
> -.BR UFFDIO_REGISTER .
> +Please refer to the list of register modes below for the compatible memory
> +backends for each mode.
>  .PP
>  The
>  .I argp
> @@ -324,9 +327,16 @@ the specified range:
>  .TP
>  .B UFFDIO_REGISTER_MODE_MISSING
>  Track page faults on missing pages.
> +Since Linux 4.3, only private anonymous ranges are compatible.
> +Since Linux 4.11, hugetlbfs and shared memory ranges are also compatible.
>  .TP
>  .B UFFDIO_REGISTER_MODE_WP
>  Track page faults on write-protected pages.
> +Since Linux 5.7, only private anonymous ranges are compatible.
> +.TP
> +.B UFFDIO_REGISTER_MODE_MINOR
> +Track minor page faults.
> +Since Linux 5.13, only hugetlbfs ranges are compatible.
>  .PP
>  If the operation is successful, the kernel modifies the
>  .I ioctls
> @@ -735,6 +745,105 @@ or not registered with userfaultfd write-protect mode.
>  .TP
>  .B EFAULT
>  Encountered a generic fault during processing.
> +.\"
> +.SS UFFDIO_CONTINUE
> +(Since Linux 5.13.)
> +Resolve a minor page fault by installing page table entries for existing pages
> +in the page cache.
> +.PP
> +The
> +.I argp
> +argument is a pointer to a
> +.I uffdio_continue
> +structure as shown below:
> +.PP
> +.in +4n
> +.EX
> +struct uffdio_continue {
> +    struct uffdio_range range; /* Range to install PTEs for and continue */
> +    __u64 mode;                /* Flags controlling the behavior of continue */
> +    __s64 mapped;              /* Number of bytes mapped, or negated error */
> +};
> +.EE
> +.in
> +.PP
> +The following value may be bitwise ORed in
> +.IR mode
> +to change the behavior of the
> +.B UFFDIO_CONTINUE
> +operation:
> +.TP
> +.B UFFDIO_CONTINUE_MODE_DONTWAKE
> +Do not wake up the thread that waits for page-fault resolution.
> +.PP
> +The
> +.I mapped
> +field is used by the kernel to return the number of bytes
> +that were actually mapped, or an error in the same manner as
> +.BR UFFDIO_COPY .
> +If the value returned in the
> +.I mapped
> +field doesn't match the value that was specified in
> +.IR range.len ,
> +the operation fails with the error
> +.BR EAGAIN .
> +The
> +.I mapped
> +field is output-only;
> +it is not read by the
> +.B UFFDIO_CONTINUE
> +operation.
> +.PP
> +This
> +.BR ioctl (2)
> +operation returns 0 on success.
> +In this case, the entire area was mapped.
> +On error, \-1 is returned and
> +.I errno
> +is set to indicate the error.
> +Possible errors include:
> +.TP
> +.B EAGAIN
> +The number of bytes mapped (i.e., the value returned in the
> +.I mapped
> +field) does not equal the value that was specified in the
> +.I range.len
> +field.
> +.TP
> +.B EINVAL
> +Either
> +.I range.start
> +or
> +.I range.len
> +was not a multiple of the system page size; or
> +.I range.len
> +was zero; or the range specified was invalid.
> +.TP
> +.B EINVAL
> +An invalid bit was specified in the
> +.IR mode
> +field.
> +.TP
> +.B EEXIST
> +One or more pages were already mapped in the given range.
> +.TP
> +.B ENOENT
> +The faulting process has changed its virtual memory layout simultaneously with
> +an outstanding
> +.B UFFDIO_CONTINUE
> +operation.
> +.TP
> +.B ENOMEM
> +Allocating memory needed to setup the page table mappings failed.
> +.TP
> +.B EFAULT
> +No existing page could be found in the page cache for the given range.
> +.TP
> +.BR ESRCH
> +The faulting process has exited at the time of a
> +.B UFFDIO_CONTINUE
> +operation.
> +.\"
>  .SH RETURN VALUE
>  See descriptions of the individual operations, above.
>  .SH ERRORS
> diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
> index 593c189d8..07f53c6ff 100644
> --- a/man2/userfaultfd.2
> +++ b/man2/userfaultfd.2
> @@ -78,7 +78,7 @@ all memory ranges that were registered with the object are unregistered
>  and unread events are flushed.
>  .\"
>  .PP
> -Userfaultfd supports two modes of registration:
> +Userfaultfd supports three modes of registration:
>  .TP
>  .BR UFFDIO_REGISTER_MODE_MISSING " (since 4.10)"
>  When registered with
> @@ -92,6 +92,18 @@ or an
>  .B UFFDIO_ZEROPAGE
>  ioctl.
>  .TP
> +.BR UFFDIO_REGISTER_MODE_MINOR " (since 5.13)"
> +When registered with
> +.B UFFDIO_REGISTER_MODE_MINOR
> +mode, user-space will receive a page-fault notification
> +when a minor page fault occurs.
> +That is, when a backing page is in the page cache, but
> +page table entries don't yet exist.
> +The faulted thread will be stopped from execution until the page fault is
> +resolved from user-space by an
> +.B UFFDIO_CONTINUE
> +ioctl.
> +.TP
>  .BR UFFDIO_REGISTER_MODE_WP " (since 5.7)"
>  When registered with
>  .B UFFDIO_REGISTER_MODE_WP
> @@ -212,9 +224,10 @@ a page fault occurring in the requested memory range, and satisfying
>  the mode defined at the registration time, will be forwarded by the kernel to
>  the user-space application.
>  The application can then use the
> -.B UFFDIO_COPY
> +.B UFFDIO_COPY ,
> +.B UFFDIO_ZEROPAGE ,
>  or
> -.B UFFDIO_ZEROPAGE
> +.B UFFDIO_CONTINUE
>  .BR ioctl (2)
>  operations to resolve the page fault.
>  .PP
> @@ -318,6 +331,43 @@ should have the flag
>  cleared upon the faulted page or range.
>  .PP
>  Write-protect mode supports only private anonymous memory.
> +.\"
> +.SS Userfaultfd minor fault mode (since 5.13)
> +Since Linux 5.13, userfaultfd supports minor fault mode.
> +In this mode, fault messages are produced not for major faults (where the
> +page was missing), but rather for minor faults, where a page exists in the page
> +cache, but the page table entries are not yet present.
> +The user needs to first check availability of this feature using
> +.B UFFDIO_API
> +ioctl against the feature bit
> +.B UFFD_FEATURE_MINOR_HUGETLBFS
> +before using this feature.
> +.PP
> +To register with userfaultfd minor fault mode, the user needs to initiate the
> +.B UFFDIO_REGISTER
> +ioctl with mode
> +.B UFFD_REGISTER_MODE_MINOR
> +set.
> +.PP
> +When a minor fault occurs, user-space will receive a page-fault notification
> +whose
> +.I uffd_msg.pagefault.flags
> +will have the
> +.B UFFD_PAGEFAULT_FLAG_MINOR
> +flag set.
> +.PP
> +To resolve a minor page fault, the handler should decide whether or not the
> +existing page contents need to be modified first.
> +If so, this should be done in-place via a second, non-userfaultfd-registered
> +mapping to the same backing page (e.g., by mapping the hugetlbfs file twice).
> +Once the page is considered "up to date", the fault can be resolved by
> +initiating an
> +.B UFFDIO_CONTINUE
> +ioctl, which installs the page table entries and (by default) wakes up the
> +faulting thread(s).
> +.PP
> +Minor fault mode supports only hugetlbfs-backed memory.
> +.\"
>  .SS Reading from the userfaultfd structure
>  Each
>  .BR read (2)
> @@ -456,19 +506,20 @@ For
>  the following flag may appear:
>  .RS
>  .TP
> -.B UFFD_PAGEFAULT_FLAG_WRITE
> -If the address is in a range that was registered with the
> -.B UFFDIO_REGISTER_MODE_MISSING
> -flag (see
> -.BR ioctl_userfaultfd (2))
> -and this flag is set, this a write fault;
> -otherwise it is a read fault.
> +.B UFFD_PAGEFAULT_FLAG_WP
> +If this flag is set, then the fault was a write-protect fault.
>  .TP
> +.B UFFD_PAGEFAULT_FLAG_MINOR
> +If this flag is set, then the fault was a minor fault.
> +.TP
> +.B UFFD_PAGEFAULT_FLAG_WRITE
> +If this flag is set, then the fault was a write fault.
> +.HP
> +If neither
>  .B UFFD_PAGEFAULT_FLAG_WP
> -If the address is in a range that was registered with the
> -.B UFFDIO_REGISTER_MODE_WP
> -flag, when this bit is set, it means it is a write-protect fault.
> -Otherwise it is a page-missing fault.
> +nor
> +.B UFFD_PAGEFAULT_FLAG_MINOR
> +are set, then the fault was a missing fault.
>  .RE
>  .TP
>  .I pagefault.feat.pid
> --
> 2.32.0.rc1.229.g3e70b5a671-goog
>
Peter Xu July 27, 2021, 4:37 p.m. UTC | #2
On Fri, Jun 04, 2021 at 12:56:22PM -0700, Axel Rasmussen wrote:
> Userfaultfd minor fault mode is supported starting from Linux 5.13.
> 
> This commit adds a description of the new mode, as well as the new ioctl
> used to resolve such faults. The two go hand-in-hand: one can't resolve
> a minor fault without continue, and continue can't be used to resolve
> any other kind of fault.
> 
> This patch covers just the hugetlbfs implementation (in 5.13). Support
> for shmem is forthcoming, but as it has not yet made it into a kernel
> release candidate, it will be added in a future commit.
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

FWIW:

Reviewed-by: Peter Xu <peterx@redhat.com>
Mike Rapoport Aug. 2, 2021, 11:21 a.m. UTC | #3
(added man-pages maintainers)

On Tue, Jul 27, 2021 at 09:32:34AM -0700, Axel Rasmussen wrote:
> Any remaining issues with this patch? I just realized today it was
> never merged. 5.13 (which contains this new feature) was released some
> weeks ago.
> 
> On Fri, Jun 4, 2021 at 12:56 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> >
> > Userfaultfd minor fault mode is supported starting from Linux 5.13.
> >
> > This commit adds a description of the new mode, as well as the new ioctl
> > used to resolve such faults. The two go hand-in-hand: one can't resolve
> > a minor fault without continue, and continue can't be used to resolve
> > any other kind of fault.
> >
> > This patch covers just the hugetlbfs implementation (in 5.13). Support
> > for shmem is forthcoming, but as it has not yet made it into a kernel
> > release candidate, it will be added in a future commit.
> >
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> > ---
> >  man2/ioctl_userfaultfd.2 | 125 ++++++++++++++++++++++++++++++++++++---
> >  man2/userfaultfd.2       |  79 ++++++++++++++++++++-----
> >  2 files changed, 182 insertions(+), 22 deletions(-)
> >
> > diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
> > index 504f61d4b..7b990c24a 100644
> > --- a/man2/ioctl_userfaultfd.2
> > +++ b/man2/ioctl_userfaultfd.2
> > @@ -214,6 +214,10 @@ memory accesses to the regions registered with userfaultfd.
> >  If this feature bit is set,
> >  .I uffd_msg.pagefault.feat.ptid
> >  will be set to the faulted thread ID for each page-fault message.
> > +.TP
> > +.BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)"
> > +If this feature bit is set, the kernel supports registering userfaultfd ranges
> > +in minor mode on hugetlbfs-backed memory areas.
> >  .PP
> >  The returned
> >  .I ioctls
> > @@ -240,6 +244,11 @@ operation is supported.
> >  The
> >  .B UFFDIO_WRITEPROTECT
> >  operation is supported.
> > +.TP
> > +.B 1 << _UFFDIO_CONTINUE
> > +The
> > +.B UFFDIO_CONTINUE
> > +operation is supported.
> >  .PP
> >  This
> >  .BR ioctl (2)
> > @@ -278,14 +287,8 @@ by the current kernel version.
> >  (Since Linux 4.3.)
> >  Register a memory address range with the userfaultfd object.
> >  The pages in the range must be "compatible".
> > -.PP
> > -Up to Linux kernel 4.11,
> > -only private anonymous ranges are compatible for registering with
> > -.BR UFFDIO_REGISTER .
> > -.PP
> > -Since Linux 4.11,
> > -hugetlbfs and shared memory ranges are also compatible with
> > -.BR UFFDIO_REGISTER .
> > +Please refer to the list of register modes below for the compatible memory
> > +backends for each mode.
> >  .PP
> >  The
> >  .I argp
> > @@ -324,9 +327,16 @@ the specified range:
> >  .TP
> >  .B UFFDIO_REGISTER_MODE_MISSING
> >  Track page faults on missing pages.
> > +Since Linux 4.3, only private anonymous ranges are compatible.
> > +Since Linux 4.11, hugetlbfs and shared memory ranges are also compatible.
> >  .TP
> >  .B UFFDIO_REGISTER_MODE_WP
> >  Track page faults on write-protected pages.
> > +Since Linux 5.7, only private anonymous ranges are compatible.
> > +.TP
> > +.B UFFDIO_REGISTER_MODE_MINOR
> > +Track minor page faults.
> > +Since Linux 5.13, only hugetlbfs ranges are compatible.
> >  .PP
> >  If the operation is successful, the kernel modifies the
> >  .I ioctls
> > @@ -735,6 +745,105 @@ or not registered with userfaultfd write-protect mode.
> >  .TP
> >  .B EFAULT
> >  Encountered a generic fault during processing.
> > +.\"
> > +.SS UFFDIO_CONTINUE
> > +(Since Linux 5.13.)
> > +Resolve a minor page fault by installing page table entries for existing pages
> > +in the page cache.
> > +.PP
> > +The
> > +.I argp
> > +argument is a pointer to a
> > +.I uffdio_continue
> > +structure as shown below:
> > +.PP
> > +.in +4n
> > +.EX
> > +struct uffdio_continue {
> > +    struct uffdio_range range; /* Range to install PTEs for and continue */
> > +    __u64 mode;                /* Flags controlling the behavior of continue */
> > +    __s64 mapped;              /* Number of bytes mapped, or negated error */
> > +};
> > +.EE
> > +.in
> > +.PP
> > +The following value may be bitwise ORed in
> > +.IR mode
> > +to change the behavior of the
> > +.B UFFDIO_CONTINUE
> > +operation:
> > +.TP
> > +.B UFFDIO_CONTINUE_MODE_DONTWAKE
> > +Do not wake up the thread that waits for page-fault resolution.
> > +.PP
> > +The
> > +.I mapped
> > +field is used by the kernel to return the number of bytes
> > +that were actually mapped, or an error in the same manner as
> > +.BR UFFDIO_COPY .
> > +If the value returned in the
> > +.I mapped
> > +field doesn't match the value that was specified in
> > +.IR range.len ,
> > +the operation fails with the error
> > +.BR EAGAIN .
> > +The
> > +.I mapped
> > +field is output-only;
> > +it is not read by the
> > +.B UFFDIO_CONTINUE
> > +operation.
> > +.PP
> > +This
> > +.BR ioctl (2)
> > +operation returns 0 on success.
> > +In this case, the entire area was mapped.
> > +On error, \-1 is returned and
> > +.I errno
> > +is set to indicate the error.
> > +Possible errors include:
> > +.TP
> > +.B EAGAIN
> > +The number of bytes mapped (i.e., the value returned in the
> > +.I mapped
> > +field) does not equal the value that was specified in the
> > +.I range.len
> > +field.
> > +.TP
> > +.B EINVAL
> > +Either
> > +.I range.start
> > +or
> > +.I range.len
> > +was not a multiple of the system page size; or
> > +.I range.len
> > +was zero; or the range specified was invalid.
> > +.TP
> > +.B EINVAL
> > +An invalid bit was specified in the
> > +.IR mode
> > +field.
> > +.TP
> > +.B EEXIST
> > +One or more pages were already mapped in the given range.
> > +.TP
> > +.B ENOENT
> > +The faulting process has changed its virtual memory layout simultaneously with
> > +an outstanding
> > +.B UFFDIO_CONTINUE
> > +operation.
> > +.TP
> > +.B ENOMEM
> > +Allocating memory needed to setup the page table mappings failed.
> > +.TP
> > +.B EFAULT
> > +No existing page could be found in the page cache for the given range.
> > +.TP
> > +.BR ESRCH
> > +The faulting process has exited at the time of a
> > +.B UFFDIO_CONTINUE
> > +operation.
> > +.\"
> >  .SH RETURN VALUE
> >  See descriptions of the individual operations, above.
> >  .SH ERRORS
> > diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
> > index 593c189d8..07f53c6ff 100644
> > --- a/man2/userfaultfd.2
> > +++ b/man2/userfaultfd.2
> > @@ -78,7 +78,7 @@ all memory ranges that were registered with the object are unregistered
> >  and unread events are flushed.
> >  .\"
> >  .PP
> > -Userfaultfd supports two modes of registration:
> > +Userfaultfd supports three modes of registration:
> >  .TP
> >  .BR UFFDIO_REGISTER_MODE_MISSING " (since 4.10)"
> >  When registered with
> > @@ -92,6 +92,18 @@ or an
> >  .B UFFDIO_ZEROPAGE
> >  ioctl.
> >  .TP
> > +.BR UFFDIO_REGISTER_MODE_MINOR " (since 5.13)"
> > +When registered with
> > +.B UFFDIO_REGISTER_MODE_MINOR
> > +mode, user-space will receive a page-fault notification
> > +when a minor page fault occurs.
> > +That is, when a backing page is in the page cache, but
> > +page table entries don't yet exist.
> > +The faulted thread will be stopped from execution until the page fault is
> > +resolved from user-space by an
> > +.B UFFDIO_CONTINUE
> > +ioctl.
> > +.TP
> >  .BR UFFDIO_REGISTER_MODE_WP " (since 5.7)"
> >  When registered with
> >  .B UFFDIO_REGISTER_MODE_WP
> > @@ -212,9 +224,10 @@ a page fault occurring in the requested memory range, and satisfying
> >  the mode defined at the registration time, will be forwarded by the kernel to
> >  the user-space application.
> >  The application can then use the
> > -.B UFFDIO_COPY
> > +.B UFFDIO_COPY ,
> > +.B UFFDIO_ZEROPAGE ,
> >  or
> > -.B UFFDIO_ZEROPAGE
> > +.B UFFDIO_CONTINUE
> >  .BR ioctl (2)
> >  operations to resolve the page fault.
> >  .PP
> > @@ -318,6 +331,43 @@ should have the flag
> >  cleared upon the faulted page or range.
> >  .PP
> >  Write-protect mode supports only private anonymous memory.
> > +.\"
> > +.SS Userfaultfd minor fault mode (since 5.13)
> > +Since Linux 5.13, userfaultfd supports minor fault mode.
> > +In this mode, fault messages are produced not for major faults (where the
> > +page was missing), but rather for minor faults, where a page exists in the page
> > +cache, but the page table entries are not yet present.
> > +The user needs to first check availability of this feature using
> > +.B UFFDIO_API
> > +ioctl against the feature bit
> > +.B UFFD_FEATURE_MINOR_HUGETLBFS
> > +before using this feature.
> > +.PP
> > +To register with userfaultfd minor fault mode, the user needs to initiate the
> > +.B UFFDIO_REGISTER
> > +ioctl with mode
> > +.B UFFD_REGISTER_MODE_MINOR
> > +set.
> > +.PP
> > +When a minor fault occurs, user-space will receive a page-fault notification
> > +whose
> > +.I uffd_msg.pagefault.flags
> > +will have the
> > +.B UFFD_PAGEFAULT_FLAG_MINOR
> > +flag set.
> > +.PP
> > +To resolve a minor page fault, the handler should decide whether or not the
> > +existing page contents need to be modified first.
> > +If so, this should be done in-place via a second, non-userfaultfd-registered
> > +mapping to the same backing page (e.g., by mapping the hugetlbfs file twice).
> > +Once the page is considered "up to date", the fault can be resolved by
> > +initiating an
> > +.B UFFDIO_CONTINUE
> > +ioctl, which installs the page table entries and (by default) wakes up the
> > +faulting thread(s).
> > +.PP
> > +Minor fault mode supports only hugetlbfs-backed memory.
> > +.\"
> >  .SS Reading from the userfaultfd structure
> >  Each
> >  .BR read (2)
> > @@ -456,19 +506,20 @@ For
> >  the following flag may appear:
> >  .RS
> >  .TP
> > -.B UFFD_PAGEFAULT_FLAG_WRITE
> > -If the address is in a range that was registered with the
> > -.B UFFDIO_REGISTER_MODE_MISSING
> > -flag (see
> > -.BR ioctl_userfaultfd (2))
> > -and this flag is set, this a write fault;
> > -otherwise it is a read fault.
> > +.B UFFD_PAGEFAULT_FLAG_WP
> > +If this flag is set, then the fault was a write-protect fault.
> >  .TP
> > +.B UFFD_PAGEFAULT_FLAG_MINOR
> > +If this flag is set, then the fault was a minor fault.
> > +.TP
> > +.B UFFD_PAGEFAULT_FLAG_WRITE
> > +If this flag is set, then the fault was a write fault.
> > +.HP
> > +If neither
> >  .B UFFD_PAGEFAULT_FLAG_WP
> > -If the address is in a range that was registered with the
> > -.B UFFDIO_REGISTER_MODE_WP
> > -flag, when this bit is set, it means it is a write-protect fault.
> > -Otherwise it is a page-missing fault.
> > +nor
> > +.B UFFD_PAGEFAULT_FLAG_MINOR
> > +are set, then the fault was a missing fault.
> >  .RE
> >  .TP
> >  .I pagefault.feat.pid
> > --
> > 2.32.0.rc1.229.g3e70b5a671-goog
> >
>
Alejandro Colomar Aug. 2, 2021, 12:21 p.m. UTC | #4
Hi Mike, Axel,

On 8/2/21 1:21 PM, Mike Rapoport wrote:
> (added man-pages maintainers)

Thanks!  If I'm not CCed, I may not notice the email, depending on the 
traffic of the lists, and the amount of time I have ;)

> 
> On Tue, Jul 27, 2021 at 09:32:34AM -0700, Axel Rasmussen wrote:
>> Any remaining issues with this patch? I just realized today it was
>> never merged. 5.13 (which contains this new feature) was released some
>> weeks ago.

Please see some minor formatting issues I commented below.
Other than that, it looks good to me.

Thanks,

Alex

>>
>> On Fri, Jun 4, 2021 at 12:56 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
>>>
>>> Userfaultfd minor fault mode is supported starting from Linux 5.13.
>>>
>>> This commit adds a description of the new mode, as well as the new ioctl
>>> used to resolve such faults. The two go hand-in-hand: one can't resolve
>>> a minor fault without continue, and continue can't be used to resolve
>>> any other kind of fault.
>>>
>>> This patch covers just the hugetlbfs implementation (in 5.13). Support
>>> for shmem is forthcoming, but as it has not yet made it into a kernel
>>> release candidate, it will be added in a future commit.
>>>
>>> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
>>> ---
>>>   man2/ioctl_userfaultfd.2 | 125 ++++++++++++++++++++++++++++++++++++---
>>>   man2/userfaultfd.2       |  79 ++++++++++++++++++++-----
>>>   2 files changed, 182 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
>>> index 504f61d4b..7b990c24a 100644
>>> --- a/man2/ioctl_userfaultfd.2
>>> +++ b/man2/ioctl_userfaultfd.2
>>> @@ -214,6 +214,10 @@ memory accesses to the regions registered with userfaultfd.
>>>   If this feature bit is set,
>>>   .I uffd_msg.pagefault.feat.ptid
>>>   will be set to the faulted thread ID for each page-fault message.
>>> +.TP
>>> +.BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)"
>>> +If this feature bit is set, the kernel supports registering userfaultfd ranges
>>> +in minor mode on hugetlbfs-backed memory areas.

See the folowing extract from man-pages(7):

    Use semantic newlines
        In the source of a manual page,  new  sentences  should  be
        started  on  new  lines, and long sentences should be split
        into lines at clause breaks  (commas,  semicolons,  colons,
        and  so on).  This convention, sometimes known as "semantic
        newlines", makes it easier to see the  effect  of  patches,
        which often operate at the level of individual sentences or
        sentence clauses.

A trick to check if some text is correct at first glance, is that the 
following regex should rarely match:
[,;:.] \+\w

Multi-sentence parenthetical expressions should also go on separate 
lines normally.

I'd for example break the above text into:

[
If this feature bit is set,
the kernel supports registering userfaultfd ranges
in minor mode on hugetlbfs-backed memory areas
]

Note the break after the comma, and another break at a sensible point 
(you already did that one correctly in this example, but some below don't).


>>>   .PP
>>>   The returned
>>>   .I ioctls
>>> @@ -240,6 +244,11 @@ operation is supported.
>>>   The
>>>   .B UFFDIO_WRITEPROTECT
>>>   operation is supported.
>>> +.TP
>>> +.B 1 << _UFFDIO_CONTINUE
>>> +The
>>> +.B UFFDIO_CONTINUE
>>> +operation is supported.
>>>   .PP
>>>   This
>>>   .BR ioctl (2)
>>> @@ -278,14 +287,8 @@ by the current kernel version.
>>>   (Since Linux 4.3.)
>>>   Register a memory address range with the userfaultfd object.
>>>   The pages in the range must be "compatible".
>>> -.PP
>>> -Up to Linux kernel 4.11,
>>> -only private anonymous ranges are compatible for registering with
>>> -.BR UFFDIO_REGISTER .
>>> -.PP
>>> -Since Linux 4.11,
>>> -hugetlbfs and shared memory ranges are also compatible with
>>> -.BR UFFDIO_REGISTER .
>>> +Please refer to the list of register modes below for the compatible memory
>>> +backends for each mode.

Regarding semantic newlines mentioned above:

Here for example, a more sensible point to break the line would be just 
after (or maybe before, up to you) the first "for".

>>>   .PP
>>>   The
>>>   .I argp
>>> @@ -324,9 +327,16 @@ the specified range:
>>>   .TP
>>>   .B UFFDIO_REGISTER_MODE_MISSING
>>>   Track page faults on missing pages.
>>> +Since Linux 4.3, only private anonymous ranges are compatible.
>>> +Since Linux 4.11, hugetlbfs and shared memory ranges are also compatible.
>>>   .TP
>>>   .B UFFDIO_REGISTER_MODE_WP
>>>   Track page faults on write-protected pages.
>>> +Since Linux 5.7, only private anonymous ranges are compatible.
>>> +.TP
>>> +.B UFFDIO_REGISTER_MODE_MINOR
>>> +Track minor page faults.
>>> +Since Linux 5.13, only hugetlbfs ranges are compatible.
>>>   .PP
>>>   If the operation is successful, the kernel modifies the
>>>   .I ioctls
>>> @@ -735,6 +745,105 @@ or not registered with userfaultfd write-protect mode.
>>>   .TP
>>>   .B EFAULT
>>>   Encountered a generic fault during processing.
>>> +.\"
>>> +.SS UFFDIO_CONTINUE
>>> +(Since Linux 5.13.)
>>> +Resolve a minor page fault by installing page table entries for existing pages
>>> +in the page cache.
>>> +.PP
>>> +The
>>> +.I argp
>>> +argument is a pointer to a
>>> +.I uffdio_continue
>>> +structure as shown below:
>>> +.PP
>>> +.in +4n
>>> +.EX
>>> +struct uffdio_continue {
>>> +    struct uffdio_range range; /* Range to install PTEs for and continue */
>>> +    __u64 mode;                /* Flags controlling the behavior of continue */
>>> +    __s64 mapped;              /* Number of bytes mapped, or negated error */
>>> +};
>>> +.EE
>>> +.in
>>> +.PP
>>> +The following value may be bitwise ORed in
>>> +.IR mode
>>> +to change the behavior of the
>>> +.B UFFDIO_CONTINUE
>>> +operation:
>>> +.TP
>>> +.B UFFDIO_CONTINUE_MODE_DONTWAKE
>>> +Do not wake up the thread that waits for page-fault resolution.
>>> +.PP
>>> +The
>>> +.I mapped
>>> +field is used by the kernel to return the number of bytes
>>> +that were actually mapped, or an error in the same manner as
>>> +.BR UFFDIO_COPY .
>>> +If the value returned in the
>>> +.I mapped
>>> +field doesn't match the value that was specified in
>>> +.IR range.len ,
>>> +the operation fails with the error
>>> +.BR EAGAIN .
>>> +The
>>> +.I mapped
>>> +field is output-only;
>>> +it is not read by the
>>> +.B UFFDIO_CONTINUE
>>> +operation.
>>> +.PP
>>> +This
>>> +.BR ioctl (2)
>>> +operation returns 0 on success.
>>> +In this case, the entire area was mapped.
>>> +On error, \-1 is returned and
>>> +.I errno
>>> +is set to indicate the error.
>>> +Possible errors include:
>>> +.TP
>>> +.B EAGAIN
>>> +The number of bytes mapped (i.e., the value returned in the
>>> +.I mapped
>>> +field) does not equal the value that was specified in the
>>> +.I range.len
>>> +field.
>>> +.TP
>>> +.B EINVAL
>>> +Either
>>> +.I range.start
>>> +or
>>> +.I range.len
>>> +was not a multiple of the system page size; or
>>> +.I range.len
>>> +was zero; or the range specified was invalid.
>>> +.TP
>>> +.B EINVAL
>>> +An invalid bit was specified in the
>>> +.IR mode
>>> +field.
>>> +.TP
>>> +.B EEXIST
>>> +One or more pages were already mapped in the given range.
>>> +.TP
>>> +.B ENOENT
>>> +The faulting process has changed its virtual memory layout simultaneously with
>>> +an outstanding
>>> +.B UFFDIO_CONTINUE
>>> +operation.
>>> +.TP
>>> +.B ENOMEM
>>> +Allocating memory needed to setup the page table mappings failed.
>>> +.TP
>>> +.B EFAULT
>>> +No existing page could be found in the page cache for the given range.
>>> +.TP
>>> +.BR ESRCH
>>> +The faulting process has exited at the time of a
>>> +.B UFFDIO_CONTINUE
>>> +operation.
>>> +.\"
>>>   .SH RETURN VALUE
>>>   See descriptions of the individual operations, above.
>>>   .SH ERRORS
>>> diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
>>> index 593c189d8..07f53c6ff 100644
>>> --- a/man2/userfaultfd.2
>>> +++ b/man2/userfaultfd.2
>>> @@ -78,7 +78,7 @@ all memory ranges that were registered with the object are unregistered
>>>   and unread events are flushed.
>>>   .\"
>>>   .PP
>>> -Userfaultfd supports two modes of registration:
>>> +Userfaultfd supports three modes of registration:
>>>   .TP
>>>   .BR UFFDIO_REGISTER_MODE_MISSING " (since 4.10)"
>>>   When registered with
>>> @@ -92,6 +92,18 @@ or an
>>>   .B UFFDIO_ZEROPAGE
>>>   ioctl.
>>>   .TP
>>> +.BR UFFDIO_REGISTER_MODE_MINOR " (since 5.13)"
>>> +When registered with
>>> +.B UFFDIO_REGISTER_MODE_MINOR
>>> +mode, user-space will receive a page-fault notification

s/user-space/user space/

See the following extract from man-pages(7):

    Preferred terms
        The  following  table  lists some preferred terms to use in
        man pages, mainly to ensure consistency across pages.

        Term                 Avoid using              Notes
        ─────────────────────────────────────────────────────────────
        [...]
        user space           userspace

However, when user space is used as an adjective, per the usual English 
rules, we write "user-space".  Example: "a user-space program".

>>> +when a minor page fault occurs.
>>> +That is, when a backing page is in the page cache, but
>>> +page table entries don't yet exist.
>>> +The faulted thread will be stopped from execution until the page fault is
>>> +resolved from user-space by an
>>> +.B UFFDIO_CONTINUE
>>> +ioctl.
>>> +.TP
>>>   .BR UFFDIO_REGISTER_MODE_WP " (since 5.7)"
>>>   When registered with
>>>   .B UFFDIO_REGISTER_MODE_WP
>>> @@ -212,9 +224,10 @@ a page fault occurring in the requested memory range, and satisfying
>>>   the mode defined at the registration time, will be forwarded by the kernel to
>>>   the user-space application.
>>>   The application can then use the
>>> -.B UFFDIO_COPY
>>> +.B UFFDIO_COPY ,
>>> +.B UFFDIO_ZEROPAGE ,
>>>   or
>>> -.B UFFDIO_ZEROPAGE
>>> +.B UFFDIO_CONTINUE
>>>   .BR ioctl (2)
>>>   operations to resolve the page fault.
>>>   .PP
>>> @@ -318,6 +331,43 @@ should have the flag
>>>   cleared upon the faulted page or range.
>>>   .PP
>>>   Write-protect mode supports only private anonymous memory.
>>> +.\"
>>> +.SS Userfaultfd minor fault mode (since 5.13)
>>> +Since Linux 5.13, userfaultfd supports minor fault mode.
>>> +In this mode, fault messages are produced not for major faults (where the
>>> +page was missing), but rather for minor faults, where a page exists in the page
>>> +cache, but the page table entries are not yet present.
>>> +The user needs to first check availability of this feature using
>>> +.B UFFDIO_API
>>> +ioctl against the feature bit
>>> +.B UFFD_FEATURE_MINOR_HUGETLBFS
>>> +before using this feature.
>>> +.PP
>>> +To register with userfaultfd minor fault mode, the user needs to initiate the
>>> +.B UFFDIO_REGISTER
>>> +ioctl with mode
>>> +.B UFFD_REGISTER_MODE_MINOR
>>> +set.
>>> +.PP
>>> +When a minor fault occurs, user-space will receive a page-fault notification
>>> +whose
>>> +.I uffd_msg.pagefault.flags
>>> +will have the
>>> +.B UFFD_PAGEFAULT_FLAG_MINOR
>>> +flag set.
>>> +.PP
>>> +To resolve a minor page fault, the handler should decide whether or not the
>>> +existing page contents need to be modified first.
>>> +If so, this should be done in-place via a second, non-userfaultfd-registered
>>> +mapping to the same backing page (e.g., by mapping the hugetlbfs file twice).
>>> +Once the page is considered "up to date", the fault can be resolved by
>>> +initiating an
>>> +.B UFFDIO_CONTINUE
>>> +ioctl, which installs the page table entries and (by default) wakes up the
>>> +faulting thread(s).
>>> +.PP
>>> +Minor fault mode supports only hugetlbfs-backed memory.
>>> +.\"
>>>   .SS Reading from the userfaultfd structure
>>>   Each
>>>   .BR read (2)
>>> @@ -456,19 +506,20 @@ For
>>>   the following flag may appear:
>>>   .RS
>>>   .TP
>>> -.B UFFD_PAGEFAULT_FLAG_WRITE
>>> -If the address is in a range that was registered with the
>>> -.B UFFDIO_REGISTER_MODE_MISSING
>>> -flag (see
>>> -.BR ioctl_userfaultfd (2))
>>> -and this flag is set, this a write fault;
>>> -otherwise it is a read fault.
>>> +.B UFFD_PAGEFAULT_FLAG_WP
>>> +If this flag is set, then the fault was a write-protect fault.
>>>   .TP
>>> +.B UFFD_PAGEFAULT_FLAG_MINOR
>>> +If this flag is set, then the fault was a minor fault.
>>> +.TP
>>> +.B UFFD_PAGEFAULT_FLAG_WRITE
>>> +If this flag is set, then the fault was a write fault.
>>> +.HP

See the following extract from groff_man(7):

    Deprecated features
        Use of the following is discouraged.

        [...]

        .HP [indent]
               Set up a paragraph with a hanging left  indentation.
               The  indent argument, if present, is handled as with
               .TP.

               Use of this presentation‐level macro is  deprecated.
               While it is universally portable to legacy Unix sys‐
               tems, a hanging indentation cannot be expressed nat‐
               urally  under HTML, and many HTML‐based manual view‐
               ers simply interpret it as a starter  for  a  normal
               paragraph.  Thus, any information or distinction you
               tried to express with the indentation may be lost.

I'd just use .PP here, I think.


>>> +If neither
>>>   .B UFFD_PAGEFAULT_FLAG_WP
>>> -If the address is in a range that was registered with the
>>> -.B UFFDIO_REGISTER_MODE_WP
>>> -flag, when this bit is set, it means it is a write-protect fault.
>>> -Otherwise it is a page-missing fault.
>>> +nor
>>> +.B UFFD_PAGEFAULT_FLAG_MINOR
>>> +are set, then the fault was a missing fault.
>>>   .RE
>>>   .TP
>>>   .I pagefault.feat.pid
>>> --
>>> 2.32.0.rc1.229.g3e70b5a671-goog
>>>
>>
>
Axel Rasmussen March 22, 2022, 4:31 p.m. UTC | #5
On Mon, Aug 2, 2021 at 5:21 AM Alejandro Colomar (man-pages)
<alx.manpages@gmail.com> wrote:
>
> Hi Mike, Axel,
>
> On 8/2/21 1:21 PM, Mike Rapoport wrote:
> > (added man-pages maintainers)
>
> Thanks!  If I'm not CCed, I may not notice the email, depending on the
> traffic of the lists, and the amount of time I have ;)
>
> >
> > On Tue, Jul 27, 2021 at 09:32:34AM -0700, Axel Rasmussen wrote:
> >> Any remaining issues with this patch? I just realized today it was
> >> never merged. 5.13 (which contains this new feature) was released some
> >> weeks ago.
>
> Please see some minor formatting issues I commented below.
> Other than that, it looks good to me.
>
> Thanks,
>
> Alex
>
> >>
> >> On Fri, Jun 4, 2021 at 12:56 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> >>>
> >>> Userfaultfd minor fault mode is supported starting from Linux 5.13.
> >>>
> >>> This commit adds a description of the new mode, as well as the new ioctl
> >>> used to resolve such faults. The two go hand-in-hand: one can't resolve
> >>> a minor fault without continue, and continue can't be used to resolve
> >>> any other kind of fault.
> >>>
> >>> This patch covers just the hugetlbfs implementation (in 5.13). Support
> >>> for shmem is forthcoming, but as it has not yet made it into a kernel
> >>> release candidate, it will be added in a future commit.
> >>>
> >>> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> >>> ---
> >>>   man2/ioctl_userfaultfd.2 | 125 ++++++++++++++++++++++++++++++++++++---
> >>>   man2/userfaultfd.2       |  79 ++++++++++++++++++++-----
> >>>   2 files changed, 182 insertions(+), 22 deletions(-)
> >>>
> >>> diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
> >>> index 504f61d4b..7b990c24a 100644
> >>> --- a/man2/ioctl_userfaultfd.2
> >>> +++ b/man2/ioctl_userfaultfd.2
> >>> @@ -214,6 +214,10 @@ memory accesses to the regions registered with userfaultfd.
> >>>   If this feature bit is set,
> >>>   .I uffd_msg.pagefault.feat.ptid
> >>>   will be set to the faulted thread ID for each page-fault message.
> >>> +.TP
> >>> +.BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)"
> >>> +If this feature bit is set, the kernel supports registering userfaultfd ranges
> >>> +in minor mode on hugetlbfs-backed memory areas.
>
> See the folowing extract from man-pages(7):
>
>     Use semantic newlines
>         In the source of a manual page,  new  sentences  should  be
>         started  on  new  lines, and long sentences should be split
>         into lines at clause breaks  (commas,  semicolons,  colons,
>         and  so on).  This convention, sometimes known as "semantic
>         newlines", makes it easier to see the  effect  of  patches,
>         which often operate at the level of individual sentences or
>         sentence clauses.
>
> A trick to check if some text is correct at first glance, is that the
> following regex should rarely match:
> [,;:.] \+\w
>
> Multi-sentence parenthetical expressions should also go on separate
> lines normally.
>
> I'd for example break the above text into:
>
> [
> If this feature bit is set,
> the kernel supports registering userfaultfd ranges
> in minor mode on hugetlbfs-backed memory areas
> ]
>
> Note the break after the comma, and another break at a sensible point
> (you already did that one correctly in this example, but some below don't).
>
>
> >>>   .PP
> >>>   The returned
> >>>   .I ioctls
> >>> @@ -240,6 +244,11 @@ operation is supported.
> >>>   The
> >>>   .B UFFDIO_WRITEPROTECT
> >>>   operation is supported.
> >>> +.TP
> >>> +.B 1 << _UFFDIO_CONTINUE
> >>> +The
> >>> +.B UFFDIO_CONTINUE
> >>> +operation is supported.
> >>>   .PP
> >>>   This
> >>>   .BR ioctl (2)
> >>> @@ -278,14 +287,8 @@ by the current kernel version.
> >>>   (Since Linux 4.3.)
> >>>   Register a memory address range with the userfaultfd object.
> >>>   The pages in the range must be "compatible".
> >>> -.PP
> >>> -Up to Linux kernel 4.11,
> >>> -only private anonymous ranges are compatible for registering with
> >>> -.BR UFFDIO_REGISTER .
> >>> -.PP
> >>> -Since Linux 4.11,
> >>> -hugetlbfs and shared memory ranges are also compatible with
> >>> -.BR UFFDIO_REGISTER .
> >>> +Please refer to the list of register modes below for the compatible memory
> >>> +backends for each mode.
>
> Regarding semantic newlines mentioned above:
>
> Here for example, a more sensible point to break the line would be just
> after (or maybe before, up to you) the first "for".
>
> >>>   .PP
> >>>   The
> >>>   .I argp
> >>> @@ -324,9 +327,16 @@ the specified range:
> >>>   .TP
> >>>   .B UFFDIO_REGISTER_MODE_MISSING
> >>>   Track page faults on missing pages.
> >>> +Since Linux 4.3, only private anonymous ranges are compatible.
> >>> +Since Linux 4.11, hugetlbfs and shared memory ranges are also compatible.
> >>>   .TP
> >>>   .B UFFDIO_REGISTER_MODE_WP
> >>>   Track page faults on write-protected pages.
> >>> +Since Linux 5.7, only private anonymous ranges are compatible.
> >>> +.TP
> >>> +.B UFFDIO_REGISTER_MODE_MINOR
> >>> +Track minor page faults.
> >>> +Since Linux 5.13, only hugetlbfs ranges are compatible.
> >>>   .PP
> >>>   If the operation is successful, the kernel modifies the
> >>>   .I ioctls
> >>> @@ -735,6 +745,105 @@ or not registered with userfaultfd write-protect mode.
> >>>   .TP
> >>>   .B EFAULT
> >>>   Encountered a generic fault during processing.
> >>> +.\"
> >>> +.SS UFFDIO_CONTINUE
> >>> +(Since Linux 5.13.)
> >>> +Resolve a minor page fault by installing page table entries for existing pages
> >>> +in the page cache.
> >>> +.PP
> >>> +The
> >>> +.I argp
> >>> +argument is a pointer to a
> >>> +.I uffdio_continue
> >>> +structure as shown below:
> >>> +.PP
> >>> +.in +4n
> >>> +.EX
> >>> +struct uffdio_continue {
> >>> +    struct uffdio_range range; /* Range to install PTEs for and continue */
> >>> +    __u64 mode;                /* Flags controlling the behavior of continue */
> >>> +    __s64 mapped;              /* Number of bytes mapped, or negated error */
> >>> +};
> >>> +.EE
> >>> +.in
> >>> +.PP
> >>> +The following value may be bitwise ORed in
> >>> +.IR mode
> >>> +to change the behavior of the
> >>> +.B UFFDIO_CONTINUE
> >>> +operation:
> >>> +.TP
> >>> +.B UFFDIO_CONTINUE_MODE_DONTWAKE
> >>> +Do not wake up the thread that waits for page-fault resolution.
> >>> +.PP
> >>> +The
> >>> +.I mapped
> >>> +field is used by the kernel to return the number of bytes
> >>> +that were actually mapped, or an error in the same manner as
> >>> +.BR UFFDIO_COPY .
> >>> +If the value returned in the
> >>> +.I mapped
> >>> +field doesn't match the value that was specified in
> >>> +.IR range.len ,
> >>> +the operation fails with the error
> >>> +.BR EAGAIN .
> >>> +The
> >>> +.I mapped
> >>> +field is output-only;
> >>> +it is not read by the
> >>> +.B UFFDIO_CONTINUE
> >>> +operation.
> >>> +.PP
> >>> +This
> >>> +.BR ioctl (2)
> >>> +operation returns 0 on success.
> >>> +In this case, the entire area was mapped.
> >>> +On error, \-1 is returned and
> >>> +.I errno
> >>> +is set to indicate the error.
> >>> +Possible errors include:
> >>> +.TP
> >>> +.B EAGAIN
> >>> +The number of bytes mapped (i.e., the value returned in the
> >>> +.I mapped
> >>> +field) does not equal the value that was specified in the
> >>> +.I range.len
> >>> +field.
> >>> +.TP
> >>> +.B EINVAL
> >>> +Either
> >>> +.I range.start
> >>> +or
> >>> +.I range.len
> >>> +was not a multiple of the system page size; or
> >>> +.I range.len
> >>> +was zero; or the range specified was invalid.
> >>> +.TP
> >>> +.B EINVAL
> >>> +An invalid bit was specified in the
> >>> +.IR mode
> >>> +field.
> >>> +.TP
> >>> +.B EEXIST
> >>> +One or more pages were already mapped in the given range.
> >>> +.TP
> >>> +.B ENOENT
> >>> +The faulting process has changed its virtual memory layout simultaneously with
> >>> +an outstanding
> >>> +.B UFFDIO_CONTINUE
> >>> +operation.
> >>> +.TP
> >>> +.B ENOMEM
> >>> +Allocating memory needed to setup the page table mappings failed.
> >>> +.TP
> >>> +.B EFAULT
> >>> +No existing page could be found in the page cache for the given range.
> >>> +.TP
> >>> +.BR ESRCH
> >>> +The faulting process has exited at the time of a
> >>> +.B UFFDIO_CONTINUE
> >>> +operation.
> >>> +.\"
> >>>   .SH RETURN VALUE
> >>>   See descriptions of the individual operations, above.
> >>>   .SH ERRORS
> >>> diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
> >>> index 593c189d8..07f53c6ff 100644
> >>> --- a/man2/userfaultfd.2
> >>> +++ b/man2/userfaultfd.2
> >>> @@ -78,7 +78,7 @@ all memory ranges that were registered with the object are unregistered
> >>>   and unread events are flushed.
> >>>   .\"
> >>>   .PP
> >>> -Userfaultfd supports two modes of registration:
> >>> +Userfaultfd supports three modes of registration:
> >>>   .TP
> >>>   .BR UFFDIO_REGISTER_MODE_MISSING " (since 4.10)"
> >>>   When registered with
> >>> @@ -92,6 +92,18 @@ or an
> >>>   .B UFFDIO_ZEROPAGE
> >>>   ioctl.
> >>>   .TP
> >>> +.BR UFFDIO_REGISTER_MODE_MINOR " (since 5.13)"
> >>> +When registered with
> >>> +.B UFFDIO_REGISTER_MODE_MINOR
> >>> +mode, user-space will receive a page-fault notification
>
> s/user-space/user space/
>
> See the following extract from man-pages(7):
>
>     Preferred terms
>         The  following  table  lists some preferred terms to use in
>         man pages, mainly to ensure consistency across pages.
>
>         Term                 Avoid using              Notes
>         ─────────────────────────────────────────────────────────────
>         [...]
>         user space           userspace
>
> However, when user space is used as an adjective, per the usual English
> rules, we write "user-space".  Example: "a user-space program".

100% agreed that "user space" is more correct, but this man page
already has many instances of "user-space" in it. I'd suggest we
either fix all of them, or just follow the existing convention within
this page.

How about, leaving this as-is for this patch, to keep the diff tidy,
and I can send a follow-up patch to fix all the instances of this in
this page?

>
> >>> +when a minor page fault occurs.
> >>> +That is, when a backing page is in the page cache, but
> >>> +page table entries don't yet exist.
> >>> +The faulted thread will be stopped from execution until the page fault is
> >>> +resolved from user-space by an
> >>> +.B UFFDIO_CONTINUE
> >>> +ioctl.
> >>> +.TP
> >>>   .BR UFFDIO_REGISTER_MODE_WP " (since 5.7)"
> >>>   When registered with
> >>>   .B UFFDIO_REGISTER_MODE_WP
> >>> @@ -212,9 +224,10 @@ a page fault occurring in the requested memory range, and satisfying
> >>>   the mode defined at the registration time, will be forwarded by the kernel to
> >>>   the user-space application.
> >>>   The application can then use the
> >>> -.B UFFDIO_COPY
> >>> +.B UFFDIO_COPY ,
> >>> +.B UFFDIO_ZEROPAGE ,
> >>>   or
> >>> -.B UFFDIO_ZEROPAGE
> >>> +.B UFFDIO_CONTINUE
> >>>   .BR ioctl (2)
> >>>   operations to resolve the page fault.
> >>>   .PP
> >>> @@ -318,6 +331,43 @@ should have the flag
> >>>   cleared upon the faulted page or range.
> >>>   .PP
> >>>   Write-protect mode supports only private anonymous memory.
> >>> +.\"
> >>> +.SS Userfaultfd minor fault mode (since 5.13)
> >>> +Since Linux 5.13, userfaultfd supports minor fault mode.
> >>> +In this mode, fault messages are produced not for major faults (where the
> >>> +page was missing), but rather for minor faults, where a page exists in the page
> >>> +cache, but the page table entries are not yet present.
> >>> +The user needs to first check availability of this feature using
> >>> +.B UFFDIO_API
> >>> +ioctl against the feature bit
> >>> +.B UFFD_FEATURE_MINOR_HUGETLBFS
> >>> +before using this feature.
> >>> +.PP
> >>> +To register with userfaultfd minor fault mode, the user needs to initiate the
> >>> +.B UFFDIO_REGISTER
> >>> +ioctl with mode
> >>> +.B UFFD_REGISTER_MODE_MINOR
> >>> +set.
> >>> +.PP
> >>> +When a minor fault occurs, user-space will receive a page-fault notification
> >>> +whose
> >>> +.I uffd_msg.pagefault.flags
> >>> +will have the
> >>> +.B UFFD_PAGEFAULT_FLAG_MINOR
> >>> +flag set.
> >>> +.PP
> >>> +To resolve a minor page fault, the handler should decide whether or not the
> >>> +existing page contents need to be modified first.
> >>> +If so, this should be done in-place via a second, non-userfaultfd-registered
> >>> +mapping to the same backing page (e.g., by mapping the hugetlbfs file twice).
> >>> +Once the page is considered "up to date", the fault can be resolved by
> >>> +initiating an
> >>> +.B UFFDIO_CONTINUE
> >>> +ioctl, which installs the page table entries and (by default) wakes up the
> >>> +faulting thread(s).
> >>> +.PP
> >>> +Minor fault mode supports only hugetlbfs-backed memory.
> >>> +.\"
> >>>   .SS Reading from the userfaultfd structure
> >>>   Each
> >>>   .BR read (2)
> >>> @@ -456,19 +506,20 @@ For
> >>>   the following flag may appear:
> >>>   .RS
> >>>   .TP
> >>> -.B UFFD_PAGEFAULT_FLAG_WRITE
> >>> -If the address is in a range that was registered with the
> >>> -.B UFFDIO_REGISTER_MODE_MISSING
> >>> -flag (see
> >>> -.BR ioctl_userfaultfd (2))
> >>> -and this flag is set, this a write fault;
> >>> -otherwise it is a read fault.
> >>> +.B UFFD_PAGEFAULT_FLAG_WP
> >>> +If this flag is set, then the fault was a write-protect fault.
> >>>   .TP
> >>> +.B UFFD_PAGEFAULT_FLAG_MINOR
> >>> +If this flag is set, then the fault was a minor fault.
> >>> +.TP
> >>> +.B UFFD_PAGEFAULT_FLAG_WRITE
> >>> +If this flag is set, then the fault was a write fault.
> >>> +.HP
>
> See the following extract from groff_man(7):
>
>     Deprecated features
>         Use of the following is discouraged.
>
>         [...]
>
>         .HP [indent]
>                Set up a paragraph with a hanging left  indentation.
>                The  indent argument, if present, is handled as with
>                .TP.
>
>                Use of this presentation‐level macro is  deprecated.
>                While it is universally portable to legacy Unix sys‐
>                tems, a hanging indentation cannot be expressed nat‐
>                urally  under HTML, and many HTML‐based manual view‐
>                ers simply interpret it as a starter  for  a  normal
>                paragraph.  Thus, any information or distinction you
>                tried to express with the indentation may be lost.
>
> I'd just use .PP here, I think.
>
>
> >>> +If neither
> >>>   .B UFFD_PAGEFAULT_FLAG_WP
> >>> -If the address is in a range that was registered with the
> >>> -.B UFFDIO_REGISTER_MODE_WP
> >>> -flag, when this bit is set, it means it is a write-protect fault.
> >>> -Otherwise it is a page-missing fault.
> >>> +nor
> >>> +.B UFFD_PAGEFAULT_FLAG_MINOR
> >>> +are set, then the fault was a missing fault.
> >>>   .RE
> >>>   .TP
> >>>   .I pagefault.feat.pid
> >>> --
> >>> 2.32.0.rc1.229.g3e70b5a671-goog
> >>>
> >>
> >
>
>
> --
> Alejandro Colomar
> Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
> http://www.alejandro-colomar.es/
Alejandro Colomar April 2, 2022, 9:48 p.m. UTC | #6
Hi Axel,

On 3/22/22 17:31, Axel Rasmussen wrote:
> On Mon, Aug 2, 2021 at 5:21 AM Alejandro Colomar (man-pages)
> <alx.manpages@gmail.com> wrote:
>>>>> +mode, user-space will receive a page-fault notification
>>
>> s/user-space/user space/
>>
>> See the following extract from man-pages(7):
>>
>>     Preferred terms
>>         The  following  table  lists some preferred terms to use in
>>         man pages, mainly to ensure consistency across pages.
>>
>>         Term                 Avoid using              Notes
>>         ─────────────────────────────────────────────────────────────
>>         [...]
>>         user space           userspace
>>
>> However, when user space is used as an adjective, per the usual English
>> rules, we write "user-space".  Example: "a user-space program".
> 
> 100% agreed that "user space" is more correct, but this man page
> already has many instances of "user-space" in it. I'd suggest we
> either fix all of them, or just follow the existing convention within
> this page.
> 
> How about, leaving this as-is for this patch, to keep the diff tidy,
> and I can send a follow-up patch to fix all the instances of this in
> this page?
> 

Sure.  Sorry for the delay.

Thanks,

Alex
diff mbox series

Patch

diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
index 504f61d4b..7b990c24a 100644
--- a/man2/ioctl_userfaultfd.2
+++ b/man2/ioctl_userfaultfd.2
@@ -214,6 +214,10 @@  memory accesses to the regions registered with userfaultfd.
 If this feature bit is set,
 .I uffd_msg.pagefault.feat.ptid
 will be set to the faulted thread ID for each page-fault message.
+.TP
+.BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)"
+If this feature bit is set, the kernel supports registering userfaultfd ranges
+in minor mode on hugetlbfs-backed memory areas.
 .PP
 The returned
 .I ioctls
@@ -240,6 +244,11 @@  operation is supported.
 The
 .B UFFDIO_WRITEPROTECT
 operation is supported.
+.TP
+.B 1 << _UFFDIO_CONTINUE
+The
+.B UFFDIO_CONTINUE
+operation is supported.
 .PP
 This
 .BR ioctl (2)
@@ -278,14 +287,8 @@  by the current kernel version.
 (Since Linux 4.3.)
 Register a memory address range with the userfaultfd object.
 The pages in the range must be "compatible".
-.PP
-Up to Linux kernel 4.11,
-only private anonymous ranges are compatible for registering with
-.BR UFFDIO_REGISTER .
-.PP
-Since Linux 4.11,
-hugetlbfs and shared memory ranges are also compatible with
-.BR UFFDIO_REGISTER .
+Please refer to the list of register modes below for the compatible memory
+backends for each mode.
 .PP
 The
 .I argp
@@ -324,9 +327,16 @@  the specified range:
 .TP
 .B UFFDIO_REGISTER_MODE_MISSING
 Track page faults on missing pages.
+Since Linux 4.3, only private anonymous ranges are compatible.
+Since Linux 4.11, hugetlbfs and shared memory ranges are also compatible.
 .TP
 .B UFFDIO_REGISTER_MODE_WP
 Track page faults on write-protected pages.
+Since Linux 5.7, only private anonymous ranges are compatible.
+.TP
+.B UFFDIO_REGISTER_MODE_MINOR
+Track minor page faults.
+Since Linux 5.13, only hugetlbfs ranges are compatible.
 .PP
 If the operation is successful, the kernel modifies the
 .I ioctls
@@ -735,6 +745,105 @@  or not registered with userfaultfd write-protect mode.
 .TP
 .B EFAULT
 Encountered a generic fault during processing.
+.\"
+.SS UFFDIO_CONTINUE
+(Since Linux 5.13.)
+Resolve a minor page fault by installing page table entries for existing pages
+in the page cache.
+.PP
+The
+.I argp
+argument is a pointer to a
+.I uffdio_continue
+structure as shown below:
+.PP
+.in +4n
+.EX
+struct uffdio_continue {
+    struct uffdio_range range; /* Range to install PTEs for and continue */
+    __u64 mode;                /* Flags controlling the behavior of continue */
+    __s64 mapped;              /* Number of bytes mapped, or negated error */
+};
+.EE
+.in
+.PP
+The following value may be bitwise ORed in
+.IR mode
+to change the behavior of the
+.B UFFDIO_CONTINUE
+operation:
+.TP
+.B UFFDIO_CONTINUE_MODE_DONTWAKE
+Do not wake up the thread that waits for page-fault resolution.
+.PP
+The
+.I mapped
+field is used by the kernel to return the number of bytes
+that were actually mapped, or an error in the same manner as
+.BR UFFDIO_COPY .
+If the value returned in the
+.I mapped
+field doesn't match the value that was specified in
+.IR range.len ,
+the operation fails with the error
+.BR EAGAIN .
+The
+.I mapped
+field is output-only;
+it is not read by the
+.B UFFDIO_CONTINUE
+operation.
+.PP
+This
+.BR ioctl (2)
+operation returns 0 on success.
+In this case, the entire area was mapped.
+On error, \-1 is returned and
+.I errno
+is set to indicate the error.
+Possible errors include:
+.TP
+.B EAGAIN
+The number of bytes mapped (i.e., the value returned in the
+.I mapped
+field) does not equal the value that was specified in the
+.I range.len
+field.
+.TP
+.B EINVAL
+Either
+.I range.start
+or
+.I range.len
+was not a multiple of the system page size; or
+.I range.len
+was zero; or the range specified was invalid.
+.TP
+.B EINVAL
+An invalid bit was specified in the
+.IR mode
+field.
+.TP
+.B EEXIST
+One or more pages were already mapped in the given range.
+.TP
+.B ENOENT
+The faulting process has changed its virtual memory layout simultaneously with
+an outstanding
+.B UFFDIO_CONTINUE
+operation.
+.TP
+.B ENOMEM
+Allocating memory needed to setup the page table mappings failed.
+.TP
+.B EFAULT
+No existing page could be found in the page cache for the given range.
+.TP
+.BR ESRCH
+The faulting process has exited at the time of a
+.B UFFDIO_CONTINUE
+operation.
+.\"
 .SH RETURN VALUE
 See descriptions of the individual operations, above.
 .SH ERRORS
diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
index 593c189d8..07f53c6ff 100644
--- a/man2/userfaultfd.2
+++ b/man2/userfaultfd.2
@@ -78,7 +78,7 @@  all memory ranges that were registered with the object are unregistered
 and unread events are flushed.
 .\"
 .PP
-Userfaultfd supports two modes of registration:
+Userfaultfd supports three modes of registration:
 .TP
 .BR UFFDIO_REGISTER_MODE_MISSING " (since 4.10)"
 When registered with
@@ -92,6 +92,18 @@  or an
 .B UFFDIO_ZEROPAGE
 ioctl.
 .TP
+.BR UFFDIO_REGISTER_MODE_MINOR " (since 5.13)"
+When registered with
+.B UFFDIO_REGISTER_MODE_MINOR
+mode, user-space will receive a page-fault notification
+when a minor page fault occurs.
+That is, when a backing page is in the page cache, but
+page table entries don't yet exist.
+The faulted thread will be stopped from execution until the page fault is
+resolved from user-space by an
+.B UFFDIO_CONTINUE
+ioctl.
+.TP
 .BR UFFDIO_REGISTER_MODE_WP " (since 5.7)"
 When registered with
 .B UFFDIO_REGISTER_MODE_WP
@@ -212,9 +224,10 @@  a page fault occurring in the requested memory range, and satisfying
 the mode defined at the registration time, will be forwarded by the kernel to
 the user-space application.
 The application can then use the
-.B UFFDIO_COPY
+.B UFFDIO_COPY ,
+.B UFFDIO_ZEROPAGE ,
 or
-.B UFFDIO_ZEROPAGE
+.B UFFDIO_CONTINUE
 .BR ioctl (2)
 operations to resolve the page fault.
 .PP
@@ -318,6 +331,43 @@  should have the flag
 cleared upon the faulted page or range.
 .PP
 Write-protect mode supports only private anonymous memory.
+.\"
+.SS Userfaultfd minor fault mode (since 5.13)
+Since Linux 5.13, userfaultfd supports minor fault mode.
+In this mode, fault messages are produced not for major faults (where the
+page was missing), but rather for minor faults, where a page exists in the page
+cache, but the page table entries are not yet present.
+The user needs to first check availability of this feature using
+.B UFFDIO_API
+ioctl against the feature bit
+.B UFFD_FEATURE_MINOR_HUGETLBFS
+before using this feature.
+.PP
+To register with userfaultfd minor fault mode, the user needs to initiate the
+.B UFFDIO_REGISTER
+ioctl with mode
+.B UFFD_REGISTER_MODE_MINOR
+set.
+.PP
+When a minor fault occurs, user-space will receive a page-fault notification
+whose
+.I uffd_msg.pagefault.flags
+will have the
+.B UFFD_PAGEFAULT_FLAG_MINOR
+flag set.
+.PP
+To resolve a minor page fault, the handler should decide whether or not the
+existing page contents need to be modified first.
+If so, this should be done in-place via a second, non-userfaultfd-registered
+mapping to the same backing page (e.g., by mapping the hugetlbfs file twice).
+Once the page is considered "up to date", the fault can be resolved by
+initiating an
+.B UFFDIO_CONTINUE
+ioctl, which installs the page table entries and (by default) wakes up the
+faulting thread(s).
+.PP
+Minor fault mode supports only hugetlbfs-backed memory.
+.\"
 .SS Reading from the userfaultfd structure
 Each
 .BR read (2)
@@ -456,19 +506,20 @@  For
 the following flag may appear:
 .RS
 .TP
-.B UFFD_PAGEFAULT_FLAG_WRITE
-If the address is in a range that was registered with the
-.B UFFDIO_REGISTER_MODE_MISSING
-flag (see
-.BR ioctl_userfaultfd (2))
-and this flag is set, this a write fault;
-otherwise it is a read fault.
+.B UFFD_PAGEFAULT_FLAG_WP
+If this flag is set, then the fault was a write-protect fault.
 .TP
+.B UFFD_PAGEFAULT_FLAG_MINOR
+If this flag is set, then the fault was a minor fault.
+.TP
+.B UFFD_PAGEFAULT_FLAG_WRITE
+If this flag is set, then the fault was a write fault.
+.HP
+If neither
 .B UFFD_PAGEFAULT_FLAG_WP
-If the address is in a range that was registered with the
-.B UFFDIO_REGISTER_MODE_WP
-flag, when this bit is set, it means it is a write-protect fault.
-Otherwise it is a page-missing fault.
+nor
+.B UFFD_PAGEFAULT_FLAG_MINOR
+are set, then the fault was a missing fault.
 .RE
 .TP
 .I pagefault.feat.pid