diff mbox series

[RFC,v2] fhandle: expose u64 mount id to name_to_handle_at(2)

Message ID 20240523-exportfs-u64-mount-id-v2-1-f9f959f17eb1@cyphar.com (mailing list archive)
State New, archived
Headers show
Series [RFC,v2] fhandle: expose u64 mount id to name_to_handle_at(2) | expand

Commit Message

Aleksa Sarai May 23, 2024, 8:57 p.m. UTC
Now that we provide a unique 64-bit mount ID interface in statx, we can
now provide a race-free way for name_to_handle_at(2) to provide a file
handle and corresponding mount without needing to worry about racing
with /proc/mountinfo parsing.

While this is not necessary if you are using AT_EMPTY_PATH and don't
care about an extra statx(2) call, users that pass full paths into
name_to_handle_at(2) need to know which mount the file handle comes from
(to make sure they don't try to open_by_handle_at a file handle from a
different filesystem) and switching to AT_EMPTY_PATH would require
allocating a file for every name_to_handle_at(2) call, turning

  err = name_to_handle_at(-EBADF, "/foo/bar/baz", &handle, &mntid,
                          AT_HANDLE_MNT_ID_UNIQUE);

into

  int fd = openat(-EBADF, "/foo/bar/baz", O_PATH | O_CLOEXEC);
  err1 = name_to_handle_at(fd, "", &handle, &unused_mntid, AT_EMPTY_PATH);
  err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &statxbuf);
  mntid = statxbuf.stx_mnt_id;
  close(fd);

Unlike AT_HANDLE_FID, as per Amir's suggestion, AT_HANDLE_MNT_ID_UNIQUE
uses a new AT_* bit from the historically-unused 0xFF range (which we
now define as being the "per-syscall" range for AT_* bits).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Changes in v2:
- Fixed a few minor compiler warnings and a buggy copy_to_user() check.
- Rename AT_HANDLE_UNIQUE_MOUNT_ID -> AT_HANDLE_MNT_ID_UNIQUE to match statx.
- Switched to using an AT_* bit from 0xFF and defining that range as
  being "per-syscall" for future usage.
- Sync tools/ copy of <linux/fcntl.h> to include changes.
- v1: <https://lore.kernel.org/r/20240520-exportfs-u64-mount-id-v1-1-f55fd9215b8e@cyphar.com>
---
 fs/fhandle.c                     | 29 ++++++++++++++++++++++-------
 include/linux/syscalls.h         |  2 +-
 include/uapi/linux/fcntl.h       | 28 +++++++++++++++++++++-------
 tools/include/uapi/linux/fcntl.h | 28 +++++++++++++++++++++-------
 4 files changed, 65 insertions(+), 22 deletions(-)


---
base-commit: 584bbf439d0fa83d728ec49f3a38c581bdc828b4
change-id: 20240515-exportfs-u64-mount-id-9ebb5c58b53c

Best regards,

Comments

Amir Goldstein May 24, 2024, 4:58 a.m. UTC | #1
On Thu, May 23, 2024 at 11:57 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> Now that we provide a unique 64-bit mount ID interface in statx, we can
> now provide a race-free way for name_to_handle_at(2) to provide a file
> handle and corresponding mount without needing to worry about racing
> with /proc/mountinfo parsing.
>
> While this is not necessary if you are using AT_EMPTY_PATH and don't
> care about an extra statx(2) call, users that pass full paths into
> name_to_handle_at(2) need to know which mount the file handle comes from
> (to make sure they don't try to open_by_handle_at a file handle from a
> different filesystem) and switching to AT_EMPTY_PATH would require
> allocating a file for every name_to_handle_at(2) call, turning
>
>   err = name_to_handle_at(-EBADF, "/foo/bar/baz", &handle, &mntid,
>                           AT_HANDLE_MNT_ID_UNIQUE);
>
> into
>
>   int fd = openat(-EBADF, "/foo/bar/baz", O_PATH | O_CLOEXEC);
>   err1 = name_to_handle_at(fd, "", &handle, &unused_mntid, AT_EMPTY_PATH);
>   err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &statxbuf);
>   mntid = statxbuf.stx_mnt_id;
>   close(fd);
>
> Unlike AT_HANDLE_FID, as per Amir's suggestion, AT_HANDLE_MNT_ID_UNIQUE
> uses a new AT_* bit from the historically-unused 0xFF range (which we
> now define as being the "per-syscall" range for AT_* bits).
>

Sorry for nit picking, but I think that "Unlike AT_HANDLE_FID,..." is confusing
in this statement.
AT_HANDLE_FID is using a bit that was already effectively allocated for a
"per-syscall" range.
I don't think that mentioning AT_HANDLE_FID adds any clarity to the statement
so better drop it?

> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
> Changes in v2:
> - Fixed a few minor compiler warnings and a buggy copy_to_user() check.
> - Rename AT_HANDLE_UNIQUE_MOUNT_ID -> AT_HANDLE_MNT_ID_UNIQUE to match statx.
> - Switched to using an AT_* bit from 0xFF and defining that range as
>   being "per-syscall" for future usage.
> - Sync tools/ copy of <linux/fcntl.h> to include changes.
> - v1: <https://lore.kernel.org/r/20240520-exportfs-u64-mount-id-v1-1-f55fd9215b8e@cyphar.com>
> ---
>  fs/fhandle.c                     | 29 ++++++++++++++++++++++-------
>  include/linux/syscalls.h         |  2 +-
>  include/uapi/linux/fcntl.h       | 28 +++++++++++++++++++++-------
>  tools/include/uapi/linux/fcntl.h | 28 +++++++++++++++++++++-------
>  4 files changed, 65 insertions(+), 22 deletions(-)
>
[...]

> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index c0bcc185fa48..9ed9d65842c1 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -87,22 +87,24 @@
>  #define DN_ATTRIB      0x00000020      /* File changed attibutes */
>  #define DN_MULTISHOT   0x80000000      /* Don't remove notifier */
>
> +#define AT_FDCWD               -100    /* Special value used to indicate
> +                                           openat should use the current
> +                                           working directory. */

(more nit picking)
If you are changing this line, please at least add a new line,
this is a different namespace :-/
and perhaps change it to "Special value of dirfd argument..."

Also, better leave a comment here to discourage allocation from this range:

+ /* Reserved for per-syscall flags              0xff   */

> +#define AT_SYMLINK_NOFOLLOW    0x100   /* Do not follow symbolic links.  */
> +
>  /*
> - * The constants AT_REMOVEDIR and AT_EACCESS have the same value.  AT_EACCESS is
> - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to
> + * The constants AT_REMOVEDIR and AT_EACCESS have the same value.  AT_EACCESS
> + * is meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to
>   * unlinkat.  The two functions do completely different things and therefore,
>   * the flags can be allowed to overlap.  For example, passing AT_REMOVEDIR to
>   * faccessat would be undefined behavior and thus treating it equivalent to
>   * AT_EACCESS is valid undefined behavior.
>   */

If you are going to add this churn in this patch, please do it otherwise.
It does not make sense to have this long explanation about pre-syscall
AT_* flags in a different location from the comment you added about
"All new purely-syscall-specific AT_* flags.."
if this explanation is needed at all, it should be after the new comment
as an example.


> -#define AT_FDCWD               -100    /* Special value used to indicate
> -                                           openat should use the current
> -                                           working directory. */
> -#define AT_SYMLINK_NOFOLLOW    0x100   /* Do not follow symbolic links.  */
>  #define AT_EACCESS             0x200   /* Test access permitted for
>                                             effective IDs, not real IDs.  */
>  #define AT_REMOVEDIR           0x200   /* Remove directory instead of
>                                             unlinking file.  */

I really prefer to move those to the per-syscall section
right next to AT_HANDLE_FID and leave a comment here:

/* Reserved for per-syscall flags           0x200   */

> +
>  #define AT_SYMLINK_FOLLOW      0x400   /* Follow symbolic links.  */
>  #define AT_NO_AUTOMOUNT                0x800   /* Suppress terminal automount traversal */
>  #define AT_EMPTY_PATH          0x1000  /* Allow empty relative pathname */
> @@ -114,10 +116,22 @@
>
>  #define AT_RECURSIVE           0x8000  /* Apply to the entire subtree */
>
> -/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */
> +/*
> + * All new purely-syscall-specific AT_* flags should consider using bits from
> + * 0xFF, but the bits used by RENAME_* (0x7) should be avoided in case users
> + * decide to pass AT_* flags to renameat2() by accident.

Sorry, but I find the use of my renameat2() example a bit confusing
in this sentence.
If you mention it at all, please use "For example, the bits used by..."
but I think it is more important to say "...should consider re-using bits
already used by other per-syscalls flags".

> These flag bits are
> + * free for re-use by other syscall's syscall-specific flags without worry.
> + */
> +
> +/*
> + * Flags for name_to_handle_at(2). To save AT_ flag space we re-use the
> + * AT_EACCESS/AT_REMOVEDIR bit for AT_HANDLE_FID.
> + */

AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID have exact same status,
so instead of this asymmetric comment:

+/* Flags for faccessat(2) */
+#define AT_EACCESS             0x200   /* Test access permitted for
+                                           effective IDs, not real IDs.  */
+/* Flags for unlinkat(2) */
+#define AT_REMOVEDIR           0x200   /* Remove directory instead of
+                                           unlinking file.  */
+/* Flags for name_to_handle_at(2) */
+#define AT_HANDLE_FID          0x200   /* file handle is needed to
                                        compare object identity and may not
                                        be usable to open_by_handle_at(2) */

> +#define AT_HANDLE_MNT_ID_UNIQUE        0x80    /* return the u64 unique mount id */

IDGI, I think we may have been miscommunicating :-/
If 0x7 range is to be avoided for generic AT_ flags, then it *should* be used
for new per-syscall flags such as this one.

The reservation of 0xff is not a strong guarantee.
As long as people re-use new per-syscalls flags efficiently, we could
decide to reclaim some of this space for generic AT_ flags in the future
if it is needed.

I know most of the mess was here before your patch, but I think
it got to a point where we must put a little order before introducing
the new per-syscall flag.

Thanks,
Amir.
Christoph Hellwig May 26, 2024, 9:25 a.m. UTC | #2
On Thu, May 23, 2024 at 01:57:32PM -0700, Aleksa Sarai wrote:
> Now that we provide a unique 64-bit mount ID interface in statx, we can
> now provide a race-free way for name_to_handle_at(2) to provide a file
> handle and corresponding mount without needing to worry about racing
> with /proc/mountinfo parsing.

file handles are not tied to mounts, they are tied to super_blocks,
and they can survive reboots or (less relevant) remounts.  This thus
seems like a very confusing if not wrong interfaces.
Aleksa Sarai May 26, 2024, 7:01 p.m. UTC | #3
On 2024-05-26, Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, May 23, 2024 at 01:57:32PM -0700, Aleksa Sarai wrote:
> > Now that we provide a unique 64-bit mount ID interface in statx, we can
> > now provide a race-free way for name_to_handle_at(2) to provide a file
> > handle and corresponding mount without needing to worry about racing
> > with /proc/mountinfo parsing.
> 
> file handles are not tied to mounts, they are tied to super_blocks,
> and they can survive reboots or (less relevant) remounts.  This thus
> seems like a very confusing if not wrong interfaces.

The existing interface already provides a mount ID which is not even
safe without rebooting.

The purpose of the mount ID (in the existing API) is to allow userspace
to make sure they know which filesystem mount the file handle came from
so they can store the relevant information (fsid, etc) to make sure they
know which superblock the mount came from when it comes to doing
open_by_handle_at(2). However, there is a race between getting the mount
ID from name_to_handle_at() and parsing /proc/self/mountinfo, which
means that userspace cannot ever be sure that they know the correct
mount the file handle came from (the man page for open_by_handle_at
mentions this issue explicitly).

Getting the ability to get a 64-bit mount ID would allow userspace to
use statmount(2) directly, avoiding this race. You're quite right that
obviously you wouldn't want to just store the mount ID.

An alternative would be to return something unique to the filesystem
superblock, but as far as I can tell there is no guarantee that every
Linux filesystem's fsid is sufficiently unique to act as a globally
unique identifier. At least with a 64-bit mount ID and statmount(2),
userspace can decide what information is needed to get sufficiently
unique information about the source filesystem.
Trond Myklebust May 26, 2024, 10:32 p.m. UTC | #4
On Sun, 2024-05-26 at 02:25 -0700, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 01:57:32PM -0700, Aleksa Sarai wrote:
> > Now that we provide a unique 64-bit mount ID interface in statx, we
> > can
> > now provide a race-free way for name_to_handle_at(2) to provide a
> > file
> > handle and corresponding mount without needing to worry about
> > racing
> > with /proc/mountinfo parsing.
> 
> file handles are not tied to mounts, they are tied to super_blocks,
> and they can survive reboots or (less relevant) remounts.  This thus
> seems like a very confusing if not wrong interfaces.

I assume the reason is to give the caller a race free way to figure out
which submount the path resolves to. The problem is that nothing stops
another process from calling umount() before you're done parsing
/proc/mountinfo and have resolved the mount id.

If we're looking to change the API, then perhaps returning a file
descriptor might be a better alternative?
Most userland NFS servers are in any case going to follow up obtaining
the filehandle with a stat() or even a full blown open() in order to
get file attributes, set up file state, etc. By returning an open file
descriptor to the resolved file (even if it is only an O_PATH
descriptor) we could accelerate those operations in addition to solving
the umount() race.

Alternatively, just remove the path argument altogether, and require
the descriptor argument to be an O_PATH or regular open file descriptor
that resolves to the file we want to get a filehandle for. However this
would require a userland NFS server to generally do a
open_by_handle_at() to resolve the parent directory handle, then do an
openat(O_PATH) to get the file to look up, before being able to call
the name_to_handle_at() replacement.
i.e. there would be 1 extra syscall.
Christoph Hellwig May 27, 2024, 11:47 a.m. UTC | #5
On Sun, May 26, 2024 at 12:01:08PM -0700, Aleksa Sarai wrote:
> The existing interface already provides a mount ID which is not even
> safe without rebooting.

And that seems to be a big part of the problem where the Linux by handle
syscall API deviated from all know precedence for no good reason.  NFS
file handles which were the start of this do (and have to) encode a
persistent file system identifier.  As do the xfs handles (although they
do the decoding in the userspace library on Linux for historic reasons),
as do the FreeBSD equivalents to these syscalls.

> An alternative would be to return something unique to the filesystem
> superblock, but as far as I can tell there is no guarantee that every
> Linux filesystem's fsid is sufficiently unique to act as a globally
> unique identifier. At least with a 64-bit mount ID and statmount(2),
> userspace can decide what information is needed to get sufficiently
> unique information about the source filesystem.

Well, every file system that supports export ops already needs a
globally unique ID for NFS to work properly.  We might not have good
enough interfaces for that, but that shouldn't be too hard.
Christoph Hellwig May 27, 2024, 11:49 a.m. UTC | #6
On Sun, May 26, 2024 at 10:32:39PM +0000, Trond Myklebust wrote:
> I assume the reason is to give the caller a race free way to figure out
> which submount the path resolves to.

But the handle op are global to the file systems (aka super_block).  It
does not matter what mount you use to access it.

Snip random things about userland NFS servers I couldn't care less
about..
Christian Brauner May 27, 2024, 12:22 p.m. UTC | #7
On Sun, May 26, 2024 at 02:25:34AM -0700, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 01:57:32PM -0700, Aleksa Sarai wrote:
> > Now that we provide a unique 64-bit mount ID interface in statx, we can
> > now provide a race-free way for name_to_handle_at(2) to provide a file
> > handle and corresponding mount without needing to worry about racing
> > with /proc/mountinfo parsing.
> 
> file handles are not tied to mounts, they are tied to super_blocks,
> and they can survive reboots or (less relevant) remounts.  This thus
> seems like a very confusing if not wrong interfaces.

I'm not fond of the interface as I've said on an earlier version, but
name_to_handle_at() has always exposed the mount id. IOW, this isn't
adding a new concept to the system call.
Christian Brauner May 27, 2024, 12:29 p.m. UTC | #8
On Mon, May 27, 2024 at 04:47:56AM -0700, Christoph Hellwig wrote:
> On Sun, May 26, 2024 at 12:01:08PM -0700, Aleksa Sarai wrote:
> > The existing interface already provides a mount ID which is not even
> > safe without rebooting.
> 
> And that seems to be a big part of the problem where the Linux by handle
> syscall API deviated from all know precedence for no good reason.  NFS
> file handles which were the start of this do (and have to) encode a
> persistent file system identifier.  As do the xfs handles (although they
> do the decoding in the userspace library on Linux for historic reasons),
> as do the FreeBSD equivalents to these syscalls.
> 
> > An alternative would be to return something unique to the filesystem
> > superblock, but as far as I can tell there is no guarantee that every
> > Linux filesystem's fsid is sufficiently unique to act as a globally
> > unique identifier. At least with a 64-bit mount ID and statmount(2),
> > userspace can decide what information is needed to get sufficiently
> > unique information about the source filesystem.
> 
> Well, every file system that supports export ops already needs a
> globally unique ID for NFS to work properly.  We might not have good
> enough interfaces for that, but that shouldn't be too hard.

I see not inherent problem with exposing the 64 bit mount id through
name_to_handle_at() as we already to expose the old one anyway.

But I agree that it is useful if we had the guarantee that file handles
are unique in the way you describe. As it currently stands that doesn't
seem to be the case and userspace doesn't seem to have a way of figuring
out if the handle provided by name_to_handle_at() is indeed unique as
you describe and can be reliably passed to open_by_handle_at().

Yes, we should fix it but that's really orthogonal to the mount id. It
is separately useful and we already do expose it anyway.
Christian Brauner May 27, 2024, 1:17 p.m. UTC | #9
On Mon, May 27, 2024 at 02:29:02PM +0200, Christian Brauner wrote:
> On Mon, May 27, 2024 at 04:47:56AM -0700, Christoph Hellwig wrote:
> > On Sun, May 26, 2024 at 12:01:08PM -0700, Aleksa Sarai wrote:
> > > The existing interface already provides a mount ID which is not even
> > > safe without rebooting.
> > 
> > And that seems to be a big part of the problem where the Linux by handle
> > syscall API deviated from all know precedence for no good reason.  NFS
> > file handles which were the start of this do (and have to) encode a
> > persistent file system identifier.  As do the xfs handles (although they
> > do the decoding in the userspace library on Linux for historic reasons),
> > as do the FreeBSD equivalents to these syscalls.
> > 
> > > An alternative would be to return something unique to the filesystem
> > > superblock, but as far as I can tell there is no guarantee that every
> > > Linux filesystem's fsid is sufficiently unique to act as a globally
> > > unique identifier. At least with a 64-bit mount ID and statmount(2),
> > > userspace can decide what information is needed to get sufficiently
> > > unique information about the source filesystem.
> > 
> > Well, every file system that supports export ops already needs a
> > globally unique ID for NFS to work properly.  We might not have good
> > enough interfaces for that, but that shouldn't be too hard.
> 
> I see not inherent problem with exposing the 64 bit mount id through
> name_to_handle_at() as we already to expose the old one anyway.
> 
> But I agree that it is useful if we had the guarantee that file handles
> are unique in the way you describe. As it currently stands that doesn't
> seem to be the case and userspace doesn't seem to have a way of figuring
> out if the handle provided by name_to_handle_at() is indeed unique as
> you describe and can be reliably passed to open_by_handle_at().
> 
> Yes, we should fix it but that's really orthogonal to the mount id. It
> is separately useful and we already do expose it anyway.

Put another way, name_to_handle_at(2) currently states:

   Obtaining a persistent filesystem ID
       The mount IDs in /proc/self/mountinfo can be reused as
       filesystems are unmounted and mounted.  Therefore, the mount ID
       returned by name_to_handle_at()  (in  *mount_id)  should  not  be
       treated  as  a persistent identifier for the corresponding
       mounted filesystem.  However, an application can use the
       information in the mountinfo record that corresponds to the mount
       ID to derive a persistent identifier.

       For example, one can use the device name in the fifth field of
       the mountinfo record to search for the corresponding device UUID
       via the symbolic links in /dev/disks/by-uuid.   (A  more
       comfortable  way  of obtaining the UUID is to use the libblkid(3)
       library.)  That process can then be reversed, using the UUID to
       look up the device name, and then obtaining the corre‐ sponding
       mount point, in order to produce the mount_fd argument used by
       open_by_handle_at().

Returning the 64bit mount id makes this race-free because we now have
statmount():

u64 mnt_id = 0;
name_to_handle_at(AT_FDCWD, "/path/to/file", &handle, &mnt_id, 0);
statmount(mnt_id);

Which gets you the device number which one can use to figure out the
uuid without ever having to open a single file (We could even expose the
UUID of the filesystem through statmount() if we wanted to.).
Jan Kara May 27, 2024, 1:34 p.m. UTC | #10
On Mon 27-05-24 04:47:56, Christoph Hellwig wrote:
> On Sun, May 26, 2024 at 12:01:08PM -0700, Aleksa Sarai wrote:
> > The existing interface already provides a mount ID which is not even
> > safe without rebooting.
> 
> And that seems to be a big part of the problem where the Linux by handle
> syscall API deviated from all know precedence for no good reason.  NFS
> file handles which were the start of this do (and have to) encode a
> persistent file system identifier.  As do the xfs handles (although they
> do the decoding in the userspace library on Linux for historic reasons),
> as do the FreeBSD equivalents to these syscalls.

So I was wondering how this is actually working in practice. Checking the
code, NFS server is (based on configuration in /etc/exports) either using
device number as the filesystem identifier or fsid / uuid as specified in
/etc/exports.

> > An alternative would be to return something unique to the filesystem
> > superblock, but as far as I can tell there is no guarantee that every
> > Linux filesystem's fsid is sufficiently unique to act as a globally
> > unique identifier. At least with a 64-bit mount ID and statmount(2),
> > userspace can decide what information is needed to get sufficiently
> > unique information about the source filesystem.
> 
> Well, every file system that supports export ops already needs a
> globally unique ID for NFS to work properly.  We might not have good
> enough interfaces for that, but that shouldn't be too hard.

So as my research above shows, this ID is either manually configured in
/etc/exports or NFS server uses device number which is not guaranteed to be
persistent. Filesystems, at least currently, have no obligation to provide
anything (and some of them indeed don't provide any uuid or persistent
fsid). I guess that's the reason why mount ID is reported with
name_to_handle_at().

Don't get me wrong, I agree with your reservations towards mount ID (per
mount instead of per-sb, non-persistent) and I agree the properties you
describe are the golden standard we should strive for but I mainly wanted
to point out this is not reality today and in particular providing the
"persistency" guarantee of the filesystem ID may require on disk format
changes for some filesystems.

So returning the 64-bit mount ID from name_to_handle_at() weasels out of
these "how to identify arbitrary superblock" problems by giving userspace a
reasonably reliable way to generate this superblock identifier itself. I'm
fully open to less errorprone API for this but at this point I don't see it
so changing the mount ID returned from name_to_handle_at() to 64-bit unique
one seems like a sane practical choice to me...

								Honza
Trond Myklebust May 27, 2024, 3:38 p.m. UTC | #11
On Mon, 2024-05-27 at 04:49 -0700, hch@infradead.org wrote:
> On Sun, May 26, 2024 at 10:32:39PM +0000, Trond Myklebust wrote:
> > I assume the reason is to give the caller a race free way to figure
> > out
> > which submount the path resolves to.
> 
> But the handle op are global to the file systems (aka super_block). 
> It
> does not matter what mount you use to access it.

Sure. However if you are providing a path argument, then presumably you
need to know which file system (aka super_block) it eventually resolves
to.

> 
> Snip random things about userland NFS servers I couldn't care less
> about..
> 

My point was that at least for that case, you are better off using a
file descriptor and not having to care about the mount id.

If your use case isn't NFS servers, then what use case are you
targeting, and how do you expect those applications to use this API?
Trond Myklebust May 27, 2024, 3:47 p.m. UTC | #12
On Mon, 2024-05-27 at 15:17 +0200, Christian Brauner wrote:
> 
> Returning the 64bit mount id makes this race-free because we now have
> statmount():
> 
> u64 mnt_id = 0;
> name_to_handle_at(AT_FDCWD, "/path/to/file", &handle, &mnt_id, 0);
> statmount(mnt_id);
> 
> Which gets you the device number which one can use to figure out the
> uuid without ever having to open a single file (We could even expose
> the
> UUID of the filesystem through statmount() if we wanted to.).
> 

It is not race free. statmount() depends on the filesystem still being
mounted somewhere in your namespace, which is not guaranteed above.
Christoph Hellwig May 27, 2024, 4:18 p.m. UTC | #13
On Mon, May 27, 2024 at 02:29:02PM +0200, Christian Brauner wrote:
> I see not inherent problem with exposing the 64 bit mount id through
> name_to_handle_at() as we already to expose the old one anyway.

That's one way to frame it.  The other is that exposing the different
mount ID also doesn't substantially fix the problem, so it's not
worth the churn.
Christoph Hellwig May 27, 2024, 4:24 p.m. UTC | #14
On Mon, May 27, 2024 at 03:34:30PM +0200, Jan Kara wrote:
> So I was wondering how this is actually working in practice. Checking the
> code, NFS server is (based on configuration in /etc/exports) either using
> device number as the filesystem identifier or fsid / uuid as specified in
> /etc/exports.

Yes, it's a rather suboptimal implementation.

> So returning the 64-bit mount ID from name_to_handle_at() weasels out of
> these "how to identify arbitrary superblock" problems by giving userspace a
> reasonably reliable way to generate this superblock identifier itself. I'm
> fully open to less errorprone API for this but at this point I don't see it
> so changing the mount ID returned from name_to_handle_at() to 64-bit unique
> one seems like a sane practical choice to me...

Well, how about we fix the thing for real:

 - allow file systems to provide a uniqueu identifier of at least
   uuid size (16 bytes) in the superblock or through an export operation
 - for non-persistent file systems allow to generate one at boot time
   using the normal uuid generation helpers
 - add a new flag to name_to_handle_at/open_by_handle_at to include it
   in the file handle, and thus make the file handle work more like
   the normal file handle
 - add code to nfsd to directly make use of this

This would solve all the problems in this proposal as well as all the
obvious ones it doesn't solve.
Christoph Hellwig May 27, 2024, 4:29 p.m. UTC | #15
On Mon, May 27, 2024 at 03:38:40PM +0000, Trond Myklebust wrote:
> > It
> > does not matter what mount you use to access it.
> 
> Sure. However if you are providing a path argument, then presumably you
> need to know which file system (aka super_block) it eventually resolves
> to.

Except that you can't, at least not without running into potential
races.  The only way to fix a race vs unmount/remount is to include
the fsid part in the kernel generated file handle.

> 
> If your use case isn't NFS servers, then what use case are you
> targeting, and how do you expect those applications to use this API?

The main user of the open by handle syscalls seems to be fanotify
magic.
Christian Brauner May 28, 2024, 7:05 a.m. UTC | #16
On Mon, May 27, 2024 at 03:47:33PM +0000, Trond Myklebust wrote:
> On Mon, 2024-05-27 at 15:17 +0200, Christian Brauner wrote:
> > 
> > Returning the 64bit mount id makes this race-free because we now have
> > statmount():
> > 
> > u64 mnt_id = 0;
> > name_to_handle_at(AT_FDCWD, "/path/to/file", &handle, &mnt_id, 0);
> > statmount(mnt_id);
> > 
> > Which gets you the device number which one can use to figure out the
> > uuid without ever having to open a single file (We could even expose
> > the
> > UUID of the filesystem through statmount() if we wanted to.).
> > 
> 
> It is not race free. statmount() depends on the filesystem still being
> mounted somewhere in your namespace, which is not guaranteed above.

The unsigned 64bit mount is not recyclable. It is a unique identifier
for a mount for the lifetime of the system. Even if bumped on every
cycle it will still take hundreds of years to overflow.
Christian Brauner May 28, 2024, 7:12 a.m. UTC | #17
On Mon, May 27, 2024 at 09:29:48AM -0700, hch@infradead.org wrote:
> On Mon, May 27, 2024 at 03:38:40PM +0000, Trond Myklebust wrote:
> > > It
> > > does not matter what mount you use to access it.
> > 
> > Sure. However if you are providing a path argument, then presumably you
> > need to know which file system (aka super_block) it eventually resolves
> > to.
> 
> Except that you can't, at least not without running into potential
> races.  The only way to fix a race vs unmount/remount is to include
> the fsid part in the kernel generated file handle.
> 
> > 
> > If your use case isn't NFS servers, then what use case are you
> > targeting, and how do you expect those applications to use this API?
> 
> The main user of the open by handle syscalls seems to be fanotify
> magic.

It's also used by userspace for uniquely identifying cgroups via handles
as cgroups and - even without open_by_handle_at() - to check whether a
file is still valid.

And again a 64bit mount is is a simple way to race-free go to whatever
superblock uuid you want. They cannot be recycled and are unique for the
lifetime of the system.
Christoph Hellwig May 28, 2024, 7:15 a.m. UTC | #18
On Tue, May 28, 2024 at 09:12:33AM +0200, Christian Brauner wrote:
> It's also used by userspace for uniquely identifying cgroups via handles
> as cgroups and - even without open_by_handle_at() - to check whether a
> file is still valid.
> 
> And again a 64bit mount is is a simple way to race-free go to whatever
> superblock uuid you want. They cannot be recycled and are unique for the
> lifetime of the system.

And then break when you reboot.  Which you might not care about for
cgroups, but which is really bad for the concept of a file handle.

See one of my other replies for a proposed interface that is just as
easy to use for userspace, a little more complex in the kernel but
safe for it.  I'd much prefer that over using ay kind of "mount ID"
which doesn't fit into the file handle concept at all.
Christian Brauner May 28, 2024, 8:20 a.m. UTC | #19
> Well, how about we fix the thing for real:
> 
>  - allow file systems to provide a uniqueu identifier of at least
>    uuid size (16 bytes) in the superblock or through an export operation
>  - for non-persistent file systems allow to generate one at boot time
>    using the normal uuid generation helpers
>  - add a new flag to name_to_handle_at/open_by_handle_at to include it
>    in the file handle, and thus make the file handle work more like
>    the normal file handle
>  - add code to nfsd to directly make use of this
> 
> This would solve all the problems in this proposal as well as all the
> obvious ones it doesn't solve.

As I've said multiple times on the thread I agree that this is what we
should do next and I would be happy to take patches for this. But
exposing the 64bit mount id doesn't impede or complicate that work. It's
a simple and useful addition that can be done now and doesn't prevent us
from doing the proposed work.

Hell, if you twist my arm I'll even write the patches for this.
Christoph Hellwig May 28, 2024, 8:28 a.m. UTC | #20
On Tue, May 28, 2024 at 10:20:21AM +0200, Christian Brauner wrote:
> > This would solve all the problems in this proposal as well as all the
> > obvious ones it doesn't solve.
> 
> As I've said multiple times on the thread I agree that this is what we
> should do next and I would be happy to take patches for this. But
> exposing the 64bit mount id doesn't impede or complicate that work. It's
> a simple and useful addition that can be done now and doesn't prevent us
> from doing the proposed work.

I really confuses the user story as we now have not only one but two
broken modes for the open by handle ops.  We just further go down the
rabbit hole of mixing a non-persistent identifiers with a persistent
one.

> Hell, if you twist my arm I'll even write the patches for this.

I'm also happy to help with that despite very limited time, but I'd
rather avoid doing the misguided mount ID thing.
Christian Brauner May 28, 2024, 9:17 a.m. UTC | #21
> > Hell, if you twist my arm I'll even write the patches for this.
> 
> I'm also happy to help with that despite very limited time, but I'd
> rather avoid doing the misguided mount ID thing.

As I've said earlier, independent of the new handle type returning the
new mount id is useful and needed because it allows the caller to
reliably generate a mount fd for use with open_by_handle_at() via
statmount(). That won't be solved by a new handle type and is racy with
the old mount id. So I intend to accept a version of this patch.
Jan Kara May 28, 2024, 10:11 a.m. UTC | #22
On Mon 27-05-24 09:29:48, hch@infradead.org wrote:
> On Mon, May 27, 2024 at 03:38:40PM +0000, Trond Myklebust wrote:
> > If your use case isn't NFS servers, then what use case are you
> > targeting, and how do you expect those applications to use this API?
> 
> The main user of the open by handle syscalls seems to be fanotify
> magic.

So some fanotify users may use open_by_handle_at() and name_to_handle_at()
but we specifically designed fanotify to not depend on this mount id
feature of the API (because it wasn't really usable couple of years ago
when we were designing this with Amir). fanotify returns fsid + fhandle in
its events and userspace is expected to build a mapping of fsid ->
"whatever it needs to identify a filesystem" when placing fanotify marks.
If it wants to open file / directory where events happened, then this
usually means keeping fsid -> "some open fd on fs" mapping so that it can
then use open_by_handle_at() for opening.

								Honza
Christoph Hellwig May 28, 2024, 10:55 a.m. UTC | #23
On Tue, May 28, 2024 at 11:17:58AM +0200, Christian Brauner wrote:
> As I've said earlier, independent of the new handle type returning the
> new mount id is useful and needed because it allows the caller to
> reliably generate a mount fd for use with open_by_handle_at() via
> statmount(). That won't be solved by a new handle type and is racy with
> the old mount id. So I intend to accept a version of this patch.

The whole point is that with the fsid in the handle we do not even need
a mount fd for open_by_handle_at.

If you insist on making this interface even more crappy than it is and
create confusion please record my explicit:

Nacked-by: Christoph Hellwig <hch@lst.de>

in the commit :(
Christoph Hellwig May 28, 2024, 10:56 a.m. UTC | #24
On Tue, May 28, 2024 at 12:11:52PM +0200, Jan Kara wrote:
> So some fanotify users may use open_by_handle_at() and name_to_handle_at()
> but we specifically designed fanotify to not depend on this mount id
> feature of the API (because it wasn't really usable couple of years ago
> when we were designing this with Amir). fanotify returns fsid + fhandle in
> its events and userspace is expected to build a mapping of fsid ->
> "whatever it needs to identify a filesystem" when placing fanotify marks.
> If it wants to open file / directory where events happened, then this
> usually means keeping fsid -> "some open fd on fs" mapping so that it can
> then use open_by_handle_at() for opening.

Which seems like another argument for my version of the handles to
include the fsid.  Although IIRC the fanotify fsid is only 64 bits which
isn't really good entropy, so we might have to rev that as well.
Christian Brauner May 28, 2024, 12:04 p.m. UTC | #25
On Tue, May 28, 2024 at 03:55:29AM -0700, Christoph Hellwig wrote:
> On Tue, May 28, 2024 at 11:17:58AM +0200, Christian Brauner wrote:
> > As I've said earlier, independent of the new handle type returning the
> > new mount id is useful and needed because it allows the caller to
> > reliably generate a mount fd for use with open_by_handle_at() via
> > statmount(). That won't be solved by a new handle type and is racy with
> > the old mount id. So I intend to accept a version of this patch.
> 
> The whole point is that with the fsid in the handle we do not even need
> a mount fd for open_by_handle_at.

Can you please explain how opening an fd based on a handle returned from
name_to_handle_at() and not using a mount file descriptor for
open_by_handle_at() would work?
Christoph Hellwig May 28, 2024, 1:22 p.m. UTC | #26
On Tue, May 28, 2024 at 02:04:16PM +0200, Christian Brauner wrote:
> Can you please explain how opening an fd based on a handle returned from
> name_to_handle_at() and not using a mount file descriptor for
> open_by_handle_at() would work?

Same as NFS file handles:

name_to_handle_at returns a handle that includes a file system
identifier.

open_by_handle_at looks up the superblock based on that identifier.

For the identifier I could imagin three choices:

 1) use the fsid as returned in statfs and returned by fsnotify.
    The downside is that it is "only" 64-bit.  The upside is that
    we have a lot of plumbing for it
 2) fixed 128-bit identifier to provide more entropy
 3) a variable length identifier, which is more similar to NFS,
    but also a lot more complicated

We'd need a global lookup structure to find the sb by id.  The simplest
one would be a simple linear loop over super_blocks which isn't terribly
efficient, but probably better than whatever userspace is doing to
find a mount fd right now.

Let me cook up a simple prototype for 1) as it shouldn't be more than
a few hundred lines of code.
Miklos Szeredi May 28, 2024, 1:28 p.m. UTC | #27
On Tue, 28 May 2024 at 15:24, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, May 28, 2024 at 02:04:16PM +0200, Christian Brauner wrote:
> > Can you please explain how opening an fd based on a handle returned from
> > name_to_handle_at() and not using a mount file descriptor for
> > open_by_handle_at() would work?
>
> Same as NFS file handles:
>
> name_to_handle_at returns a handle that includes a file system
> identifier.
>
> open_by_handle_at looks up the superblock based on that identifier.

The open file needs a specific mount, holding the superblock is not sufficient.

Thanks,
Miklos
Dave Chinner May 28, 2024, 11:25 p.m. UTC | #28
On Tue, May 28, 2024 at 03:56:43AM -0700, hch@infradead.org wrote:
> On Tue, May 28, 2024 at 12:11:52PM +0200, Jan Kara wrote:
> > So some fanotify users may use open_by_handle_at() and name_to_handle_at()
> > but we specifically designed fanotify to not depend on this mount id
> > feature of the API (because it wasn't really usable couple of years ago
> > when we were designing this with Amir). fanotify returns fsid + fhandle in
> > its events and userspace is expected to build a mapping of fsid ->
> > "whatever it needs to identify a filesystem" when placing fanotify marks.
> > If it wants to open file / directory where events happened, then this
> > usually means keeping fsid -> "some open fd on fs" mapping so that it can
> > then use open_by_handle_at() for opening.
> 
> Which seems like another argument for my version of the handles to
> include the fsid.  Although IIRC the fanotify fsid is only 64 bits which
> isn't really good entropy, so we might have to rev that as well.

I'm in agreement with Christoph that the filehandle needs to contain
the restricted scope information internally. I said that in response
to an earlier version of the patch here:

https://lore.kernel.org/linux-fsdevel/ZlPOd0p7AUn7JqLu@dread.disaster.area/

	"If filehandles are going to be restricted to a specific container
	(e.g. a single mount) so that less permissions are needed to open
	the filehandle, then the filehandle itself needs to encode those
	restrictions. Decoding the filehandle needs to ensure that the
	opening context has permission to access whatever the filehandle
	points to in a restricted environment. This will prevent existing
	persistent, global filehandles from being used as restricted
	filehandles and vice versa."

But no-one has bothered to reply or acknowledge my comments so I'll
point them out again and repeat: Filehandles generated by
the kernel for unprivileged use *must* be self describing and self
validating as the kernel must be able to detect and prevent
unprivelged users from generating custom filehandles that can be
used to access files outside the restricted scope of their
container.

-Dave.
Christoph Hellwig May 29, 2024, 6:24 a.m. UTC | #29
On Wed, May 29, 2024 at 09:25:49AM +1000, Dave Chinner wrote:
> But no-one has bothered to reply or acknowledge my comments so I'll
> point them out again and repeat: Filehandles generated by
> the kernel for unprivileged use *must* be self describing and self
> validating as the kernel must be able to detect and prevent
> unprivelged users from generating custom filehandles that can be
> used to access files outside the restricted scope of their
> container.

We must not generate file handle for unprivileged use at all, as they
bypass all the path based access controls.
Christoph Hellwig May 29, 2024, 6:34 a.m. UTC | #30
On Tue, May 28, 2024 at 03:28:18PM +0200, Miklos Szeredi wrote:
> > open_by_handle_at looks up the superblock based on that identifier.
> 
> The open file needs a specific mount, holding the superblock is not
> sufficient.

A strut file needs a vfsmount, yes.  And it better be reachable by
the calling process.  And maybe an optional restriction to a specific
mount by the caller might be useful, but I can't see how it is
required.
Amir Goldstein May 29, 2024, 7:23 a.m. UTC | #31
On Wed, May 29, 2024 at 9:24 AM hch@infradead.org <hch@infradead.org> wrote:
>
> On Wed, May 29, 2024 at 09:25:49AM +1000, Dave Chinner wrote:
> > But no-one has bothered to reply or acknowledge my comments so I'll
> > point them out again and repeat: Filehandles generated by
> > the kernel for unprivileged use *must* be self describing and self
> > validating

Very nice requirement for a very strong feature,
which is far out of scope for the proposed patch IMO.

What is "generated by the kernel for unprivileged use"?
name_to_handle_at() is unprivileged and for example, fanotify unprivileged
users can use it to compare a marked (i.e. watched) object with an
object that is the subject of an event.
This does not break any security model.

> > as the kernel must be able to detect and prevent
> > unprivelged users from generating custom filehandles that can be
> > used to access files outside the restricted scope of their
> > container.

Emphasis on "that can be used to access files".
The patch in this thread to get a unique 64bit mntid does not make any
difference wrt to the concern above, so I am having a hard time
understand what the fuss is about.

>
> We must not generate file handle for unprivileged use at all, as they
> bypass all the path based access controls.
>

Again, how is "generate file handle for unprivileged use" related to
the patch in question.

The simple solution to the race that Aleksa is trying to prevent,
as was mentioned several times in this thread, is to allow
name_to_handle_at() on an empty path, e.g.:
fd = openat(.., O_PATH); fstatat(fd,..); name_to_handle_at(fd,..);

Aleksa prefers the unique mntid solution to save 2 syscalls.
Would any of the objections here been called for letting
name_to_handle_at() take an empty path?

I would like to offer a different solution (also may have been
mentioned before).

I always disliked the fact that mount_id and mount_fd arguments of
name_to_handle_at()/open_by_handle_at() are not symmetric, when
at first glance they appear to be symmetric as the handle argument.

What if we can make them symmetric and save the openat() syscall?

int path_fd;
int ret = name_to_handle_at(dirfd, path, handle, &path_fd,
                                              AT_HADNLE_PATH_FD);

and then kernel returns an O_PATH fd to the path object
in addition to the file handle (requires same privilege as
encoding the fh).

Userspace can use path_fd to query unique mntid, fsid, uuid
or whatever it needs to know in order to make sure that a
later call in the future to open_by_handle_at() will use
a mount_fd from the same filesystem/mount whatever,
whether in the same boot or after reboot.

If we are doing this, it also makes sense to allow mount_fd argument
to open_by_handle_at() to accept O_PATH fd, but that makes
sense anyway, because mount_fd is essentially a reference to
a path.

Thanks,
Amir.
Christian Brauner May 29, 2024, 7:40 a.m. UTC | #32
On Tue, May 28, 2024 at 06:22:23AM -0700, Christoph Hellwig wrote:
> On Tue, May 28, 2024 at 02:04:16PM +0200, Christian Brauner wrote:
> > Can you please explain how opening an fd based on a handle returned from
> > name_to_handle_at() and not using a mount file descriptor for
> > open_by_handle_at() would work?
> 
> Same as NFS file handles:
> 
> name_to_handle_at returns a handle that includes a file system
> identifier.
> 
> open_by_handle_at looks up the superblock based on that identifier.
> 
> For the identifier I could imagin three choices:
> 
>  1) use the fsid as returned in statfs and returned by fsnotify.
>     The downside is that it is "only" 64-bit.  The upside is that
>     we have a lot of plumbing for it
>  2) fixed 128-bit identifier to provide more entropy
>  3) a variable length identifier, which is more similar to NFS,
>     but also a lot more complicated
> 
> We'd need a global lookup structure to find the sb by id.  The simplest
> one would be a simple linear loop over super_blocks which isn't terribly
> efficient, but probably better than whatever userspace is doing to
> find a mount fd right now.
> 
> Let me cook up a simple prototype for 1) as it shouldn't be more than
> a few hundred lines of code.

Yeah, that's exactly what I figured and no that's not something we
should do.

Not just can have a really large number of superblocks if you have mount
namespaces and large container workloads that interface also needs to be
highly privileged.

Plus, you do have filesystems like btrfs that can be mounted multiple
times with the same uuid.

And in general users will still need to be able to legitimately use a
mount fd and not care about the handle type used with it.
Christoph Hellwig May 31, 2024, 8:14 a.m. UTC | #33
On Wed, May 29, 2024 at 09:40:01AM +0200, Christian Brauner wrote:
> Yeah, that's exactly what I figured and no that's not something we
> should do.
> 
> Not just can have a really large number of superblocks if you have mount
> namespaces and large container workloads that interface also needs to be
> highly privileged.

Again, that would be the most trivial POC.  We can easily do hash.

> Plus, you do have filesystems like btrfs that can be mounted multiple
> times with the same uuid.

Which doesn't matter.  Just like for NFS file handles the fs identifier
identifier plus the file part of the file handle need to be unique.

> And in general users will still need to be able to legitimately use a
> mount fd and not care about the handle type used with it.

I don't understand what you mean.  If we hand out file handles with
fsid that of course needs to be keyed off a new flag for both
name_to_handle and open_by_hnalde that makes them not interchangable
to handles generated without that flag.
Christian Brauner May 31, 2024, 10:28 a.m. UTC | #34
> I don't understand what you mean.  If we hand out file handles with

I don't think adding another highly privileged interface in opposition
to exposing the mount id is warranted. The new mount id exposure patch
is a performance improvement for existing use-cases to let userspace
avoid additional system calls. Because as Aleksa showed in the commit
message there's ways to make this race-free right now. The patch hasn't
gained any Acks from Amir or Jeff so it's obviously not going to get
merged. So really, I don't want another interface that goes from
arbitrary superblock to file hidden behind yet another global capability
check.
Aleksa Sarai June 1, 2024, 8:12 a.m. UTC | #35
On 2024-05-28, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Tue, 28 May 2024 at 15:24, Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Tue, May 28, 2024 at 02:04:16PM +0200, Christian Brauner wrote:
> > > Can you please explain how opening an fd based on a handle returned from
> > > name_to_handle_at() and not using a mount file descriptor for
> > > open_by_handle_at() would work?
> >
> > Same as NFS file handles:
> >
> > name_to_handle_at returns a handle that includes a file system
> > identifier.
> >
> > open_by_handle_at looks up the superblock based on that identifier.
> 
> The open file needs a specific mount, holding the superblock is not sufficient.

Not to mention that providing a mount fd is what allows for extensions
like Christian's proposed method of allowing restricted forms of
open_by_handle_at() to be used by unprivileged users.

If file handles really are going to end up being the "correct" mechanism
of referencing inodes by userspace, then future API designs really need
to stop assuming that the user is capable(CAP_DAC_READ_SEARCH). Being
able to open any file in any superblock the kernel knows about
(presumably using a kernel-internal mount if we are getting rid of the
mount fd) is also capable(CAP_SYS_ADMIN) territory.

Would the idea be to sign or MAC every file handle to avoid userspace
being able to brute-force the file handle of anything the system sees?
What happens if the key has to change? Then the handles aren't globally
unique anymore...
Jan Kara June 3, 2024, 10:30 a.m. UTC | #36
On Sat 01-06-24 01:12:31, Aleksa Sarai wrote:
> On 2024-05-28, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > On Tue, 28 May 2024 at 15:24, Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Tue, May 28, 2024 at 02:04:16PM +0200, Christian Brauner wrote:
> > > > Can you please explain how opening an fd based on a handle returned from
> > > > name_to_handle_at() and not using a mount file descriptor for
> > > > open_by_handle_at() would work?
> > >
> > > Same as NFS file handles:
> > >
> > > name_to_handle_at returns a handle that includes a file system
> > > identifier.
> > >
> > > open_by_handle_at looks up the superblock based on that identifier.
> > 
> > The open file needs a specific mount, holding the superblock is not sufficient.
> 
> Not to mention that providing a mount fd is what allows for extensions
> like Christian's proposed method of allowing restricted forms of
> open_by_handle_at() to be used by unprivileged users.
> 
> If file handles really are going to end up being the "correct" mechanism
> of referencing inodes by userspace, then future API designs really need
> to stop assuming that the user is capable(CAP_DAC_READ_SEARCH). Being
> able to open any file in any superblock the kernel knows about
> (presumably using a kernel-internal mount if we are getting rid of the
> mount fd) is also capable(CAP_SYS_ADMIN) territory.

Well, but this is already handled - name_to_handle_at() with AT_HANDLE_FID
is completely unpriviledged operation. Unpriviledged userspace can use
fhandle for comparisons with other file handles but that's all it is good
for (similarly as inode number you get from statx(2) but does not have the
problem with inode number uniqueness on btrfs, bcachefs, etc.). I don't
expect unpriviledged userspace to be able to more with the fhandle it got.

								Honza
Christoph Hellwig June 4, 2024, 5:22 a.m. UTC | #37
On Sat, Jun 01, 2024 at 01:12:31AM -0700, Aleksa Sarai wrote:
> Not to mention that providing a mount fd is what allows for extensions
> like Christian's proposed method of allowing restricted forms of
> open_by_handle_at() to be used by unprivileged users.

As mentioned there I find the concept of an unprivileged
open_by_handle_at extremely questionable as it trivially gives access to
any inode on the file systems.

> If file handles really are going to end up being the "correct" mechanism
> of referencing inodes by userspace,

They aren't.

> then future API designs really need
> to stop assuming that the user is capable(CAP_DAC_READ_SEARCH).

There is no way to support open by handle for unprivileged users.  The
concept of an inode number based file handle simply does not work for
that at all.
diff mbox series

Patch

diff --git a/fs/fhandle.c b/fs/fhandle.c
index 8a7f86c2139a..c6e90161b900 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -16,7 +16,8 @@ 
 
 static long do_sys_name_to_handle(const struct path *path,
 				  struct file_handle __user *ufh,
-				  int __user *mnt_id, int fh_flags)
+				  void __user *mnt_id, bool unique_mntid,
+				  int fh_flags)
 {
 	long retval;
 	struct file_handle f_handle;
@@ -69,9 +70,19 @@  static long do_sys_name_to_handle(const struct path *path,
 	} else
 		retval = 0;
 	/* copy the mount id */
-	if (put_user(real_mount(path->mnt)->mnt_id, mnt_id) ||
-	    copy_to_user(ufh, handle,
-			 struct_size(handle, f_handle, handle_bytes)))
+	if (unique_mntid) {
+		if (put_user(real_mount(path->mnt)->mnt_id_unique,
+			     (u64 __user *) mnt_id))
+			retval = -EFAULT;
+	} else {
+		if (put_user(real_mount(path->mnt)->mnt_id,
+			     (int __user *) mnt_id))
+			retval = -EFAULT;
+	}
+	/* copy the handle */
+	if (retval != -EFAULT &&
+		copy_to_user(ufh, handle,
+			     struct_size(handle, f_handle, handle_bytes)))
 		retval = -EFAULT;
 	kfree(handle);
 	return retval;
@@ -83,6 +94,7 @@  static long do_sys_name_to_handle(const struct path *path,
  * @name: name that should be converted to handle.
  * @handle: resulting file handle
  * @mnt_id: mount id of the file system containing the file
+ *          (u64 if AT_HANDLE_MNT_ID_UNIQUE, otherwise int)
  * @flag: flag value to indicate whether to follow symlink or not
  *        and whether a decodable file handle is required.
  *
@@ -92,7 +104,7 @@  static long do_sys_name_to_handle(const struct path *path,
  * value required.
  */
 SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name,
-		struct file_handle __user *, handle, int __user *, mnt_id,
+		struct file_handle __user *, handle, void __user *, mnt_id,
 		int, flag)
 {
 	struct path path;
@@ -100,7 +112,8 @@  SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name,
 	int fh_flags;
 	int err;
 
-	if (flag & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_HANDLE_FID))
+	if (flag & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_HANDLE_FID |
+		     AT_HANDLE_MNT_ID_UNIQUE))
 		return -EINVAL;
 
 	lookup_flags = (flag & AT_SYMLINK_FOLLOW) ? LOOKUP_FOLLOW : 0;
@@ -109,7 +122,9 @@  SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name,
 		lookup_flags |= LOOKUP_EMPTY;
 	err = user_path_at(dfd, name, lookup_flags, &path);
 	if (!err) {
-		err = do_sys_name_to_handle(&path, handle, mnt_id, fh_flags);
+		err = do_sys_name_to_handle(&path, handle, mnt_id,
+					    flag & AT_HANDLE_MNT_ID_UNIQUE,
+					    fh_flags);
 		path_put(&path);
 	}
 	return err;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e619ac10cd23..99c5a783a01e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -863,7 +863,7 @@  asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags,
 				  const char  __user *pathname);
 asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name,
 				      struct file_handle __user *handle,
-				      int __user *mnt_id, int flag);
+				      void __user *mnt_id, int flag);
 asmlinkage long sys_open_by_handle_at(int mountdirfd,
 				      struct file_handle __user *handle,
 				      int flags);
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index c0bcc185fa48..9ed9d65842c1 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -87,22 +87,24 @@ 
 #define DN_ATTRIB	0x00000020	/* File changed attibutes */
 #define DN_MULTISHOT	0x80000000	/* Don't remove notifier */
 
+#define AT_FDCWD		-100    /* Special value used to indicate
+                                           openat should use the current
+                                           working directory. */
+#define AT_SYMLINK_NOFOLLOW	0x100   /* Do not follow symbolic links.  */
+
 /*
- * The constants AT_REMOVEDIR and AT_EACCESS have the same value.  AT_EACCESS is
- * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to
+ * The constants AT_REMOVEDIR and AT_EACCESS have the same value.  AT_EACCESS
+ * is meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to
  * unlinkat.  The two functions do completely different things and therefore,
  * the flags can be allowed to overlap.  For example, passing AT_REMOVEDIR to
  * faccessat would be undefined behavior and thus treating it equivalent to
  * AT_EACCESS is valid undefined behavior.
  */
-#define AT_FDCWD		-100    /* Special value used to indicate
-                                           openat should use the current
-                                           working directory. */
-#define AT_SYMLINK_NOFOLLOW	0x100   /* Do not follow symbolic links.  */
 #define AT_EACCESS		0x200	/* Test access permitted for
                                            effective IDs, not real IDs.  */
 #define AT_REMOVEDIR		0x200   /* Remove directory instead of
                                            unlinking file.  */
+
 #define AT_SYMLINK_FOLLOW	0x400   /* Follow symbolic links.  */
 #define AT_NO_AUTOMOUNT		0x800	/* Suppress terminal automount traversal */
 #define AT_EMPTY_PATH		0x1000	/* Allow empty relative pathname */
@@ -114,10 +116,22 @@ 
 
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
-/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */
+/*
+ * All new purely-syscall-specific AT_* flags should consider using bits from
+ * 0xFF, but the bits used by RENAME_* (0x7) should be avoided in case users
+ * decide to pass AT_* flags to renameat2() by accident. These flag bits are
+ * free for re-use by other syscall's syscall-specific flags without worry.
+ */
+
+/*
+ * Flags for name_to_handle_at(2). To save AT_ flag space we re-use the
+ * AT_EACCESS/AT_REMOVEDIR bit for AT_HANDLE_FID.
+ */
 #define AT_HANDLE_FID		AT_REMOVEDIR	/* file handle is needed to
 					compare object identity and may not
 					be usable to open_by_handle_at(2) */
+#define AT_HANDLE_MNT_ID_UNIQUE	0x80	/* return the u64 unique mount id */
+
 #if defined(__KERNEL__)
 #define AT_GETATTR_NOSEC	0x80000000
 #endif
diff --git a/tools/include/uapi/linux/fcntl.h b/tools/include/uapi/linux/fcntl.h
index 282e90aeb163..f671312ca481 100644
--- a/tools/include/uapi/linux/fcntl.h
+++ b/tools/include/uapi/linux/fcntl.h
@@ -85,22 +85,24 @@ 
 #define DN_ATTRIB	0x00000020	/* File changed attibutes */
 #define DN_MULTISHOT	0x80000000	/* Don't remove notifier */
 
+#define AT_FDCWD		-100    /* Special value used to indicate
+                                           openat should use the current
+                                           working directory. */
+#define AT_SYMLINK_NOFOLLOW	0x100   /* Do not follow symbolic links.  */
+
 /*
- * The constants AT_REMOVEDIR and AT_EACCESS have the same value.  AT_EACCESS is
- * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to
+ * The constants AT_REMOVEDIR and AT_EACCESS have the same value.  AT_EACCESS
+ * is meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to
  * unlinkat.  The two functions do completely different things and therefore,
  * the flags can be allowed to overlap.  For example, passing AT_REMOVEDIR to
  * faccessat would be undefined behavior and thus treating it equivalent to
  * AT_EACCESS is valid undefined behavior.
  */
-#define AT_FDCWD		-100    /* Special value used to indicate
-                                           openat should use the current
-                                           working directory. */
-#define AT_SYMLINK_NOFOLLOW	0x100   /* Do not follow symbolic links.  */
 #define AT_EACCESS		0x200	/* Test access permitted for
                                            effective IDs, not real IDs.  */
 #define AT_REMOVEDIR		0x200   /* Remove directory instead of
                                            unlinking file.  */
+
 #define AT_SYMLINK_FOLLOW	0x400   /* Follow symbolic links.  */
 #define AT_NO_AUTOMOUNT		0x800	/* Suppress terminal automount traversal */
 #define AT_EMPTY_PATH		0x1000	/* Allow empty relative pathname */
@@ -112,10 +114,22 @@ 
 
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
-/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */
+/*
+ * All new purely-syscall-specific AT_* flags should consider using bits from
+ * 0xFF, but the bits used by RENAME_* (0x7) should be avoided in case users
+ * decide to pass AT_* flags to renameat2() by accident. These flag bits are
+ * free for re-use by other syscall's syscall-specific flags without worry.
+ */
+
+/*
+ * Flags for name_to_handle_at(2). To save AT_ flag space we re-use the
+ * AT_EACCESS/AT_REMOVEDIR bit for AT_HANDLE_FID.
+ */
 #define AT_HANDLE_FID		AT_REMOVEDIR	/* file handle is needed to
 					compare object identity and may not
 					be usable to open_by_handle_at(2) */
+#define AT_HANDLE_MNT_ID_UNIQUE	0x80 /* returned mount id is the u64 unique mount id */
+
 #if defined(__KERNEL__)
 #define AT_GETATTR_NOSEC	0x80000000
 #endif