Message ID | 20240523-exportfs-u64-mount-id-v2-1-f9f959f17eb1@cyphar.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC,v2] fhandle: expose u64 mount id to name_to_handle_at(2) | expand |
On Thu, May 23, 2024 at 11:57 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > Now that we provide a unique 64-bit mount ID interface in statx, we can > now provide a race-free way for name_to_handle_at(2) to provide a file > handle and corresponding mount without needing to worry about racing > with /proc/mountinfo parsing. > > While this is not necessary if you are using AT_EMPTY_PATH and don't > care about an extra statx(2) call, users that pass full paths into > name_to_handle_at(2) need to know which mount the file handle comes from > (to make sure they don't try to open_by_handle_at a file handle from a > different filesystem) and switching to AT_EMPTY_PATH would require > allocating a file for every name_to_handle_at(2) call, turning > > err = name_to_handle_at(-EBADF, "/foo/bar/baz", &handle, &mntid, > AT_HANDLE_MNT_ID_UNIQUE); > > into > > int fd = openat(-EBADF, "/foo/bar/baz", O_PATH | O_CLOEXEC); > err1 = name_to_handle_at(fd, "", &handle, &unused_mntid, AT_EMPTY_PATH); > err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &statxbuf); > mntid = statxbuf.stx_mnt_id; > close(fd); > > Unlike AT_HANDLE_FID, as per Amir's suggestion, AT_HANDLE_MNT_ID_UNIQUE > uses a new AT_* bit from the historically-unused 0xFF range (which we > now define as being the "per-syscall" range for AT_* bits). > Sorry for nit picking, but I think that "Unlike AT_HANDLE_FID,..." is confusing in this statement. AT_HANDLE_FID is using a bit that was already effectively allocated for a "per-syscall" range. I don't think that mentioning AT_HANDLE_FID adds any clarity to the statement so better drop it? > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > --- > Changes in v2: > - Fixed a few minor compiler warnings and a buggy copy_to_user() check. > - Rename AT_HANDLE_UNIQUE_MOUNT_ID -> AT_HANDLE_MNT_ID_UNIQUE to match statx. > - Switched to using an AT_* bit from 0xFF and defining that range as > being "per-syscall" for future usage. > - Sync tools/ copy of <linux/fcntl.h> to include changes. > - v1: <https://lore.kernel.org/r/20240520-exportfs-u64-mount-id-v1-1-f55fd9215b8e@cyphar.com> > --- > fs/fhandle.c | 29 ++++++++++++++++++++++------- > include/linux/syscalls.h | 2 +- > include/uapi/linux/fcntl.h | 28 +++++++++++++++++++++------- > tools/include/uapi/linux/fcntl.h | 28 +++++++++++++++++++++------- > 4 files changed, 65 insertions(+), 22 deletions(-) > [...] > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > index c0bcc185fa48..9ed9d65842c1 100644 > --- a/include/uapi/linux/fcntl.h > +++ b/include/uapi/linux/fcntl.h > @@ -87,22 +87,24 @@ > #define DN_ATTRIB 0x00000020 /* File changed attibutes */ > #define DN_MULTISHOT 0x80000000 /* Don't remove notifier */ > > +#define AT_FDCWD -100 /* Special value used to indicate > + openat should use the current > + working directory. */ (more nit picking) If you are changing this line, please at least add a new line, this is a different namespace :-/ and perhaps change it to "Special value of dirfd argument..." Also, better leave a comment here to discourage allocation from this range: + /* Reserved for per-syscall flags 0xff */ > +#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ > + > /* > - * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS is > - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to > + * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS > + * is meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to > * unlinkat. The two functions do completely different things and therefore, > * the flags can be allowed to overlap. For example, passing AT_REMOVEDIR to > * faccessat would be undefined behavior and thus treating it equivalent to > * AT_EACCESS is valid undefined behavior. > */ If you are going to add this churn in this patch, please do it otherwise. It does not make sense to have this long explanation about pre-syscall AT_* flags in a different location from the comment you added about "All new purely-syscall-specific AT_* flags.." if this explanation is needed at all, it should be after the new comment as an example. > -#define AT_FDCWD -100 /* Special value used to indicate > - openat should use the current > - working directory. */ > -#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ > #define AT_EACCESS 0x200 /* Test access permitted for > effective IDs, not real IDs. */ > #define AT_REMOVEDIR 0x200 /* Remove directory instead of > unlinking file. */ I really prefer to move those to the per-syscall section right next to AT_HANDLE_FID and leave a comment here: /* Reserved for per-syscall flags 0x200 */ > + > #define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ > #define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount traversal */ > #define AT_EMPTY_PATH 0x1000 /* Allow empty relative pathname */ > @@ -114,10 +116,22 @@ > > #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ > > -/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */ > +/* > + * All new purely-syscall-specific AT_* flags should consider using bits from > + * 0xFF, but the bits used by RENAME_* (0x7) should be avoided in case users > + * decide to pass AT_* flags to renameat2() by accident. Sorry, but I find the use of my renameat2() example a bit confusing in this sentence. If you mention it at all, please use "For example, the bits used by..." but I think it is more important to say "...should consider re-using bits already used by other per-syscalls flags". > These flag bits are > + * free for re-use by other syscall's syscall-specific flags without worry. > + */ > + > +/* > + * Flags for name_to_handle_at(2). To save AT_ flag space we re-use the > + * AT_EACCESS/AT_REMOVEDIR bit for AT_HANDLE_FID. > + */ AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID have exact same status, so instead of this asymmetric comment: +/* Flags for faccessat(2) */ +#define AT_EACCESS 0x200 /* Test access permitted for + effective IDs, not real IDs. */ +/* Flags for unlinkat(2) */ +#define AT_REMOVEDIR 0x200 /* Remove directory instead of + unlinking file. */ +/* Flags for name_to_handle_at(2) */ +#define AT_HANDLE_FID 0x200 /* file handle is needed to compare object identity and may not be usable to open_by_handle_at(2) */ > +#define AT_HANDLE_MNT_ID_UNIQUE 0x80 /* return the u64 unique mount id */ IDGI, I think we may have been miscommunicating :-/ If 0x7 range is to be avoided for generic AT_ flags, then it *should* be used for new per-syscall flags such as this one. The reservation of 0xff is not a strong guarantee. As long as people re-use new per-syscalls flags efficiently, we could decide to reclaim some of this space for generic AT_ flags in the future if it is needed. I know most of the mess was here before your patch, but I think it got to a point where we must put a little order before introducing the new per-syscall flag. Thanks, Amir.
On Thu, May 23, 2024 at 01:57:32PM -0700, Aleksa Sarai wrote: > Now that we provide a unique 64-bit mount ID interface in statx, we can > now provide a race-free way for name_to_handle_at(2) to provide a file > handle and corresponding mount without needing to worry about racing > with /proc/mountinfo parsing. file handles are not tied to mounts, they are tied to super_blocks, and they can survive reboots or (less relevant) remounts. This thus seems like a very confusing if not wrong interfaces.
On 2024-05-26, Christoph Hellwig <hch@infradead.org> wrote: > On Thu, May 23, 2024 at 01:57:32PM -0700, Aleksa Sarai wrote: > > Now that we provide a unique 64-bit mount ID interface in statx, we can > > now provide a race-free way for name_to_handle_at(2) to provide a file > > handle and corresponding mount without needing to worry about racing > > with /proc/mountinfo parsing. > > file handles are not tied to mounts, they are tied to super_blocks, > and they can survive reboots or (less relevant) remounts. This thus > seems like a very confusing if not wrong interfaces. The existing interface already provides a mount ID which is not even safe without rebooting. The purpose of the mount ID (in the existing API) is to allow userspace to make sure they know which filesystem mount the file handle came from so they can store the relevant information (fsid, etc) to make sure they know which superblock the mount came from when it comes to doing open_by_handle_at(2). However, there is a race between getting the mount ID from name_to_handle_at() and parsing /proc/self/mountinfo, which means that userspace cannot ever be sure that they know the correct mount the file handle came from (the man page for open_by_handle_at mentions this issue explicitly). Getting the ability to get a 64-bit mount ID would allow userspace to use statmount(2) directly, avoiding this race. You're quite right that obviously you wouldn't want to just store the mount ID. An alternative would be to return something unique to the filesystem superblock, but as far as I can tell there is no guarantee that every Linux filesystem's fsid is sufficiently unique to act as a globally unique identifier. At least with a 64-bit mount ID and statmount(2), userspace can decide what information is needed to get sufficiently unique information about the source filesystem.
On Sun, 2024-05-26 at 02:25 -0700, Christoph Hellwig wrote: > On Thu, May 23, 2024 at 01:57:32PM -0700, Aleksa Sarai wrote: > > Now that we provide a unique 64-bit mount ID interface in statx, we > > can > > now provide a race-free way for name_to_handle_at(2) to provide a > > file > > handle and corresponding mount without needing to worry about > > racing > > with /proc/mountinfo parsing. > > file handles are not tied to mounts, they are tied to super_blocks, > and they can survive reboots or (less relevant) remounts. This thus > seems like a very confusing if not wrong interfaces. I assume the reason is to give the caller a race free way to figure out which submount the path resolves to. The problem is that nothing stops another process from calling umount() before you're done parsing /proc/mountinfo and have resolved the mount id. If we're looking to change the API, then perhaps returning a file descriptor might be a better alternative? Most userland NFS servers are in any case going to follow up obtaining the filehandle with a stat() or even a full blown open() in order to get file attributes, set up file state, etc. By returning an open file descriptor to the resolved file (even if it is only an O_PATH descriptor) we could accelerate those operations in addition to solving the umount() race. Alternatively, just remove the path argument altogether, and require the descriptor argument to be an O_PATH or regular open file descriptor that resolves to the file we want to get a filehandle for. However this would require a userland NFS server to generally do a open_by_handle_at() to resolve the parent directory handle, then do an openat(O_PATH) to get the file to look up, before being able to call the name_to_handle_at() replacement. i.e. there would be 1 extra syscall.
On Sun, May 26, 2024 at 12:01:08PM -0700, Aleksa Sarai wrote: > The existing interface already provides a mount ID which is not even > safe without rebooting. And that seems to be a big part of the problem where the Linux by handle syscall API deviated from all know precedence for no good reason. NFS file handles which were the start of this do (and have to) encode a persistent file system identifier. As do the xfs handles (although they do the decoding in the userspace library on Linux for historic reasons), as do the FreeBSD equivalents to these syscalls. > An alternative would be to return something unique to the filesystem > superblock, but as far as I can tell there is no guarantee that every > Linux filesystem's fsid is sufficiently unique to act as a globally > unique identifier. At least with a 64-bit mount ID and statmount(2), > userspace can decide what information is needed to get sufficiently > unique information about the source filesystem. Well, every file system that supports export ops already needs a globally unique ID for NFS to work properly. We might not have good enough interfaces for that, but that shouldn't be too hard.
On Sun, May 26, 2024 at 10:32:39PM +0000, Trond Myklebust wrote: > I assume the reason is to give the caller a race free way to figure out > which submount the path resolves to. But the handle op are global to the file systems (aka super_block). It does not matter what mount you use to access it. Snip random things about userland NFS servers I couldn't care less about..
On Sun, May 26, 2024 at 02:25:34AM -0700, Christoph Hellwig wrote: > On Thu, May 23, 2024 at 01:57:32PM -0700, Aleksa Sarai wrote: > > Now that we provide a unique 64-bit mount ID interface in statx, we can > > now provide a race-free way for name_to_handle_at(2) to provide a file > > handle and corresponding mount without needing to worry about racing > > with /proc/mountinfo parsing. > > file handles are not tied to mounts, they are tied to super_blocks, > and they can survive reboots or (less relevant) remounts. This thus > seems like a very confusing if not wrong interfaces. I'm not fond of the interface as I've said on an earlier version, but name_to_handle_at() has always exposed the mount id. IOW, this isn't adding a new concept to the system call.
On Mon, May 27, 2024 at 04:47:56AM -0700, Christoph Hellwig wrote: > On Sun, May 26, 2024 at 12:01:08PM -0700, Aleksa Sarai wrote: > > The existing interface already provides a mount ID which is not even > > safe without rebooting. > > And that seems to be a big part of the problem where the Linux by handle > syscall API deviated from all know precedence for no good reason. NFS > file handles which were the start of this do (and have to) encode a > persistent file system identifier. As do the xfs handles (although they > do the decoding in the userspace library on Linux for historic reasons), > as do the FreeBSD equivalents to these syscalls. > > > An alternative would be to return something unique to the filesystem > > superblock, but as far as I can tell there is no guarantee that every > > Linux filesystem's fsid is sufficiently unique to act as a globally > > unique identifier. At least with a 64-bit mount ID and statmount(2), > > userspace can decide what information is needed to get sufficiently > > unique information about the source filesystem. > > Well, every file system that supports export ops already needs a > globally unique ID for NFS to work properly. We might not have good > enough interfaces for that, but that shouldn't be too hard. I see not inherent problem with exposing the 64 bit mount id through name_to_handle_at() as we already to expose the old one anyway. But I agree that it is useful if we had the guarantee that file handles are unique in the way you describe. As it currently stands that doesn't seem to be the case and userspace doesn't seem to have a way of figuring out if the handle provided by name_to_handle_at() is indeed unique as you describe and can be reliably passed to open_by_handle_at(). Yes, we should fix it but that's really orthogonal to the mount id. It is separately useful and we already do expose it anyway.
On Mon, May 27, 2024 at 02:29:02PM +0200, Christian Brauner wrote: > On Mon, May 27, 2024 at 04:47:56AM -0700, Christoph Hellwig wrote: > > On Sun, May 26, 2024 at 12:01:08PM -0700, Aleksa Sarai wrote: > > > The existing interface already provides a mount ID which is not even > > > safe without rebooting. > > > > And that seems to be a big part of the problem where the Linux by handle > > syscall API deviated from all know precedence for no good reason. NFS > > file handles which were the start of this do (and have to) encode a > > persistent file system identifier. As do the xfs handles (although they > > do the decoding in the userspace library on Linux for historic reasons), > > as do the FreeBSD equivalents to these syscalls. > > > > > An alternative would be to return something unique to the filesystem > > > superblock, but as far as I can tell there is no guarantee that every > > > Linux filesystem's fsid is sufficiently unique to act as a globally > > > unique identifier. At least with a 64-bit mount ID and statmount(2), > > > userspace can decide what information is needed to get sufficiently > > > unique information about the source filesystem. > > > > Well, every file system that supports export ops already needs a > > globally unique ID for NFS to work properly. We might not have good > > enough interfaces for that, but that shouldn't be too hard. > > I see not inherent problem with exposing the 64 bit mount id through > name_to_handle_at() as we already to expose the old one anyway. > > But I agree that it is useful if we had the guarantee that file handles > are unique in the way you describe. As it currently stands that doesn't > seem to be the case and userspace doesn't seem to have a way of figuring > out if the handle provided by name_to_handle_at() is indeed unique as > you describe and can be reliably passed to open_by_handle_at(). > > Yes, we should fix it but that's really orthogonal to the mount id. It > is separately useful and we already do expose it anyway. Put another way, name_to_handle_at(2) currently states: Obtaining a persistent filesystem ID The mount IDs in /proc/self/mountinfo can be reused as filesystems are unmounted and mounted. Therefore, the mount ID returned by name_to_handle_at() (in *mount_id) should not be treated as a persistent identifier for the corresponding mounted filesystem. However, an application can use the information in the mountinfo record that corresponds to the mount ID to derive a persistent identifier. For example, one can use the device name in the fifth field of the mountinfo record to search for the corresponding device UUID via the symbolic links in /dev/disks/by-uuid. (A more comfortable way of obtaining the UUID is to use the libblkid(3) library.) That process can then be reversed, using the UUID to look up the device name, and then obtaining the corre‐ sponding mount point, in order to produce the mount_fd argument used by open_by_handle_at(). Returning the 64bit mount id makes this race-free because we now have statmount(): u64 mnt_id = 0; name_to_handle_at(AT_FDCWD, "/path/to/file", &handle, &mnt_id, 0); statmount(mnt_id); Which gets you the device number which one can use to figure out the uuid without ever having to open a single file (We could even expose the UUID of the filesystem through statmount() if we wanted to.).
On Mon 27-05-24 04:47:56, Christoph Hellwig wrote: > On Sun, May 26, 2024 at 12:01:08PM -0700, Aleksa Sarai wrote: > > The existing interface already provides a mount ID which is not even > > safe without rebooting. > > And that seems to be a big part of the problem where the Linux by handle > syscall API deviated from all know precedence for no good reason. NFS > file handles which were the start of this do (and have to) encode a > persistent file system identifier. As do the xfs handles (although they > do the decoding in the userspace library on Linux for historic reasons), > as do the FreeBSD equivalents to these syscalls. So I was wondering how this is actually working in practice. Checking the code, NFS server is (based on configuration in /etc/exports) either using device number as the filesystem identifier or fsid / uuid as specified in /etc/exports. > > An alternative would be to return something unique to the filesystem > > superblock, but as far as I can tell there is no guarantee that every > > Linux filesystem's fsid is sufficiently unique to act as a globally > > unique identifier. At least with a 64-bit mount ID and statmount(2), > > userspace can decide what information is needed to get sufficiently > > unique information about the source filesystem. > > Well, every file system that supports export ops already needs a > globally unique ID for NFS to work properly. We might not have good > enough interfaces for that, but that shouldn't be too hard. So as my research above shows, this ID is either manually configured in /etc/exports or NFS server uses device number which is not guaranteed to be persistent. Filesystems, at least currently, have no obligation to provide anything (and some of them indeed don't provide any uuid or persistent fsid). I guess that's the reason why mount ID is reported with name_to_handle_at(). Don't get me wrong, I agree with your reservations towards mount ID (per mount instead of per-sb, non-persistent) and I agree the properties you describe are the golden standard we should strive for but I mainly wanted to point out this is not reality today and in particular providing the "persistency" guarantee of the filesystem ID may require on disk format changes for some filesystems. So returning the 64-bit mount ID from name_to_handle_at() weasels out of these "how to identify arbitrary superblock" problems by giving userspace a reasonably reliable way to generate this superblock identifier itself. I'm fully open to less errorprone API for this but at this point I don't see it so changing the mount ID returned from name_to_handle_at() to 64-bit unique one seems like a sane practical choice to me... Honza
On Mon, 2024-05-27 at 04:49 -0700, hch@infradead.org wrote: > On Sun, May 26, 2024 at 10:32:39PM +0000, Trond Myklebust wrote: > > I assume the reason is to give the caller a race free way to figure > > out > > which submount the path resolves to. > > But the handle op are global to the file systems (aka super_block). > It > does not matter what mount you use to access it. Sure. However if you are providing a path argument, then presumably you need to know which file system (aka super_block) it eventually resolves to. > > Snip random things about userland NFS servers I couldn't care less > about.. > My point was that at least for that case, you are better off using a file descriptor and not having to care about the mount id. If your use case isn't NFS servers, then what use case are you targeting, and how do you expect those applications to use this API?
On Mon, 2024-05-27 at 15:17 +0200, Christian Brauner wrote: > > Returning the 64bit mount id makes this race-free because we now have > statmount(): > > u64 mnt_id = 0; > name_to_handle_at(AT_FDCWD, "/path/to/file", &handle, &mnt_id, 0); > statmount(mnt_id); > > Which gets you the device number which one can use to figure out the > uuid without ever having to open a single file (We could even expose > the > UUID of the filesystem through statmount() if we wanted to.). > It is not race free. statmount() depends on the filesystem still being mounted somewhere in your namespace, which is not guaranteed above.
On Mon, May 27, 2024 at 02:29:02PM +0200, Christian Brauner wrote: > I see not inherent problem with exposing the 64 bit mount id through > name_to_handle_at() as we already to expose the old one anyway. That's one way to frame it. The other is that exposing the different mount ID also doesn't substantially fix the problem, so it's not worth the churn.
On Mon, May 27, 2024 at 03:34:30PM +0200, Jan Kara wrote: > So I was wondering how this is actually working in practice. Checking the > code, NFS server is (based on configuration in /etc/exports) either using > device number as the filesystem identifier or fsid / uuid as specified in > /etc/exports. Yes, it's a rather suboptimal implementation. > So returning the 64-bit mount ID from name_to_handle_at() weasels out of > these "how to identify arbitrary superblock" problems by giving userspace a > reasonably reliable way to generate this superblock identifier itself. I'm > fully open to less errorprone API for this but at this point I don't see it > so changing the mount ID returned from name_to_handle_at() to 64-bit unique > one seems like a sane practical choice to me... Well, how about we fix the thing for real: - allow file systems to provide a uniqueu identifier of at least uuid size (16 bytes) in the superblock or through an export operation - for non-persistent file systems allow to generate one at boot time using the normal uuid generation helpers - add a new flag to name_to_handle_at/open_by_handle_at to include it in the file handle, and thus make the file handle work more like the normal file handle - add code to nfsd to directly make use of this This would solve all the problems in this proposal as well as all the obvious ones it doesn't solve.
On Mon, May 27, 2024 at 03:38:40PM +0000, Trond Myklebust wrote: > > It > > does not matter what mount you use to access it. > > Sure. However if you are providing a path argument, then presumably you > need to know which file system (aka super_block) it eventually resolves > to. Except that you can't, at least not without running into potential races. The only way to fix a race vs unmount/remount is to include the fsid part in the kernel generated file handle. > > If your use case isn't NFS servers, then what use case are you > targeting, and how do you expect those applications to use this API? The main user of the open by handle syscalls seems to be fanotify magic.
On Mon, May 27, 2024 at 03:47:33PM +0000, Trond Myklebust wrote: > On Mon, 2024-05-27 at 15:17 +0200, Christian Brauner wrote: > > > > Returning the 64bit mount id makes this race-free because we now have > > statmount(): > > > > u64 mnt_id = 0; > > name_to_handle_at(AT_FDCWD, "/path/to/file", &handle, &mnt_id, 0); > > statmount(mnt_id); > > > > Which gets you the device number which one can use to figure out the > > uuid without ever having to open a single file (We could even expose > > the > > UUID of the filesystem through statmount() if we wanted to.). > > > > It is not race free. statmount() depends on the filesystem still being > mounted somewhere in your namespace, which is not guaranteed above. The unsigned 64bit mount is not recyclable. It is a unique identifier for a mount for the lifetime of the system. Even if bumped on every cycle it will still take hundreds of years to overflow.
On Mon, May 27, 2024 at 09:29:48AM -0700, hch@infradead.org wrote: > On Mon, May 27, 2024 at 03:38:40PM +0000, Trond Myklebust wrote: > > > It > > > does not matter what mount you use to access it. > > > > Sure. However if you are providing a path argument, then presumably you > > need to know which file system (aka super_block) it eventually resolves > > to. > > Except that you can't, at least not without running into potential > races. The only way to fix a race vs unmount/remount is to include > the fsid part in the kernel generated file handle. > > > > > If your use case isn't NFS servers, then what use case are you > > targeting, and how do you expect those applications to use this API? > > The main user of the open by handle syscalls seems to be fanotify > magic. It's also used by userspace for uniquely identifying cgroups via handles as cgroups and - even without open_by_handle_at() - to check whether a file is still valid. And again a 64bit mount is is a simple way to race-free go to whatever superblock uuid you want. They cannot be recycled and are unique for the lifetime of the system.
On Tue, May 28, 2024 at 09:12:33AM +0200, Christian Brauner wrote: > It's also used by userspace for uniquely identifying cgroups via handles > as cgroups and - even without open_by_handle_at() - to check whether a > file is still valid. > > And again a 64bit mount is is a simple way to race-free go to whatever > superblock uuid you want. They cannot be recycled and are unique for the > lifetime of the system. And then break when you reboot. Which you might not care about for cgroups, but which is really bad for the concept of a file handle. See one of my other replies for a proposed interface that is just as easy to use for userspace, a little more complex in the kernel but safe for it. I'd much prefer that over using ay kind of "mount ID" which doesn't fit into the file handle concept at all.
> Well, how about we fix the thing for real: > > - allow file systems to provide a uniqueu identifier of at least > uuid size (16 bytes) in the superblock or through an export operation > - for non-persistent file systems allow to generate one at boot time > using the normal uuid generation helpers > - add a new flag to name_to_handle_at/open_by_handle_at to include it > in the file handle, and thus make the file handle work more like > the normal file handle > - add code to nfsd to directly make use of this > > This would solve all the problems in this proposal as well as all the > obvious ones it doesn't solve. As I've said multiple times on the thread I agree that this is what we should do next and I would be happy to take patches for this. But exposing the 64bit mount id doesn't impede or complicate that work. It's a simple and useful addition that can be done now and doesn't prevent us from doing the proposed work. Hell, if you twist my arm I'll even write the patches for this.
On Tue, May 28, 2024 at 10:20:21AM +0200, Christian Brauner wrote: > > This would solve all the problems in this proposal as well as all the > > obvious ones it doesn't solve. > > As I've said multiple times on the thread I agree that this is what we > should do next and I would be happy to take patches for this. But > exposing the 64bit mount id doesn't impede or complicate that work. It's > a simple and useful addition that can be done now and doesn't prevent us > from doing the proposed work. I really confuses the user story as we now have not only one but two broken modes for the open by handle ops. We just further go down the rabbit hole of mixing a non-persistent identifiers with a persistent one. > Hell, if you twist my arm I'll even write the patches for this. I'm also happy to help with that despite very limited time, but I'd rather avoid doing the misguided mount ID thing.
> > Hell, if you twist my arm I'll even write the patches for this. > > I'm also happy to help with that despite very limited time, but I'd > rather avoid doing the misguided mount ID thing. As I've said earlier, independent of the new handle type returning the new mount id is useful and needed because it allows the caller to reliably generate a mount fd for use with open_by_handle_at() via statmount(). That won't be solved by a new handle type and is racy with the old mount id. So I intend to accept a version of this patch.
On Mon 27-05-24 09:29:48, hch@infradead.org wrote: > On Mon, May 27, 2024 at 03:38:40PM +0000, Trond Myklebust wrote: > > If your use case isn't NFS servers, then what use case are you > > targeting, and how do you expect those applications to use this API? > > The main user of the open by handle syscalls seems to be fanotify > magic. So some fanotify users may use open_by_handle_at() and name_to_handle_at() but we specifically designed fanotify to not depend on this mount id feature of the API (because it wasn't really usable couple of years ago when we were designing this with Amir). fanotify returns fsid + fhandle in its events and userspace is expected to build a mapping of fsid -> "whatever it needs to identify a filesystem" when placing fanotify marks. If it wants to open file / directory where events happened, then this usually means keeping fsid -> "some open fd on fs" mapping so that it can then use open_by_handle_at() for opening. Honza
On Tue, May 28, 2024 at 11:17:58AM +0200, Christian Brauner wrote: > As I've said earlier, independent of the new handle type returning the > new mount id is useful and needed because it allows the caller to > reliably generate a mount fd for use with open_by_handle_at() via > statmount(). That won't be solved by a new handle type and is racy with > the old mount id. So I intend to accept a version of this patch. The whole point is that with the fsid in the handle we do not even need a mount fd for open_by_handle_at. If you insist on making this interface even more crappy than it is and create confusion please record my explicit: Nacked-by: Christoph Hellwig <hch@lst.de> in the commit :(
On Tue, May 28, 2024 at 12:11:52PM +0200, Jan Kara wrote: > So some fanotify users may use open_by_handle_at() and name_to_handle_at() > but we specifically designed fanotify to not depend on this mount id > feature of the API (because it wasn't really usable couple of years ago > when we were designing this with Amir). fanotify returns fsid + fhandle in > its events and userspace is expected to build a mapping of fsid -> > "whatever it needs to identify a filesystem" when placing fanotify marks. > If it wants to open file / directory where events happened, then this > usually means keeping fsid -> "some open fd on fs" mapping so that it can > then use open_by_handle_at() for opening. Which seems like another argument for my version of the handles to include the fsid. Although IIRC the fanotify fsid is only 64 bits which isn't really good entropy, so we might have to rev that as well.
On Tue, May 28, 2024 at 03:55:29AM -0700, Christoph Hellwig wrote: > On Tue, May 28, 2024 at 11:17:58AM +0200, Christian Brauner wrote: > > As I've said earlier, independent of the new handle type returning the > > new mount id is useful and needed because it allows the caller to > > reliably generate a mount fd for use with open_by_handle_at() via > > statmount(). That won't be solved by a new handle type and is racy with > > the old mount id. So I intend to accept a version of this patch. > > The whole point is that with the fsid in the handle we do not even need > a mount fd for open_by_handle_at. Can you please explain how opening an fd based on a handle returned from name_to_handle_at() and not using a mount file descriptor for open_by_handle_at() would work?
On Tue, May 28, 2024 at 02:04:16PM +0200, Christian Brauner wrote: > Can you please explain how opening an fd based on a handle returned from > name_to_handle_at() and not using a mount file descriptor for > open_by_handle_at() would work? Same as NFS file handles: name_to_handle_at returns a handle that includes a file system identifier. open_by_handle_at looks up the superblock based on that identifier. For the identifier I could imagin three choices: 1) use the fsid as returned in statfs and returned by fsnotify. The downside is that it is "only" 64-bit. The upside is that we have a lot of plumbing for it 2) fixed 128-bit identifier to provide more entropy 3) a variable length identifier, which is more similar to NFS, but also a lot more complicated We'd need a global lookup structure to find the sb by id. The simplest one would be a simple linear loop over super_blocks which isn't terribly efficient, but probably better than whatever userspace is doing to find a mount fd right now. Let me cook up a simple prototype for 1) as it shouldn't be more than a few hundred lines of code.
On Tue, 28 May 2024 at 15:24, Christoph Hellwig <hch@infradead.org> wrote: > > On Tue, May 28, 2024 at 02:04:16PM +0200, Christian Brauner wrote: > > Can you please explain how opening an fd based on a handle returned from > > name_to_handle_at() and not using a mount file descriptor for > > open_by_handle_at() would work? > > Same as NFS file handles: > > name_to_handle_at returns a handle that includes a file system > identifier. > > open_by_handle_at looks up the superblock based on that identifier. The open file needs a specific mount, holding the superblock is not sufficient. Thanks, Miklos
On Tue, May 28, 2024 at 03:56:43AM -0700, hch@infradead.org wrote: > On Tue, May 28, 2024 at 12:11:52PM +0200, Jan Kara wrote: > > So some fanotify users may use open_by_handle_at() and name_to_handle_at() > > but we specifically designed fanotify to not depend on this mount id > > feature of the API (because it wasn't really usable couple of years ago > > when we were designing this with Amir). fanotify returns fsid + fhandle in > > its events and userspace is expected to build a mapping of fsid -> > > "whatever it needs to identify a filesystem" when placing fanotify marks. > > If it wants to open file / directory where events happened, then this > > usually means keeping fsid -> "some open fd on fs" mapping so that it can > > then use open_by_handle_at() for opening. > > Which seems like another argument for my version of the handles to > include the fsid. Although IIRC the fanotify fsid is only 64 bits which > isn't really good entropy, so we might have to rev that as well. I'm in agreement with Christoph that the filehandle needs to contain the restricted scope information internally. I said that in response to an earlier version of the patch here: https://lore.kernel.org/linux-fsdevel/ZlPOd0p7AUn7JqLu@dread.disaster.area/ "If filehandles are going to be restricted to a specific container (e.g. a single mount) so that less permissions are needed to open the filehandle, then the filehandle itself needs to encode those restrictions. Decoding the filehandle needs to ensure that the opening context has permission to access whatever the filehandle points to in a restricted environment. This will prevent existing persistent, global filehandles from being used as restricted filehandles and vice versa." But no-one has bothered to reply or acknowledge my comments so I'll point them out again and repeat: Filehandles generated by the kernel for unprivileged use *must* be self describing and self validating as the kernel must be able to detect and prevent unprivelged users from generating custom filehandles that can be used to access files outside the restricted scope of their container. -Dave.
On Wed, May 29, 2024 at 09:25:49AM +1000, Dave Chinner wrote: > But no-one has bothered to reply or acknowledge my comments so I'll > point them out again and repeat: Filehandles generated by > the kernel for unprivileged use *must* be self describing and self > validating as the kernel must be able to detect and prevent > unprivelged users from generating custom filehandles that can be > used to access files outside the restricted scope of their > container. We must not generate file handle for unprivileged use at all, as they bypass all the path based access controls.
On Tue, May 28, 2024 at 03:28:18PM +0200, Miklos Szeredi wrote: > > open_by_handle_at looks up the superblock based on that identifier. > > The open file needs a specific mount, holding the superblock is not > sufficient. A strut file needs a vfsmount, yes. And it better be reachable by the calling process. And maybe an optional restriction to a specific mount by the caller might be useful, but I can't see how it is required.
On Wed, May 29, 2024 at 9:24 AM hch@infradead.org <hch@infradead.org> wrote: > > On Wed, May 29, 2024 at 09:25:49AM +1000, Dave Chinner wrote: > > But no-one has bothered to reply or acknowledge my comments so I'll > > point them out again and repeat: Filehandles generated by > > the kernel for unprivileged use *must* be self describing and self > > validating Very nice requirement for a very strong feature, which is far out of scope for the proposed patch IMO. What is "generated by the kernel for unprivileged use"? name_to_handle_at() is unprivileged and for example, fanotify unprivileged users can use it to compare a marked (i.e. watched) object with an object that is the subject of an event. This does not break any security model. > > as the kernel must be able to detect and prevent > > unprivelged users from generating custom filehandles that can be > > used to access files outside the restricted scope of their > > container. Emphasis on "that can be used to access files". The patch in this thread to get a unique 64bit mntid does not make any difference wrt to the concern above, so I am having a hard time understand what the fuss is about. > > We must not generate file handle for unprivileged use at all, as they > bypass all the path based access controls. > Again, how is "generate file handle for unprivileged use" related to the patch in question. The simple solution to the race that Aleksa is trying to prevent, as was mentioned several times in this thread, is to allow name_to_handle_at() on an empty path, e.g.: fd = openat(.., O_PATH); fstatat(fd,..); name_to_handle_at(fd,..); Aleksa prefers the unique mntid solution to save 2 syscalls. Would any of the objections here been called for letting name_to_handle_at() take an empty path? I would like to offer a different solution (also may have been mentioned before). I always disliked the fact that mount_id and mount_fd arguments of name_to_handle_at()/open_by_handle_at() are not symmetric, when at first glance they appear to be symmetric as the handle argument. What if we can make them symmetric and save the openat() syscall? int path_fd; int ret = name_to_handle_at(dirfd, path, handle, &path_fd, AT_HADNLE_PATH_FD); and then kernel returns an O_PATH fd to the path object in addition to the file handle (requires same privilege as encoding the fh). Userspace can use path_fd to query unique mntid, fsid, uuid or whatever it needs to know in order to make sure that a later call in the future to open_by_handle_at() will use a mount_fd from the same filesystem/mount whatever, whether in the same boot or after reboot. If we are doing this, it also makes sense to allow mount_fd argument to open_by_handle_at() to accept O_PATH fd, but that makes sense anyway, because mount_fd is essentially a reference to a path. Thanks, Amir.
On Tue, May 28, 2024 at 06:22:23AM -0700, Christoph Hellwig wrote: > On Tue, May 28, 2024 at 02:04:16PM +0200, Christian Brauner wrote: > > Can you please explain how opening an fd based on a handle returned from > > name_to_handle_at() and not using a mount file descriptor for > > open_by_handle_at() would work? > > Same as NFS file handles: > > name_to_handle_at returns a handle that includes a file system > identifier. > > open_by_handle_at looks up the superblock based on that identifier. > > For the identifier I could imagin three choices: > > 1) use the fsid as returned in statfs and returned by fsnotify. > The downside is that it is "only" 64-bit. The upside is that > we have a lot of plumbing for it > 2) fixed 128-bit identifier to provide more entropy > 3) a variable length identifier, which is more similar to NFS, > but also a lot more complicated > > We'd need a global lookup structure to find the sb by id. The simplest > one would be a simple linear loop over super_blocks which isn't terribly > efficient, but probably better than whatever userspace is doing to > find a mount fd right now. > > Let me cook up a simple prototype for 1) as it shouldn't be more than > a few hundred lines of code. Yeah, that's exactly what I figured and no that's not something we should do. Not just can have a really large number of superblocks if you have mount namespaces and large container workloads that interface also needs to be highly privileged. Plus, you do have filesystems like btrfs that can be mounted multiple times with the same uuid. And in general users will still need to be able to legitimately use a mount fd and not care about the handle type used with it.
On Wed, May 29, 2024 at 09:40:01AM +0200, Christian Brauner wrote: > Yeah, that's exactly what I figured and no that's not something we > should do. > > Not just can have a really large number of superblocks if you have mount > namespaces and large container workloads that interface also needs to be > highly privileged. Again, that would be the most trivial POC. We can easily do hash. > Plus, you do have filesystems like btrfs that can be mounted multiple > times with the same uuid. Which doesn't matter. Just like for NFS file handles the fs identifier identifier plus the file part of the file handle need to be unique. > And in general users will still need to be able to legitimately use a > mount fd and not care about the handle type used with it. I don't understand what you mean. If we hand out file handles with fsid that of course needs to be keyed off a new flag for both name_to_handle and open_by_hnalde that makes them not interchangable to handles generated without that flag.
> I don't understand what you mean. If we hand out file handles with
I don't think adding another highly privileged interface in opposition
to exposing the mount id is warranted. The new mount id exposure patch
is a performance improvement for existing use-cases to let userspace
avoid additional system calls. Because as Aleksa showed in the commit
message there's ways to make this race-free right now. The patch hasn't
gained any Acks from Amir or Jeff so it's obviously not going to get
merged. So really, I don't want another interface that goes from
arbitrary superblock to file hidden behind yet another global capability
check.
On 2024-05-28, Miklos Szeredi <miklos@szeredi.hu> wrote: > On Tue, 28 May 2024 at 15:24, Christoph Hellwig <hch@infradead.org> wrote: > > > > On Tue, May 28, 2024 at 02:04:16PM +0200, Christian Brauner wrote: > > > Can you please explain how opening an fd based on a handle returned from > > > name_to_handle_at() and not using a mount file descriptor for > > > open_by_handle_at() would work? > > > > Same as NFS file handles: > > > > name_to_handle_at returns a handle that includes a file system > > identifier. > > > > open_by_handle_at looks up the superblock based on that identifier. > > The open file needs a specific mount, holding the superblock is not sufficient. Not to mention that providing a mount fd is what allows for extensions like Christian's proposed method of allowing restricted forms of open_by_handle_at() to be used by unprivileged users. If file handles really are going to end up being the "correct" mechanism of referencing inodes by userspace, then future API designs really need to stop assuming that the user is capable(CAP_DAC_READ_SEARCH). Being able to open any file in any superblock the kernel knows about (presumably using a kernel-internal mount if we are getting rid of the mount fd) is also capable(CAP_SYS_ADMIN) territory. Would the idea be to sign or MAC every file handle to avoid userspace being able to brute-force the file handle of anything the system sees? What happens if the key has to change? Then the handles aren't globally unique anymore...
On Sat 01-06-24 01:12:31, Aleksa Sarai wrote: > On 2024-05-28, Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Tue, 28 May 2024 at 15:24, Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Tue, May 28, 2024 at 02:04:16PM +0200, Christian Brauner wrote: > > > > Can you please explain how opening an fd based on a handle returned from > > > > name_to_handle_at() and not using a mount file descriptor for > > > > open_by_handle_at() would work? > > > > > > Same as NFS file handles: > > > > > > name_to_handle_at returns a handle that includes a file system > > > identifier. > > > > > > open_by_handle_at looks up the superblock based on that identifier. > > > > The open file needs a specific mount, holding the superblock is not sufficient. > > Not to mention that providing a mount fd is what allows for extensions > like Christian's proposed method of allowing restricted forms of > open_by_handle_at() to be used by unprivileged users. > > If file handles really are going to end up being the "correct" mechanism > of referencing inodes by userspace, then future API designs really need > to stop assuming that the user is capable(CAP_DAC_READ_SEARCH). Being > able to open any file in any superblock the kernel knows about > (presumably using a kernel-internal mount if we are getting rid of the > mount fd) is also capable(CAP_SYS_ADMIN) territory. Well, but this is already handled - name_to_handle_at() with AT_HANDLE_FID is completely unpriviledged operation. Unpriviledged userspace can use fhandle for comparisons with other file handles but that's all it is good for (similarly as inode number you get from statx(2) but does not have the problem with inode number uniqueness on btrfs, bcachefs, etc.). I don't expect unpriviledged userspace to be able to more with the fhandle it got. Honza
On Sat, Jun 01, 2024 at 01:12:31AM -0700, Aleksa Sarai wrote: > Not to mention that providing a mount fd is what allows for extensions > like Christian's proposed method of allowing restricted forms of > open_by_handle_at() to be used by unprivileged users. As mentioned there I find the concept of an unprivileged open_by_handle_at extremely questionable as it trivially gives access to any inode on the file systems. > If file handles really are going to end up being the "correct" mechanism > of referencing inodes by userspace, They aren't. > then future API designs really need > to stop assuming that the user is capable(CAP_DAC_READ_SEARCH). There is no way to support open by handle for unprivileged users. The concept of an inode number based file handle simply does not work for that at all.
diff --git a/fs/fhandle.c b/fs/fhandle.c index 8a7f86c2139a..c6e90161b900 100644 --- a/fs/fhandle.c +++ b/fs/fhandle.c @@ -16,7 +16,8 @@ static long do_sys_name_to_handle(const struct path *path, struct file_handle __user *ufh, - int __user *mnt_id, int fh_flags) + void __user *mnt_id, bool unique_mntid, + int fh_flags) { long retval; struct file_handle f_handle; @@ -69,9 +70,19 @@ static long do_sys_name_to_handle(const struct path *path, } else retval = 0; /* copy the mount id */ - if (put_user(real_mount(path->mnt)->mnt_id, mnt_id) || - copy_to_user(ufh, handle, - struct_size(handle, f_handle, handle_bytes))) + if (unique_mntid) { + if (put_user(real_mount(path->mnt)->mnt_id_unique, + (u64 __user *) mnt_id)) + retval = -EFAULT; + } else { + if (put_user(real_mount(path->mnt)->mnt_id, + (int __user *) mnt_id)) + retval = -EFAULT; + } + /* copy the handle */ + if (retval != -EFAULT && + copy_to_user(ufh, handle, + struct_size(handle, f_handle, handle_bytes))) retval = -EFAULT; kfree(handle); return retval; @@ -83,6 +94,7 @@ static long do_sys_name_to_handle(const struct path *path, * @name: name that should be converted to handle. * @handle: resulting file handle * @mnt_id: mount id of the file system containing the file + * (u64 if AT_HANDLE_MNT_ID_UNIQUE, otherwise int) * @flag: flag value to indicate whether to follow symlink or not * and whether a decodable file handle is required. * @@ -92,7 +104,7 @@ static long do_sys_name_to_handle(const struct path *path, * value required. */ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name, - struct file_handle __user *, handle, int __user *, mnt_id, + struct file_handle __user *, handle, void __user *, mnt_id, int, flag) { struct path path; @@ -100,7 +112,8 @@ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name, int fh_flags; int err; - if (flag & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_HANDLE_FID)) + if (flag & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_HANDLE_FID | + AT_HANDLE_MNT_ID_UNIQUE)) return -EINVAL; lookup_flags = (flag & AT_SYMLINK_FOLLOW) ? LOOKUP_FOLLOW : 0; @@ -109,7 +122,9 @@ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name, lookup_flags |= LOOKUP_EMPTY; err = user_path_at(dfd, name, lookup_flags, &path); if (!err) { - err = do_sys_name_to_handle(&path, handle, mnt_id, fh_flags); + err = do_sys_name_to_handle(&path, handle, mnt_id, + flag & AT_HANDLE_MNT_ID_UNIQUE, + fh_flags); path_put(&path); } return err; diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e619ac10cd23..99c5a783a01e 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -863,7 +863,7 @@ asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags, const char __user *pathname); asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name, struct file_handle __user *handle, - int __user *mnt_id, int flag); + void __user *mnt_id, int flag); asmlinkage long sys_open_by_handle_at(int mountdirfd, struct file_handle __user *handle, int flags); diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index c0bcc185fa48..9ed9d65842c1 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -87,22 +87,24 @@ #define DN_ATTRIB 0x00000020 /* File changed attibutes */ #define DN_MULTISHOT 0x80000000 /* Don't remove notifier */ +#define AT_FDCWD -100 /* Special value used to indicate + openat should use the current + working directory. */ +#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ + /* - * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS is - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to + * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS + * is meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to * unlinkat. The two functions do completely different things and therefore, * the flags can be allowed to overlap. For example, passing AT_REMOVEDIR to * faccessat would be undefined behavior and thus treating it equivalent to * AT_EACCESS is valid undefined behavior. */ -#define AT_FDCWD -100 /* Special value used to indicate - openat should use the current - working directory. */ -#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ #define AT_EACCESS 0x200 /* Test access permitted for effective IDs, not real IDs. */ #define AT_REMOVEDIR 0x200 /* Remove directory instead of unlinking file. */ + #define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ #define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount traversal */ #define AT_EMPTY_PATH 0x1000 /* Allow empty relative pathname */ @@ -114,10 +116,22 @@ #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ -/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */ +/* + * All new purely-syscall-specific AT_* flags should consider using bits from + * 0xFF, but the bits used by RENAME_* (0x7) should be avoided in case users + * decide to pass AT_* flags to renameat2() by accident. These flag bits are + * free for re-use by other syscall's syscall-specific flags without worry. + */ + +/* + * Flags for name_to_handle_at(2). To save AT_ flag space we re-use the + * AT_EACCESS/AT_REMOVEDIR bit for AT_HANDLE_FID. + */ #define AT_HANDLE_FID AT_REMOVEDIR /* file handle is needed to compare object identity and may not be usable to open_by_handle_at(2) */ +#define AT_HANDLE_MNT_ID_UNIQUE 0x80 /* return the u64 unique mount id */ + #if defined(__KERNEL__) #define AT_GETATTR_NOSEC 0x80000000 #endif diff --git a/tools/include/uapi/linux/fcntl.h b/tools/include/uapi/linux/fcntl.h index 282e90aeb163..f671312ca481 100644 --- a/tools/include/uapi/linux/fcntl.h +++ b/tools/include/uapi/linux/fcntl.h @@ -85,22 +85,24 @@ #define DN_ATTRIB 0x00000020 /* File changed attibutes */ #define DN_MULTISHOT 0x80000000 /* Don't remove notifier */ +#define AT_FDCWD -100 /* Special value used to indicate + openat should use the current + working directory. */ +#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ + /* - * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS is - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to + * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS + * is meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to * unlinkat. The two functions do completely different things and therefore, * the flags can be allowed to overlap. For example, passing AT_REMOVEDIR to * faccessat would be undefined behavior and thus treating it equivalent to * AT_EACCESS is valid undefined behavior. */ -#define AT_FDCWD -100 /* Special value used to indicate - openat should use the current - working directory. */ -#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ #define AT_EACCESS 0x200 /* Test access permitted for effective IDs, not real IDs. */ #define AT_REMOVEDIR 0x200 /* Remove directory instead of unlinking file. */ + #define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ #define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount traversal */ #define AT_EMPTY_PATH 0x1000 /* Allow empty relative pathname */ @@ -112,10 +114,22 @@ #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ -/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */ +/* + * All new purely-syscall-specific AT_* flags should consider using bits from + * 0xFF, but the bits used by RENAME_* (0x7) should be avoided in case users + * decide to pass AT_* flags to renameat2() by accident. These flag bits are + * free for re-use by other syscall's syscall-specific flags without worry. + */ + +/* + * Flags for name_to_handle_at(2). To save AT_ flag space we re-use the + * AT_EACCESS/AT_REMOVEDIR bit for AT_HANDLE_FID. + */ #define AT_HANDLE_FID AT_REMOVEDIR /* file handle is needed to compare object identity and may not be usable to open_by_handle_at(2) */ +#define AT_HANDLE_MNT_ID_UNIQUE 0x80 /* returned mount id is the u64 unique mount id */ + #if defined(__KERNEL__) #define AT_GETATTR_NOSEC 0x80000000 #endif
Now that we provide a unique 64-bit mount ID interface in statx, we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing. While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call, turning err = name_to_handle_at(-EBADF, "/foo/bar/baz", &handle, &mntid, AT_HANDLE_MNT_ID_UNIQUE); into int fd = openat(-EBADF, "/foo/bar/baz", O_PATH | O_CLOEXEC); err1 = name_to_handle_at(fd, "", &handle, &unused_mntid, AT_EMPTY_PATH); err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &statxbuf); mntid = statxbuf.stx_mnt_id; close(fd); Unlike AT_HANDLE_FID, as per Amir's suggestion, AT_HANDLE_MNT_ID_UNIQUE uses a new AT_* bit from the historically-unused 0xFF range (which we now define as being the "per-syscall" range for AT_* bits). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- Changes in v2: - Fixed a few minor compiler warnings and a buggy copy_to_user() check. - Rename AT_HANDLE_UNIQUE_MOUNT_ID -> AT_HANDLE_MNT_ID_UNIQUE to match statx. - Switched to using an AT_* bit from 0xFF and defining that range as being "per-syscall" for future usage. - Sync tools/ copy of <linux/fcntl.h> to include changes. - v1: <https://lore.kernel.org/r/20240520-exportfs-u64-mount-id-v1-1-f55fd9215b8e@cyphar.com> --- fs/fhandle.c | 29 ++++++++++++++++++++++------- include/linux/syscalls.h | 2 +- include/uapi/linux/fcntl.h | 28 +++++++++++++++++++++------- tools/include/uapi/linux/fcntl.h | 28 +++++++++++++++++++++------- 4 files changed, 65 insertions(+), 22 deletions(-) --- base-commit: 584bbf439d0fa83d728ec49f3a38c581bdc828b4 change-id: 20240515-exportfs-u64-mount-id-9ebb5c58b53c Best regards,