[GIT,PULL] Filesystem Information

Message ID	1845353.1596469795@warthog.procyon.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=Dl7R=BN=vger.kernel.org=linux-security-module-owner@kernel.org> Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 3798903 From: David Howells <dhowells@redhat.com> References: <1842689.1596468469@warthog.procyon.org.uk> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: dhowells@redhat.com, viro@zeniv.linux.org.uk, raven@themaw.net, kzak@redhat.com, jlayton@redhat.com, mszeredi@redhat.com, nicolas.dichtel@6wind.com, christian@brauner.io, lennart@poettering.net, linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [GIT PULL] Filesystem Information MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Mon, 03 Aug 2020 16:49:55 +0100 Message-ID: <1845353.1596469795@warthog.procyon.org.uk> Sender: owner-linux-security-module@vger.kernel.org Precedence: bulk
Series	[GIT,PULL] Filesystem Information \| expand [GIT,PULL] Filesystem Information

David Howells Aug. 3, 2020, 3:49 p.m. UTC

Hi Linus,

Here's a set of patches that adds a system call, fsinfo(), that allows
information about the VFS, mount topology, superblock and files to be
retrieved.

The patchset is based on top of the mount notifications patchset so that
the mount notification mechanism can be hooked to provide event counters
that can be retrieved with fsinfo(), thereby making it a lot faster to work
out which mounts have changed.

Note that there was a last minute change requested by Miklós: the event
counter bits got moved from the mount notification patchset to this one.
The counters got made atomic_long_t inside the kernel and __u64 in the
UAPI.  The aggregate changes can be assessed by comparing pre-change tag,
fsinfo-core-20200724 to the requested pull tag.

Karel Zak has created preliminary patches that add support to libmount[*]
and Ian Kent has started working on making systemd use these and mount
notifications[**].

[*] https://github.com/karelzak/util-linux/commits/topic/fsinfo

[**] https://lists.freedesktop.org/archives/systemd-devel/2020-July/044914.html


=======
THE WHY
=======

Why do we want this?

Using /proc/mounts (or similar) has problems:

 (1) Reading from it holds a global lock (namespace_sem) that prevents
     mounting and unmounting.  Lots of data is encoded and mangled into
     text whilst the lock is held, including superblock option strings and
     mount point paths.  This causes performance problems when there are a
     lot of mount objects in a system.

 (2) Even though namespace_sem is held during a read, reading the whole
     file isn't necessarily atomic with respect to mount-type operations.
     If a read isn't satisfied in one go, then it may return to userspace
     briefly and then continue reading some way into the file.  But changes
     can occur in the interval that may then go unseen.

 (3) Determining what has changed means parsing and comparing consecutive
     outputs of /proc/mounts.

 (4) Querying a specific mount or superblock means searching through
     /proc/mounts and searching by path or mount ID - but we might have an
     fd we want to query.

 (5) Whilst you can poll() it for events, it only tells you that something
     changed in the namespace, not what or whether you can even see the
     change.

To fix the notification issues, the preceding notifications patchset added
mount watch notifications whereby you can watch for notifications in a
specific mount subtree.  The notification messages include the ID(s) of the
affected mounts.

To support notifications, however, we need to be able to handle overruns in
the notification queue.  I added a number of event counters to struct
super_block and struct mount to allow you to pin down the changes, but
there needs to be a way to retrieve them.  Exposing them through /proc
would require adding yet another /proc/mounts-type file.  We could add
per-mount directories full of attributes in sysfs, but that has issues also
(see below).

Adding an extensible system call interface for retrieving filesystem
information also allows other things to be exposed:

 (1) Jeff Layton's error handling changes need a way to allow error event
     information to be retrieved.

 (2) Bits in masks returned by things like statx() and FS_IOC_GETFLAGS are
     actually 3-state { Set, Unset, Not supported }.  It could be useful to
     provide a way to expose information like this[*].

 (3) Limits of the numerical metadata values in a filesystem[*].

 (4) Filesystem capability information[*].  Filesystems don't all have the
     same capabilities, and even different instances may have different
     capabilities, particularly with network filesystems where the set of
     may be server-dependent.  Capabilities might even vary at file
     granularity - though possibly such information should be conveyed
     through statx() instead.

 (5) ID mapping/shifting tables in use for a superblock.

 (6) Filesystem-specific information.  I need something for AFS so that I
     can do pioctl()-emulation, thereby allowing me to implement certain of
     the AFS command line utilities that query state of a particular file.
     This could also have application for other filesystems, such as NFS,
     CIFS and ext4.

 [*] In a lot of cases these are probably invariant and can be memcpy'd
     from static data.

There's a further consideration: I want to make it possible to have
fsconfig(fd, FSCONFIG_CMD_CREATE) be intercepted by a container manager
such that the manager can supervise a mount attempted inside the container.
The manager would be given an fd pointing to the fs_context struct and
would then need some way to query it (fsinfo()) and modify it (fsconfig()).
This could also be used to arbitrate user-requested mounts when containers
are not in play.


================
DESIGN DECISIONS
================

 (1) Information is partitioned into sets of attributes.

 (2) Attribute IDs are integers as they're fast to compare.

 (3) Attribute values are typed (struct, list of structs, string, opaque
     blob).  They type is fixed for a particular attribute.

 (4) For structure types, the length is also a version.  New fields can be
     tacked onto the end.

 (5) When copying a versioned struct to userspace, the core handles a
     version mismatch by truncating or zero-padding the data as necessary.
     This is transparent to the filesystem.

 (6) The core handles all the buffering and buffer resizing.

 (7) The filesystem never gets any access to the userspace parameter buffer
     or result buffer.

 (8) "Meta" attributes can describe other attributes.


========
OVERVIEW
========

fsinfo() is a system call that allows information about the filesystem at a
particular path point to be queried as a set of attributes.

Attribute values are of four basic types:

 (1) Structure with version-dependent length (the length is the version).

 (2) Variable-length string.

 (3) List of structures (all the same length).

 (4) Opaque blob.

Attributes can have multiple values either as a sequence of values or a
sequence-of-sequences of values and all the values of a particular
attribute must be of the same type.  Values can be up to INT_MAX size,
subject to memory availability.

Note that the values of an attribute *are* allowed to vary between dentries
within a single superblock, depending on the specific dentry that you're
looking at, but the values still have to be of the type for that attribute.

I've tried to make the interface as light as possible, so integer attribute
IDs rather than string and the core does all the buffer allocation and
expansion and all the extensibility support work rather than leaving that
to the filesystems.  This also means that userspace pointers are not
exposed to the filesystem.


fsinfo() allows a variety of information to be retrieved about a filesystem
and the mount topology:

 (1) General superblock attributes:

     - Filesystem identifiers (UUID, volume label, device numbers, ...)
     - The limits on a filesystem's capabilities
     - Information on supported statx fields and attributes and IOC flags.
     - A variety single-bit flags indicating supported capabilities.
     - Timestamp resolution and range.
     - The amount of space/free space in a filesystem (as statfs()).
     - Superblock notification counter.

 (2) Filesystem-specific superblock attributes:

     - Superblock-level timestamps.
     - Cell name, workgroup or other netfs grouping concept.
     - Server names and addresses.

 (3) VFS information:

     - Mount topology information.
     - Mount attributes.
     - Mount notification counter.
     - Mount point path.

 (4) Information about what the fsinfo() syscall itself supports, including
     the type and struct size of attributes.

The system is extensible:

 (1) New attributes can be added.  There is no requirement that a
     filesystem implement every attribute.  A helper function is provided
     to scan a list of attributes and a filesystem can have multiple such
     lists.

 (2) Version length-dependent structure attributes can be made larger and
     have additional information tacked on the end, provided it keeps the
     layout of the existing fields.  If an older process asks for a shorter
     structure, it will only be given the bits it asks for.  If a newer
     process asks for a longer structure on an older kernel, the extra
     space will be set to 0.  In all cases, the size of the data actually
     available is returned.

     In essence, the size of a structure is that structure's version: a
     smaller size is an earlier version and a later version includes
     everything that the earlier version did.

 (3) New single-bit capability flags can be added.  This is a structure-typed
     attribute and, as such, (2) applies.  Any bits you wanted but the kernel
     doesn't support are automatically set to 0.

fsinfo() may be called like the following, for example:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
		.Nth		= 2,
	};
	struct fsinfo_server_address address;
	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc",
		     &params, sizeof(params),
		     &address, sizeof(address));

The above example would query an AFS filesystem to retrieve the address
list for the 3rd server, and:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_NFS_SERVER_NAME;
	};
	char server_name[256];
	len = fsinfo(AT_FDCWD, "/home/dhowells/",
		     &params, sizeof(params),
		     &server_name, sizeof(server_name));

would retrieve the name of the NFS server as a string.

In future, I want to make fsinfo() capable of querying a context created by
fsopen() or fspick(), e.g.:

	fd = fsopen("ext4", 0);
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_CONFIGURATION;
	};
	char buffer[65536];
	fsinfo(fd, NULL, &params, sizeof(params), &buffer, sizeof(buffer));

even if that context doesn't currently have a superblock attached.


David
---
The following changes since commit 841a0dfa511364fa9a8d67512e0643669f1f03e3:

  watch_queue: sample: Display mount tree change notifications (2020-08-03 12:15:38 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/fsinfo-core-20200803

for you to fetch changes up to 19d687f5902b65a158c5bec904d65d0525ea4efe:

  samples: add error state information to test-fsinfo.c (2020-08-03 13:43:46 +0100)

----------------------------------------------------------------
Filesystem information

----------------------------------------------------------------
David Howells (15):
      fsinfo: Introduce a non-repeating system-unique superblock ID
      fsinfo: Add fsinfo() syscall to query filesystem information
      fsinfo: Provide a bitmap of the features a filesystem supports
      fsinfo: Allow retrieval of superblock devname, options and stats
      fsinfo: Allow fsinfo() to look up a mount object by ID
      fsinfo: Add a uniquifier ID to struct mount
      fsinfo: Allow mount information to be queried
      fsinfo: Allow mount topology and propagation info to be retrieved
      watch_queue: Mount event counters
      fsinfo: Provide notification overrun handling support
      fsinfo: sample: Mount listing program
      fsinfo: Add API documentation
      fsinfo: Add support for AFS
      fsinfo: Add support to ext4
      fsinfo: Add an attribute that lists all the visible mounts in a namespace

Jeff Layton (3):
      errseq: add a new errseq_scrape function
      vfs: allow fsinfo to fetch the current state of s_wb_err
      samples: add error state information to test-fsinfo.c

 Documentation/filesystems/fsinfo.rst        | 574 ++++++++++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/arm64/include/asm/unistd.h             |   2 +-
 arch/arm64/include/asm/unistd32.h           |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
 arch/s390/kernel/syscalls/syscall.tbl       |   1 +
 arch/sh/kernel/syscalls/syscall.tbl         |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
 fs/Kconfig                                  |   7 +
 fs/Makefile                                 |   1 +
 fs/afs/internal.h                           |   1 +
 fs/afs/super.c                              | 216 ++++++-
 fs/d_path.c                                 |   2 +-
 fs/ext4/Makefile                            |   1 +
 fs/ext4/ext4.h                              |   6 +
 fs/ext4/fsinfo.c                            |  97 +++
 fs/ext4/super.c                             |   3 +
 fs/fsinfo.c                                 | 748 +++++++++++++++++++++++
 fs/internal.h                               |  16 +
 fs/mount.h                                  |   6 +
 fs/mount_notify.c                           |  10 +-
 fs/namespace.c                              | 427 +++++++++++++-
 fs/super.c                                  |   2 +
 include/linux/errseq.h                      |   1 +
 include/linux/fs.h                          |   7 +
 include/linux/fsinfo.h                      | 112 ++++
 include/linux/syscalls.h                    |   4 +
 include/uapi/asm-generic/unistd.h           |   4 +-
 include/uapi/linux/fsinfo.h                 | 344 +++++++++++
 include/uapi/linux/mount.h                  |  13 +-
 kernel/sys_ni.c                             |   1 +
 lib/errseq.c                                |  33 +-
 samples/vfs/Makefile                        |   6 +-
 samples/vfs/test-fsinfo.c                   | 883 ++++++++++++++++++++++++++++
 samples/vfs/test-mntinfo.c                  | 277 +++++++++
 46 files changed, 3808 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/filesystems/fsinfo.rst
 create mode 100644 fs/ext4/fsinfo.c
 create mode 100644 fs/fsinfo.c
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 samples/vfs/test-fsinfo.c
 create mode 100644 samples/vfs/test-mntinfo.c

Miklos Szeredi Aug. 3, 2020, 4:42 p.m. UTC | #1

On Mon, Aug 3, 2020 at 5:50 PM David Howells <dhowells@redhat.com> wrote:
>
>
> Hi Linus,
>
> Here's a set of patches that adds a system call, fsinfo(), that allows
> information about the VFS, mount topology, superblock and files to be
> retrieved.
>
> The patchset is based on top of the mount notifications patchset so that
> the mount notification mechanism can be hooked to provide event counters
> that can be retrieved with fsinfo(), thereby making it a lot faster to work
> out which mounts have changed.
>
> Note that there was a last minute change requested by Miklós: the event
> counter bits got moved from the mount notification patchset to this one.
> The counters got made atomic_long_t inside the kernel and __u64 in the
> UAPI.  The aggregate changes can be assessed by comparing pre-change tag,
> fsinfo-core-20200724 to the requested pull tag.
>
> Karel Zak has created preliminary patches that add support to libmount[*]
> and Ian Kent has started working on making systemd use these and mount
> notifications[**].

So why are you asking to pull at this stage?

Has anyone done a review of the patchset?

I think it's obvious that this API needs more work.  The integration
work done by Ian is a good direction, but it's not quite the full
validation and review that a complex new API needs.

At least that's my opinion.

Thanks,
Miklos

Ian Kent Aug. 4, 2020, 2:15 a.m. UTC | #2

On Mon, 2020-08-03 at 18:42 +0200, Miklos Szeredi wrote:
> On Mon, Aug 3, 2020 at 5:50 PM David Howells <dhowells@redhat.com>
> wrote:
> > 
> > Hi Linus,
> > 
> > Here's a set of patches that adds a system call, fsinfo(), that
> > allows
> > information about the VFS, mount topology, superblock and files to
> > be
> > retrieved.
> > 
> > The patchset is based on top of the mount notifications patchset so
> > that
> > the mount notification mechanism can be hooked to provide event
> > counters
> > that can be retrieved with fsinfo(), thereby making it a lot faster
> > to work
> > out which mounts have changed.
> > 
> > Note that there was a last minute change requested by Miklós: the
> > event
> > counter bits got moved from the mount notification patchset to this
> > one.
> > The counters got made atomic_long_t inside the kernel and __u64 in
> > the
> > UAPI.  The aggregate changes can be assessed by comparing pre-
> > change tag,
> > fsinfo-core-20200724 to the requested pull tag.
> > 
> > Karel Zak has created preliminary patches that add support to
> > libmount[*]
> > and Ian Kent has started working on making systemd use these and
> > mount
> > notifications[**].
> 
> So why are you asking to pull at this stage?
> 
> Has anyone done a review of the patchset?

I have been working with the patch set as it has evolved for quite a
while now.

I've been reading the kernel code quite a bit and forwarded questions
and minor changes to David as they arose.

As for a review, not specifically, but while the series implements a
rather large change it's surprisingly straight forward to read.

In the time I have been working with it I haven't noticed any problems
except for those few minor things that I reported to David early on (in
some cases accompanied by simple patches).

And more recently (obviously) I've been working with the mount
notifications changes and, from a readability POV, I find it's the
same as the fsinfo() code.

> 
> I think it's obvious that this API needs more work.  The integration
> work done by Ian is a good direction, but it's not quite the full
> validation and review that a complex new API needs.

Maybe but the system call is fundamental to making notifications useful
and, as I say, after working with it for quite a while I don't fell
there's missing features (that David hasn't added along the way) and
have found it provides what's needed for what I'm doing (for mount
notifications at least).

I'll be posting a github PR for systemd for discussion soon while I
get on with completing the systemd change. Like overflow handling and
meson build system changes to allow building with and without the
util-linux libmount changes.

So, ideally, I'd like to see the series merged, we've been working on
it for quite a considerable time now.

Ian

Miklos Szeredi Aug. 4, 2020, 2:36 p.m. UTC | #3

On Tue, Aug 4, 2020 at 4:15 AM Ian Kent <raven@themaw.net> wrote:
>
> On Mon, 2020-08-03 at 18:42 +0200, Miklos Szeredi wrote:
> > On Mon, Aug 3, 2020 at 5:50 PM David Howells <dhowells@redhat.com>
> > wrote:
> > >
> > > Hi Linus,
> > >
> > > Here's a set of patches that adds a system call, fsinfo(), that
> > > allows
> > > information about the VFS, mount topology, superblock and files to
> > > be
> > > retrieved.
> > >
> > > The patchset is based on top of the mount notifications patchset so
> > > that
> > > the mount notification mechanism can be hooked to provide event
> > > counters
> > > that can be retrieved with fsinfo(), thereby making it a lot faster
> > > to work
> > > out which mounts have changed.
> > >
> > > Note that there was a last minute change requested by Miklós: the
> > > event
> > > counter bits got moved from the mount notification patchset to this
> > > one.
> > > The counters got made atomic_long_t inside the kernel and __u64 in
> > > the
> > > UAPI.  The aggregate changes can be assessed by comparing pre-
> > > change tag,
> > > fsinfo-core-20200724 to the requested pull tag.
> > >
> > > Karel Zak has created preliminary patches that add support to
> > > libmount[*]
> > > and Ian Kent has started working on making systemd use these and
> > > mount
> > > notifications[**].
> >
> > So why are you asking to pull at this stage?
> >
> > Has anyone done a review of the patchset?
>
> I have been working with the patch set as it has evolved for quite a
> while now.
>
> I've been reading the kernel code quite a bit and forwarded questions
> and minor changes to David as they arose.
>
> As for a review, not specifically, but while the series implements a
> rather large change it's surprisingly straight forward to read.
>
> In the time I have been working with it I haven't noticed any problems
> except for those few minor things that I reported to David early on (in
> some cases accompanied by simple patches).
>
> And more recently (obviously) I've been working with the mount
> notifications changes and, from a readability POV, I find it's the
> same as the fsinfo() code.
>
> >
> > I think it's obvious that this API needs more work.  The integration
> > work done by Ian is a good direction, but it's not quite the full
> > validation and review that a complex new API needs.
>
> Maybe but the system call is fundamental to making notifications useful
> and, as I say, after working with it for quite a while I don't fell
> there's missing features (that David hasn't added along the way) and
> have found it provides what's needed for what I'm doing (for mount
> notifications at least).

Apart from the various issues related to the various mount ID's and
their sizes, my general comment is (and was always): why are we adding
a multiplexer for retrieval of mostly unrelated binary structures?

<linux/fsinfo.h> is 345 lines.  This is not a simple and clean API.

A simple and clean replacement API would be:

int get_mount_attribute(int dfd, const char *path, const char
*attr_name, char *value_buf, size_t buf_size, int flags);

No header file needed with dubiously sized binary values.

The only argument was performance, but apart from purely synthetic
microbenchmarks that hasn't been proven to be an issue.

And notice how similar the above interface is to getxattr(), or the
proposed readfile().  Where has the "everything is  a file" philosophy
gone?

I think we already lost that with the xattr API, that should have been
done in a way that fits this philosophy.  But given that we  have "/"
as the only special purpose char in filenames, and even repetitions
are allowed, it's hard to think of a good way to do that.  Pity.

Still I think it would be nice to have a general purpose attribute
retrieval API instead of the multiplicity of binary ioctls, xattrs,
etc.

Is that totally crazy?  Nobody missing the beauty in recently introduced APIs?

Thanks,
Miklos

Ian Kent Aug. 5, 2020, 1:33 a.m. UTC | #4

On Tue, 2020-08-04 at 16:36 +0200, Miklos Szeredi wrote:
> On Tue, Aug 4, 2020 at 4:15 AM Ian Kent <raven@themaw.net> wrote:
> > On Mon, 2020-08-03 at 18:42 +0200, Miklos Szeredi wrote:
> > > On Mon, Aug 3, 2020 at 5:50 PM David Howells <dhowells@redhat.com
> > > >
> > > wrote:
> > > > Hi Linus,
> > > > 
> > > > Here's a set of patches that adds a system call, fsinfo(), that
> > > > allows
> > > > information about the VFS, mount topology, superblock and files
> > > > to
> > > > be
> > > > retrieved.
> > > > 
> > > > The patchset is based on top of the mount notifications
> > > > patchset so
> > > > that
> > > > the mount notification mechanism can be hooked to provide event
> > > > counters
> > > > that can be retrieved with fsinfo(), thereby making it a lot
> > > > faster
> > > > to work
> > > > out which mounts have changed.
> > > > 
> > > > Note that there was a last minute change requested by Miklós:
> > > > the
> > > > event
> > > > counter bits got moved from the mount notification patchset to
> > > > this
> > > > one.
> > > > The counters got made atomic_long_t inside the kernel and __u64
> > > > in
> > > > the
> > > > UAPI.  The aggregate changes can be assessed by comparing pre-
> > > > change tag,
> > > > fsinfo-core-20200724 to the requested pull tag.
> > > > 
> > > > Karel Zak has created preliminary patches that add support to
> > > > libmount[*]
> > > > and Ian Kent has started working on making systemd use these
> > > > and
> > > > mount
> > > > notifications[**].
> > > 
> > > So why are you asking to pull at this stage?
> > > 
> > > Has anyone done a review of the patchset?
> > 
> > I have been working with the patch set as it has evolved for quite
> > a
> > while now.
> > 
> > I've been reading the kernel code quite a bit and forwarded
> > questions
> > and minor changes to David as they arose.
> > 
> > As for a review, not specifically, but while the series implements
> > a
> > rather large change it's surprisingly straight forward to read.
> > 
> > In the time I have been working with it I haven't noticed any
> > problems
> > except for those few minor things that I reported to David early on
> > (in
> > some cases accompanied by simple patches).
> > 
> > And more recently (obviously) I've been working with the mount
> > notifications changes and, from a readability POV, I find it's the
> > same as the fsinfo() code.
> > 
> > > I think it's obvious that this API needs more work.  The
> > > integration
> > > work done by Ian is a good direction, but it's not quite the full
> > > validation and review that a complex new API needs.
> > 
> > Maybe but the system call is fundamental to making notifications
> > useful
> > and, as I say, after working with it for quite a while I don't fell
> > there's missing features (that David hasn't added along the way)
> > and
> > have found it provides what's needed for what I'm doing (for mount
> > notifications at least).
> 
> Apart from the various issues related to the various mount ID's and
> their sizes, my general comment is (and was always): why are we
> adding
> a multiplexer for retrieval of mostly unrelated binary structures?
> 
> <linux/fsinfo.h> is 345 lines.  This is not a simple and clean API.
> 
> A simple and clean replacement API would be:
> 
> int get_mount_attribute(int dfd, const char *path, const char
> *attr_name, char *value_buf, size_t buf_size, int flags);
> 
> No header file needed with dubiously sized binary values.
> 
> The only argument was performance, but apart from purely synthetic
> microbenchmarks that hasn't been proven to be an issue.
> 
> And notice how similar the above interface is to getxattr(), or the
> proposed readfile().  Where has the "everything is  a file"
> philosophy
> gone?

Maybe, but that philosophy (in a roundabout way) is what's resulted
in some of the problems we now have. Granted it's blind application
of that philosophy rather than the philosophy itself but that is
what happens.

I get that your comments are driven by the way that philosophy should
be applied which is more of a "if it works best doing it that way then
do it that way, and that's usually a file".

In this case there is a logical division of various types of file
system information and the underlying suggestion is maybe it's time
to move away from the "everything is a file" hard and fast rule,
and get rid of some of the problems that have resulted from it.

The notifications is an example, yes, the delivery mechanism is
a "file" but the design of the queueing mechanism makes a lot of
sense for the throughput that's going to be needed as time marches
on. Then there's different sub-systems each with unique information
that needs to be deliverable some other way because delivering "all"
the information via the notification would be just plain wrong so
a multi-faceted information delivery mechanism makes the most
sense to allow specific targeted retrieval of individual items of
information.

But that also supposes your at least open to the idea that "maybe
not everything should be a file".

> 
> I think we already lost that with the xattr API, that should have
> been
> done in a way that fits this philosophy.  But given that we  have "/"
> as the only special purpose char in filenames, and even repetitions
> are allowed, it's hard to think of a good way to do that.  Pity.
> 
> Still I think it would be nice to have a general purpose attribute
> retrieval API instead of the multiplicity of binary ioctls, xattrs,
> etc.
> 
> Is that totally crazy?  Nobody missing the beauty in recently
> introduced APIs?
> 
> Thanks,
> Miklos

Miklos Szeredi Aug. 5, 2020, 8 a.m. UTC | #5

On Wed, Aug 5, 2020 at 3:33 AM Ian Kent <raven@themaw.net> wrote:
>

> On Tue, 2020-08-04 at 16:36 +0200, Miklos Szeredi wrote:
> > And notice how similar the above interface is to getxattr(), or the
> > proposed readfile().  Where has the "everything is  a file"
> > philosophy
> > gone?
>
> Maybe, but that philosophy (in a roundabout way) is what's resulted
> in some of the problems we now have. Granted it's blind application
> of that philosophy rather than the philosophy itself but that is
> what happens.

Agree.   What people don't seem to realize, even though there are
blindingly obvious examples, that binary interfaces like the proposed
fsinfo(2) syscall can also result in a multitude of problems at the
same time as solving some others.

There's no magic solution in API design,  it's not balck and white.
We just need to strive for a good enough solution.  The problem seems
to be that trying to discuss the merits of other approaches seems to
hit a brick wall.  We just see repeated pull requests from David,
without any real discussion of the proposed alternatives.

>
> I get that your comments are driven by the way that philosophy should
> be applied which is more of a "if it works best doing it that way then
> do it that way, and that's usually a file".
>
> In this case there is a logical division of various types of file
> system information and the underlying suggestion is maybe it's time
> to move away from the "everything is a file" hard and fast rule,
> and get rid of some of the problems that have resulted from it.
>
> The notifications is an example, yes, the delivery mechanism is
> a "file" but the design of the queueing mechanism makes a lot of
> sense for the throughput that's going to be needed as time marches
> on. Then there's different sub-systems each with unique information
> that needs to be deliverable some other way because delivering "all"
> the information via the notification would be just plain wrong so
> a multi-faceted information delivery mechanism makes the most
> sense to allow specific targeted retrieval of individual items of
> information.
>
> But that also supposes your at least open to the idea that "maybe
> not everything should be a file".

Sure.  I've learned pragmatism, although idealist at heart.  And I'm
not saying all API's from David are shit.  statx(2) is doing fine.
It's a simple binary interface that does its job well.   Compare the
header files for statx and fsinfo, though, and maybe you'll see what
I'm getting at...

Thanks,
Miklos

Miklos Szeredi Aug. 5, 2020, 8:24 a.m. UTC | #6

On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <miklos@szeredi.hu> wrote:

> I think we already lost that with the xattr API, that should have been
> done in a way that fits this philosophy.  But given that we  have "/"
> as the only special purpose char in filenames, and even repetitions
> are allowed, it's hard to think of a good way to do that.  Pity.

One way this could be solved is to allow opting into an alternative
path resolution mode.

E.g.
  openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);

Yes, the implementation might be somewhat tricky, but that's another
question.  Also I'm pretty sure that we should be reducing the
POSIX-ness of anything below "//" to the bare minimum.  No seeking,
etc....

I think this would open up some nice possibilities beyond the fsinfo thing.

Thanks,
Miklos

Ian Kent Aug. 5, 2020, 11:13 a.m. UTC | #7

On Wed, 2020-08-05 at 10:00 +0200, Miklos Szeredi wrote:
> On Wed, Aug 5, 2020 at 3:33 AM Ian Kent <raven@themaw.net> wrote:
> > On Tue, 2020-08-04 at 16:36 +0200, Miklos Szeredi wrote:
> > > And notice how similar the above interface is to getxattr(), or
> > > the
> > > proposed readfile().  Where has the "everything is  a file"
> > > philosophy
> > > gone?
> > 
> > Maybe, but that philosophy (in a roundabout way) is what's resulted
> > in some of the problems we now have. Granted it's blind application
> > of that philosophy rather than the philosophy itself but that is
> > what happens.
> 
> Agree.   What people don't seem to realize, even though there are
> blindingly obvious examples, that binary interfaces like the proposed
> fsinfo(2) syscall can also result in a multitude of problems at the
> same time as solving some others.
> 
> There's no magic solution in API design,  it's not balck and white.
> We just need to strive for a good enough solution.  The problem seems
> to be that trying to discuss the merits of other approaches seems to
> hit a brick wall.  We just see repeated pull requests from David,
> without any real discussion of the proposed alternatives.
> 
> > I get that your comments are driven by the way that philosophy
> > should
> > be applied which is more of a "if it works best doing it that way
> > then
> > do it that way, and that's usually a file".
> > 
> > In this case there is a logical division of various types of file
> > system information and the underlying suggestion is maybe it's time
> > to move away from the "everything is a file" hard and fast rule,
> > and get rid of some of the problems that have resulted from it.
> > 
> > The notifications is an example, yes, the delivery mechanism is
> > a "file" but the design of the queueing mechanism makes a lot of
> > sense for the throughput that's going to be needed as time marches
> > on. Then there's different sub-systems each with unique information
> > that needs to be deliverable some other way because delivering
> > "all"
> > the information via the notification would be just plain wrong so
> > a multi-faceted information delivery mechanism makes the most
> > sense to allow specific targeted retrieval of individual items of
> > information.
> > 
> > But that also supposes your at least open to the idea that "maybe
> > not everything should be a file".
> 
> Sure.  I've learned pragmatism, although idealist at heart.  And I'm
> not saying all API's from David are shit.  statx(2) is doing fine.
> It's a simple binary interface that does its job well.   Compare the
> header files for statx and fsinfo, though, and maybe you'll see what
> I'm getting at...

Yeah, but I'm biased so not much joy there ... ;)

Ian

Miklos Szeredi Aug. 11, 2020, 1:54 p.m. UTC | #8

On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote:
> On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > I think we already lost that with the xattr API, that should have been
> > done in a way that fits this philosophy.  But given that we  have "/"
> > as the only special purpose char in filenames, and even repetitions
> > are allowed, it's hard to think of a good way to do that.  Pity.
> 
> One way this could be solved is to allow opting into an alternative
> path resolution mode.
> 
> E.g.
>   openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);

Proof of concept patch and test program below.

Opted for triple slash in the hope that just maybe we could add a global
/proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) path
resolution without breaking too many things.  Will try that later...

Comments?

Thanks,
Miklos

cat_alt.c:
-------- >8 --------
#define _GNU_SOURCE
#include <err.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <linux/unistd.h>
#include <linux/openat2.h>

#define RESOLVE_ALT		0x20 /* Alternative path walk mode where
					multiple slashes have special meaning */

int main(int argc, char *argv[])
{
	struct open_how how = {
		.flags = O_RDONLY,
		.resolve = RESOLVE_ALT,
	};
	int fd, res, i;
	char buf[65536], *end;
	const char *path = argv[1];
	int dfd = AT_FDCWD;

	if (argc < 2 || argc > 4)
		errx(1, "usage: %s path [dirfd] [--nofollow]", argv[0]);


	for (i = 2; i < argc; i++) {
		if (strcmp(argv[i], "--nofollow") == 0) {
			how.flags |= O_NOFOLLOW;
		} else {
			dfd = strtoul(argv[i], &end, 0);
			if (end == argv[i] || *end)
				errx(1, "invalid dirfd: %s", argv[i]);
		}
	}

	fd = syscall(__NR_openat2, dfd, path, &how, sizeof(how));
	if (fd == -1)
		err(1, "failed to open %s", argv[1]);

	while (1) {
		res = read(fd, buf, sizeof(buf));
		if (res == -1)
			err(1, "failed to read file");
		if (res == 0)
			break;

		write(1, buf, res);
	}
	close(fd);
	return 0;
}
-------- >8 --------

---
 fs/Makefile                  |    2 
 fs/file_table.c              |   70 ++++++++++++++--------
 fs/fsmeta.c                  |  135 +++++++++++++++++++++++++++++++++++++++++++
 fs/internal.h                |    9 ++
 fs/mount.h                   |    4 +
 fs/namei.c                   |   77 +++++++++++++++++++++---
 fs/namespace.c               |   12 +++
 fs/open.c                    |    2 
 fs/proc_namespace.c          |    2 
 include/linux/fcntl.h        |    2 
 include/linux/namei.h        |    3 
 include/uapi/linux/magic.h   |    1 
 include/uapi/linux/openat2.h |    2 
 13 files changed, 282 insertions(+), 39 deletions(-)

--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,7 +13,7 @@ obj-y :=	open.o read_write.o file_table.
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o splice.o sync.o utimes.o d_path.o \
 		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
-		fs_types.o fs_context.o fs_parser.o fsopen.o
+		fs_types.o fs_context.o fs_parser.o fsopen.o fsmeta.o \
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o block_dev.o direct-io.o mpage.o
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -178,22 +178,9 @@ struct file *alloc_empty_file_noaccount(
 	return f;
 }
 
-/**
- * alloc_file - allocate and initialize a 'struct file'
- *
- * @path: the (dentry, vfsmount) pair for the new file
- * @flags: O_... flags with which the new file will be opened
- * @fop: the 'struct file_operations' for the new file
- */
-static struct file *alloc_file(const struct path *path, int flags,
-		const struct file_operations *fop)
+static void init_file(struct file *file, const struct path *path, int flags,
+		      const struct file_operations *fop)
 {
-	struct file *file;
-
-	file = alloc_empty_file(flags, current_cred());
-	if (IS_ERR(file))
-		return file;
-
 	file->f_path = *path;
 	file->f_inode = path->dentry->d_inode;
 	file->f_mapping = path->dentry->d_inode->i_mapping;
@@ -209,31 +196,66 @@ static struct file *alloc_file(const str
 	file->f_op = fop;
 	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
 		i_readcount_inc(path->dentry->d_inode);
+}
+
+/**
+ * alloc_file - allocate and initialize a 'struct file'
+ *
+ * @path: the (dentry, vfsmount) pair for the new file
+ * @flags: O_... flags with which the new file will be opened
+ * @fop: the 'struct file_operations' for the new file
+ */
+static struct file *alloc_file(const struct path *path, int flags,
+		const struct file_operations *fop)
+{
+	struct file *file;
+
+	file = alloc_empty_file(flags, current_cred());
+	if (IS_ERR(file))
+		return file;
+
+	init_file(file, path, flags, fop);
+
 	return file;
 }
 
-struct file *alloc_file_pseudo(struct inode *inode, struct vfsmount *mnt,
-				const char *name, int flags,
-				const struct file_operations *fops)
+int init_file_pseudo(struct file *file, struct inode *inode,
+		     struct vfsmount *mnt, const char *name, int flags,
+		     const struct file_operations *fops)
 {
 	static const struct dentry_operations anon_ops = {
 		.d_dname = simple_dname
 	};
 	struct qstr this = QSTR_INIT(name, strlen(name));
 	struct path path;
-	struct file *file;
 
 	path.dentry = d_alloc_pseudo(mnt->mnt_sb, &this);
 	if (!path.dentry)
-		return ERR_PTR(-ENOMEM);
+		return -ENOMEM;
 	if (!mnt->mnt_sb->s_d_op)
 		d_set_d_op(path.dentry, &anon_ops);
 	path.mnt = mntget(mnt);
 	d_instantiate(path.dentry, inode);
-	file = alloc_file(&path, flags, fops);
-	if (IS_ERR(file)) {
-		ihold(inode);
-		path_put(&path);
+	init_file(file, &path, flags, fops);
+
+	return 0;
+}
+
+struct file *alloc_file_pseudo(struct inode *inode, struct vfsmount *mnt,
+				const char *name, int flags,
+				const struct file_operations *fops)
+{
+	struct file *file;
+	int err;
+
+	file = alloc_empty_file(flags, current_cred());
+	if (IS_ERR(file))
+		return file;
+
+	err = init_file_pseudo(file, inode, mnt, name, flags, fops);
+	if (err) {
+		fput(file);
+		file = ERR_PTR(err);
 	}
 	return file;
 }
--- /dev/null
+++ b/fs/fsmeta.c
@@ -0,0 +1,135 @@
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/magic.h>
+#include <linux/seq_file.h>
+#include <linux/fs_struct.h>
+#include <linux/pseudo_fs.h>
+
+#include "mount.h"
+#include "internal.h"
+
+static struct vfsmount *fsmeta_mnt;
+static struct inode *fsmeta_inode;
+
+
+static struct vfsmount *fsmeta_mnt_info_get_mnt(struct seq_file *seq)
+{
+	struct proc_mounts *p = seq->private;
+
+	return &list_entry(p->cursor.mnt_list.next, struct mount, mnt_list)->mnt;
+}
+
+static void *fsmeta_mnt_info_start(struct seq_file *seq, loff_t *pos)
+{
+	mnt_namespace_lock_read();
+	return *pos == 0 ? fsmeta_mnt_info_get_mnt(seq) : NULL;
+}
+
+static void *fsmeta_mnt_info_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return NULL;
+}
+
+static void fsmeta_mnt_info_stop(struct seq_file *seq, void *v)
+{
+	mnt_namespace_unlock_read();
+}
+
+static int fsmeta_mnt_info_show(struct seq_file *seq, void *v)
+{
+	return show_mountinfo(seq, v);
+}
+
+static const struct seq_operations fsmeta_mnt_info_sops = {
+	.start = fsmeta_mnt_info_start,
+	.next = fsmeta_mnt_info_next,
+	.stop = fsmeta_mnt_info_stop,
+	.show = fsmeta_mnt_info_show,
+};
+
+static int fsmeta_mnt_info_release(struct inode *inode, struct file *file)
+{
+	if (file->private_data) {
+		struct seq_file *seq = file->private_data;
+		struct proc_mounts *p = seq->private;
+
+		mntput(fsmeta_mnt_info_get_mnt(seq));
+		path_put(&p->root);
+
+		return seq_release_private(inode, file);
+	}
+	return 0;
+}
+
+static const struct file_operations fsmeta_mnt_info_fops = {
+	.release = fsmeta_mnt_info_release,
+	.read = seq_read,
+	.llseek = no_llseek,
+};
+
+static int fsmeta_mnt_info_open(struct file *file, const struct path *path,
+				const struct open_flags *op)
+{
+	struct proc_mounts *p;
+	int err;
+
+	err = init_file_pseudo(file, fsmeta_inode, fsmeta_mnt, "[mnt.info]",
+			       op->open_flag, &fsmeta_mnt_info_fops);
+	if (err)
+		return err;
+	/*
+	 * This reference is now sunk in file->f_path.dentry->d_inode and will
+	 * be released by fput()
+	 */
+	ihold(fsmeta_inode);
+
+	err = seq_open_private(file, &fsmeta_mnt_info_sops, sizeof(*p));
+	if (err)
+		return err;
+
+	p = ((struct seq_file *)file->private_data)->private;
+	get_fs_root(current->fs, &p->root);
+	p->cursor.mnt_list.next = &real_mount(mntget(path->mnt))->mnt_list;
+
+	return 0;
+}
+
+int fsmeta_open(const char *meta_name, const struct path *path,
+	      struct file *file, const struct open_flags *op)
+{
+	if (op->open_flag & ~(O_LARGEFILE | O_CLOEXEC | O_NOFOLLOW))
+		return -EINVAL;
+
+	if (strcmp(meta_name, "mnt/info") == 0)
+		return fsmeta_mnt_info_open(file, path, op);
+
+	pr_info("invalid fsmeta file <%s> on %pd4\n", meta_name, path->dentry);
+	return -EINVAL;
+}
+
+static int fsmeta_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, FSMETA_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type fsmeta_fs_type = {
+	.name		= "fsmeta",
+	.init_fs_context = fsmeta_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int __init fsmeta_init(void)
+{
+	fsmeta_mnt = kern_mount(&fsmeta_fs_type);
+	if (IS_ERR(fsmeta_mnt))
+		panic("fsmeta_init() kernel mount failed (%ld)\n", PTR_ERR(fsmeta_mnt));
+
+	fsmeta_inode = alloc_anon_inode(fsmeta_mnt->mnt_sb);
+	if (IS_ERR(fsmeta_inode))
+		panic("fsmeta_init() inode allocation failed (%ld)\n", PTR_ERR(fsmeta_inode));
+
+	return 0;
+}
+fs_initcall(fsmeta_init);
+
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -99,6 +99,9 @@ extern void chroot_fs_refs(const struct
  */
 extern struct file *alloc_empty_file(int, const struct cred *);
 extern struct file *alloc_empty_file_noaccount(int, const struct cred *);
+extern int init_file_pseudo(struct file *file, struct inode *inode,
+			    struct vfsmount *mnt, const char *name, int flags,
+			    const struct file_operations *fops);
 
 /*
  * super.c
@@ -185,3 +188,9 @@ int sb_init_dio_done_wq(struct super_blo
  */
 int do_statx(int dfd, const char __user *filename, unsigned flags,
 	     unsigned int mask, struct statx __user *buffer);
+
+/*
+ * fs/fsmeta.c
+ */
+int fsmeta_open(const char *meta_name, const struct path *path,
+		struct file *file, const struct open_flags *op);
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -159,3 +159,7 @@ static inline bool is_anon_ns(struct mnt
 }
 
 extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor);
+
+void mnt_namespace_lock_read(void);
+void mnt_namespace_unlock_read(void);
+int show_mountinfo(struct seq_file *m, struct vfsmount *mnt);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2094,6 +2094,30 @@ static inline u64 hash_name(const void *
 
 #endif
 
+static int lookup_alt(const char *name, struct nameidata *nd)
+{
+	if ((nd->flags & LOOKUP_RCU) && unlazy_walk(nd) != 0)
+		return -ECHILD;
+
+	nd->last.name = name + 3;
+	nd->last_type = LAST_META;
+
+	return 0;
+}
+
+static bool is_alt(const char *name, struct nameidata *nd, int depth)
+{
+	if (!(nd->flags & LOOKUP_ALT))
+		return false;
+
+	/* no alternative lookup inside symlinks */
+	if (depth)
+		return false;
+
+	/* name[0] has already been verified to be a slash */
+	return name[1] == '/' && name[2] == '/' && name[3] != '/';
+}
+
 /*
  * Name resolution.
  * This is the basic name resolution function, turning a pathname into
@@ -2111,8 +2135,13 @@ static int link_path_walk(const char *na
 	nd->flags |= LOOKUP_PARENT;
 	if (IS_ERR(name))
 		return PTR_ERR(name);
-	while (*name=='/')
-		name++;
+	if (*name == '/') {
+		if (!is_alt(name, nd, depth)) {
+			do {
+				name++;
+			} while (*name == '/');
+		}
+	}
 	if (!*name)
 		return 0;
 
@@ -2122,6 +2151,9 @@ static int link_path_walk(const char *na
 		u64 hash_len;
 		int type;
 
+		if (*name == '/')
+			return lookup_alt(name, nd);
+
 		err = may_lookup(nd);
 		if (err)
 			return err;
@@ -2163,6 +2195,13 @@ static int link_path_walk(const char *na
 		 * If it wasn't NUL, we know it was '/'. Skip that
 		 * slash, and continue until no more slashes.
 		 */
+		if (is_alt(name, nd, depth)) {
+			link = walk_component(nd, WALK_TRAILING);
+			if (unlikely(link))
+				goto LINK;
+
+			return lookup_alt(name, nd);
+		}
 		do {
 			name++;
 		} while (unlikely(*name == '/'));
@@ -2183,6 +2222,7 @@ static int link_path_walk(const char *na
 			link = walk_component(nd, WALK_MORE);
 		}
 		if (unlikely(link)) {
+LINK:
 			if (IS_ERR(link))
 				return PTR_ERR(link);
 			/* a symlink to follow */
@@ -2239,11 +2279,11 @@ static const char *path_init(struct name
 	nd->path.dentry = NULL;
 
 	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
-	if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
+	if (*s == '/' && !is_alt(s, nd, 0) && !(flags & LOOKUP_IN_ROOT)) {
 		error = nd_jump_root(nd);
 		if (unlikely(error))
 			return ERR_PTR(error);
-		return s;
+		return s + 1;
 	}
 
 	/* Relative pathname -- get the starting-point it is relative to. */
@@ -2272,7 +2312,8 @@ static const char *path_init(struct name
 
 		dentry = f.file->f_path.dentry;
 
-		if (*s && unlikely(!d_can_lookup(dentry))) {
+		if (*s && unlikely(!d_can_lookup(dentry)) &&
+		    !is_alt(s, nd, 0)) {
 			fdput(f);
 			return ERR_PTR(-ENOTDIR);
 		}
@@ -2303,6 +2344,9 @@ static const char *path_init(struct name
 
 static inline const char *lookup_last(struct nameidata *nd)
 {
+	if (nd->last_type == LAST_META)
+		return ERR_PTR(-EINVAL);
+
 	if (nd->last_type == LAST_NORM && nd->last.name[nd->last.len])
 		nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
 
@@ -2331,7 +2375,7 @@ static int path_lookupat(struct nameidat
 
 	while (!(err = link_path_walk(s, nd)) &&
 	       (s = lookup_last(nd)) != NULL)
-		;
+		nd->flags &= ~LOOKUP_ALT;
 	if (!err)
 		err = complete_walk(nd);
 
@@ -2410,9 +2454,15 @@ static struct filename *filename_parenta
 	if (unlikely(retval == -ESTALE))
 		retval = path_parentat(&nd, flags | LOOKUP_REVAL, parent);
 	if (likely(!retval)) {
-		*last = nd.last;
-		*type = nd.last_type;
-		audit_inode(name, parent->dentry, AUDIT_INODE_PARENT);
+		if (nd.last_type == LAST_META) {
+			path_put(parent);
+			putname(name);
+			name = ERR_PTR(-EINVAL);
+		} else {
+			*last = nd.last;
+			*type = nd.last_type;
+			audit_inode(name, parent->dentry, AUDIT_INODE_PARENT);
+		}
 	} else {
 		putname(name);
 		name = ERR_PTR(retval);
@@ -3123,6 +3173,10 @@ static const char *open_last_lookups(str
 	nd->flags |= op->intent;
 
 	if (nd->last_type != LAST_NORM) {
+		if (nd->last_type == LAST_META) {
+			return ERR_PTR(fsmeta_open(nd->last.name, &nd->path,
+						   file, op));
+		}
 		if (nd->depth)
 			put_link(nd);
 		return handle_dots(nd, nd->last_type);
@@ -3206,6 +3260,9 @@ static int do_open(struct nameidata *nd,
 	int acc_mode;
 	int error;
 
+	if (nd->last_type == LAST_META)
+		return 0;
+
 	if (!(file->f_mode & (FMODE_OPENED | FMODE_CREATED))) {
 		error = complete_walk(nd);
 		if (error)
@@ -3355,7 +3412,7 @@ static struct file *path_openat(struct n
 		const char *s = path_init(nd, flags);
 		while (!(error = link_path_walk(s, nd)) &&
 		       (s = open_last_lookups(nd, file, op)) != NULL)
-			;
+			nd->flags &= ~LOOKUP_ALT;
 		if (!error)
 			error = do_open(nd, file, op);
 		terminate_walk(nd);
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -69,7 +69,7 @@ static DEFINE_IDA(mnt_group_ida);
 static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
-static DECLARE_RWSEM(namespace_sem);
+DECLARE_RWSEM(namespace_sem);
 static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
 static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
 
@@ -1435,6 +1435,16 @@ static inline void namespace_lock(void)
 	down_write(&namespace_sem);
 }
 
+void mnt_namespace_lock_read(void)
+{
+	down_read(&namespace_sem);
+}
+
+void mnt_namespace_unlock_read(void)
+{
+	up_read(&namespace_sem);
+}
+
 enum umount_tree_flags {
 	UMOUNT_SYNC = 1,
 	UMOUNT_PROPAGATE = 2,
--- a/fs/open.c
+++ b/fs/open.c
@@ -1098,6 +1098,8 @@ inline int build_open_flags(const struct
 		lookup_flags |= LOOKUP_BENEATH;
 	if (how->resolve & RESOLVE_IN_ROOT)
 		lookup_flags |= LOOKUP_IN_ROOT;
+	if (how->resolve & RESOLVE_ALT)
+		lookup_flags |= LOOKUP_ALT;
 
 	op->lookup_flags = lookup_flags;
 	return 0;
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -128,7 +128,7 @@ static int show_vfsmnt(struct seq_file *
 	return err;
 }
 
-static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
+int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
 {
 	struct proc_mounts *p = m->private;
 	struct mount *r = real_mount(mnt);
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -19,7 +19,7 @@
 /* List of all valid flags for the how->resolve argument: */
 #define VALID_RESOLVE_FLAGS \
 	(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
-	 RESOLVE_BENEATH | RESOLVE_IN_ROOT)
+	 RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_ALT)
 
 /* List of all open_how "versions". */
 #define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -15,7 +15,7 @@ enum { MAX_NESTED_LINKS = 8 };
 /*
  * Type of the last component on LOOKUP_PARENT
  */
-enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
+enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_META};
 
 /* pathwalk mode */
 #define LOOKUP_FOLLOW		0x0001	/* follow links at the end */
@@ -27,6 +27,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LA
 
 #define LOOKUP_REVAL		0x0020	/* tell ->d_revalidate() to trust no cache */
 #define LOOKUP_RCU		0x0040	/* RCU pathwalk mode; semi-internal */
+#define LOOKUP_ALT		0x200000 /* Alternative path walk mode */
 
 /* These tell filesystem methods that we are dealing with the final component... */
 #define LOOKUP_OPEN		0x0100	/* ... in open */
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -88,6 +88,7 @@
 #define BPF_FS_MAGIC		0xcafe4a11
 #define AAFS_MAGIC		0x5a3c69f0
 #define ZONEFS_MAGIC		0x5a4f4653
+#define FSMETA_MAGIC		0x9f8ea387
 
 /* Since UDF 2.01 is ISO 13346 based... */
 #define UDF_SUPER_MAGIC		0x15013346
--- a/include/uapi/linux/openat2.h
+++ b/include/uapi/linux/openat2.h
@@ -35,5 +35,7 @@ struct open_how {
 #define RESOLVE_IN_ROOT		0x10 /* Make all jumps to "/" and ".."
 					be scoped inside the dirfd
 					(similar to chroot(2)). */
+#define RESOLVE_ALT		0x20 /* Alternative path walk mode where
+					multiple slashes have special meaning */
 
 #endif /* _UAPI_LINUX_OPENAT2_H */

Al Viro Aug. 11, 2020, 2:08 p.m. UTC | #9

On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote:
> On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote:
> > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > I think we already lost that with the xattr API, that should have been
> > > done in a way that fits this philosophy.  But given that we  have "/"
> > > as the only special purpose char in filenames, and even repetitions
> > > are allowed, it's hard to think of a good way to do that.  Pity.
> > 
> > One way this could be solved is to allow opting into an alternative
> > path resolution mode.
> > 
> > E.g.
> >   openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
> 
> Proof of concept patch and test program below.
> 
> Opted for triple slash in the hope that just maybe we could add a global
> /proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) path
> resolution without breaking too many things.  Will try that later...
> 
> Comments?

Hell, NO.  This is unspeakably tasteless.  And full of lovely corner cases wrt
symlink bodies, etc.

Consider that one NAKed.  I'm seriously unhappy with the entire fsinfo thing
in general, but this one is really over the top.

Miklos Szeredi Aug. 11, 2020, 2:22 p.m. UTC | #10

On Tue, Aug 11, 2020 at 4:08 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote:
> > On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote:
> > > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > > I think we already lost that with the xattr API, that should have been
> > > > done in a way that fits this philosophy.  But given that we  have "/"
> > > > as the only special purpose char in filenames, and even repetitions
> > > > are allowed, it's hard to think of a good way to do that.  Pity.
> > >
> > > One way this could be solved is to allow opting into an alternative
> > > path resolution mode.
> > >
> > > E.g.
> > >   openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
> >
> > Proof of concept patch and test program below.
> >
> > Opted for triple slash in the hope that just maybe we could add a global
> > /proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) path
> > resolution without breaking too many things.  Will try that later...
> >
> > Comments?
>
> Hell, NO.  This is unspeakably tasteless.  And full of lovely corner cases wrt
> symlink bodies, etc.

It's disabled inside symlink body resolution.

Rules are simple:

 - strip off trailing part after first instance of ///
 - perform path lookup as normal
 - resolve meta path after /// on result of normal lookup

Thanks,
Miklos

Al Viro Aug. 11, 2020, 2:31 p.m. UTC | #11

On Tue, Aug 11, 2020 at 04:22:19PM +0200, Miklos Szeredi wrote:
> On Tue, Aug 11, 2020 at 4:08 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote:
> > > On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote:
> > > > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > > I think we already lost that with the xattr API, that should have been
> > > > > done in a way that fits this philosophy.  But given that we  have "/"
> > > > > as the only special purpose char in filenames, and even repetitions
> > > > > are allowed, it's hard to think of a good way to do that.  Pity.
> > > >
> > > > One way this could be solved is to allow opting into an alternative
> > > > path resolution mode.
> > > >
> > > > E.g.
> > > >   openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
> > >
> > > Proof of concept patch and test program below.
> > >
> > > Opted for triple slash in the hope that just maybe we could add a global
> > > /proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) path
> > > resolution without breaking too many things.  Will try that later...
> > >
> > > Comments?
> >
> > Hell, NO.  This is unspeakably tasteless.  And full of lovely corner cases wrt
> > symlink bodies, etc.
> 
> It's disabled inside symlink body resolution.
> 
> Rules are simple:
> 
>  - strip off trailing part after first instance of ///
>  - perform path lookup as normal
>  - resolve meta path after /// on result of normal lookup

... and interpolation of relative symlink body into the pathname does change
behaviour now, *including* the cases when said symlink body does not contain
that triple-X^Hslash garbage.  Wonderful...

Al Viro Aug. 11, 2020, 2:36 p.m. UTC | #12

On Tue, Aug 11, 2020 at 10:33:59AM -0400, Tang Jiye wrote:
> anyone knows how to post a question?

Generally the way you just have, except that you generally
put it *after* the relevant parts of the quoted text (and
removes the irrelevant ones).

Miklos Szeredi Aug. 11, 2020, 2:36 p.m. UTC | #13

On Tue, Aug 11, 2020 at 4:31 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Aug 11, 2020 at 04:22:19PM +0200, Miklos Szeredi wrote:
> > On Tue, Aug 11, 2020 at 4:08 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote:
> > > > On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote:
> > > > > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > >
> > > > > > I think we already lost that with the xattr API, that should have been
> > > > > > done in a way that fits this philosophy.  But given that we  have "/"
> > > > > > as the only special purpose char in filenames, and even repetitions
> > > > > > are allowed, it's hard to think of a good way to do that.  Pity.
> > > > >
> > > > > One way this could be solved is to allow opting into an alternative
> > > > > path resolution mode.
> > > > >
> > > > > E.g.
> > > > >   openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
> > > >
> > > > Proof of concept patch and test program below.
> > > >
> > > > Opted for triple slash in the hope that just maybe we could add a global
> > > > /proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) path
> > > > resolution without breaking too many things.  Will try that later...
> > > >
> > > > Comments?
> > >
> > > Hell, NO.  This is unspeakably tasteless.  And full of lovely corner cases wrt
> > > symlink bodies, etc.
> >
> > It's disabled inside symlink body resolution.
> >
> > Rules are simple:
> >
> >  - strip off trailing part after first instance of ///
> >  - perform path lookup as normal
> >  - resolve meta path after /// on result of normal lookup
>
> ... and interpolation of relative symlink body into the pathname does change
> behaviour now, *including* the cases when said symlink body does not contain
> that triple-X^Hslash garbage.  Wonderful...

Can you please explain?

Thanks,
Miklos

Al Viro Aug. 11, 2020, 2:42 p.m. UTC | #14

On Tue, Aug 11, 2020 at 04:36:32PM +0200, Miklos Szeredi wrote:

> > >  - strip off trailing part after first instance of ///
> > >  - perform path lookup as normal
> > >  - resolve meta path after /// on result of normal lookup
> >
> > ... and interpolation of relative symlink body into the pathname does change
> > behaviour now, *including* the cases when said symlink body does not contain
> > that triple-X^Hslash garbage.  Wonderful...
> 
> Can you please explain?

Currently substituting the body of a relative symlink in place of its name
results in equivalent pathname.  With your patch that is not just no longer
true, it's no longer true even when the symlink body does not contain that
/// kludge - it can come in part from the symlink body and in part from the
rest of pathname.  I.e. you can't even tell if substitution is an equivalent
replacement by looking at the symlink body alone.

Miklos Szeredi Aug. 11, 2020, 2:47 p.m. UTC | #15

On Tue, Aug 11, 2020 at 4:42 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Aug 11, 2020 at 04:36:32PM +0200, Miklos Szeredi wrote:
>
> > > >  - strip off trailing part after first instance of ///
> > > >  - perform path lookup as normal
> > > >  - resolve meta path after /// on result of normal lookup
> > >
> > > ... and interpolation of relative symlink body into the pathname does change
> > > behaviour now, *including* the cases when said symlink body does not contain
> > > that triple-X^Hslash garbage.  Wonderful...
> >
> > Can you please explain?
>
> Currently substituting the body of a relative symlink in place of its name
> results in equivalent pathname.

Except proc symlinks, that is.

>  With your patch that is not just no longer
> true, it's no longer true even when the symlink body does not contain that
> /// kludge - it can come in part from the symlink body and in part from the
> rest of pathname.  I.e. you can't even tell if substitution is an equivalent
> replacement by looking at the symlink body alone.

Yes, that's true not just for symlink bodies but any concatenation of
two path segments.

That's why it's enabled with RESOLVE_ALT.  I've said that I plan to
experiment with turning this on globally, but that doesn't mean it's
necessarily a good idea.  The posted patch contains nothing of that
sort.

Thanks,
Miklos

Linus Torvalds Aug. 11, 2020, 3:20 p.m. UTC | #16

[ I missed the beginning of this discussion, so maybe this was already
suggested ]

On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> >
> > E.g.
> >   openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
>
> Proof of concept patch and test program below.

I don't think this works for the reasons Al says, but a slight
modification might.

IOW, if you do something more along the lines of

       fd = open(""foo/bar", O_PATH);
       metadatafd = openat(fd, "metadataname", O_ALT);

it might be workable.

So you couldn't do it with _one_ pathname, because that is always
fundamentally going to hit pathname lookup rules.

But if you start a new path lookup with new rules, that's fine.

This is what I think xattrs should always have done, because they are
broken garbage.

In fact, if we do it right, I think we could have "getxattr()" be 100%
equivalent to (modulo all the error handling that this doesn't do, of
course):

  ssize_t getxattr(const char *path, const char *name,
                        void *value, size_t size)
  {
     int fd, attrfd;

     fd = open(path, O_PATH);
     attrfd = openat(fd, name, O_ALT);
     close(fd);
     read(attrfd, value, size);
     close(attrfd);
  }

and you'd still use getxattr() and friends as a shorthand (and for
POSIX compatibility), but internally in the kernel we'd have a
interface around that "xattrs are just file handles" model.

               Linus

Miklos Szeredi Aug. 11, 2020, 3:30 p.m. UTC | #17

On Tue, Aug 11, 2020 at 5:20 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> [ I missed the beginning of this discussion, so maybe this was already
> suggested ]
>
> On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > >
> > > E.g.
> > >   openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
> >
> > Proof of concept patch and test program below.
>
> I don't think this works for the reasons Al says, but a slight
> modification might.
>
> IOW, if you do something more along the lines of
>
>        fd = open(""foo/bar", O_PATH);
>        metadatafd = openat(fd, "metadataname", O_ALT);
>
> it might be workable.

That would have been my backup suggestion, in case the unified
namespace doesn't work out.

I wouldn't think the normal lookup rules really get in the way if we
explicitly enable alternative path lookup with a flag.  The rules just
need to be documented.

What's the disadvantage of doing it with a single lookup WITH an enabling flag?

It's definitely not going to break anything, so no backward
compatibility issues whatsoever.

Thanks,
Miklos

Andy Lutomirski Aug. 11, 2020, 3:39 p.m. UTC | #18

> On Aug 11, 2020, at 8:20 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> [ I missed the beginning of this discussion, so maybe this was already
> suggested ]
> 
>> On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>> 
>>> 
>>> E.g.
>>>  openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
>> 
>> Proof of concept patch and test program below.
> 
> I don't think this works for the reasons Al says, but a slight
> modification might.
> 
> IOW, if you do something more along the lines of
> 
>       fd = open(""foo/bar", O_PATH);
>       metadatafd = openat(fd, "metadataname", O_ALT);
> 
> it might be workable.
> 
> So you couldn't do it with _one_ pathname, because that is always
> fundamentally going to hit pathname lookup rules.
> 
> But if you start a new path lookup with new rules, that's fine.
> 
> This is what I think xattrs should always have done, because they are
> broken garbage.
> 
> In fact, if we do it right, I think we could have "getxattr()" be 100%
> equivalent to (modulo all the error handling that this doesn't do, of
> course):
> 
>  ssize_t getxattr(const char *path, const char *name,
>                        void *value, size_t size)
>  {
>     int fd, attrfd;
> 
>     fd = open(path, O_PATH);
>     attrfd = openat(fd, name, O_ALT);
>     close(fd);
>     read(attrfd, value, size);
>     close(attrfd);
>  }
> 
> and you'd still use getxattr() and friends as a shorthand (and for
> POSIX compatibility), but internally in the kernel we'd have a
> interface around that "xattrs are just file handles" model.
> 
> 

This is a lot like a less nutty version of NTFS streams, whereas the /// idea is kind of like an extra-nutty version of NTFS streams.

I am personally not a fan of the in-band signaling implications of overloading /.  For example, there is plenty of code out there that thinks that (a + “/“ + b) concatenates paths. With /// overloaded, this stops being true.

Linus Torvalds Aug. 11, 2020, 4:05 p.m. UTC | #19

On Tue, Aug 11, 2020 at 8:30 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> What's the disadvantage of doing it with a single lookup WITH an enabling flag?
>
> It's definitely not going to break anything, so no backward
> compatibility issues whatsoever.

No backwards compatibility issues for existing programs, no.

But your suggestion is fundamentally ambiguous, and you most
definitely *can* hit that if people start using this in new programs.

Where does that "unified" pathname come from? It will be generated
from "base filename + metadata name" in user space, and

 (a) the base filename might have double or triple slashes in it for
whatever reasons.

This is not some "made-up gotcha" thing - I see double slashes *all*
the time when we have things like Makefiles doing

    srctree=../../src/

and then people do "$(srctree)/". If you haven't seen that kind of
pattern where the pathname has two (or sometimes more!) slashes in the
middle, you've led a very sheltered life.

 (b) even if the new user space were to think about that, and remove
those (hah! when have you ever seen user space do that?), as Al
mentioned, the user *filesystem* might have pathnames with double
slashes as part of symlinks.

So now we'd have to make sure that when we traverse symlinks, that
O_ALT gets cleared. Which means that it's not a unified namespace
after all, because you can't make symlinks point to metadata.

Or we'd retroactively change the semantics of a symlink, and that _is_
a backwards compatibility issue. Not with old software, no, but it
changes the meaning of old symlinks!

So no, I don't think a unified namespace ends up working.

And I say that as somebody who actually loves the concept. Ask Al: I
have a few times pushed for "let's allow directory behavior on regular
files", so that you could do things like a tar-filesystem, and access
the contents of a tar-file by just doing

    cat my-file.tar/inside/the/archive.c

or similar.

Al has convinced me it's a horrible idea (and there you have a
non-ambiguous marker: the slash at the end of a pathname that
otherwise looks and acts as a non-directory)

               Linus

Al Viro Aug. 11, 2020, 4:05 p.m. UTC | #20

On Tue, Aug 11, 2020 at 08:20:24AM -0700, Linus Torvalds wrote:

> I don't think this works for the reasons Al says, but a slight
> modification might.
> 
> IOW, if you do something more along the lines of
> 
>        fd = open(""foo/bar", O_PATH);
>        metadatafd = openat(fd, "metadataname", O_ALT);
> 
> it might be workable.
> 
> So you couldn't do it with _one_ pathname, because that is always
> fundamentally going to hit pathname lookup rules.
> 
> But if you start a new path lookup with new rules, that's fine.

Except that you suddenly see non-directory dentries get children.
And a lot of dcache-related logics needs to be changed if that
becomes possible.

I agree that xattrs are garbage, but this approach won't be
a straightforward solution.  Can those suckers be passed to
...at() as starting points?  Can they be bound in namespace?
Can something be bound *on* them?  What do they have for inodes
and what maintains their inumbers (and st_dev, while we are at
it)?  Can _they_ have secondaries like that (sensu Swift)?
Is that a flat space, or can they be directories?

Only a part of the problems is implementation-related (and those are
not trivial at all); most the fun comes from semantics of those things.
And answers to the implementation questions are seriously dependent upon
that...

Linus Torvalds Aug. 11, 2020, 4:09 p.m. UTC | #21

On Tue, Aug 11, 2020 at 9:05 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Except that you suddenly see non-directory dentries get children.
> And a lot of dcache-related logics needs to be changed if that
> becomes possible.

Yeah, I think you'd basically need to associate a (dynamic)
mount-point to that path when you start doing O_ALT. Or something.

And it might not be reasonably implementable. I just think that as
_interface_ it's unambiguous and fairly clean, and if Miklos can
implement something like that, I think it would be maintainable.

No?

                Linus

Casey Schaufler Aug. 11, 2020, 4:17 p.m. UTC | #22

On 8/11/2020 8:39 AM, Andy Lutomirski wrote:
>
>> On Aug 11, 2020, at 8:20 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> [ I missed the beginning of this discussion, so maybe this was already
>> suggested ]
>>
>>> On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>
>>>> E.g.
>>>>  openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
>>> Proof of concept patch and test program below.
>> I don't think this works for the reasons Al says, but a slight
>> modification might.
>>
>> IOW, if you do something more along the lines of
>>
>>       fd = open(""foo/bar", O_PATH);
>>       metadatafd = openat(fd, "metadataname", O_ALT);
>>
>> it might be workable.
>>
>> So you couldn't do it with _one_ pathname, because that is always
>> fundamentally going to hit pathname lookup rules.
>>
>> But if you start a new path lookup with new rules, that's fine.
>>
>> This is what I think xattrs should always have done, because they are
>> broken garbage.
>>
>> In fact, if we do it right, I think we could have "getxattr()" be 100%
>> equivalent to (modulo all the error handling that this doesn't do, of
>> course):
>>
>>  ssize_t getxattr(const char *path, const char *name,
>>                        void *value, size_t size)
>>  {known
>>     int fd, attrfd;
>>
>>     fd = open(path, O_PATH);
>>     attrfd = openat(fd, name, O_ALT);
>>     close(fd);
>>     read(attrfd, value, size);
>>     close(attrfd);
>>  }
>>
>> and you'd still use getxattr() and friends as a shorthand (and for
>> POSIX compatibility), but internally in the kernel we'd have a
>> interface around that "xattrs are just file handles" model.

This doesn't work so well for setxattr(), which we want to be atomic.

> This is a lot like a less nutty version of NTFS streams, whereas the /// idea is kind of like an extra-nutty version of NTFS streams.
>
> I am personally not a fan of the in-band signaling implications of overloading /.  For example, there is plenty of code out there that thinks that (a + “/“ + b) concatenates paths. With /// overloaded, this stops being true.

Since a////////b has known meaning, and lots of applications
play loose with '/', its really dangerous to treat the string as
special. We only get away with '.' and '..' because their behavior
was defined before many of y'all were born.

Linus Torvalds Aug. 11, 2020, 4:30 p.m. UTC | #23

On Tue, Aug 11, 2020 at 9:17 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> This doesn't work so well for setxattr(), which we want to be atomic.

Well, it's not like the old interfaces could go away. But yes, doing

        metadatafd = openat(fd, "metadataname", O_ALT | O_CREAT | O_EXCL)

to create a new xattr (and then write to it) would not act like
setxattr(). Even if you do it as one atomic write, a reader would see
that zero-sized xattr between the O_CREAT and the write.

Of course, we could just hide zero-sized xattrs from the legacy
interfaces and avoid things like that, but another option is to say
that only the legacy interfaces give that particular atomicity
guarantee.

> Since a////////b has known meaning, and lots of applications
> play loose with '/', its really dangerous to treat the string as
> special. We only get away with '.' and '..' because their behavior
> was defined before many of y'all were born.

Yeah, I really don't think it's a good idea to play with "//".

POSIX does allow special semantics for a pathname with "//" at the
*beginning*, but even that has been very questionable (and Linux has
never supported it).

                   Linus

Al Viro Aug. 11, 2020, 4:39 p.m. UTC | #24

On Tue, Aug 11, 2020 at 09:09:36AM -0700, Linus Torvalds wrote:
> On Tue, Aug 11, 2020 at 9:05 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Except that you suddenly see non-directory dentries get children.
> > And a lot of dcache-related logics needs to be changed if that
> > becomes possible.
> 
> Yeah, I think you'd basically need to associate a (dynamic)
> mount-point to that path when you start doing O_ALT. Or something.

Whee...  That's going to be non-workable for xattrs - fgetxattr()
needs to work after unlink().  And you'd obviously need to prevent
crossing into that sucker on normal lookups, which would add quite
a few interesting twists around the automount points.

I'm not saying it's not doable, but it won't be anywhere near
straightforward.  And API semantics questions are still there...

Miklos Szeredi Aug. 11, 2020, 6:49 p.m. UTC | #25

On Tue, Aug 11, 2020 at 6:05 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:

> and then people do "$(srctree)/". If you haven't seen that kind of
> pattern where the pathname has two (or sometimes more!) slashes in the
> middle, you've led a very sheltered life.

Oh, I have.   That's why I opted for triple slashes, since that should
work most of the time even in those concatenated cases.  And yes, I
know, most is not always, and this might just be hiding bugs, etc...
I think the pragmatic approach would be to try this and see how many
triple slash hits a normal workload gets and if it's reasonably low,
then hopefully that together with warnings for O_ALT would be enough.

>  (b) even if the new user space were to think about that, and remove
> those (hah! when have you ever seen user space do that?), as Al
> mentioned, the user *filesystem* might have pathnames with double
> slashes as part of symlinks.
>
> So now we'd have to make sure that when we traverse symlinks, that
> O_ALT gets cleared.

That's exactly what I implemented in the proof of concept patch.

> Which means that it's not a unified namespace
> after all, because you can't make symlinks point to metadata.

I don't think that's a great deal.  Also I think other limitations
would make sense:

 - no mounts allowed under ///
 - no ./.. resolution after ///
 - no hardlinks
 - no special files, just regular and directory
 - no seeking (regular or dir)

>     cat my-file.tar/inside/the/archive.c
>
> or similar.
>
> Al has convinced me it's a horrible idea (and there you have a
> non-ambiguous marker: the slash at the end of a pathname that
> otherwise looks and acts as a non-directory)

Umm, can you remind me what's so horrible about that?  Yeah, hard
linked directories are a no-no.  But it doesn't have to be implemented
in a way to actually be a problem with hard links.

Thanks,
Miklos

Lennart Poettering Aug. 11, 2020, 7:31 p.m. UTC | #26

On Di, 11.08.20 20:49, Miklos Szeredi (miklos@szeredi.hu) wrote:

> On Tue, Aug 11, 2020 at 6:05 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>
> > and then people do "$(srctree)/". If you haven't seen that kind of
> > pattern where the pathname has two (or sometimes more!) slashes in the
> > middle, you've led a very sheltered life.
>
> Oh, I have.   That's why I opted for triple slashes, since that should
> work most of the time even in those concatenated cases.  And yes, I
> know, most is not always, and this might just be hiding bugs, etc...
> I think the pragmatic approach would be to try this and see how many
> triple slash hits a normal workload gets and if it's reasonably low,
> then hopefully that together with warnings for O_ALT would be enough.

There's no point. Userspace relies on the current meaning of triple
slashes. It really does.

I know many places in systemd where we might end up with a triple
slash. Here's a real-life example: some code wants to access the
cgroup attribute 'cgroup.controllers' of the root cgroup. It thus
generates the right path in the fs for it, which is the concatenation of
"/sys/fs/cgroup/" (because that's where cgroupfs is mounted), of "/"
(i.e. for the root cgroup) and of "/cgroup.controllers" (as that's the
file the attribute is exposed under).

And there you go:

   "/sys/fs/cgroup/" + "/" + "/cgroup.controllers" → "/sys/fs/cgroup///cgroup.controllers"

This is a real-life thing. Don't break this please.

Lennart

--
Lennart Poettering, Berlin

Christian Brauner Aug. 11, 2020, 7:39 p.m. UTC | #27

On Tue, Aug 11, 2020 at 09:05:22AM -0700, Linus Torvalds wrote:
> On Tue, Aug 11, 2020 at 8:30 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > What's the disadvantage of doing it with a single lookup WITH an enabling flag?
> >
> > It's definitely not going to break anything, so no backward
> > compatibility issues whatsoever.
> 
> No backwards compatibility issues for existing programs, no.
> 
> But your suggestion is fundamentally ambiguous, and you most
> definitely *can* hit that if people start using this in new programs.
> 
> Where does that "unified" pathname come from? It will be generated
> from "base filename + metadata name" in user space, and
> 
>  (a) the base filename might have double or triple slashes in it for
> whatever reasons.
> 
> This is not some "made-up gotcha" thing - I see double slashes *all*
> the time when we have things like Makefiles doing
> 
>     srctree=../../src/
> 
> and then people do "$(srctree)/". If you haven't seen that kind of
> pattern where the pathname has two (or sometimes more!) slashes in the
> middle, you've led a very sheltered life.
> 
>  (b) even if the new user space were to think about that, and remove
> those (hah! when have you ever seen user space do that?), as Al
> mentioned, the user *filesystem* might have pathnames with double
> slashes as part of symlinks.
> 
> So now we'd have to make sure that when we traverse symlinks, that
> O_ALT gets cleared. Which means that it's not a unified namespace
> after all, because you can't make symlinks point to metadata.
> 
> Or we'd retroactively change the semantics of a symlink, and that _is_
> a backwards compatibility issue. Not with old software, no, but it
> changes the meaning of old symlinks!
> 
> So no, I don't think a unified namespace ends up working.
> 
> And I say that as somebody who actually loves the concept. Ask Al: I
> have a few times pushed for "let's allow directory behavior on regular
> files", so that you could do things like a tar-filesystem, and access
> the contents of a tar-file by just doing
> 
>     cat my-file.tar/inside/the/archive.c
> 
> or similar.
> 
> Al has convinced me it's a horrible idea (and there you have a
> non-ambiguous marker: the slash at the end of a pathname that
> otherwise looks and acts as a non-directory)
> 

Putting my kernel hat down, putting my userspace hat on.

I'm looking at this from a potential user of this interface.
I'm not a huge fan of the metadata fd approach I'd much rather have a
dedicated system call rather than opening a side-channel metadata fd
that I can read binary data from. Maybe I'm alone in this but I was
under the impression that other users including Ian, Lennart, and Karel
have said on-list in some form that they would prefer this approach.
There are even patches for systemd and libmount, I thought?

But if we want to go down a completely different route then I'd prefer
if this metadata fd with "special semantics" did not in any way alter
the meaning of regular paths. This has the potential to cause a lot of
churn for userspace. I think having to play concatenation games in
shared libraries for mount information is a bad plan in addition to all
the issues you raised here.

Christian

Christian Brauner Aug. 11, 2020, 7:50 p.m. UTC | #28

On Tue, Aug 11, 2020 at 09:31:05PM +0200, Lennart Poettering wrote:
> On Di, 11.08.20 20:49, Miklos Szeredi (miklos@szeredi.hu) wrote:
> 
> > On Tue, Aug 11, 2020 at 6:05 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >
> > > and then people do "$(srctree)/". If you haven't seen that kind of
> > > pattern where the pathname has two (or sometimes more!) slashes in the
> > > middle, you've led a very sheltered life.
> >
> > Oh, I have.   That's why I opted for triple slashes, since that should
> > work most of the time even in those concatenated cases.  And yes, I
> > know, most is not always, and this might just be hiding bugs, etc...
> > I think the pragmatic approach would be to try this and see how many
> > triple slash hits a normal workload gets and if it's reasonably low,
> > then hopefully that together with warnings for O_ALT would be enough.
> 
> There's no point. Userspace relies on the current meaning of triple
> slashes. It really does.
> 
> I know many places in systemd where we might end up with a triple
> slash. Here's a real-life example: some code wants to access the
> cgroup attribute 'cgroup.controllers' of the root cgroup. It thus
> generates the right path in the fs for it, which is the concatenation of
> "/sys/fs/cgroup/" (because that's where cgroupfs is mounted), of "/"
> (i.e. for the root cgroup) and of "/cgroup.controllers" (as that's the
> file the attribute is exposed under).
> 
> And there you go:
> 
>    "/sys/fs/cgroup/" + "/" + "/cgroup.controllers" → "/sys/fs/cgroup///cgroup.controllers"
> 
> This is a real-life thing. Don't break this please.

Taken from a log from a container:

lxc f4 20200810105815.742 TRACE    cgfsng - cgroups/cgfsng.c:cg_legacy_handle_cpuset_hierarchy:552 - "cgroup.clone_children" was already set to "1"
lxc f4 20200810105815.742 WARN     cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset///lxc.monitor.f4"
lxc f4 20200810105815.743 INFO     cgfsng - cgroups/cgfsng.c:cgfsng_monitor_create:1366 - The monitor process uses "lxc.monitor.f4" as cgroup
lxc f4 20200810105815.743 DEBUG    storage - storage/storage.c:get_storage_by_name:211 - Detected rootfs type "dir"
lxc f4 20200810105815.743 TRACE    cgfsng - cgroups/cgfsng.c:cg_legacy_handle_cpuset_hierarchy:552 - "cgroup.clone_children" was already set to "1"
lxc f4 20200810105815.743 WARN     cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset///lxc.payload.f4"
lxc f4 20200810105815.743 INFO     cgfsng - cgroups/cgfsng.c:cgfsng_payload_create:1469 - The container process uses "lxc.payload.f4" as cgroup
lxc f4 20200810105815.744 TRACE    start - start.c:lxc_spawn:1731 - Spawned container directly into target cgroup via cgroup2 fd 17 

Christian

Miklos Szeredi Aug. 11, 2020, 8:28 p.m. UTC | #29

On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler <casey@schaufler-ca.com> wrote:

> Since a////////b has known meaning, and lots of applications
> play loose with '/', its really dangerous to treat the string as
> special. We only get away with '.' and '..' because their behavior
> was defined before many of y'all were born.

So the founding fathers have set things in stone and now we can't
change it.   Right?

Well that's how it looks... but let's think a little; we have '/' and
'\0' that can't be used in filenames.  Also '.' and '..' are
prohibited names. It's not a trivial limitation, so applications are
probably not used to dumping binary data into file names.  And that
means it's probably possible to find a fairly short combination that
is never used in practice (probably containing the "/." sequence).
Why couldn't we reserve such a combination now?

I have no idea how to find such it, but other than that, I see no
theoretical problem with extending the list of reserved filenames.

Thanks,
Miklos

Jann Horn Aug. 11, 2020, 8:36 p.m. UTC | #30

On Tue, Aug 11, 2020 at 10:29 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > Since a////////b has known meaning, and lots of applications
> > play loose with '/', its really dangerous to treat the string as
> > special. We only get away with '.' and '..' because their behavior
> > was defined before many of y'all were born.
>
> So the founding fathers have set things in stone and now we can't
> change it.   Right?
>
> Well that's how it looks... but let's think a little; we have '/' and
> '\0' that can't be used in filenames.  Also '.' and '..' are
> prohibited names. It's not a trivial limitation, so applications are
> probably not used to dumping binary data into file names.  And that
> means it's probably possible to find a fairly short combination that
> is never used in practice (probably containing the "/." sequence).
> Why couldn't we reserve such a combination now?

This isn't just about finding a string that "is never used in
practice". There is userspace software that performs security checks
based on the precise semantics that paths have nowadays, and those
security checks will sometimes happily let you use arbitrary binary
garbage in path components as long as there's no '\0' or '/' in there
and the name isn't "." or "..", because that's just how paths work on
Linux.

If you change the semantics of path strings, you'd have to be
confident that the new semantics fit nicely with all the path
validation routines that exist scattered across userspace, and don't
expose new interfaces through file server software and setuid binaries
and so on.

I really don't like this idea.

Miklos Szeredi Aug. 11, 2020, 8:56 p.m. UTC | #31

On Tue, Aug 11, 2020 at 10:37 PM Jann Horn <jannh@google.com> wrote:
> If you change the semantics of path strings, you'd have to be
> confident that the new semantics fit nicely with all the path
> validation routines that exist scattered across userspace, and don't
> expose new interfaces through file server software and setuid binaries
> and so on.

So that's where O_ALT comes in.   If the application is consenting,
then that should prevent exploits.   Or?

Thanks,
Miklos

Andy Lutomirski Aug. 11, 2020, 9:17 p.m. UTC | #32

On Tue, Aug 11, 2020 at 1:56 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, Aug 11, 2020 at 10:37 PM Jann Horn <jannh@google.com> wrote:
> > If you change the semantics of path strings, you'd have to be
> > confident that the new semantics fit nicely with all the path
> > validation routines that exist scattered across userspace, and don't
> > expose new interfaces through file server software and setuid binaries
> > and so on.
>
> So that's where O_ALT comes in.   If the application is consenting,
> then that should prevent exploits.   Or?

We're going to be at risk from libraries that want to use the new
O_ALT mechanism but are invoked by old code that passes traditional
Linux paths.  Each library will have to sanitize paths, and some will
screw it up.

I much prefer Linus' variant where the final part of the extended path
is passed as a separate parameter.

Linus Torvalds Aug. 11, 2020, 9:18 p.m. UTC | #33

On Tue, Aug 11, 2020 at 1:56 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> So that's where O_ALT comes in.   If the application is consenting,
> then that should prevent exploits.   Or?

If the application is consenting AND GETS IT RIGHT it should prevent exploits.

But that's a big deal.

Why not just do it the way I suggested? Then you don't have any of these issues.

              Linus

Al Viro Aug. 11, 2020, 9:20 p.m. UTC | #34

On Tue, Aug 11, 2020 at 10:28:31PM +0200, Miklos Szeredi wrote:
> On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
> > Since a////////b has known meaning, and lots of applications
> > play loose with '/', its really dangerous to treat the string as
> > special. We only get away with '.' and '..' because their behavior
> > was defined before many of y'all were born.
> 
> So the founding fathers have set things in stone and now we can't
> change it.   Right?

Right.

> Well that's how it looks... but let's think a little; we have '/' and
> '\0' that can't be used in filenames.  Also '.' and '..' are
> prohibited names. It's not a trivial limitation, so applications are
> probably not used to dumping binary data into file names.  And that
> means it's probably possible to find a fairly short combination that
> is never used in practice (probably containing the "/." sequence).

No, it is not.  Miklos, get real - you will end up with obscure
pathname produced once in a while by a script fragment from hell
spewed out by crusty piece of awk buried in a piece of shit
makefile from hell (and you are lucky if it won't be an automake
output, while we are at it).  Exercised only when some shipped
turd needs to be regenerated.  Have you _ever_ tried to debug e.g.
gcc build problems?  I have, and it's extremely unpleasant.  Failures
tend to be obscure as hell, backtracking them through the makefiles
is a massive PITA and figuring out why said piece of awk produces
what it does...

I know what I would've done if the likely 5 hours of cursing everything
would have ended up with discovery that some luser had assumed that
surely, no sane software would ever generate this sequence of characters
in anything used as a pathname, and that for this reason I'm looking
forward to several more hours of playing with utterly revolting crap
to convince it to stay away from that sequence...

> Why couldn't we reserve such a combination now?
> 
> I have no idea how to find such it, but other than that, I see no
> theoretical problem with extending the list of reserved filenames.

"not breaking userland", for one.

Casey Schaufler Aug. 11, 2020, 9:35 p.m. UTC | #35

On 8/11/2020 1:28 PM, Miklos Szeredi wrote:
> On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
>> Since a////////b has known meaning, and lots of applications
>> play loose with '/', its really dangerous to treat the string as
>> special. We only get away with '.' and '..' because their behavior
>> was defined before many of y'all were born.
> So the founding fathers have set things in stone and now we can't
> change it.   Right?

The founders did lots of things that, in retrospect, weren't
such great ideas, but that we have to live with.

> Well that's how it looks... but let's think a little; we have '/' and
> '\0' that can't be used in filenames.  Also '.' and '..' are
> prohibited names. It's not a trivial limitation, so applications are
> probably not used to dumping binary data into file names.

Hee Hee. Back in the early days of UNIX (the 1970s) there was command
dsw(1) "delete from switches" because files with untypeible names where
unfortunately common. I would question the assertion that "applications
are not used to dumping binary data into file names", based on how
often I've wished we still had dsw(1).

>   And that
> means it's probably possible to find a fairly short combination that
> is never used in practice (probably containing the "/." sequence).

You'd think, but you'd be wrong. In the UNIX days we tried everything
from "..." to ".NO_HID." and there always arose a problem or two. Not
the least of which is that a "magic" pathname generated on an old system,
then mounted on a new system will never give you the results you want.

> Why couldn't we reserve such a combination now?
>
> I have no idea how to find such it, but other than that, I see no
> theoretical problem with extending the list of reserved filenames.

You need a sequence that is never used in any language, and
that has never been used as a magic shell sequence. If you want
a fun story to tell over beers, look up how using the "@" as the
erase character on a TTY33 lead to it being used in email addresses.

> Thanks,
> Miklos

David Howells Aug. 12, 2020, 12:05 a.m. UTC | #36

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [ I missed the beginning of this discussion, so maybe this was already
> suggested ]

Well, the start of it was my proposal of an fsinfo() system call.  That at its
simplest takes an object reference (eg. a path) and an integer attribute ID (it
could use a string instead, I suppose, but it would mean a bunch of strcmps
instead of integer comparisons) and returns the value of the attribute.  But I
allow you to do slightly more interesting things than that too.

Miklós seems dead-set against adding a system call specifically for this -
though he's proposed extending open in various ways and also proposed an
additional syscall, readfile(), that does the open+read+close all in one step.

I think also at some point, he (or maybe James?) proposed adding a new magic
filesystem mounted somewhere on proc (reflecting an open fd) that then had a
bunch of symlinks to somewhere in sysfs (reflecting a mount).  The idea being
that you did something like:

	fd = open("/path/to/object", O_PATH);
	sprintf(name, "/proc/self/fds/%u/attr1", fd);
	attrfd = open(name, O_RDONLY);
	read(attrfd, buf1, sizeof(buf1));
	close(attrfd);
	sprintf(name, "/proc/self/fds/%u/attr2", fd);
	attrfd = open(name, O_RDONLY);
	read(attrfd, buf2, sizeof(buf2));
	close(attrfd);

or:

	sprintf(name, "/proc/self/fds/%u/attr1", fd);
	readfile(name, buf1, sizeof(buf1));
	sprintf(name, "/proc/self/fds/%u/attr2", fd);
	readfile(name, buf2, sizeof(buf2));

and then "/proc/self/fds/12/attr2" might then be a symlink to, say,
"/sys/mounts/615/mount_attr".

Miklós's justification for this was that it could then be operated from a shell
script without the need for a utility - except that bash, at least, can't do
O_PATH opens.

James has proposed making fsconfig() able to retrieve attributes (though I'd
prefer to give it a sibling syscall that does the retrieval rather than making
fsconfig() do that too).

>   {
>      int fd, attrfd;
>
>      fd = open(path, O_PATH);
>      attrfd = openat(fd, name, O_ALT);
>      close(fd);
>      read(attrfd, value, size);
>      close(attrfd);
>   }

Please don't go down this path.  You're proposing five syscalls - including
creating two file descriptors - to do what fsinfo() does in one.

Do you have a particular objection to adding a syscall specifically for
retrieving filesystem/VFS information?

-~-

Anyway, in case you're interested in what I want to get out of this - which is
the reason for it being posted in the first place:

 (*) The ability to retrieve various attributes of a filesystem/superblock,
     including information on:

	- Filesystem features: Does it support things like hard links, user
          quotas, direct I/O.

	- Filesystem limits: What's the maximum size of a file, an xattr, a
          directory; how many files can it support.

	- Supported API features: What FS_IOC_GETFLAGS does it support?  Which
          can be set?  Does it have Windows file attributes available?  What
          statx attributes are supported?  What do the timestamps support?
          What sort of case handling is done on filenames?

     Note that for a lot of cases, this stuff is fixed and can just be memcpy'd
     from rodata.  Some of this is variable, however, in things like ext4 and
     xfs, depending on, say, mkfs configuration.  The situation is even more
     complex with network filesystems as this may depend on the server they're
     talking to.

     But note also that some of this stuff might change file-to-file, even
     within a superblock.

 (*) The ability to retrieve attributes of a mount point, including information
     on the flags, propagation settings and child lists.

 (*) The ability to quickly retrieve a list of accessible mount point IDs,
     with change event counters to permit userspace (eg. systemd) to quickly
     determine if anything changed in the even of an overrun.

 (*) The ability to find mounts/superblocks by mount ID.  Paths are not unique
     identifiers for mountpoints.  You can stack multiple mounts on the same
     directory, but a path only sees the top one.

 (*) The ability to look inside a different mount namespace - one to which you
     have a reference fd.  This would allow a container manager to look inside
     the container it is managing.

 (*) The ability to expose filesystem-specific attributes.  Network filesystems
     can expose lists of servers and server addresses, for instance.

 (*) The ability to use the object referenced to determine the namespace
     (particularly the network namespace) to look in.  The problem with looking
     in, say, /proc/net/... is that it looks at current's net namespace -
     whether or not the object of interest is in the same one.

 (*) The ability to query the context attached to the fd obtained from
     fsopen().  Such a context may not have a superblock attached to it yet or
     may not be mounted yet.

     The aim is to allow a container manager to supervise a mount being made in
     a container.  It kind of pairs with fsconfig() in that respect.

 (*) The ability to query mount and superblock event counters to help a
     watching process handle overrun in the notifications queue.

What I've done with fsinfo() is:

 (*) Provided a number of ways to refer to the object to be queried (path,
     dirfd+path, fd, mount ID - with others planned).

 (*) Made it so that attibutes are referenced by a numeric ID to keep search
     time minimal.  Numeric IDs must be declared in uapi/linux/fsinfo.h.

 (*) Made it so that the core does most of the work.  Filesystems are given an
     in-kernel buffer to copy into and don't get to see any userspace pointers.

 (*) Made it so that values are not, by and large, encoded as text if it can be
     avoided.  Backward and forward compatibility on binary structs is handled
     by the core.  The filesystem just fills in the values in the UAPI struct
     in the buffer.  The core will zero-pad or truncate the data to match what
     userspace asked for.

     The UAPI struct must be declared in uapi/linux/fsinfo.h.

 (*) Made it so that, for some attributes, the core will fill in the data as
     best it can from what's available in the superblock, mount struct or mount
     namespace.  The filesystem can then amend this if it wants to.

 (*) Made it so that attributes are typed.  The types are few: string, struct,
     list of struct, opaque.  Structs are extensible: the length is the
     version, a new version is required to be a superset of the old version and
     excess requestage is simply cleared by the kernel.

     Information about the type of an attribute can be queried by fsinfo().

What I want to avoid:

 (*) Adding another magic filesystem.

 (*) Adding symlinks from proc to sysfs.

 (*) Having to use open to get an attribute.

 (*) Having to use multiple opens to get an attribute.

 (*) Having to pathwalk to get to the attribute from the object being queried.

 (*) Allocating another O_ open flag for this.

 (*) Avoidable text encoding and decoding.

 (*) Letting the filesystem access the userspace buffer.

Note that I'm not against splitting fsinfo() into a set of sibling syscalls if
that makes it more palatable, or even against using strings for the attribute
IDs, though I'd prefer to avoid the strcmps.

David

Ian Kent Aug. 12, 2020, 12:53 a.m. UTC | #37

On Tue, 2020-08-11 at 21:39 +0200, Christian Brauner wrote:
> On Tue, Aug 11, 2020 at 09:05:22AM -0700, Linus Torvalds wrote:
> > On Tue, Aug 11, 2020 at 8:30 AM Miklos Szeredi <miklos@szeredi.hu>
> > wrote:
> > > What's the disadvantage of doing it with a single lookup WITH an
> > > enabling flag?
> > > 
> > > It's definitely not going to break anything, so no backward
> > > compatibility issues whatsoever.
> > 
> > No backwards compatibility issues for existing programs, no.
> > 
> > But your suggestion is fundamentally ambiguous, and you most
> > definitely *can* hit that if people start using this in new
> > programs.
> > 
> > Where does that "unified" pathname come from? It will be generated
> > from "base filename + metadata name" in user space, and
> > 
> >  (a) the base filename might have double or triple slashes in it
> > for
> > whatever reasons.
> > 
> > This is not some "made-up gotcha" thing - I see double slashes
> > *all*
> > the time when we have things like Makefiles doing
> > 
> >     srctree=../../src/
> > 
> > and then people do "$(srctree)/". If you haven't seen that kind of
> > pattern where the pathname has two (or sometimes more!) slashes in
> > the
> > middle, you've led a very sheltered life.
> > 
> >  (b) even if the new user space were to think about that, and
> > remove
> > those (hah! when have you ever seen user space do that?), as Al
> > mentioned, the user *filesystem* might have pathnames with double
> > slashes as part of symlinks.
> > 
> > So now we'd have to make sure that when we traverse symlinks, that
> > O_ALT gets cleared. Which means that it's not a unified namespace
> > after all, because you can't make symlinks point to metadata.
> > 
> > Or we'd retroactively change the semantics of a symlink, and that
> > _is_
> > a backwards compatibility issue. Not with old software, no, but it
> > changes the meaning of old symlinks!
> > 
> > So no, I don't think a unified namespace ends up working.
> > 
> > And I say that as somebody who actually loves the concept. Ask Al:
> > I
> > have a few times pushed for "let's allow directory behavior on
> > regular
> > files", so that you could do things like a tar-filesystem, and
> > access
> > the contents of a tar-file by just doing
> > 
> >     cat my-file.tar/inside/the/archive.c
> > 
> > or similar.
> > 
> > Al has convinced me it's a horrible idea (and there you have a
> > non-ambiguous marker: the slash at the end of a pathname that
> > otherwise looks and acts as a non-directory)
> > 
> 
> Putting my kernel hat down, putting my userspace hat on.
> 
> I'm looking at this from a potential user of this interface.
> I'm not a huge fan of the metadata fd approach I'd much rather have a
> dedicated system call rather than opening a side-channel metadata fd
> that I can read binary data from. Maybe I'm alone in this but I was
> under the impression that other users including Ian, Lennart, and
> Karel
> have said on-list in some form that they would prefer this approach.
> There are even patches for systemd and libmount, I thought?

Not quite sure what you mean here.

Karel (with some contributions by me) has implemented the interfaces
for David's mount notifications and fsinfo() call in libmount. We
still have a little more to do on that.

I also have a systemd implementation that uses these libmount features
for mount table handling that works quite well, with a couple more
things to do to complete it, that Lennart has done an initial review
for.

It's no secret that I don't like the proc file system in general
but it is really useful for many things, that's just the way it
is.

Ian

Miklos Szeredi Aug. 12, 2020, 7:23 a.m. UTC | #38

On Tue, Aug 11, 2020 at 11:19 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Aug 11, 2020 at 1:56 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > So that's where O_ALT comes in.   If the application is consenting,
> > then that should prevent exploits.   Or?
>
> If the application is consenting AND GETS IT RIGHT it should prevent exploits.
>
> But that's a big deal.
>
> Why not just do it the way I suggested? Then you don't have any of these issues.

Will do.

I just want to understand the reasons why a unified namespace is
completely out of the question.   And I won't accept "it's just fugly"
or "it's the way it's always been done, so don't change it".  Those
are not good reasons.

Oh, I'm used to these "fights", had them all along.  In hindsight I
should have accepted others' advice in some of the cases, but in
others that big argument turned out to be a complete non-issue.   One
such being inode and dentry duplication in the overlayfs case vs.
in-built stacking in the union-mount case.  There were a lot of issues
with overlayfs, that's true, but dcache/icache size has NEVER actually
been reported as a problem.

While Al has a lot of experience, it's hard to accept all that
anecdotal evidence just because he says it.   Your worries are also
just those: worries.  They may turn out to be an issue or they may
not.

Anyway, starting with just introducing the alt namespace without
unification seems to be a good first step. If that turns out to be
workable, we can revisit unification later.

Thanks,
Miklos

Miklos Szeredi Aug. 12, 2020, 7:55 a.m. UTC | #39

On Wed, Aug 12, 2020 at 2:05 AM David Howells <dhowells@redhat.com> wrote:

> >   {
> >      int fd, attrfd;
> >
> >      fd = open(path, O_PATH);
> >      attrfd = openat(fd, name, O_ALT);
> >      close(fd);
> >      read(attrfd, value, size);
> >      close(attrfd);
> >   }
>
> Please don't go down this path.  You're proposing five syscalls - including
> creating two file descriptors - to do what fsinfo() does in one.

So what?  People argued against readfile() for exactly the opposite of
reasons, even though that's a lot less specialized than fsinfo().

Worried about performance?  Io-uring will allow you to do all those
five syscalls (or many more) with just one I/O submission.

Thanks,
Miklos

David Howells Aug. 12, 2020, 8:29 a.m. UTC | #40

Miklos Szeredi <miklos@szeredi.hu> wrote:

> Worried about performance?  Io-uring will allow you to do all those
> five syscalls (or many more) with just one I/O submission.

io_uring isn't going to help here.  We're talking about synchronous reads.
AIUI, you're adding a couple more syscalls to the list and running stuff in a
side thread to save the effort of going in and out of the kernel five times.
But you still have to pay the set up/tear down costs on the fds and do the
pathwalks.  io_uring doesn't magically make that cost disappear.

io_uring also requires resources such as a kernel accessible ring buffer to
make it work.

You're proposing making everything else more messy just to avoid a dedicated
syscall.  Could you please set out your reasoning for that?

David

Miklos Szeredi Aug. 12, 2020, 8:37 a.m. UTC | #41

On Wed, Aug 12, 2020 at 10:29 AM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > Worried about performance?  Io-uring will allow you to do all those
> > five syscalls (or many more) with just one I/O submission.
>
> io_uring isn't going to help here.  We're talking about synchronous reads.
> AIUI, you're adding a couple more syscalls to the list and running stuff in a
> side thread to save the effort of going in and out of the kernel five times.
> But you still have to pay the set up/tear down costs on the fds and do the
> pathwalks.  io_uring doesn't magically make that cost disappear.
>
> io_uring also requires resources such as a kernel accessible ring buffer to
> make it work.
>
> You're proposing making everything else more messy just to avoid a dedicated
> syscall.  Could you please set out your reasoning for that?

a) A dedicated syscall with a complex binary API is a non-trivial
maintenance burden.

b) The awarded performance boost is not warranted for the use cases it
is designed for.

Thanks,
Miklos

Steven Whitehouse Aug. 12, 2020, 9:43 a.m. UTC | #42

Hi,

On 12/08/2020 09:37, Miklos Szeredi wrote:
[snip]
>
> b) The awarded performance boost is not warranted for the use cases it
> is designed for.
>
> Thanks,
> Miklos
>

This is a key point. One of the main drivers for this work is the 
efficiency improvement for large numbers of mounts. Ian and Karel have 
already provided performance measurements showing a significant benefit 
compared with what we have today. If you want to propose this 
alternative interface then you need to show that it can sustain similar 
levels of performance, otherwise it doesn't solve the problem. So 
performance numbers here would be helpful.

Also - I may have missed this earlier in the discussion, what are the 
atomicity guarantees with this proposal? This is the other key point for 
the API, so it would be good to see that clearly stated (i.e. how does 
one use it in combination with the notifications to provide an up to 
date, consistent view of the kernel's mounts)

Steve.

Miklos Szeredi Aug. 12, 2020, 10:04 a.m. UTC | #43

On Wed, Aug 12, 2020 at 11:43 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> Hi,
>
> On 12/08/2020 09:37, Miklos Szeredi wrote:
> [snip]
> >
> > b) The awarded performance boost is not warranted for the use cases it
> > is designed for.

>
> This is a key point. One of the main drivers for this work is the
> efficiency improvement for large numbers of mounts. Ian and Karel have
> already provided performance measurements showing a significant benefit
> compared with what we have today. If you want to propose this
> alternative interface then you need to show that it can sustain similar
> levels of performance, otherwise it doesn't solve the problem. So
> performance numbers here would be helpful.

Definitely.   Will measure performance with the interface which Linus proposed.

I'm not worried, though; the problem with the previous interface was
that it resulted in the complete mount table being re-parsed on each
individual event resulting in quadratic behavior.  This doesn't affect
any interface that can query individual mount/superblock objects.

> Also - I may have missed this earlier in the discussion, what are the
> atomicity guarantees with this proposal? This is the other key point for
> the API, so it would be good to see that clearly stated (i.e. how does
> one use it in combination with the notifications to provide an up to
> date, consistent view of the kernel's mounts)

fsinfo(2) provides version counters on mount and superblock objects to
verify consistency of returned data, since not all data is returned in
a single call.  Same method could be used with the open/read based
interface to verify consistency in case multiple attributes/attribute
groups need to be queried.

Thanks,
Miklos

Karel Zak Aug. 12, 2020, 10:14 a.m. UTC | #44

On Tue, Aug 11, 2020 at 08:20:24AM -0700, Linus Torvalds wrote:
> IOW, if you do something more along the lines of
> 
>        fd = open(""foo/bar", O_PATH);
>        metadatafd = openat(fd, "metadataname", O_ALT);
> 
> it might be workable.

I have thought we want to replace mountinfo to reduce overhead. If I
understand your idea than we will need openat()+read()+close() for
each attribute? Sounds like a pretty expensive interface.

The question is also how consistent results you get if you will read
information about the same mountpoint by multiple openat()+read()+close()
calls. 

For example,  by fsinfo(FSINFO_ATTR_MOUNT_TOPOLOGY) you get all
mountpoint propagation setting and relations by one syscall, with
your idea you will read parent, slave and flags by multiple read() and
without any lock. Sounds like you can get a mess if someone moves or
reconfigure the mountpoint or so.

openat(O_ALT) seems elegant at first glance, but it will be necessary to 
provide a richer (complex) answers by read() to reduce overhead and 
to make it more consistent for userspace. 

It would be also nice to avoid some strings formatting and separators
like we use in the current mountinfo.

I can imagine multiple values separated by binary header (like we already 
have for watch_notification, inotify, etc):

  fd = openat(fd, "mountinfo", O_ALT);

  sz = read(fd, buf, BUFSZ);
  p = buf;

  while (sz) {
     struct alt_metadata *alt = (struct alt_metadata *) p;

     char *varname = alt->name;
     char *data = alt->data;
     int len  = alt->len;

     sz -= len;
     p += len;
  }

  Karel

Karel Zak Aug. 12, 2020, 11:28 a.m. UTC | #45

On Wed, Aug 12, 2020 at 12:04:14PM +0200, Miklos Szeredi wrote:
> On Wed, Aug 12, 2020 at 11:43 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
> >
> > Hi,
> >
> > On 12/08/2020 09:37, Miklos Szeredi wrote:
> > [snip]
> > >
> > > b) The awarded performance boost is not warranted for the use cases it
> > > is designed for.
> 
> >
> > This is a key point. One of the main drivers for this work is the
> > efficiency improvement for large numbers of mounts. Ian and Karel have
> > already provided performance measurements showing a significant benefit
> > compared with what we have today. If you want to propose this
> > alternative interface then you need to show that it can sustain similar
> > levels of performance, otherwise it doesn't solve the problem. So
> > performance numbers here would be helpful.
> 
> Definitely.   Will measure performance with the interface which Linus proposed.

The proposal is based on paths and open(), how do you plan to deal
with mount IDs? David's fsinfo() allows to ask for mount info by mount
ID and it works well with mount notification where you get the ID. The
collaboration with notification interface is critical for our use-cases.

    Karel

Miklos Szeredi Aug. 12, 2020, 12:43 p.m. UTC | #46

On Wed, Aug 12, 2020 at 1:28 PM Karel Zak <kzak@redhat.com> wrote:

> The proposal is based on paths and open(), how do you plan to deal
> with mount IDs? David's fsinfo() allows to ask for mount info by mount
> ID and it works well with mount notification where you get the ID. The
> collaboration with notification interface is critical for our use-cases.

One would use the notification to keep an up to date set of attributes
for each watched mount, right?

That presumably means the mount ID <-> mount path mapping already
exists, which means it's just possible to use the open(mount_path,
O_PATH) to obtain the base fd.

If that assumption is not true, we could add a new interface for
opening the root of the mount by ID.  Fsinfo uses the dfd as a root
for checking connectivity and the filename as the mount ID + a special
flag indicating that it's not "dfd + path" we are dealing with but
"rootfd + mntid".  That sort of semantic multiplexing is quite ugly
and I wouldn't suggest doing that with openat(2).

A new syscall that returns an fd pointing to the root of the mount
might be the best solution:

   int open_mount(int root_fd, u64 mntid, int flags);

Yeah, yeah this is adding just another syscall interface, but notice how:

 a) it does one simple thing, no multiplexing at all

 b) is general purpose, and could be used for example in conjunction
with open_by_handle_at(2), that also requires an fd pointing to a
mount.

Thanks,
Miklos

David Howells Aug. 12, 2020, 1:06 p.m. UTC | #47

Miklos Szeredi <miklos@szeredi.hu> wrote:

> That presumably means the mount ID <-> mount path mapping already
> exists, which means it's just possible to use the open(mount_path,
> O_PATH) to obtain the base fd.

No, you can't.  A path more correspond to multiple mounts stacked on top of
each other, e.g.:

	mount -t tmpfs none /mnt
	mount -t tmpfs none /mnt
	mount -t tmpfs none /mnt

Now you have three co-located mounts and you can't use the path to
differentiate them.  I think this might be an issue in autofs, but Ian would
need to comment on that.

David

Miklos Szeredi Aug. 12, 2020, 1:09 p.m. UTC | #48

On Wed, Aug 12, 2020 at 12:14 PM Karel Zak <kzak@redhat.com> wrote:

> For example,  by fsinfo(FSINFO_ATTR_MOUNT_TOPOLOGY) you get all
> mountpoint propagation setting and relations by one syscall,

That's just an arbitrary grouping of attributes.

You said yourself, that what's really needed is e.g. consistent
snapshot of a complete mount tree topology.  And to get the complete
topology FSINFO_ATTR_MOUNT_TOPOLOGY and FSINFO_ATTR_MOUNT_CHILDREN are
needed for *each* individual mount.  The topology can obviously change
between those calls.

So there's no fundamental difference between getting individual
attributes or getting attribute groups in this respect.

> It would be also nice to avoid some strings formatting and separators
> like we use in the current mountinfo.

I think quoting non-printable is okay.

> I can imagine multiple values separated by binary header (like we already
> have for watch_notification, inotify, etc):

Adding a few generic binary interfaces is okay.   Adding many
specialized binary interfaces is a PITA.

Thanks,
Miklos

David Howells Aug. 12, 2020, 1:33 p.m. UTC | #49

Miklos Szeredi <miklos@szeredi.hu> wrote:

> You said yourself, that what's really needed is e.g. consistent
> snapshot of a complete mount tree topology.  And to get the complete
> topology FSINFO_ATTR_MOUNT_TOPOLOGY and FSINFO_ATTR_MOUNT_CHILDREN are
> needed for *each* individual mount.

That's not entirely true.

FSINFO_ATTR_MOUNT_ALL can be used instead of FSINFO_ATTR_MOUNT_CHILDREN if you
want to scan an entire subtree in one go.  It returns the same record type.

The result from ALL/CHILDREN includes sufficient information to build the
tree.  That only requires the parent ID.  All the rest of the information
TOPOLOGY exposes is to do with propagation.

Now, granted, I didn't include all of the topology info in the records
returned by ALL/CHILDREN because I don't expect it to change very often.  But
you can check the event counter supplied with each record to see if it might
have changed - and then call TOPOLOGY on the ones that changed.

If it simplifies life, I could add the propagation info into ALL/CHILDREN so
that you only need to call ALL to scan everything.  It requires larger
buffers, however.

> Adding a few generic binary interfaces is okay.   Adding many
> specialized binary interfaces is a PITA.

Text interfaces are also a PITA, especially when you may get multiple pieces
of information returned in one buffer and especially when you throw in
character escaping.  Of course, we can do it - and we do do it all over - but
that doesn't make it efficient.

David

David Howells Aug. 12, 2020, 1:54 p.m. UTC | #50

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> IOW, if you do something more along the lines of
> 
>        fd = open(""foo/bar", O_PATH);
>        metadatafd = openat(fd, "metadataname", O_ALT);
> 
> it might be workable.

What is it going to walk through?  You need to end up with an inode and dentry
from somewhere.

It sounds like this would have to open up a procfs-like magic filesystem, and
walk into it.  But how would that actually work?  Would you create a new
superblock each time you do this, labelled with the starting object (say the
dentry for "foo/bar" in this case), and then walk from the root?

An alternative, maybe, could be to make a new dentry type, say, and include it
in the superblock of the object being queried - and let the filesystems deal
with it.  That would mean that non-dir dentries would then have virtual
children.  You could then even use this to implement resource forks...

Another alternative would be to note O_ALT and then skip pathwalk entirely,
but just use the name as a key to the attribute, creating an anonfd to read
it.  But then why use openat() at all?  You could instead do:

	metadatafd = openmeta(fd, "metadataname");

and save the page flag.  You could even merge the two opens and do:

	metadatafd = openmeta("foo/bar", "metadataname");

Why not even combine this with Miklos's readfile() idea:

	readmeta(AT_FDCWD, "foo/bar", "metadataname", buf, sizeof(buf));

and we're now down to one syscall and no fds and you don't even need a magic
filesystem to make it work.

There's another consideration too: Paths are not unique handles to mounts.
It's entirely possible to have colocated mounts.  We need to be able to query
all the mounts on a mountpoint.

David

Miklos Szeredi Aug. 12, 2020, 1:54 p.m. UTC | #51

On Wed, Aug 12, 2020 at 3:33 PM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > You said yourself, that what's really needed is e.g. consistent
> > snapshot of a complete mount tree topology.  And to get the complete
> > topology FSINFO_ATTR_MOUNT_TOPOLOGY and FSINFO_ATTR_MOUNT_CHILDREN are
> > needed for *each* individual mount.
>
> That's not entirely true.
>
> FSINFO_ATTR_MOUNT_ALL can be used instead of FSINFO_ATTR_MOUNT_CHILDREN if you
> want to scan an entire subtree in one go.  It returns the same record type.
>
> The result from ALL/CHILDREN includes sufficient information to build the
> tree.  That only requires the parent ID.  All the rest of the information
> TOPOLOGY exposes is to do with propagation.
>
> Now, granted, I didn't include all of the topology info in the records
> returned by ALL/CHILDREN because I don't expect it to change very often.  But
> you can check the event counter supplied with each record to see if it might
> have changed - and then call TOPOLOGY on the ones that changed.

IDGI, you have all these interfaces but how will they be used?

E.g. one wants to build a consistent topology together with
propagation and attributes.   That would start with
FSINFO_ATTR_MOUNT_ALL, then iterate the given mounts calling
FSINFO_ATTR_MOUNT_INFO and FSINFO_ATTR_MOUNT_TOPOLOGY for each.  Then
when done, check the subtree notification counter with
FSINFO_ATTR_MOUNT_INFO on the top one  to see if anything has changed
in the meantime.  If it has, the whole process needs to be restarted
to see which has been changed (unless notification is also enabled).
How does the atomicity of FSINFO_ATTR_MOUNT_ALL help with that?  The
same could be done with just FSINFO_ATTR_MOUNT_CHILDREN.

And more importantly does level of consistency matter at all?  There's
no such thing for directory trees, why are mount trees different in
this respect?

> Text interfaces are also a PITA, especially when you may get multiple pieces
> of information returned in one buffer and especially when you throw in
> character escaping.  Of course, we can do it - and we do do it all over - but
> that doesn't make it efficient.

Agreed.  The format of text interfaces matters very much.

Thanks,
Miklos

Miklos Szeredi Aug. 12, 2020, 2:10 p.m. UTC | #52

On Wed, Aug 12, 2020 at 3:54 PM David Howells <dhowells@redhat.com> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > IOW, if you do something more along the lines of
> >
> >        fd = open(""foo/bar", O_PATH);
> >        metadatafd = openat(fd, "metadataname", O_ALT);
> >
> > it might be workable.
>
> What is it going to walk through?  You need to end up with an inode and dentry
> from somewhere.
>
> It sounds like this would have to open up a procfs-like magic filesystem, and
> walk into it.  But how would that actually work?  Would you create a new
> superblock each time you do this, labelled with the starting object (say the
> dentry for "foo/bar" in this case), and then walk from the root?
>
> An alternative, maybe, could be to make a new dentry type, say, and include it
> in the superblock of the object being queried - and let the filesystems deal
> with it.  That would mean that non-dir dentries would then have virtual
> children.  You could then even use this to implement resource forks...
>
> Another alternative would be to note O_ALT and then skip pathwalk entirely,
> but just use the name as a key to the attribute, creating an anonfd to read
> it.  But then why use openat() at all?  You could instead do:
>
>         metadatafd = openmeta(fd, "metadataname");
>
> and save the page flag.  You could even merge the two opens and do:
>
>         metadatafd = openmeta("foo/bar", "metadataname");
>
> Why not even combine this with Miklos's readfile() idea:
>
>         readmeta(AT_FDCWD, "foo/bar", "metadataname", buf, sizeof(buf));

And writemeta() and createmeta() and readdirmeta() and ...

The point is that generic operations already exist and no need to add
new, specialized ones to access metadata.

Thanks,
Miklos

David Howells Aug. 12, 2020, 2:23 p.m. UTC | #53

Miklos Szeredi <miklos@szeredi.hu> wrote:

> The point is that generic operations already exist and no need to add
> new, specialized ones to access metadata.

open and read already exist, yes, but the metadata isn't currently in
convenient inodes and dentries that you can just walk through.  So you're
going to end up with a specialised filesystem instead, I suspect.  Basically,
it's the same as your do-everything-through-/proc/self/fds/ approach.

And it's going to be heavier.  I don't know if you're planning on creating a
superblock each time you do an O_ALT open, but you will end up creating some
inodes, dentries and a file - even before you get to the reading bit.

David

Al Viro Aug. 12, 2020, 2:39 p.m. UTC | #54

On Wed, Aug 12, 2020 at 09:23:23AM +0200, Miklos Szeredi wrote:

> Anyway, starting with just introducing the alt namespace without
> unification seems to be a good first step. If that turns out to be
> workable, we can revisit unification later.

Start with coming up with answers to the questions on semantics
upthread.  To spare you the joy of digging through the branches
of that thread, how's that for starters?

"Can those suckers be passed to
...at() as starting points?  Can they be bound in namespace?
Can something be bound *on* them?  What do they have for inodes
and what maintains their inumbers (and st_dev, while we are at
it)?  Can _they_ have secondaries like that (sensu Swift)?
Is that a flat space, or can they be directories?"

Miklos Szeredi Aug. 12, 2020, 2:46 p.m. UTC | #55

On Wed, Aug 12, 2020 at 4:40 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Aug 12, 2020 at 09:23:23AM +0200, Miklos Szeredi wrote:
>
> > Anyway, starting with just introducing the alt namespace without
> > unification seems to be a good first step. If that turns out to be
> > workable, we can revisit unification later.
>
> Start with coming up with answers to the questions on semantics
> upthread.  To spare you the joy of digging through the branches
> of that thread, how's that for starters?
>
> "Can those suckers be passed to
> ...at() as starting points?

No.

>  Can they be bound in namespace?

No.

> Can something be bound *on* them?

No.

>  What do they have for inodes
> and what maintains their inumbers (and st_dev, while we are at
> it)?

Irrelevant.  Can be some anon dev + shared inode.

The only attribute of an attribute that I can think of that makes
sense would be st_size, but even that is probably unimportant.

>  Can _they_ have secondaries like that (sensu Swift)?

Reference?

> Is that a flat space, or can they be directories?"

Yes it has a directory tree.   But you can't mkdir, rename, link,
symlink, etc on anything in there.

Thanks,
Miklos

Al Viro Aug. 12, 2020, 3:08 p.m. UTC | #56

On Wed, Aug 12, 2020 at 04:46:20PM +0200, Miklos Szeredi wrote:

> > "Can those suckers be passed to
> > ...at() as starting points?
> 
> No.

Lovely.  And what of fchdir() to those?  Are they all non-directories?
Because the starting point of ...at() can be simulated that way...

> >  Can they be bound in namespace?
> 
> No.
> 
> > Can something be bound *on* them?
> 
> No.
> 
> >  What do they have for inodes
> > and what maintains their inumbers (and st_dev, while we are at
> > it)?
> 
> Irrelevant.  Can be some anon dev + shared inode.
> 
> The only attribute of an attribute that I can think of that makes
> sense would be st_size, but even that is probably unimportant.
> 
> >  Can _they_ have secondaries like that (sensu Swift)?
> 
> Reference?

http://www.online-literature.com/swift/3515/
	So, naturalists observe, a flea
	Has smaller fleas that on him prey;
	And these have smaller still to bite 'em,
	And so proceed ad infinitum.
of course ;-)
IOW, can the things in those trees have secondary trees on them, etc.?
Not "will they have it in your originally intended use?" - "do we need
the architecture of the entire thing to be capable to deal with that?"

> > Is that a flat space, or can they be directories?"
> 
> Yes it has a directory tree.   But you can't mkdir, rename, link,
> symlink, etc on anything in there.

That kills the "shared inode" part - you'll get deadlocks from
hell that way.  "Can't mkdir" doesn't save you from that.  BTW,
what of unlink()?  If the tree shape is not a hardwired constant,
you get to decide how it's initially populated...

Next: what will that tree be attached to?  As in, "what's the parent
of its root"?  And while we are at it, what will be the struct mount
used with those - same as the original file, something different
attached to it, something created on the fly for each pathwalk and
lazy-umounted?  And see above re fchdir() - if they can be directories,
it's very much in the game.

Miklos Szeredi Aug. 12, 2020, 3:13 p.m. UTC | #57

On Wed, Aug 12, 2020 at 5:08 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Aug 12, 2020 at 04:46:20PM +0200, Miklos Szeredi wrote:
>
> > > "Can those suckers be passed to
> > > ...at() as starting points?
> >
> > No.
>
> Lovely.  And what of fchdir() to those?

Not allowed.

> Are they all non-directories?
> Because the starting point of ...at() can be simulated that way...
>
> > >  Can they be bound in namespace?
> >
> > No.
> >
> > > Can something be bound *on* them?
> >
> > No.
> >
> > >  What do they have for inodes
> > > and what maintains their inumbers (and st_dev, while we are at
> > > it)?
> >
> > Irrelevant.  Can be some anon dev + shared inode.
> >
> > The only attribute of an attribute that I can think of that makes
> > sense would be st_size, but even that is probably unimportant.
> >
> > >  Can _they_ have secondaries like that (sensu Swift)?
> >
> > Reference?
>
> http://www.online-literature.com/swift/3515/
>         So, naturalists observe, a flea
>         Has smaller fleas that on him prey;
>         And these have smaller still to bite 'em,
>         And so proceed ad infinitum.
> of course ;-)
> IOW, can the things in those trees have secondary trees on them, etc.?
> Not "will they have it in your originally intended use?" - "do we need
> the architecture of the entire thing to be capable to deal with that?"

No.

>
> > > Is that a flat space, or can they be directories?"
> >
> > Yes it has a directory tree.   But you can't mkdir, rename, link,
> > symlink, etc on anything in there.
>
> That kills the "shared inode" part - you'll get deadlocks from
> hell that way.

No.  The shared inode is not for lookup, just for the open file.

>  "Can't mkdir" doesn't save you from that.  BTW,
> what of unlink()?  If the tree shape is not a hardwired constant,
> you get to decide how it's initially populated...
>
> Next: what will that tree be attached to?  As in, "what's the parent
> of its root"?  And while we are at it, what will be the struct mount
> used with those - same as the original file, something different
> attached to it, something created on the fly for each pathwalk and
> lazy-umounted?  And see above re fchdir() - if they can be directories,
> it's very much in the game.

Why does it have to have a struct mount?  It does not have to use
dentry/mount based path lookup.

Thanks,
Miklos

David Howells Aug. 12, 2020, 3:22 p.m. UTC | #58

Miklos Szeredi <miklos@szeredi.hu> wrote:

> Why does it have to have a struct mount?  It does not have to use
> dentry/mount based path lookup.

file->f_path.mnt

David

Al Viro Aug. 12, 2020, 4:33 p.m. UTC | #59

On Wed, Aug 12, 2020 at 05:13:14PM +0200, Miklos Szeredi wrote:

> > Lovely.  And what of fchdir() to those?
> 
> Not allowed.

Not allowed _how_?  Existing check is "is it a directory"; what do you
propose?  IIRC, you've mentioned using readdir() in that context, so
it's not that you only allow to open the leaves there.

> > > > Is that a flat space, or can they be directories?"
> > >
> > > Yes it has a directory tree.   But you can't mkdir, rename, link,
> > > symlink, etc on anything in there.
> >
> > That kills the "shared inode" part - you'll get deadlocks from
> > hell that way.
> 
> No.  The shared inode is not for lookup, just for the open file.

Bloody hell...  So what inodes are you using for lookups?  And that
thing you would be passing to readdir() - what inode will _that_ have?

> > Next: what will that tree be attached to?  As in, "what's the parent
> > of its root"?  And while we are at it, what will be the struct mount
> > used with those - same as the original file, something different
> > attached to it, something created on the fly for each pathwalk and
> > lazy-umounted?  And see above re fchdir() - if they can be directories,
> > it's very much in the game.
> 
> Why does it have to have a struct mount?  It does not have to use
> dentry/mount based path lookup.

What the fuck?  So we suddenly get an additional class of objects
serving as kinda-sorta analogues of dentries *AND* now struct file
might refer to that instead of a dentry/mount pair - all on the VFS
level?  And so do all the syscalls you want to allow for such "pathnames"?

Sure, that avoids all questions about dcache interactions - by growing
a replacement layer and making just about everything in fs/namei.c,
fs/open.c, etc. special-case the handling of that crap.

But yes, the syscall-level interface will be simple.  Wonderful.

I really hope that's not what you have in mind, though.

Miklos Szeredi Aug. 12, 2020, 5:16 p.m. UTC | #60

On Wed, Aug 12, 2020 at 6:33 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Aug 12, 2020 at 05:13:14PM +0200, Miklos Szeredi wrote:

> > Why does it have to have a struct mount?  It does not have to use
> > dentry/mount based path lookup.
>
> What the fuck?  So we suddenly get an additional class of objects
> serving as kinda-sorta analogues of dentries *AND* now struct file
> might refer to that instead of a dentry/mount pair - all on the VFS
> level?  And so do all the syscalls you want to allow for such "pathnames"?

The only syscall I'd want to allow is open, everything else would be
on the open files themselves.

file->f_path can refer to an anon mount/inode, the real object is
referred to by file->private_data.

The change to namei.c would be on the order of ~10 lines.  No other
parts of the VFS would be affected.   Maybe I'm optimistic; we'll
see...

Now off to something completely different.  Back on Tuesday.

Thanks,
Miklos

Al Viro Aug. 12, 2020, 5:39 p.m. UTC | #61

On Wed, Aug 12, 2020 at 07:16:37PM +0200, Miklos Szeredi wrote:
> On Wed, Aug 12, 2020 at 6:33 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Wed, Aug 12, 2020 at 05:13:14PM +0200, Miklos Szeredi wrote:
> 
> > > Why does it have to have a struct mount?  It does not have to use
> > > dentry/mount based path lookup.
> >
> > What the fuck?  So we suddenly get an additional class of objects
> > serving as kinda-sorta analogues of dentries *AND* now struct file
> > might refer to that instead of a dentry/mount pair - all on the VFS
> > level?  And so do all the syscalls you want to allow for such "pathnames"?
> 
> The only syscall I'd want to allow is open, everything else would be
> on the open files themselves.
> 
> file->f_path can refer to an anon mount/inode, the real object is
> referred to by file->private_data.
> 
> The change to namei.c would be on the order of ~10 lines.  No other
> parts of the VFS would be affected.

If some of the things you open are directories (and you *have* said that
directories will be among those just upthread, and used references to
readdir() as argument in favour of your approach elsewhere in the thread),
you will have to do something about fchdir().  And that's the least of
the issues.

>   Maybe I'm optimistic; we'll
> see...


> Now off to something completely different.  Back on Tuesday.

... after the window closes.  You know, it's really starting to look
like rather nasty tactical games...

Linus Torvalds Aug. 12, 2020, 6:18 p.m. UTC | #62

On Tue, Aug 11, 2020 at 5:05 PM David Howells <dhowells@redhat.com> wrote:
>
> Well, the start of it was my proposal of an fsinfo() system call.

Ugh. Ok, it's that thing.

This all seems *WAY* over-designed - both your fsinfo and Miklos' version.

What's wrong with fstatfs()? All the extra magic metadata seems to not
really be anything people really care about.

What people are actually asking for seems to be some unique mount ID,
and we have 16 bytes of spare information in 'struct statfs64'.

All the other fancy fsinfo stuff seems to be "just because", and like
complete overdesign.

Let's not add system calls just because we can.

             Linus

Al Viro Aug. 12, 2020, 6:33 p.m. UTC | #63

On Wed, Aug 12, 2020 at 06:39:11PM +0100, Al Viro wrote:
> On Wed, Aug 12, 2020 at 07:16:37PM +0200, Miklos Szeredi wrote:
> > On Wed, Aug 12, 2020 at 6:33 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Wed, Aug 12, 2020 at 05:13:14PM +0200, Miklos Szeredi wrote:
> > 
> > > > Why does it have to have a struct mount?  It does not have to use
> > > > dentry/mount based path lookup.
> > >
> > > What the fuck?  So we suddenly get an additional class of objects
> > > serving as kinda-sorta analogues of dentries *AND* now struct file
> > > might refer to that instead of a dentry/mount pair - all on the VFS
> > > level?  And so do all the syscalls you want to allow for such "pathnames"?
> > 
> > The only syscall I'd want to allow is open, everything else would be
> > on the open files themselves.
> > 
> > file->f_path can refer to an anon mount/inode, the real object is
> > referred to by file->private_data.
> > 
> > The change to namei.c would be on the order of ~10 lines.  No other
> > parts of the VFS would be affected.
> 
> If some of the things you open are directories (and you *have* said that
> directories will be among those just upthread, and used references to
> readdir() as argument in favour of your approach elsewhere in the thread),
> you will have to do something about fchdir().  And that's the least of
> the issues.

BTW, what would such opened files look like from /proc/*/fd/* POV?  And
what would happen if you walk _through_ that symlink, with e.g. ".."
following it?  Or with names of those attributes, for that matter...
What about a normal open() of such a sucker?  It won't know where to
look for your ->private_data...

FWIW, you keep refering to regularity of this stuff from the syscall
POV, but it looks like you have no real idea of what subset of the
things available for normal descriptors will be available for those.

Steven Whitehouse Aug. 12, 2020, 7:34 p.m. UTC | #64

Hi,

On 12/08/2020 19:18, Linus Torvalds wrote:
> On Tue, Aug 11, 2020 at 5:05 PM David Howells <dhowells@redhat.com> wrote:
>> Well, the start of it was my proposal of an fsinfo() system call.
> Ugh. Ok, it's that thing.
>
> This all seems *WAY* over-designed - both your fsinfo and Miklos' version.
>
> What's wrong with fstatfs()? All the extra magic metadata seems to not
> really be anything people really care about.
>
> What people are actually asking for seems to be some unique mount ID,
> and we have 16 bytes of spare information in 'struct statfs64'.
>
> All the other fancy fsinfo stuff seems to be "just because", and like
> complete overdesign.
>
> Let's not add system calls just because we can.
>
>               Linus
>

The point of this is to give us the ability to monitor mounts from 
userspace. The original inspiration was rtnetlink, in that we need a 
"dump" operation to give us a snapshot of the current mount state, plus 
then a stream of events which allow us to keep that state updated. The 
tricky question is what happens in case of overflow of the events queue, 
and just like netlink, that needs a resync of the current state to fix 
that, since we can't block mounts, of course.

The fsinfo syscall was designed to be the "dump" operation in this 
system. David's other patch set provides the stream of events. So the 
two are designed to work together. We had the discussion on using 
netlink, of whatever form a while back, and there are a number of 
reasons why that doesn't work (namespace being one).

I think fstatfs might also suffer from the issue of not being easy to 
call on things for which you have no path (e.g. over-mounted mounts) 
Plus we need to know which paths to query, which is why we need to 
enumerate the mounts in the first place - how would we get the fds for 
each mount? It might give you some sb info, but it doesn't tell you the 
options that the sb is mounted with, and it doesn't tell you where it is 
mounted either.

The overall aim is to solve some issues relating to scaling to large 
numbers of mount in systemd and autofs, and also to provide a 
generically useful interface that other tools may use to monitor mounts 
in due course too. Currently parsing /proc/mounts is the only option, 
and that tends to be slow and is certainly not atomic. Extension to 
other sb related messages is a future goal, quota being one possible 
application for the notifications.

If there is a simpler way to get to that goal, then thats all to the 
good, and we should definitely consider it,

Steve.

Linus Torvalds Aug. 12, 2020, 7:50 p.m. UTC | #65

On Wed, Aug 12, 2020 at 12:34 PM Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> The point of this is to give us the ability to monitor mounts from
> userspace.

We haven't had that before, I don't see why it's suddenly such a big deal.

The notification side I understand. Polling /proc files is not the answer.

But the whole "let's design this crazy subsystem for it" seems way
overkill. I don't see anybody caring that deeply.

It really smells like "do it because we can, not because we must".

Who the hell cares about monitoring mounts at a kHz frequencies? If
this is for MIS use, you want a nice GUI and not wasting CPU time
polling.

I'm starting to ignore the pull requests from David Howells, because
by now they have had the same pattern for a couple of years now:
esoteric new interfaces that seem overdesigned for corner-cases that
I'm not seeing people clamoring for.

I need (a) proof this is actualyl something real users care about and
(b) way more open discussion and implementation from multiple parties.

Because right now it looks like a small in-cabal of a couple of people
who have wild ideas but I'm not seeing the wider use of it.

Convince me otherwise. AGAIN. This is the exact same issue I had with
the notification queues that I really wanted actual use-cases for, and
feedback from actual outside users.

I really think this is engineering for its own sake, rather than
responding to actual user concerns.

               Linus

Al Viro Aug. 12, 2020, 9:30 p.m. UTC | #66

On Wed, Aug 12, 2020 at 07:33:26PM +0100, Al Viro wrote:

> BTW, what would such opened files look like from /proc/*/fd/* POV?  And
> what would happen if you walk _through_ that symlink, with e.g. ".."
> following it?  Or with names of those attributes, for that matter...
> What about a normal open() of such a sucker?  It won't know where to
> look for your ->private_data...
> 
> FWIW, you keep refering to regularity of this stuff from the syscall
> POV, but it looks like you have no real idea of what subset of the
> things available for normal descriptors will be available for those.

Another question: what should happen with that sucker on umount of
the filesystem holding the underlying object?  Should it be counted
as pinning that fs?

Who controls what's in that tree?  If we plan to have xattrs there,
will they be in a flat tree, or should it mirror the hierarchy of
xattrs?  When is it populated?  open() time?  What happens if we
add/remove an xattr after that point?

If we open the same file several times, what should we get?  A full
copy of the tree every time, with all coherency being up to whatever's
putting attributes there?

What are the permissions needed to do lookups in that thing?

All of that is about semantics and the answers are needed before we
start looking into implementations.  "Whatever my implementation
does" is _not_ a good way to go, especially since that'll be cast
in stone as soon as API becomes exposed to userland...

Ian Kent Aug. 13, 2020, 1:01 a.m. UTC | #67

On Wed, 2020-08-12 at 14:06 +0100, David Howells wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > That presumably means the mount ID <-> mount path mapping already
> > exists, which means it's just possible to use the open(mount_path,
> > O_PATH) to obtain the base fd.
> 
> No, you can't.  A path more correspond to multiple mounts stacked on
> top of
> each other, e.g.:
> 
> 	mount -t tmpfs none /mnt
> 	mount -t tmpfs none /mnt
> 	mount -t tmpfs none /mnt
> 
> Now you have three co-located mounts and you can't use the path to
> differentiate them.  I think this might be an issue in autofs, but
> Ian would
> need to comment on that.

It is a problem for autofs, direct mounts in particular, but also
for mount ordering at times when umounting a tree of mounts where
mounts are covered or at shutdown.

Ian

Ian Kent Aug. 13, 2020, 3:44 a.m. UTC | #68

On Wed, 2020-08-12 at 12:50 -0700, Linus Torvalds wrote:
> On Wed, Aug 12, 2020 at 12:34 PM Steven Whitehouse <
> swhiteho@redhat.com> wrote:
> > The point of this is to give us the ability to monitor mounts from
> > userspace.
> 
> We haven't had that before, I don't see why it's suddenly such a big
> deal.

Because there's a trend occurring in user space where there are
frequent and persistent mount changes that cause high overhead.

I've seen the number of problems building up over the last few months
that are essentially the same problem that I wanted to resolve. And
that's related to side effects of autofs using a large number of
mounts.

The problems are real.

> 
> The notification side I understand. Polling /proc files is not the
> answer.

Yep, that's one aspect, getting the information about a mount without
reading the entire mount table seems like the sensible thing to do to
allow for a more efficient notification mechanism.

> 
> But the whole "let's design this crazy subsystem for it" seems way
> overkill. I don't see anybody caring that deeply.
> 
> It really smells like "do it because we can, not because we must".
> 
> Who the hell cares about monitoring mounts at a kHz frequencies? If
> this is for MIS use, you want a nice GUI and not wasting CPU time
> polling.

That part of the problem still remains.

The kernel sending a continuous stream of wake ups under load does
also introduce a resource problem but that's probably something to
handle in user space.

> 
> I'm starting to ignore the pull requests from David Howells, because
> by now they have had the same pattern for a couple of years now:
> esoteric new interfaces that seem overdesigned for corner-cases that
> I'm not seeing people clamoring for.
> 
> I need (a) proof this is actualyl something real users care about and
> (b) way more open discussion and implementation from multiple
> parties.
> 
> Because right now it looks like a small in-cabal of a couple of
> people
> who have wild ideas but I'm not seeing the wider use of it.
> 
> Convince me otherwise. AGAIN. This is the exact same issue I had with
> the notification queues that I really wanted actual use-cases for,
> and
> feedback from actual outside users.
> 
> I really think this is engineering for its own sake, rather than
> responding to actual user concerns.
> 
>                Linus

Jeffrey E Altman Aug. 13, 2020, 3:53 a.m. UTC | #69

On 8/12/2020 2:18 PM, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> What's wrong with fstatfs()? All the extra magic metadata seems to not
> really be anything people really care about.
> 
> What people are actually asking for seems to be some unique mount ID,
> and we have 16 bytes of spare information in 'struct statfs64'.
> 
> All the other fancy fsinfo stuff seems to be "just because", and like
> complete overdesign.

Hi Linus,

Is there any existing method by which userland applications can
determine the properties of the filesystem in which a directory or file
is stored in a filesystem agnostic manner?

Over the past year I've observed the opendev/openstack community
struggle with performance issues caused by rsync's inability to
determine if the source and destination object's last update time have
the same resolution and valid time range.  If the source file system
supports 100 nanosecond granularity and the destination file system
supports one second granularity, any source file with a non-zero
fractional seconds timestamp will appear to have changed compared to the
copy in the destination filesystem which discarded the fractional
seconds during the last sync.  Sure, the end user could use the
--modify-window=1 option to inform rsync to add fuzz to the comparisons,
but that introduces the possibility that a file updated a fraction of a
second after an rsync execution would not synchronize the file on the
next run when both source and target have fine grained timestamps.  If
the userland sync processes have access to the source and destination
filesystem time capabilities, they can make more intelligent decisions
without explicit user input.  At a minimum, the timestamp properties
that are important to know include the range of valid timestamps and the
resolution.  Some filesystems support unsigned 32-bit time starting with
UNIX epoch.  Others signed 32-bit time with UNIX epoch.  Still others
FAT, NTFS, etc use alternative epochs and range and resolutions.

Another case where lack of filesystem properties is problematic is "df
--local" which currently relies upon string comparisons of file system
name strings to determine if the underlying file system is local or
remote.  This requires that the gnulib maintainers have knowledge of all
file systems implementations, their published names, and which category
they belong to.  Patches have been accepted in the past year to add
"smb3", "afs", and "gpfs" to the list of remote file systems.  There are
many more remote filesystems that have yet to be added including
"cephfs", "lustre", "gluster", etc.

In many cases, the filesystem properties cannot be inferred from the
filesystem name.  For network file systems, these properties might
depend upon the remote server capabilities or even the properties
associated with a particular volume or share.  Consider the case of a
remote file server that supports 64-bit 100ns time but which for
backward compatibility exports certain volumes or shares with more
restrictive capabilities. Or the case of a network file system protocol
that has evolved over time and gained new capabilities.

For the AFS community, fsinfo offers a method of exposing some server
and volume properties that are obtained via "path ioctls" in OpenAFS and
AuriStorFS.  Some example of properties that might be exposed include
answers to questions such as:

 * what is the volume cell id? perhaps a uuid.
 * what is the volume id in the cell? unsigned 64-bit integer
 * where is a mounted volume hosted? which fileservers, named by uuid
 * what is the block size? 1K, 4K, ...
 * how many blocks are in use or available?
 * what is the quota (thin provisioning), if any?
 * what is the reserved space (fat provisioning), if any?
 * how many vnodes are present?
 * what is the vnode count limit, if any?
 * when was the volume created and last updated?
 * what is the file size limit?
 * are byte range locks supported?
 * are mandatory locks supported?
 * how many entries can be created within a directory?
 * are cross-directory hard links supported?
 * are directories just-send-8, case-sensitive, case-preserving, or
   case-insensitive?
 * if not just-send-8, what character set is used?
 * if Unicode, what normalization rules? etc.
 * are per-object acls supported?
 * what volume maximum acl is assigned, if any?
 * what volume security policy (authn, integ, priv) is assigned, if any?
 * what is the replication policy, if any?
 * what is the volume encryption policy, if any?
 * what is the volume compression policy, if any?
 * are server-to-server copies supported?
 * which of atime, ctime and mtime does the volume support?
 * what is the permitted timestamp range and resolution?
 * are xattrs supported?
 * what is the xattr maximum name length?
 * what is the xattr maximum object size?
 * is the volume currently reachable?
 * is the volume immutable?
 * etc ...

Its true that there isn't widespread use of these filesystem properties
by today's userland applications but that might be due to the lack of
standard interfaces necessary to acquire the information.  For example,
userland frameworks for parallel i/o HPC applications such as HDF5,
PnetCDF and ROMIO require each supported filesystem to provide its own
proprietary "driver" which does little more than expose the filesystem
properties necessary to optimize the layout of file stream data
structures.  With something like "fsinfo" it would be much easier to
develop these HPC frameworks in a filesystem agnostic manner.  This
would permit applications built upon these frameworks to use the best
Linux filesystem available for the workload and not simply the ones for
which proprietary "drivers" have been published.

Although I am sympathetic to the voices in the community that would
prefer to start over with a different architectural approach, David's
fsinfo has been under development for more than two years.  It has not
been developed in a vacuum but in parallel with other kernel components
that have been merged during that time frame.  From my reading of this
thread and those that preceded it, fsinfo has also been developed with
input from significant userland development communities that intend to
leverage the syscall interface as soon as it becomes available.  The
March 2020 discussion of fsinfo received positive feedback not only from
within Red Hat but from other parties as well.

Since no one stepped up to provide an alternative approach in the last
five months, how long should those that desire access to the
functionality be expected to wait for it?

What is the likelihood that an alternative robust solution will be
available in the next merge window or two?

Is the design so horrid that it is better to go without the
functionality than to live with the imperfections?

I for one would like to see this functionality be made available sooner
rather than later.  I know my end users would benefit from the
availability of fsinfo.

Thank you for listening.  Stay healthy and safe, and please wear a mask.

Jeffrey Altman

Karel Zak Aug. 13, 2020, 8:52 a.m. UTC | #70

On Wed, Aug 12, 2020 at 02:43:32PM +0200, Miklos Szeredi wrote:
> On Wed, Aug 12, 2020 at 1:28 PM Karel Zak <kzak@redhat.com> wrote:
> 
> > The proposal is based on paths and open(), how do you plan to deal
> > with mount IDs? David's fsinfo() allows to ask for mount info by mount
> > ID and it works well with mount notification where you get the ID. The
> > collaboration with notification interface is critical for our use-cases.
> 
> One would use the notification to keep an up to date set of attributes
> for each watched mount, right?
> 
> That presumably means the mount ID <-> mount path mapping already
> exists, which means it's just possible to use the open(mount_path,
> O_PATH) to obtain the base fd.

The notification also reports new mount nodes, so we have no mount ID
<-> mount path mapping in userspace yet.

The another problem is that open(path) cannot be used if you have multiple
filesystems on the same mount point -- in this case (at least in theory)
you can get ID for by-path inaccessible filesystem.

> A new syscall that returns an fd pointing to the root of the mount
> might be the best solution:
> 
>    int open_mount(int root_fd, u64 mntid, int flags);

Yes, something like this is necessary. You do not want to depend
on paths if you want to read information about mountpoints.

 Karel

Karel Zak Aug. 13, 2020, 10:36 a.m. UTC | #71

On Wed, Aug 12, 2020 at 12:50:28PM -0700, Linus Torvalds wrote:
> Convince me otherwise. AGAIN. This is the exact same issue I had with
> the notification queues that I really wanted actual use-cases for, and
> feedback from actual outside users.

I thought (in last 10 years) we all agree that /proc/self/mountinfo is
the expensive, ineffective and fragile way how to deliver information to
userspace.

We have systems with thousands of mountpoints and compose mountinfo in
kernel and again parse it in userspace takes time and it's strange if
you need info about just one mountpoint. 

Unfortunately, the same systems modify the huge mount table extremely
often, because it starts/stops large number of containers and every 
container means a mount operation(s).

In this crazy environment, we have userspace tools like systemd or udisk 
which react to VFS changes and there is no elegant way how to get
details about a modified mount node from kernel.

And of course we already have negative feedback from users who
maintain large systems -- mountinfo returns inconsistent data if you
read it by more read() calls (hopefully fixed by recent Miklos'
mountinfo cursors); system is pretty busy to compose+parse mountinfo,
etc.

> I really think this is engineering for its own sake, rather than
> responding to actual user concerns.

We're too old and too lazy for "engineering for its own sake" :-)
there is pressure from users ...

Maybe David's fsinfo() sucks, but it does not mean that
/proc/self/mountinfo is something cool. Right?

We have to dig deep grave for /proc/self/mountinfo ...

    Karel

Lennart Poettering Aug. 14, 2020, 7:58 a.m. UTC | #72

On Mi, 12.08.20 12:50, Linus Torvalds (torvalds@linux-foundation.org) wrote:

> On Wed, Aug 12, 2020 at 12:34 PM Steven Whitehouse <swhiteho@redhat.com> wrote:
> >
> > The point of this is to give us the ability to monitor mounts from
> > userspace.
>
> We haven't had that before, I don't see why it's suddenly such a big deal.
>
> The notification side I understand. Polling /proc files is not the answer.
>
> But the whole "let's design this crazy subsystem for it" seems way
> overkill. I don't see anybody caring that deeply.
>
> It really smells like "do it because we can, not because we must".

With my systemd maintainer hat on (and of other userspace stuff),
there's a couple of things I really want from the kernel because it
would fix real problems for us:

1. we want mount notifications that don't require to scan
   /proc/self/mountinfo entirely again every time things change, over
   and over again, simply because that doesn't scale. We have various
   bugs open about this performance bottleneck, I could point you to,
   but I figure it's easy to see why this currently doesn't scale...

2. We want an unpriv API to query (and maybe set) the fs UUID, like we
   have nowadays for the fs label FS_IOC_[GS]ETFSLABEL

3. We want an API to query time granularity of file systems
   timestamps. Otherwise it's so hard in userspace to reproducibly
   re-generate directory trees. We need to know for example that some
   fs only has 2s granularity (like fat).

4. Similar, we want to know if an fs is case-sensitive for file
   names. Or case-preserving. And which charset it accepts for filenames.

5. We want to know if a file system supports access modes, xattrs,
   file ownership, device nodes, symlinks, hardlinks, fifos, atimes,
   btimes, ACLs and so on. All these things currently can only be
   figured out by changing things and reading back if it worked. Which
   sucks hard of course.

6. We'd like to know the max file size on a file system.

7. Right now it's hard to figure out mount options used for the fs
   backing some file: you can now statx() the file, determine the
   mnt_id by that, and then search that in /proc/self/mountinfo, but
   it's slow, because again we need to scan the whole file until we
   find the entry we need. And that can be huge IRL.

8. Similar: we quite often want to know submounts of a mount. It would
   be great if for that kind of information (i.e. list of mnt_ids
   below some other mnt_id) we wouldn't have to scan the whole of
   /p/s/mi again. In many cases in our code we operate recursively,
   and want to know the mounts below some specific dir, but currently
   pay performance price for it if the number of file systems on the
   host is huge. This doesn't sound like a biggie, but actually is a
   biggie. In systemd we spend a lot of time scaninng /p/s/mi...

9. How are file locks implemented on this fs? Are they local only, and
   orthogonal to remote locks? Are POSIX and BSD locks possibly merged
   at the backend? Do they work at all?

I don't really care too much how an API for this looks like, but let
me just say that I am not a fan of APIs that require allocating an fd
for querying info about an fd. This 'feels' a bit too recursive: if
you expose information about some fd in some magic procfs subdir, or
even in some virtual pseudo-file below the file's path then this means
we have to allocate a new fd to figure out things or the first fd, and
if we'd know the same info for that, we'd theoretically recurse
down. Now of course, most likely IRL we wouldn't actually recurse down,
but it is still smelly. In particular if fd limits are tight. I mean,
I really don't care if you expose non-file-system stuff via the fs, if
that's what you want, but I think exposing *fs* metainfo in the *fs*,
it's just ugly.

I generally detest APIs that have no chance to ever returning multiple
bits of information atomically. Splitting up querying of multiple
attributes into multiple system calls means they couldn't possibly be
determined in a congruent way. I much prefer APIs where we provide a
struct to fill in and do a single syscall, and at least for some
fields we'd know afterwards that the fields were filled in together
and are congruent with each other.

I am a fan of the statx() system call I must say. If we had something
like this for the file system itself I'd be quite happy, it could tick
off many of the requests I list above.

Hope this is useful,

Lennart

--
Lennart Poettering, Berlin

Lennart Poettering Aug. 14, 2020, 8:06 a.m. UTC | #73

On Mi, 12.08.20 11:18, Linus Torvalds (torvalds@linux-foundation.org) wrote:

> On Tue, Aug 11, 2020 at 5:05 PM David Howells <dhowells@redhat.com> wrote:
> >
> > Well, the start of it was my proposal of an fsinfo() system call.
>
> Ugh. Ok, it's that thing.
>
> This all seems *WAY* over-designed - both your fsinfo and Miklos' version.
>
> What's wrong with fstatfs()? All the extra magic metadata seems to not
> really be anything people really care about.
>
> What people are actually asking for seems to be some unique mount ID,
> and we have 16 bytes of spare information in 'struct statfs64'.

statx() exposes a `stx_mnt_id` field nowadays. So that's easy and
quick to get nowadays. It's just so inefficient matching that up with
/proc/self/mountinfo then. And it still won't give you any of the fs
capability bits (time granularity, max file size, features, …),
because the kernel doesn't expose that at all right now.

OTOH I'd already be quite happy if struct statfs64 would expose
f_features, f_max_fsize, f_time_granularity, f_charset_case_handling
fields or so.

Lennart

--
Lennart Poettering, Berlin

Linus Torvalds Aug. 14, 2020, 5:05 p.m. UTC | #74

On Wed, Aug 12, 2020 at 8:53 PM Jeffrey E Altman <jaltman@auristor.com> wrote:
>
> For the AFS community, fsinfo offers a method of exposing some server
> and volume properties that are obtained via "path ioctls" in OpenAFS and
> AuriStorFS.  Some example of properties that might be exposed include
> answers to questions such as:

Note that several of the questions you ask aren't necessarily
mount-related at all.

Doing it by mount ends up being completely the wrong thing.

For example, at a minimum, these guys may well be per-directory (or
even possibly per-file):

>  * where is a mounted volume hosted? which fileservers, named by uuid
>  * what is the block size? 1K, 4K, ...
>  * are directories just-send-8, case-sensitive, case-preserving, or
>    case-insensitive?
>  * if not just-send-8, what character set is used?
>  * if Unicode, what normalization rules? etc.
>  * what volume security policy (authn, integ, priv) is assigned, if any?
>  * what is the replication policy, if any?
>  * what is the volume encryption policy, if any?

and trying to solve this with some kind of "mount info" is pure garbage.

Honestly, I really think you may want an extended [f]statfs(), not
some mount tracking.

                 Linus

Steven Whitehouse Aug. 17, 2020, 11:32 a.m. UTC | #75

Hi,

On 12/08/2020 20:50, Linus Torvalds wrote:
> On Wed, Aug 12, 2020 at 12:34 PM Steven Whitehouse <swhiteho@redhat.com> wrote:
>> The point of this is to give us the ability to monitor mounts from
>> userspace.
> We haven't had that before, I don't see why it's suddenly such a big deal.
>
> The notification side I understand. Polling /proc files is not the answer.
>
> But the whole "let's design this crazy subsystem for it" seems way
> overkill. I don't see anybody caring that deeply.
>
> It really smells like "do it because we can, not because we must".
>
> Who the hell cares about monitoring mounts at a kHz frequencies? If
> this is for MIS use, you want a nice GUI and not wasting CPU time
> polling.
>
> I'm starting to ignore the pull requests from David Howells, because
> by now they have had the same pattern for a couple of years now:
> esoteric new interfaces that seem overdesigned for corner-cases that
> I'm not seeing people clamoring for.
>
> I need (a) proof this is actualyl something real users care about and
> (b) way more open discussion and implementation from multiple parties.
>
> Because right now it looks like a small in-cabal of a couple of people
> who have wild ideas but I'm not seeing the wider use of it.
>
> Convince me otherwise. AGAIN. This is the exact same issue I had with
> the notification queues that I really wanted actual use-cases for, and
> feedback from actual outside users.
>
> I really think this is engineering for its own sake, rather than
> responding to actual user concerns.
>
>                 Linus
>

I've been hesitant to reply to this immediately, because I can see that 
somehow there is a significant disconnect between what you expect to 
happen, and what has actually happened in this case. Have pondered this 
for a few days, I hope that the best way forward might be to explore 
where the issues are, with the intention of avoiding a repeat in the 
future. Sometimes email is a difficult medium for these kinds of 
communication, and face to face is better, but with the lack of 
conferences/travel at the moment, that option is not open in the near 
future.

The whole plan here, leading towards the ability to get a "dump plus 
updates" view of mounts in the kernel has been evolving over time. It 
has been discussed at LSF over a number of years [1] and in fact the new 
mount API which was merged recently - I wonder if this is what you are 
referring to above as:

> I'm starting to ignore the pull requests from David Howells, because
> by now they have had the same pattern for a couple of years now

was originally proposed by Al, and also worked on by Miklos[2] in 2017 
and others. Community discussion resulted in that becoming a 
prerequisite for the later notifications/fsinfo work. This was one of 
the main reasons that David picked it up[3] to work on, but not the only 
reason. That did also appear to be logical, in that cleaning up the way 
in which arguments were handled during mount would make it much easier 
to create future generic code to handle them.

That said, the overall aim here is to solve the problem and if there are 
better solutions available then I'm sure that everyone is very open to 
those. I agree very much that monitoring at kHz frequencies is not 
useful, but at the same time, there are cases which can generate large 
amounts of mount changes in a very short time period. We want to be 
reasonably efficient, but not to over-optimise, and sometimes that is a 
fine line. We also don't want to block mounts if the notifications queue 
fills up, so some kind of resync operation would be required in the 
queue overflows. The notifications and fsinfo were designed very much as 
two sides of the same coin, but submitted separately for ease of review 
more than anything else.

You recently requested some details of real users for the notifications, 
and (I assumed) by extension fsinfo too. Ian wrote these emails [4][5] 
in direct response to your request. That is what we thought you were 
looking for, so if that isn't not quite what you meant, perhaps you 
could clarify a bit more. Again, apologies if we've misinterpreted what 
you were asking for.

You also mention "...it looks like a small in-cabal of a couple of 
people..." and I hope that it doesn't look that way, it is certainly not 
our intention. There have been a fair number of people involved, and 
we've done our best to ensure that the development is guided by the 
potential users, such as autofs, AFS and systemd. If there are others 
out there with use cases, and particularly so if the use case is a GUI 
file manager type application who'd like to get involved, then please 
do. We definitely want to see involvement from end users, since there is 
no point in spending a large effort creating something that is then 
never used. As you pointed that out above, this kind of application was 
very much part of the original motivation, but we had started with the 
other users since there were clearly defined use cases that could 
demonstrate significant performance gains in those cases.

So hopefully that helps to give a bit more background about where we are 
and how we got here. Where we go next will no doubt depend on the 
outcome of the current discussions, and any guidance you can give around 
how we should have better approached this would be very helpful at this 
stage,

Steve.

[1] https://lwn.net/Articles/718803/

[2] https://lwn.net/Articles/718638/

[3] https://lwn.net/Articles/753473/

[4] https://lkml.org/lkml/2020/6/2/1182

[5] 
https://lore.kernel.org/linux-fsdevel/8eb2e52f1cbdbb8bcf5c5205a53bdc9aaa11a071.camel@themaw.net/

Linus Torvalds Aug. 17, 2020, 5:15 p.m. UTC | #76

On Mon, Aug 17, 2020 at 4:33 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> That said, the overall aim here is to solve the problem and if there are
> better solutions available then I'm sure that everyone is very open to
> those. I agree very much that monitoring at kHz frequencies is not
> useful, but at the same time, there are cases which can generate large
> amounts of mount changes in a very short time period.

So the thing is, I absolutely believe in the kernel _notifying_ about
changes so that people don't need to poll. It's why I did merge the
notification queues, although I wanted to make sure that those worked.

> You recently requested some details of real users for the notifications,
> and (I assumed) by extension fsinfo too.

No, fsinfo wasn't on the table there. To me, notifications are a
completely separate issue, because you *can* get the information from
existing sources (ie things like /proc/mounts etc), and notification
seemed to be the much more fundamental issue.

If you poll for changes, parsing something like /proc/mounts is
obviously very heavy indeed. I don't find that particularly
controversial. Plus the notification queues had other uses, even if it
wasn't clear how many or who would use them.

But honestly, the actual fsinfo thing seems (a) overdesigned and (b)
broken. I've now had two different people say how they want to use it
to figure out whether a filesystem supports certain things that aren't
even per-filesystem things in the first place.

And this feature is clearly controversial, with actual discussion about it.

And I find the whole thing confusing and over-engineered. If this was
a better statfs(), that would be one thing. But it is designed to be
this monstoer thing that does many different things, and I find it
distasteful.  Yes, you can query "extended statfs" kind of data with
it and get the per-file attributes. I find it really annoying how the
vfs layer calls to the filesystems, that then call back to the vfs
layer to fill things in, but I guess we have that nasty pattern from
stat() already. I'd rather have the VFS layer just fill in all the
default values and the stuff it already knows about, and then maybe
have the filesystem callback fill in the ones the vfs *doesn't* know
about, but whatever.

But then you can *also* query odd things like mounts that aren't even
visible, and the topology, and completely random error state.

So it has this very complex "random structures of random things"
implementation. It's a huge sign of over-design and "I don't know what
the hell I want to expose, so I'll make this generic thing that can
expose anything, and then I start adding random fields".

Some things are per-file, some things are per-mount, and some things
are per-namespace and cross mount boundaries.

And honestly, the "random binary interfaces" just turns me off a lot.

A simple and straightforward struct? Sure. But this random "whatever
goes" thing? No.

                 Linus

Linus Torvalds Aug. 17, 2020, 10:44 p.m. UTC | #77

On Mon, Aug 17, 2020 at 10:15 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So it has this very complex "random structures of random things"
> implementation. It's a huge sign of over-design and "I don't know what
> the hell I want to expose, so I'll make this generic thing that can
> expose anything, and then I start adding random fields".

You can see the overdesign in other places too: that "time
granularity" is some very odd stuff. It doesn't actually even match
the kernel granularity rules, so that fsinfo interface is basically
exporting random crap that doesn't match reality.

In the kernel, we give the granularity in nsec, but for some reason
that fsinfo stuff gives it in some hand-written pseudo-floating-point
format. Why? Don't ask me.

And do we really want to have that whole odd Nth/Mth thing?
Considering that it cannot be consistent or atomic, and the complaint
against the /proc interfaces have been about that part, it really
smells completely bogus.

So please. Can we just make a simple extended statfs() and be done
with it, instead of this hugely complex thing that does five different
things with the same interface and makes it really odd as a result?

                  Linus
So honestly,  there's a

Miklos Szeredi Aug. 18, 2020, 9:30 a.m. UTC | #78

On Wed, Aug 12, 2020 at 8:33 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Aug 12, 2020 at 06:39:11PM +0100, Al Viro wrote:
> > On Wed, Aug 12, 2020 at 07:16:37PM +0200, Miklos Szeredi wrote:
> > > On Wed, Aug 12, 2020 at 6:33 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > >
> > > > On Wed, Aug 12, 2020 at 05:13:14PM +0200, Miklos Szeredi wrote:
> > >
> > > > > Why does it have to have a struct mount?  It does not have to use
> > > > > dentry/mount based path lookup.
> > > >
> > > > What the fuck?  So we suddenly get an additional class of objects
> > > > serving as kinda-sorta analogues of dentries *AND* now struct file
> > > > might refer to that instead of a dentry/mount pair - all on the VFS
> > > > level?  And so do all the syscalls you want to allow for such "pathnames"?
> > >
> > > The only syscall I'd want to allow is open, everything else would be
> > > on the open files themselves.
> > >
> > > file->f_path can refer to an anon mount/inode, the real object is
> > > referred to by file->private_data.
> > >
> > > The change to namei.c would be on the order of ~10 lines.  No other
> > > parts of the VFS would be affected.
> >
> > If some of the things you open are directories (and you *have* said that
> > directories will be among those just upthread, and used references to
> > readdir() as argument in favour of your approach elsewhere in the thread),
> > you will have to do something about fchdir().  And that's the least of
> > the issues.
>
> BTW, what would such opened files look like from /proc/*/fd/* POV?  And
> what would happen if you walk _through_ that symlink, with e.g. ".."
> following it?  Or with names of those attributes, for that matter...
> What about a normal open() of such a sucker?  It won't know where to
> look for your ->private_data...
>
> FWIW, you keep refering to regularity of this stuff from the syscall
> POV, but it looks like you have no real idea of what subset of the
> things available for normal descriptors will be available for those.

I have said that IMO using a non-seekable anon-file would be okay for
those.   All the answers fall out of that:  nothing works on those
fd's except read/write/getdents.  No fchdir(), no /proc/*/fd deref,
etc...

Starting with a very limited functionality and expanding on that if
necessary is I think a good way to not get bogged down with the
details.

Thanks,
Miklos

Miklos Szeredi Aug. 18, 2020, 9:41 a.m. UTC | #79

On Wed, Aug 12, 2020 at 11:30 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Aug 12, 2020 at 07:33:26PM +0100, Al Viro wrote:
>
> > BTW, what would such opened files look like from /proc/*/fd/* POV?  And
> > what would happen if you walk _through_ that symlink, with e.g. ".."
> > following it?  Or with names of those attributes, for that matter...
> > What about a normal open() of such a sucker?  It won't know where to
> > look for your ->private_data...
> >
> > FWIW, you keep refering to regularity of this stuff from the syscall
> > POV, but it looks like you have no real idea of what subset of the
> > things available for normal descriptors will be available for those.
>
> Another question: what should happen with that sucker on umount of
> the filesystem holding the underlying object?  Should it be counted
> as pinning that fs?

Obviously yes.

> Who controls what's in that tree?

It could be several entities:

 - global (like mount info)
 - per inode (like xattr)
 - per fs (fs specific inode attributes)
 - etc..

>  If we plan to have xattrs there,
> will they be in a flat tree, or should it mirror the hierarchy of
> xattrs?  When is it populated?  open() time?  What happens if we
> add/remove an xattr after that point?

From the interface perspective it would be dynamic (i.e. would get
updated on open or read).  From an implementation POV it could have
caching, but that's not how I'd start out.

> If we open the same file several times, what should we get?  A full
> copy of the tree every time, with all coherency being up to whatever's
> putting attributes there?
>
> What are the permissions needed to do lookups in that thing?

That would depend on what would need to be looked up.  Top level would
be world readable, otherwise it would be up to the attribute/group.

>
> All of that is about semantics and the answers are needed before we
> start looking into implementations.  "Whatever my implementation
> does" is _not_ a good way to go, especially since that'll be cast
> in stone as soon as API becomes exposed to userland...

Fine.

Thanks,
Miklos

Miklos Szeredi Aug. 18, 2020, 12:50 p.m. UTC | #80

On Tue, Aug 18, 2020 at 12:44 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:

> So please. Can we just make a simple extended statfs() and be done
> with it, instead of this hugely complex thing that does five different
> things with the same interface and makes it really odd as a result?

How do you propose handling variable size attributes, like the list of
fs options?

Otherwise I'm fine with a statx-like interface for querying fs/mount attributes.

Thanks,
Miklos

Jeffrey E Altman Aug. 18, 2020, 3:01 p.m. UTC | #81

On 8/14/2020 1:05 PM, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> Honestly, I really think you may want an extended [f]statfs(), not
> some mount tracking.
> 
>                  Linus

Linus,

Thank you for the reply.  Perhaps some of the communication disconnect
is due to which thread this discussion is taking place on.  My
understanding is that there were two separate pull requests.  One for
mount notifications and the other for filesystem information.  This
thread is derived from the pull request entitled "Filesystem
Information" and my response was a request for use cases.  The
assumption being that the request was related to the subject.

I apologize for creating unnecessary noise due to my misinterpretation
of your intended question.  The use cases I described and the types of
filesystem information required to satisfy them do not require mount
tracking.

Jeffrey Altman

Linus Torvalds Aug. 18, 2020, 6:51 p.m. UTC | #82

On Tue, Aug 18, 2020 at 5:50 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> How do you propose handling variable size attributes, like the list of
> fs options?

I really REALLY think those things should just be ASCII data.

I think marshalling binary data is actively evil and wrong. It's great
for well-specified wire protocols. It's great for internal
communication in user space. It's *NOT* great for a kernel system call
interface.

One single simple binary structure? Sure. That's how system calls
work. I'm not claiming that things like "stat()" are wrong because
they take binary data.

But marshalling random binary structures into some buffer? Let's avoid
that. Particularly for these kinds of fairly free-form things like fs
options.

Those things *are* strings, most of them. Exactly because it needs a
level of flexibility that binary data just doesn't have.

So I'd suggest something that is very much like "statfsat()", which
gets a buffer and a length, and returns an extended "struct statfs"
*AND* just a string description at the end.

And if you don't pass a sufficiently big buffer, it will not do some
random "continuations". No state between system calls. It gets
truncated, and you need to pass a bigger buffer, kind of like
"snprintf()".

I think people who have problems parsing plain ASCII text are just
wrong. It's not that expensive. The thing that makes /proc/mounts
expensive is not the individual lines - it's that there are a lot of
them.

                     Linus

Miklos Szeredi Aug. 18, 2020, 8:18 p.m. UTC | #83

On Tue, Aug 18, 2020 at 8:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:

> I think people who have problems parsing plain ASCII text are just
> wrong. It's not that expensive. The thing that makes /proc/mounts
> expensive is not the individual lines - it's that there are a lot of
> them.

I agree completely with the above.

So why mix a binary structure into it?  Would it not make more sense
to make it text only?

I.e. NAME=VALUE pairs separated by newlines and quoting non-printable chars.

Thanks,
Miklos

Linus Torvalds Aug. 18, 2020, 8:53 p.m. UTC | #84

On Tue, Aug 18, 2020 at 1:18 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> So why mix a binary structure into it?  Would it not make more sense
> to make it text only?

.. because for basic and standard stuff, the binary structure just
makes sense and is easier for everybody.

When I want to get the size of a file, I do "stat()" on it, and get
the size from st.st_size. That's convenient, and there's no reason
_not_ to do it. Returning the size as an ASCII string would be
completely pointless and annoying as hell.

So binary formats have their places. But those places are for standard
and well-understood fields that are commonly accessed and do not have
any free-form or wild components to them that needs to be marshalled
into some binary format.

Whenever you have free-form data, just use ASCII.

It's what "mount" already uses, for chrissake. We pass in mount
options as ASCII for a good reason.

Basically, I think a rough rule of thumb can and should be:

 - stuff that the VFS knows about natively and fully is clearly pretty
mount-agnostic and generic, and can be represented in whatever
extended "struct statfs_x" directly.

 - anything that is variable-format and per-fs should be expressed in
the ASCII buffer

Look at our fancy new fs_context - that's pretty much what it does
even inside the kernel. Sure, we have "binary" fields there for core
basic information ("struct dentry *root", but also things like flags
with MNT_NOSUID), but the configuration stuff is ASCII that the
filesystem can parse itself.

Exactly because some things are very much specific to some
filesystems, not generic things.

So we fundamentally already have a mix of "standard FS data" and
"filesystem-specific options", and it's already basically split that
way: binary flag fields for the generic stuff, and ASCII text for the
odd options.

Again: binary data isn't wrong when it's a fixed structure that didn't
need some odd massaging or marshalling or parsing. Just a simple fixed
structure. That is _the_ most common kernel interface, used for almost
everything.  Sometimes we have arrays of them, but most of the time
it's a single struct pointer.

But binary data very much is wrong the moment you think you need to
have a parser to read it, or a marshaller to write it. Just use ASCII.

I really would prefer for the free-form data to have a lot of
commonalities with the /proc/mounts line. Not because that's a
wonderful format, but because there are very very few truly wonderful
formats out there, and in the absense of "wonderful", I'd much prefer
"familiar" and "able to use common helpers" (hopefully both on the
kernel side and the user side)..

               Linus

Al Viro Aug. 19, 2020, 2:29 a.m. UTC | #85

On Tue, Aug 18, 2020 at 11:51:25AM -0700, Linus Torvalds wrote:

> I think people who have problems parsing plain ASCII text are just
> wrong. It's not that expensive. The thing that makes /proc/mounts
> expensive is not the individual lines - it's that there are a lot of
> them.

	It is expensive - if you use strdup() all over the place,
do asprintf() equivalents for concatenation, etc.  IOW, you can write
BASIC (or javascript) in any language...

	systemd used to be that bad - exactly in parsing /proc/mounts;
I hadn't checked that code lately, so it's possible that it had gotten
better, but about 4 years ago it had been awful.  OTOH, at that time
I'd been looking at the atrocities kernel-side (in fs/pnode.c), where
on realistic setups we had O(N^2) allocations done, with all but O(N)
of them ending up freed before anyone could see them.  So it's not as
if they had a monopoly on bloody awful code...

Miklos Szeredi Aug. 21, 2020, 1:17 p.m. UTC | #86

On Tue, Aug 18, 2020 at 10:53 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:

> Basically, I think a rough rule of thumb can and should be:
>
>  - stuff that the VFS knows about natively and fully is clearly pretty
> mount-agnostic and generic, and can be represented in whatever
> extended "struct statfs_x" directly.
>
>  - anything that is variable-format and per-fs should be expressed in
> the ASCII buffer
>
> Look at our fancy new fs_context - that's pretty much what it does
> even inside the kernel. Sure, we have "binary" fields there for core
> basic information ("struct dentry *root", but also things like flags
> with MNT_NOSUID), but the configuration stuff is ASCII that the
> filesystem can parse itself.
>
> Exactly because some things are very much specific to some
> filesystems, not generic things.
>
> So we fundamentally already have a mix of "standard FS data" and
> "filesystem-specific options", and it's already basically split that
> way: binary flag fields for the generic stuff, and ASCII text for the
> odd options.

Okay.

Something else:  do we want a separate statmount(2) or is it okay to
mix per-mount and per-sb attributes in the same syscall?

/proc/mounts concatenates mount and sb options (since it copies the
/etc/mtab format)

/proc/self/mountinfo separates per-mount and per-sb data into
different fields at least, but the fields themselves are mixed

If we are introducing completely new interfaces, I think it would make
sense to separate per-mount and per-sb attributes somehow.  Atomicity
arguments don't apply since they have separate locking.  And we
already have separate interfaces for configuring them...

Thanks,
Miklos

[GIT,PULL] Filesystem Information

Pull-request

Message

Comments