mbox series

[00/33] VFS: Introduce filesystem context [ver #11]

Message ID 153313703562.13253.5766498657900728120.stgit@warthog.procyon.org.uk (mailing list archive)
Headers show
Series VFS: Introduce filesystem context [ver #11] | expand

Message

David Howells Aug. 1, 2018, 3:23 p.m. UTC
Hi Al,

Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount.  This is also used for remount
since much of the parsing stuff is common in many filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.

This also allows Miklós Szeredi's idea of doing:

	fd = fsopen("nfs");
	fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0);
	fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	mfd = fsmount(fd, MS_NODEV);
	move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).

I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.

I've implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.

Unconverted filesystems are handled by a legacy filesystem wrapper.


====================
WHY DO WE WANT THIS?
====================

Firstly, there's a bunch of problems with the mount(2) syscall:

 (1) It's actually six or seven different interfaces rolled into one and
     weird combinations of flags make it do different things beyond the
     original specification of the syscall.

 (2) It produces a particularly large and diverse set of errors, which have
     to be mapped back to a small error code.  Yes, there's dmesg - if you
     have it configured - but you can't necessarily see that if you're
     doing a mount inside of a container.

 (3) It copies a PAGE_SIZE block of data for each of the type, device name
     and options.

 (4) The size of the buffers is PAGE_SIZE - and this is arch dependent.

 (5) You can't mount into another mount namespace.  I could, for example,
     build a container without having to be in that container's namespace
     if I can do it from outside.

 (6) It's not really geared for the specification of multiple sources, but
     some filesystems really want that - overlayfs, for example.

and some problems in the internal kernel api:

 (1) There's no defined way to supply namespace configuration for the
     superblock - so, for instance, I can't say that I want to create a
     superblock in a particular network namespace (on automount, say).

     NFS hacks around this by creating multiple shadow file_system_types
     with different ->mount() ops.

 (2) When calling mount internally, unless you have NFS-like hacks, you
     have to generate or otherwise provide text config data which then gets
     parsed, when some of the time you could bypass the parsing stage
     entirely.

 (3) The amount of data in the data buffer is not known, but the data
     buffer might be on a kernel stack somewhere, leading to the
     possibility of tripping the stack underrun guard.

and other issues too:

 (1) Superblock remount in some filesystems applies options on an as-parsed
     basis, so if there's a parse failure, a partial alteration with no
     rollback is effected.

 (2) Under some circumstances, the mount data may get copied multiple times
     so that it can have multiple parsers applied to it or because it has
     to be parsed multiple times - for instance, once to get the
     preliminary info required to access the on-disk superblock and then
     again to update the superblock record in the kernel.

I want to be able to add support for a bunch of things:

 (1) UID, GID and Project ID mapping/translation.  I want to be able to
     install a translation table of some sort on the superblock to
     translate source identifiers (which may be foreign numeric UIDs/GIDs,
     text names, GUIDs) into system identifiers.  This needs to be done
     before the superblock is published[*].

     Note that this may, for example, involve using the context and the
     superblock held therein to issue an RPC to a server to look up
     translations.

     [*] By "published" I mean made available through mount so that other
     	 userspace processes can access it by path.

     Maybe specifying a translation range element with something like:

	fsconfig(fd, fsconfig_translate_uid, "<srcuid> <nsuid> <count>", 0, 0);

     The translation information also needs to propagate over an automount
     in some circumstances.

 (2) Namespace configuration.  I want to be able to tell the superblock
     creation process what namespaces should be applied when it created (in
     particular the userns and netns) for containerisation purposes, e.g.:

	fsconfig(fd, FSCONFIG_SET_NAMESPACE, "user", 0, userns_fd);
	fsconfig(fd, FSCONFIG_SET_NAMESPACE, "net", 0, netns_fd);

 (3) Namespace propagation.  I want to have a properly defined mechanism
     for propagating namespace configuration over automounts within the
     kernel.  This will be particularly useful for network filesystems.

 (4) Pre-mount attribute query.  A chunk of the changes is actually the
     fsinfo() syscall to query attributes of the filesystem beyond what's
     available in statx() and statfs().  This will allow a created
     superblock to be queried before it is published.

 (5) Upcall for configuration.  I would like to be able to query
     configuration that's stored in userspace when an automount is made.
     For instance, to look up network parameters for NFS or to find a cache
     selector for fscache.

     The internal fs_context could be passed to the upcall process or the
     kernel could read a config file directly if named appropriately for the
     superblock, perhaps:

	[/etc/fscontext.d/afs/example.com/cell.cfg]
	realm = EXAMPLE.COM
	translation = uid,3000,4000,100
	fscache = tag=fred

 (6) Event notifications.  I want to be able to install a watch on a
     superblock before it is published to catch things like quota events
     and EIO.

 (7) Large and binary parameters.  There might be at some point a need to
     pass large/binary objects like Microsoft PACs around.  If I understand
     PACs correctly, you can obtain these from the Kerberos server and then
     pass them to the file server when you connect.

     Having it possible to pass large or binary objects as individual
     fsconfig calls make parsing these trivial.  OTOH, some or all of this
     can potentially be handled with the use of the keyrings interface - as
     the afs filesystem does for passing kerberos tokens around; it's just
     that that seems overkill for a parameter you may only need once.


===================
SIGNIFICANT CHANGES
===================

 ver #11:

 (*) Fixed AppArmor.

 (*) Capitalised all the UAPI constants.

 (*) Explicitly numbered the FSCONFIG_* UAPI constants.

 (*) Removed all the places ANON_INODES is selected.

 (*) Fixed a bug whereby the context gets freed twice (which broke mounts of
     procfs).

 (*) Split fsinfo() off into its own patch series.

 ver #10:

 (*) Renamed "option" to "parameter" in a number of places.

 (*) Replaced the use of write() to drive the configuration with an fsconfig()
     syscall.  This also allows at-style paths and fds to be presented as typed
     object.

 (*) Routed the key=value parameter concept all the way through from the
     fsconfig() system call to the LSM and filesystem.

 (*) Added a parameter-description concept and helper functions to help
     interpret a parameter and possibly convert the value.

 (*) Made it possible to query the parameter description using the fsinfo()
     syscall.  Added a test-fs-query sample to dump the parameters used by a
     filesystem.

 ver #9:

 (*) Dropped the fd cookie stuff and the FMODE_*/O_* split stuff.

 (*) Al added an open_tree() system call to allow a mount tree to be picked
     referenced or cloned into an O_PATH-style fd.  This can then be used
     with sys_move_mount().  Dropped the O_CLONE_MOUNT and O_NON_RECURSIVE
     open() flags.

 (*) Brought error logging back in, though only in the fs_context and not
     in the task_struct.

 (*) Separated MS_REMOUNT|MS_BIND handling from MS_REMOUNT handling.

 (*) Used anon_inodes for the fd returned by fsopen() and fspick().  This
     requires making it unconditional.

 (*) Fixed lots of bugs.  Especial thanks to Al and Eric Biggers for
     finding them and providing patches.

 (*) Wrote manual pages, which I'll post separately.

 ver #8:

 (*) Changed the way fsmount() mounts into the namespace according to some
     of Al's ideas.

 (*) Put better typing on the fd cookie obtained from __fdget() & co..

 (*) Stored the fd cookie in struct nameidata rather than the dfd number.

 (*) Changed sys_fsmount() to return an O_PATH-style fd rather than
     actually mounting into the mount namespace.

 (*) Separated internal FMODE_* handling from O_* handling to free up
     certain O_* flag numbers.

 (*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use
     with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd.

 (*) Added a new syscall, sys_move_mount(), to move a mount from an
     dfd+path source to a dfd+path destination.

 (*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the
     vfsmount attached to file->f_path needs 'unmounting' if set.

 (*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful.

	[!] This doesn't work quite right.

 (*) Added a new syscall, fsinfo(), to query information about a
     filesystem.  The idea being that this will, in future, work with the
     fd from fsopen() too and permit querying of the parameters and
     metadata before fsmount() is called.

 ver #7:

 (*) Undo an incorrect MS_* -> SB_* conversion.

 (*) Pass the mount data buffer size to all the mount-related functions that
     take the data pointer.  This fixes a problem where someone (say SELinux)
     tries to copy the mount data, assuming it to be a page in size, and
     overruns the buffer - thereby incurring an oops by hitting a guard page.

 (*) Made the AFS filesystem use them as an example.  This is a much easier to
     deal with than with NFS or Ext4 as there are very few mount options.

 ver #6:

 (*) Dropped the supplementary error string facility for the moment.

 (*) Dropped the NFS patches for the moment.

 (*) Dropped the reserved file descriptor argument from fsopen() and
     replaced it with three reserved pointers that must be NULL.

 ver #5:

 (*) Renamed sb_config -> fs_context and adjusted variable names.

 (*) Differentiated the flags in sb->s_flags (now named SB_*) from those
     passed to mount(2) (named MS_*).

 (*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
     caller always provide a struct file_system_type pointer and the
     parameters required.

 (*) Got rid of vfs_submount_fc() in favour of passing
     FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context().  The purpose is now
     used more.

 (*) Call ->validate() on the remount path.

 (*) Got rid of the inode locking in sys_fsmount().

 (*) Call security_sb_mountpoint() in the mount(2) path.

 ver #4:

 (*) Split the sb_config patch up somewhat.

 (*) Made the supplementary error string facility something attached to the
     task_struct rather than the sb_config so that error messages can be
     obtained from NFS doing a mount-root-and-pathwalk inside the
     nfs_get_tree() operation.

     Further, made this managed and read by prctl rather than through the
     mount fd so that it's more generally available.

 ver #3:

 (*) Rebased on 4.12-rc1.

 (*) Split the NFS patch up somewhat.

 ver #2:

 (*) Removed the ->fill_super() from sb_config_operations and passed it in
     directly to functions that want to call it.  NFS now calls
     nfs_fill_super() directly rather than jumping through a pointer to it
     since there's only the one option at the moment.

 (*) Removed ->mnt_ns and ->sb from sb_config and moved ->pid_ns into
     proc_sb_config.

 (*) Renamed create_super -> get_tree.

 (*) Renamed struct mount_context to struct sb_config and amended various
     variable names.

 (*) sys_fsmount() acquired AT_* flags and MS_* flags (for MNT_* flags)
     arguments.

 ver #1:

 (*) Split the sb_config stuff out into its own header.

 (*) Support non-context aware filesystems through a special set of
     sb_config operations.

 (*) Stored the created superblock and root dentry into the sb_config after
     creation rather than directly into a vfsmount.  This allows some
     arguments to be removed to various NFS functions.

 (*) Added an explicit superblock-creation step.  This allows a created
     superblock to then be mounted multiple times.

 (*) Added a flag to say that the sb_config is degraded and cannot have
     another go at having a superblock creation whilst getting rid of the
     one that says it's already mounted.

Possible further developments:

 (*) Implement sb reconfiguration (for now it returns ENOANO).

 (*) Implement mount context support in more filesystems, ext4 being next
     on my list.

 (*) Move the walk-from-root stuff that nfs has to generic code so that you
     can do something akin to:

	mount /dev/sda1:/foo/bar /mnt

     See nfs_follow_remote_path() and mount_subtree().  This is slightly
     tricky in NFS as we have to prevent referral loops.

 (*) Work out how to get at the error message incurred by submounts
     encountered during nfs_follow_remote_path().

     Should the error message be moved to task_struct and made more
     general, perhaps retrieved with a prctl() function?

 (*) Clean up/consolidate the security functions.  Possibly add a
     validation hook to be called at the same time as the mount context
     validate op.

The patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

tagged as:

	mount-api-20180801

on branch:

	mount-api

David
---
Al Viro (2):
      vfs: syscall: Add open_tree(2) to reference or clone a mount
      teach move_mount(2) to work with OPEN_TREE_CLONE

David Howells (31):
      vfs: syscall: Add move_mount(2) to move mounts around
      vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled
      vfs: Introduce the basic header for the new mount API's filesystem context
      vfs: Introduce logging functions
      vfs: Add configuration parser helpers
      vfs: Add LSM hooks for the new mount API
      selinux: Implement the new mount API LSM hooks
      smack: Implement filesystem context security hooks
      apparmor: Implement security hooks for the new mount API
      tomoyo: Implement security hooks for the new mount API
      vfs: Separate changing mount flags full remount
      vfs: Implement a filesystem superblock creation/configuration context
      vfs: Remove unused code after filesystem context changes
      procfs: Move proc_fill_super() to fs/proc/root.c
      proc: Add fs_context support to procfs
      ipc: Convert mqueue fs to fs_context
      cpuset: Use fs_context
      kernfs, sysfs, cgroup, intel_rdt: Support fs_context
      hugetlbfs: Convert to fs_context
      vfs: Remove kern_mount_data()
      vfs: Provide documentation for new mount API
      Make anon_inodes unconditional
      vfs: syscall: Add fsopen() to prepare for superblock creation
      vfs: Implement logging through fs_context
      vfs: Add some logging to the core users of the fs_context log
      vfs: syscall: Add fsconfig() for configuring and managing a context
      vfs: syscall: Add fsmount() to create a mount for a superblock
      vfs: syscall: Add fspick() to select a superblock for reconfiguration
      afs: Add fs_context support
      afs: Use fs_context to pass parameters over automount
      vfs: Add a sample program for the new mount API


 Documentation/filesystems/mount_api.txt  |  706 ++++++++++++++++++++++++
 arch/arc/kernel/setup.c                  |    1 
 arch/arm/kernel/atags_parse.c            |    1 
 arch/arm/kvm/Kconfig                     |    1 
 arch/arm64/kvm/Kconfig                   |    1 
 arch/mips/kvm/Kconfig                    |    1 
 arch/powerpc/kvm/Kconfig                 |    1 
 arch/s390/kvm/Kconfig                    |    1 
 arch/sh/kernel/setup.c                   |    1 
 arch/sparc/kernel/setup_32.c             |    1 
 arch/sparc/kernel/setup_64.c             |    1 
 arch/x86/Kconfig                         |    1 
 arch/x86/entry/syscalls/syscall_32.tbl   |    6 
 arch/x86/entry/syscalls/syscall_64.tbl   |    6 
 arch/x86/kernel/cpu/intel_rdt.h          |   15 +
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |  184 ++++--
 arch/x86/kernel/setup.c                  |    1 
 arch/x86/kvm/Kconfig                     |    1 
 drivers/base/Kconfig                     |    1 
 drivers/base/devtmpfs.c                  |    1 
 drivers/char/tpm/Kconfig                 |    1 
 drivers/dma-buf/Kconfig                  |    1 
 drivers/gpio/Kconfig                     |    1 
 drivers/iio/Kconfig                      |    1 
 drivers/infiniband/Kconfig               |    1 
 drivers/vfio/Kconfig                     |    1 
 fs/Kconfig                               |    7 
 fs/Makefile                              |    5 
 fs/afs/internal.h                        |    9 
 fs/afs/mntpt.c                           |  148 +++--
 fs/afs/super.c                           |  470 +++++++++-------
 fs/afs/volume.c                          |    4 
 fs/f2fs/super.c                          |    2 
 fs/file_table.c                          |    9 
 fs/filesystems.c                         |    4 
 fs/fs_context.c                          |  779 +++++++++++++++++++++++++++
 fs/fs_parser.c                           |  476 ++++++++++++++++
 fs/fsopen.c                              |  491 +++++++++++++++++
 fs/hugetlbfs/inode.c                     |  392 ++++++++------
 fs/internal.h                            |   13 
 fs/kernfs/mount.c                        |   87 +--
 fs/libfs.c                               |   19 +
 fs/namei.c                               |    4 
 fs/namespace.c                           |  867 +++++++++++++++++++++++-------
 fs/notify/fanotify/Kconfig               |    1 
 fs/notify/inotify/Kconfig                |    1 
 fs/pnode.c                               |    1 
 fs/proc/inode.c                          |   51 --
 fs/proc/internal.h                       |    6 
 fs/proc/root.c                           |  245 ++++++--
 fs/super.c                               |  368 ++++++++++---
 fs/sysfs/mount.c                         |   67 ++
 include/linux/cgroup.h                   |    3 
 include/linux/fs.h                       |   21 +
 include/linux/fs_context.h               |  208 +++++++
 include/linux/fs_parser.h                |  116 ++++
 include/linux/kernfs.h                   |   39 +
 include/linux/lsm_hooks.h                |   70 ++
 include/linux/module.h                   |    6 
 include/linux/mount.h                    |    5 
 include/linux/security.h                 |   61 ++
 include/linux/syscalls.h                 |    9 
 include/uapi/linux/fcntl.h               |    2 
 include/uapi/linux/fs.h                  |   82 +--
 include/uapi/linux/mount.h               |   75 +++
 init/Kconfig                             |   10 
 init/do_mounts.c                         |    1 
 init/do_mounts_initrd.c                  |    1 
 ipc/mqueue.c                             |  121 +++-
 kernel/cgroup/cgroup-internal.h          |   50 +-
 kernel/cgroup/cgroup-v1.c                |  347 +++++++-----
 kernel/cgroup/cgroup.c                   |  256 ++++++---
 kernel/cgroup/cpuset.c                   |   68 ++
 samples/Kconfig                          |    6 
 samples/Makefile                         |    2 
 samples/mount_api/Makefile               |    7 
 samples/mount_api/test-fsmount.c         |  118 ++++
 security/apparmor/include/mount.h        |   11 
 security/apparmor/lsm.c                  |  108 ++++
 security/apparmor/mount.c                |   47 ++
 security/security.c                      |   51 ++
 security/selinux/hooks.c                 |  311 ++++++++++-
 security/smack/smack.h                   |   11 
 security/smack/smack_lsm.c               |  370 ++++++++++++-
 security/tomoyo/common.h                 |    3 
 security/tomoyo/mount.c                  |   46 ++
 security/tomoyo/tomoyo.c                 |   15 +
 87 files changed, 6637 insertions(+), 1484 deletions(-)
 create mode 100644 Documentation/filesystems/mount_api.txt
 create mode 100644 fs/fs_context.c
 create mode 100644 fs/fs_parser.c
 create mode 100644 fs/fsopen.c
 create mode 100644 include/linux/fs_context.h
 create mode 100644 include/linux/fs_parser.h
 create mode 100644 include/uapi/linux/mount.h
 create mode 100644 samples/mount_api/Makefile
 create mode 100644 samples/mount_api/test-fsmount.c

Comments

Eric W. Biederman Aug. 10, 2018, 2:05 p.m. UTC | #1
There is a serious problem with mount options today that fsopen does not
address.  The problem is that mount options are ignored for block based
filesystems, and any other type of filesystem that follows the same
pattern.

The script below demonstrates this bug.  Showing this bug can cause the
ext4 "acl" "quota" and "user_xattr" options to be silently ignored.

fsopen has my nack until it addresses this issue.

I don't know if we can fix this in the context of sys_mount.  But we if
we are redoing the option parsing of how we mount filesystems this needs
to be fixed before we start worrying about bug compatibility.

Hopefully this report is simple and clear enough that we can at least
agree on the problem.

Eric

# cat ~/bin/bdev-loop0.sh
#!/bin/sh
set -x
set -e

LOOP=loop0

dd if=/dev/zero bs=1024 count=1048576 of=$LOOP-file
losetup /dev/$LOOP $LOOP-file
mkfs.ext4 /dev/$LOOP

mkdir $LOOP-noacl-noquota-nouser_xattr
mount -t ext4 /dev/$LOOP -o "noacl,noquota,nouser_xattr" $LOOP-noacl-noquota-nouser_xattr

mkdir $LOOP-acl-quota-user_xattr
mount -t ext4 /dev/$LOOP  -o "acl,quota,user_xattr" $LOOP-acl-quota-user_xattr

cat /proc/mounts | grep loop0


root@finagle:~# ~/bin/bdev-loop0.sh
+ set -e
+ LOOP=loop0
+ dd if=/dev/zero bs=1024 count=1048576 of=loop0-file
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 4.37645 s, 245 MB/s
+ losetup /dev/loop0 loop0-file
+ mkfs.ext4 /dev/loop0
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
65536 inodes, 262144 blocks
13107 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=268435456
8 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376

Writing inode tables: done                            
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 29 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
+ mkdir loop0-noacl-noquota-nouser_xattr
+ mount -t ext4 /dev/loop0 -o noacl,noquota,nouser_xattr loop0-noacl-noquota-nouser_xattr
+ mkdir loop0-acl-quota-user_xattr
+ mount -t ext4 /dev/loop0 -o acl,quota,user_xattr loop0-acl-quota-user_xattr
+ + grep loop0
cat /proc/mounts
/dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
/dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
Andy Lutomirski Aug. 10, 2018, 2:36 p.m. UTC | #2
> On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> 
> 
> There is a serious problem with mount options today that fsopen does not
> address.  The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.
> 

> /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0

To make sure I understand correctly: the problem is that the second mount ignored the options because the device was already mounted, right?

For the new API, I think the only remotely sane approach is to refuse to mount or init or whatever you call it an already mounted bdev. If user code genuinely needs to bind-mount an existing mount that is known only by its bdev, we can add a specific API just for that.
David Howells Aug. 10, 2018, 3:11 p.m. UTC | #3
Eric W. Biederman <ebiederm@xmission.com> wrote:

> There is a serious problem with mount options today that fsopen does not
> address.  The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.

Yes.  Since you *absolutely* *insist* on this being fixed *right* *now* *or*
*else*, I'm working up a set of additional patches to give userspace the
option of whether they want no sharing; sharing, but only with exactly the
same parameters; or to ignore the parameter differences and just accept
sharing of what's already already mounted (ie. the current behaviour).

The second option, however, is not trivial as it needs to compare the fs
contexts, including the LSM parameters.  To make that work, I really need to
remove the old security_mnt_opts stuff - which means I need to port btrfs to
the new context stuff.

We discussed this yesterday, and I proposed a solution, and I'm working on it.

Yes, I agree it would be nice to have, but it *doesn't* really need supporting
right this minute, since what I have now oughtn't to break the current
behaviour.

David
Tetsuo Handa Aug. 10, 2018, 3:11 p.m. UTC | #4
On 2018/08/10 23:05, Eric W. Biederman wrote:
> 
> There is a serious problem with mount options today that fsopen does not
> address.  The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.
> 
> The script below demonstrates this bug.  Showing this bug can cause the
> ext4 "acl" "quota" and "user_xattr" options to be silently ignored.
> 
> fsopen has my nack until it addresses this issue.
> 
> I don't know if we can fix this in the context of sys_mount.  But we if
> we are redoing the option parsing of how we mount filesystems this needs
> to be fixed before we start worrying about bug compatibility.
> 
> Hopefully this report is simple and clear enough that we can at least
> agree on the problem.
> 
> Eric

This might be related to a problem that syzbot is failing to reproduce a problem.

  https://groups.google.com/forum/#!msg/syzkaller-bugs/R03vI7RCVco/0PijCTrcCgAJ

  syzbot found a reproducer, and the reproducer was working until next-20180803.
  But the reproducer is failing to reproduce this problem in next-20180806 despite
  there is no mm related change between next-20180803 and next-20180806.

  Therefore, I suspect that the reproducer is no longer working as intended. And
  there was parser change (David Howells' patch) between next-20180803 and next-20180806.

I'm waiting for response from David Howells...
David Howells Aug. 10, 2018, 3:13 p.m. UTC | #5
Andy Lutomirski <luto@amacapital.net> wrote:

> > /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> > /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> 
> To make sure I understand correctly: the problem is that the second mount
> ignored the options because the device was already mounted, right?
> 
> For the new API, I think the only remotely sane approach is to refuse to
> mount or init or whatever you call it an already mounted bdev. If user code
> genuinely needs to bind-mount an existing mount that is known only by its
> bdev, we can add a specific API just for that.

I'm adding some flags to fsopen() to allow userspace to say whether it wants
no sharing, same parameters-only sharing or anything-goes sharing (as now).

I'm also adding a flag whereby userspace can forbid anyone else from sharing a
new superblock it has just set up.

David
Al Viro Aug. 10, 2018, 3:16 p.m. UTC | #6
On Fri, Aug 10, 2018 at 09:05:22AM -0500, Eric W. Biederman wrote:
> 
> There is a serious problem with mount options today that fsopen does not
> address.  The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.
> 
> The script below demonstrates this bug.  Showing this bug can cause the
> ext4 "acl" "quota" and "user_xattr" options to be silently ignored.
> 
> fsopen has my nack until it addresses this issue.
> 
> I don't know if we can fix this in the context of sys_mount.  But we if
> we are redoing the option parsing of how we mount filesystems this needs
> to be fixed before we start worrying about bug compatibility.
> 
> Hopefully this report is simple and clear enough that we can at least
> agree on the problem.

Sure, it is simple.  So's the solution: MNT_USERNS_SPECIAL_SEMANTICS that
would get passed to filesystems, so that Eric would be able to implement
his mount(2)-incompatible behaviour at leisure, without worrying about
compatibility issues.

Does that address your complaint?  Because one thing we are not going
to do is changing mount(2) behaviour.  Reason: userland-visible
behaviour of hell knows how many local scripts.  Another thing that
is flat-out not feasible is some kind of blanket "compare options"
stuff; it *can* be done as helpers to be used by filesystem when
it sees that new flag, but it's simply not going to work at the
fs-independent level.  Trivial example with the same ext4:
mount /dev/sda1 /mnt/a -o bsddf vs. mount /dev/sda1 /mnt/b
ext4 can tell that these are the same.  syscall itself has no
clue.  What's more, it's not just explicitly spelled default
options - it's the stuff that has more than one form.  And while
we are at it, the things like two NFS mounts of different trees
from the same server; they might or might not get the same superblock.
Depending upon the options.

Convenience helper that would allow ext4 to compare options and reject
the incompatible mount?  Not sure how much ext4-specific knowledge
would have to go in it, but if you can come up with one - more power
to you.  But the decision to use it *must* be ext4-specific.  Because
for e.g. NFS such thing as -o fsid=..., while certainly a part of
options, has a very different meaning - it's "use a separate fs instance"
(and let the server deal with coherency issues on its end).

Decision to use sget() (and the way it's used) is up to filesystem.
We *can't* lift that into syscall.  Not without breaking the fuck out
of existing behaviour.

Having something like a second callback for mount_bdev() that would
be called when we'd found an existing instance for the same block
device?  Sure, no problem.  Having a helper for doing such comparison
that would work in enough cases to bother, so that different fs
could avoid boilerplate in that callback?  Again, more power to you.

But I don't see what the hell does that have to the syscall interface.
Eric W. Biederman Aug. 10, 2018, 3:17 p.m. UTC | #7
Andy Lutomirski <luto@amacapital.net> writes:

>> On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> 
>> 
>> There is a serious problem with mount options today that fsopen does not
>> address.  The problem is that mount options are ignored for block based
>> filesystems, and any other type of filesystem that follows the same
>> pattern.
>> 
>
>> /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
>> /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
>
> To make sure I understand correctly: the problem is that the second
> mount ignored the options because the device was already mounted,
> right?

Yes.

> For the new API, I think the only remotely sane approach is to refuse
> to mount or init or whatever you call it an already mounted bdev. If
> user code genuinely needs to bind-mount an existing mount that is
> known only by its bdev, we can add a specific API just for that.

Eric
Al Viro Aug. 10, 2018, 3:24 p.m. UTC | #8
On Fri, Aug 10, 2018 at 07:36:17AM -0700, Andy Lutomirski wrote:
> 
> 
> > On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > 
> > 
> > There is a serious problem with mount options today that fsopen does not
> > address.  The problem is that mount options are ignored for block based
> > filesystems, and any other type of filesystem that follows the same
> > pattern.
> > 
> 
> > /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> > /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> 
> To make sure I understand correctly: the problem is that the second mount ignored the options because the device was already mounted, right?
> 
> For the new API, I think the only remotely sane approach is to refuse to mount or init or whatever you call it an already mounted bdev. If user code genuinely needs to bind-mount an existing mount that is known only by its bdev, we can add a specific API just for that.

First of all, that does NOT belong anywhere other than fs itself.
Example: NFS.  Not every attempt to mount something leads to creation
of new fs instance; moreover, whether it will or not can't be predicted
in general.

PS: for pity sake, fix your MUA; 270-character lines are way over the
top.
Theodore Ts'o Aug. 10, 2018, 3:39 p.m. UTC | #9
On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote:
> 
> Yes.  Since you *absolutely* *insist* on this being fixed *right* *now* *or*
> *else*, I'm working up a set of additional patches to give userspace the
> option of whether they want no sharing; sharing, but only with exactly the
> same parameters; or to ignore the parameter differences and just accept
> sharing of what's already already mounted (ie. the current behaviour).

But there's no way to support "no sharing", at least not in the
general case.  A file system can only be mounted once, and without
file system support, there's no way for a file system to be mounted
with the bsddf or minixdf mount simultaneously.

Even *with* file system support, there's no way today for the VFS to
keep track of whether a pathname resolution came through one
mountpoint or another, so I can't do something like this:

	mount /dev/sdXX -o casefold /android-data
	mount /dev/sdXX -o nocasefold /android-data-2

Which is a pity, since if we could we could much more easily get rid
of the horror which is Android's wrapfs...

So if the file system has been mounted with one set of mount options,
and you want to try to mount it with a conflicting set of mount
options and you don't want it to silently ignore the mount options,
the *only* thing we can today is to refuse the mount and return an
error.  

I'm not sure Eric would really consider that an improvement for the
container use case....

						- Ted

P.S.  And as Al has pointed out, this would require special, per-file
system support to determine whether the mount options are conflicting
or not....
David Howells Aug. 10, 2018, 3:53 p.m. UTC | #10
Theodore Y. Ts'o <tytso@mit.edu> wrote:

> Even *with* file system support, there's no way today for the VFS to
> keep track of whether a pathname resolution came through one
> mountpoint or another, so I can't do something like this:

Ummm...  Isn't that encoded in the vfsmount pointer in struct path?

However, the case folding stuff - is that a superblockism of a mountpointism?

> So if the file system has been mounted with one set of mount options,
> and you want to try to mount it with a conflicting set of mount
> options and you don't want it to silently ignore the mount options,
> the *only* thing we can today is to refuse the mount and return an
> error.  

With fsopen() there is the option to have the filesystem and the LSM attempt
to compare the non-key[*] mount options and reject the attempt to share if
they differ in any way.

David


[*] sget lookup keys, that is.
Casey Schaufler Aug. 10, 2018, 3:55 p.m. UTC | #11
On 8/10/2018 8:39 AM, Theodore Y. Ts'o wrote:
> On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote:
>> Yes.  Since you *absolutely* *insist* on this being fixed *right* *now* *or*
>> *else*, I'm working up a set of additional patches to give userspace the
>> option of whether they want no sharing; sharing, but only with exactly the
>> same parameters; or to ignore the parameter differences and just accept
>> sharing of what's already already mounted (ie. the current behaviour).
> But there's no way to support "no sharing", at least not in the
> general case.  A file system can only be mounted once, and without
> file system support, there's no way for a file system to be mounted
> with the bsddf or minixdf mount simultaneously.
>
> Even *with* file system support, there's no way today for the VFS to
> keep track of whether a pathname resolution came through one
> mountpoint or another, so I can't do something like this:
>
> 	mount /dev/sdXX -o casefold /android-data
> 	mount /dev/sdXX -o nocasefold /android-data-2
>
> Which is a pity, since if we could we could much more easily get rid
> of the horror which is Android's wrapfs...
>
> So if the file system has been mounted with one set of mount options,
> and you want to try to mount it with a conflicting set of mount
> options and you don't want it to silently ignore the mount options,
> the *only* thing we can today is to refuse the mount and return an
> error.  
>
> I'm not sure Eric would really consider that an improvement for the
> container use case....
>
> 						- Ted
>
> P.S.  And as Al has pointed out, this would require special, per-file
> system support to determine whether the mount options are conflicting
> or not....

This extends to LSMs that support mount options (SELinux and Smack)
as well.
David Howells Aug. 10, 2018, 4:11 p.m. UTC | #12
Casey Schaufler <casey@schaufler-ca.com> wrote:

> > P.S.  And as Al has pointed out, this would require special, per-file
> > system support to determine whether the mount options are conflicting
> > or not....
> 
> This extends to LSMs that support mount options (SELinux and Smack)
> as well. 

Yes.  I'm doing that.

David
Theodore Ts'o Aug. 10, 2018, 4:14 p.m. UTC | #13
On Fri, Aug 10, 2018 at 04:53:58PM +0100, David Howells wrote:
> Theodore Y. Ts'o <tytso@mit.edu> wrote:
> 
> > Even *with* file system support, there's no way today for the VFS to
> > keep track of whether a pathname resolution came through one
> > mountpoint or another, so I can't do something like this:
> 
> Ummm...  Isn't that encoded in the vfsmount pointer in struct path?

Well, yes, and we do use this as a hack to make read-only bind mounts
work.  But that's done as a special case, and it's for permissions
checking only.

The big problem is that there is single dentry cache object regardless
of which mount point was used to access it.  So that makes it
impossible to support case folding as a mount-pointism.

> 
> However, the case folding stuff - is that a superblockism of a mountpointism?

It's a superblock-ism.  As far as I know the *only* thing that we can
support as a mount-pointism is the ro flag, and that's handled as a
special case, and only if the original superblock was mounted
read/write.  ey That was my point; aside from the ro flag, we can't
support any other mount options as a per-mount point thing, so the
only thing we can do is to fail the mount if there are conflicting
mount options.  And I'm not really sure it helps the container use
case, since the whole point is they want their "guest" to be able to
blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the
fact that in some other container, someone had run "mount /dev/sda1 -o
xattr /mnt".  But having the second mount fail because of conflicting
mount option breaks the illusion that containers are functionally as
rich as VM's.

So before you put in lots of work to support rejecting the attmpted
mount if the mount options conflict, are we sure people will actually
find this to be useful?  Because it's not only fsopen() work for you,
but each file system is going to have to implement new functions to
answer the question "are these mount options conflicting or not?".
Are we sure it's worth the effort?

					- Ted
Eric W. Biederman Aug. 10, 2018, 6 p.m. UTC | #14
"Theodore Y. Ts'o" <tytso@mit.edu> writes:

> On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote:
>> 
>> Yes.  Since you *absolutely* *insist* on this being fixed *right* *now* *or*
>> *else*, I'm working up a set of additional patches to give userspace the
>> option of whether they want no sharing; sharing, but only with exactly the
>> same parameters; or to ignore the parameter differences and just accept
>> sharing of what's already already mounted (ie. the current behaviour).
>
> But there's no way to support "no sharing", at least not in the
> general case.  A file system can only be mounted once, and without
> file system support, there's no way for a file system to be mounted
> with the bsddf or minixdf mount simultaneously.
>
> Even *with* file system support, there's no way today for the VFS to
> keep track of whether a pathname resolution came through one
> mountpoint or another, so I can't do something like this:
>
> 	mount /dev/sdXX -o casefold /android-data
> 	mount /dev/sdXX -o nocasefold /android-data-2
>
> Which is a pity, since if we could we could much more easily get rid
> of the horror which is Android's wrapfs...
>
> So if the file system has been mounted with one set of mount options,
> and you want to try to mount it with a conflicting set of mount
> options and you don't want it to silently ignore the mount options,
> the *only* thing we can today is to refuse the mount and return an
> error.  
>
> I'm not sure Eric would really consider that an improvement for the
> container use case....

I think I would consider it an improvement.  I keep running into cases
where the mount options differed and something was done silently and
that causes problems.

Eric
Andy Lutomirski Aug. 10, 2018, 8:06 p.m. UTC | #15
On Fri, Aug 10, 2018 at 9:14 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> And I'm not really sure it helps the container use
> case, since the whole point is they want their "guest" to be able to
> blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the
> fact that in some other container, someone had run "mount /dev/sda1 -o
> xattr /mnt".  But having the second mount fail because of conflicting
> mount option breaks the illusion that containers are functionally as
> rich as VM's.

If the same block device is visible, with rw access, in two different
containers, I don't see any anything good can happen.  Sure, with the
current somewhat erratic semantics of mount(2), something kind of sort
of reasonable happens if they both mount it.  But if one or both of
them try to use, say, tune2fs or fsck, it's not going to go well.  And
a situation where they mount with different options and the result
depends on the order of the mounts is just plain bad.

I see four sane ways to deal with this:

1. Don't put the block device in the container at all.  The container
manager mounts it.

2. Use seccomp or a similar mechanism to intercept and emulate the
mount request.

3. Teach the filesystem driver to do something sensible.  This will
inherently be per-fs, and probably involves some serious magic or
allowing filesystem-specific vfsmount options.

4. Introduce a concept of a special kind of fake block device that
refers to an existing superblock, doesn't allow direct read or write,
and does the right thing when mounted.  Not obviously worth the
effort.

It seems to me that the current approach mostly involves crossing our fingers.
Theodore Ts'o Aug. 10, 2018, 8:46 p.m. UTC | #16
On Fri, Aug 10, 2018 at 01:06:54PM -0700, Andy Lutomirski wrote:
> If the same block device is visible, with rw access, in two different
> containers, I don't see any anything good can happen.

It's worse than that.  I've fixed a lot of bugs which cause the kernel
to crash, and a few that might be levered into a privilege escalationh
attack, when you mount a maliciously corrupted file system using ext4.
I'm told told the security researcher filed similar reports with the
XFS community, and he was told, "that's what metadata checksums are
for; go away".  Given how much time it takes to work with these
security researchers, I don't blame them.

But in light of that, I'd make a somewhat stronger statement.  If you
let an untrusted container mount arbitrary block devices where they
have rw acccess to the underlying block device, nothing good can
happen.  Period.  :-)

Which is why I don't think the lack of being able to reject
"conflicting mount options" is really all that important.  It
certainly shouldn't block the fsopen patch series.  #1, it's a problem
we have today, and #2, I'm really not all sure supporting bind mounts
via specifying block device was ever a good idea to begin with.  And
#3, while I've been fixing ext4 against security issues caused by
maliciously corrupted file system images, I'm still sure that allowing
untrusted containers access to mount *any* file system via a block
device for which they have r/w access is a Really Bad Idea.

> It seems to me that the current approach mostly involves crossing our fingers.

Agreed!

						- Ted
Darrick J. Wong Aug. 10, 2018, 10:12 p.m. UTC | #17
On Fri, Aug 10, 2018 at 04:46:39PM -0400, Theodore Y. Ts'o wrote:
> On Fri, Aug 10, 2018 at 01:06:54PM -0700, Andy Lutomirski wrote:
> > If the same block device is visible, with rw access, in two different
> > containers, I don't see any anything good can happen.
> 
> It's worse than that.  I've fixed a lot of bugs which cause the kernel
> to crash, and a few that might be levered into a privilege escalationh
> attack, when you mount a maliciously corrupted file system using ext4.
> I'm told told the security researcher filed similar reports with the
> XFS community, and he was told, "that's what metadata checksums are
> for; go away".

Hey now, there was a little more nuance to it than that[1][2].  The
complaint in the first instance had much more to do with breaking
existing V4 filesystems by adding format requirements that mkfs didn't
know about when the filesystem was created.  Yes, you can create V4
filesystems that will hang the system if the log was totally unformatted
and metadata updates are made, but OTOH it's fairly obvious when that
happens, you have to be root to mount a disk filesystem, and we try to
avoid breaking existing users.

XFS developers have been and will continue to examine security problems
when they are brought to our attention and strengthen validation as
needed to minimize the risk of incorrect behaviors, but filesystems are
complex machines, complex machinery is risky, and we arbitrate some of
that risk by requiring administrators to elect to mount an XFS.

> Given how much time it takes to work with these security researchers,
> I don't blame them.
> 
> But in light of that, I'd make a somewhat stronger statement.  If you
> let an untrusted container mount arbitrary block devices where they
> have rw acccess to the underlying block device, nothing good can
> happen.  Period.  :-)
> 
> Which is why I don't think the lack of being able to reject
> "conflicting mount options" is really all that important.  It
> certainly shouldn't block the fsopen patch series.  #1, it's a problem
> we have today, and #2, I'm really not all sure supporting bind mounts
> via specifying block device was ever a good idea to begin with.  And
> #3, while I've been fixing ext4 against security issues caused by
> maliciously corrupted file system images, I'm still sure that allowing
> untrusted containers access to mount *any* file system via a block
> device for which they have r/w access is a Really Bad Idea.
> 
> > It seems to me that the current approach mostly involves crossing our fingers.
> 
> Agreed!

Crossing our fingers and demanding administrator intentionality when
mounting filesystems off some piece of storage.

--D

[1] https://lkml.org/lkml/2018/5/21/649
[2] https://lkml.org/lkml/2018/4/2/572
Theodore Ts'o Aug. 10, 2018, 11:54 p.m. UTC | #18
On Fri, Aug 10, 2018 at 03:12:34PM -0700, Darrick J. Wong wrote:
> Hey now, there was a little more nuance to it than that[1][2].  The
> complaint in the first instance had much more to do with breaking
> existing V4 filesystems by adding format requirements that mkfs didn't
> know about when the filesystem was created.  Yes, you can create V4
> filesystems that will hang the system if the log was totally unformatted
> and metadata updates are made, but OTOH it's fairly obvious when that
> happens, you have to be root to mount a disk filesystem, and we try to
> avoid breaking existing users.

I wasn't thinking about syzbot reports; I've largely written them off
as far as file system testing is concerned, but rather Wen Xu at
Georgia Tech, who is much more reasonable than Dmitry, and has helpeyd
me out a lot; and has complained that the XFS folks haven't been
engaging with him.

In either case, both security researchers are fuzzing file system
images, and then fixing the checksums, and discovering that this can
lead to kernel crashes, and in a few cases, buffer overruns that can
lead to potential privilege escalations.  Wen can generate reports
faster than syzbot, but at least he gives me file system images (as
opposed to having to dig them out of syzbot repro C files) and he
actually does some analysis and explains what he thinks is going on.

I don't think anyone was claiming that format requirements should be
added to ext4 or xfs file systems.  But rather, that kernel code
should be made more robust against maliciously corrupted file system
images that have valid checksums.  I've been more willing to work with
Wen; Dave has expressed the opinion that these are not realistic bug
reports, and since only root can mount file systems, it's not high
priority.

The reason why I bring this up here is that in container land, there
are those who believe that "container root" should be able to mount
file systems, and if the "container root" isn't trusted, the fact that
the "container root" can crash the host kernel, or worse, corrupt the
host kernel and break out of the container as a result, that would be
sad.

I was pretty sure most file system developers are on the same page
that allowing untrusted "container roots" the ability to mount
arbitrary block device file systems is insanity.  Whether or not we
try to fix these sorts of bugs submitted by security researchers.  :-)

	  	       	    	       		  - Ted
Eric W. Biederman Aug. 11, 2018, 12:28 a.m. UTC | #19
"Theodore Y. Ts'o" <tytso@mit.edu> writes:

> On Fri, Aug 10, 2018 at 04:53:58PM +0100, David Howells wrote:
>> Theodore Y. Ts'o <tytso@mit.edu> wrote:
>> 
>> > Even *with* file system support, there's no way today for the VFS to
>> > keep track of whether a pathname resolution came through one
>> > mountpoint or another, so I can't do something like this:
>> 

>> However, the case folding stuff - is that a superblockism of a mountpointism?
>
> It's a superblock-ism.  As far as I know the *only* thing that we can
> support as a mount-pointism is the ro flag, and that's handled as a
> special case, and only if the original superblock was mounted
> read/write.  ey That was my point; aside from the ro flag, we can't
> support any other mount options as a per-mount point thing, so the
> only thing we can do is to fail the mount if there are conflicting
> mount options.  And I'm not really sure it helps the container use
> case, since the whole point is they want their "guest" to be able to
> blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the
> fact that in some other container, someone had run "mount /dev/sda1 -o
> xattr /mnt".  But having the second mount fail because of conflicting
> mount option breaks the illusion that containers are functionally as
> rich as VM's.

Ted this isn't about some container case.

It about the fact that practically every filesystem in the kernel has
the behavior I have described and it means that if root is not super
careful root will shoot himself in the foot with the shotgun we have
pointed there.

It really is about loosing acls or some other filesystem option.


Eric
Darrick J. Wong Aug. 11, 2018, 12:38 a.m. UTC | #20
On Fri, Aug 10, 2018 at 07:54:47PM -0400, Theodore Y. Ts'o wrote:
> On Fri, Aug 10, 2018 at 03:12:34PM -0700, Darrick J. Wong wrote:
> > Hey now, there was a little more nuance to it than that[1][2].  The
> > complaint in the first instance had much more to do with breaking
> > existing V4 filesystems by adding format requirements that mkfs didn't
> > know about when the filesystem was created.  Yes, you can create V4
> > filesystems that will hang the system if the log was totally unformatted
> > and metadata updates are made, but OTOH it's fairly obvious when that
> > happens, you have to be root to mount a disk filesystem, and we try to
> > avoid breaking existing users.
> 
> I wasn't thinking about syzbot reports; I've largely written them off
> as far as file system testing is concerned, but rather Wen Xu at
> Georgia Tech, who is much more reasonable than Dmitry, and has helpeyd
> me out a lot; and has complained that the XFS folks haven't been
> engaging with him.

Ahh, ok.  Yes, Wen has been easier to work with, and gives out
filesystem images.  Hm, I'll go comb the bugzilla again...

> In either case, both security researchers are fuzzing file system
> images, and then fixing the checksums, and discovering that this can
> lead to kernel crashes, and in a few cases, buffer overruns that can
> lead to potential privilege escalations.  Wen can generate reports
> faster than syzbot, but at least he gives me file system images (as
> opposed to having to dig them out of syzbot repro C files) and he
> actually does some analysis and explains what he thinks is going on.

(FWIW I tried to figure out how to add fs image dumping to syzbot and
whoah that was horrifying.

> I don't think anyone was claiming that format requirements should be
> added to ext4 or xfs file systems.  But rather, that kernel code
> should be made more robust against maliciously corrupted file system
> images that have valid checksums.  I've been more willing to work with
> Wen; Dave has expressed the opinion that these are not realistic bug
> reports, and since only root can mount file systems, it's not high
> priority.

I don't think they're high priority either, but they're at least worth
/some/ attention.

> The reason why I bring this up here is that in container land, there
> are those who believe that "container root" should be able to mount
> file systems, and if the "container root" isn't trusted, the fact that
> the "container root" can crash the host kernel, or worse, corrupt the
> host kernel and break out of the container as a result, that would be
> sad.
> 
> I was pretty sure most file system developers are on the same page
> that allowing untrusted "container roots" the ability to mount
> arbitrary block device file systems is insanity.

Agreed.

> Whether or not we try to fix these sorts of bugs submitted by security
> researchers.  :-)

and agreed. :)

--D

> 	  	       	    	       		  - Ted
Eric W. Biederman Aug. 11, 2018, 1:05 a.m. UTC | #21
Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Fri, Aug 10, 2018 at 09:05:22AM -0500, Eric W. Biederman wrote:
>> 
>> There is a serious problem with mount options today that fsopen does not
>> address.  The problem is that mount options are ignored for block based
>> filesystems, and any other type of filesystem that follows the same
>> pattern.
>> 
>> The script below demonstrates this bug.  Showing this bug can cause the
>> ext4 "acl" "quota" and "user_xattr" options to be silently ignored.
>> 
>> fsopen has my nack until it addresses this issue.
>> 
>> I don't know if we can fix this in the context of sys_mount.  But we if
>> we are redoing the option parsing of how we mount filesystems this needs
>> to be fixed before we start worrying about bug compatibility.
>> 
>> Hopefully this report is simple and clear enough that we can at least
>> agree on the problem.
>
> Sure, it is simple.  So's the solution: MNT_USERNS_SPECIAL_SEMANTICS that
> would get passed to filesystems, so that Eric would be able to implement
> his mount(2)-incompatible behaviour at leisure, without worrying about
> compatibility issues.
>
> Does that address your complaint?

Absolutely not.

My complaint is that the current implemented behavior of practically
every filesystem in the kernel, is that it will ignore mount options
when mounted a second time.  

It is not some weird special case.

It is not some container thing.

It is that the behavior of mount(2) with practically every filesystem
type when that filesystem is already mounted somewhere else behaves
in ways no one would expect.

With the new fsopen api the easy thing to do is simply have CMD_CREATE
CMD_BIND_INTERNAL and be done with it.  CMD_CREATE guarantee that a new
superblock is created.  CMD_BIND_INTERNAL would only work with an
existing superblock.  Then root would at least know that he is
connecting to an already mounted filesystem and could look at the
options etc and fail if he didn't like what he saw.  No surprises, no
muss, no fuss simple.


But I have been told the simple solution above is somehow unacceptable.
And an option to compare the mount options and see if they are the same
was offered.  That would will work to.

I just care that we define the semantics in such a way that it is not
easy for root to get confused and do something stupid that will bite
later, and that we build the infrastructure so that all filesystems
can implement it easily.

So yes this is 100% a question about how filesystems should behave with
respect to their option when mounted for a second time.  That is what
Dave Howells patchset is addressing.

> Because one thing we are not going to do is changing mount(2)
> behaviour.

I have not asked for that.  I have asked that we get it right for
fsopen.

> Reason: userland-visible behaviour of hell knows how many local scripts.



> Another thing that
> is flat-out not feasible is some kind of blanket "compare options"
> stuff; it *can* be done as helpers to be used by filesystem when
> it sees that new flag, but it's simply not going to work at the
> fs-independent level.
>
> Trivial example with the same ext4:
> mount /dev/sda1 /mnt/a -o bsddf vs. mount /dev/sda1 /mnt/b
> ext4 can tell that these are the same.  syscall itself has no
> clue.  What's more, it's not just explicitly spelled default
> options - it's the stuff that has more than one form.  And while
> we are at it, the things like two NFS mounts of different trees
> from the same server; they might or might not get the same superblock.
> Depending upon the options.
>
> Convenience helper that would allow ext4 to compare options and reject
> the incompatible mount?  Not sure how much ext4-specific knowledge
> would have to go in it, but if you can come up with one - more power
> to you.  But the decision to use it *must* be ext4-specific.  Because
> for e.g. NFS such thing as -o fsid=..., while certainly a part of
> options, has a very different meaning - it's "use a separate fs instance"
> (and let the server deal with coherency issues on its end).
>
> Decision to use sget() (and the way it's used) is up to filesystem.
> We *can't* lift that into syscall.  Not without breaking the fuck out
> of existing behaviour.

I have never proposed that.  See above.  I may have talked in terms
of what sget does and muddied the waters.  If so I apologize.

All I proposed was that we distinguish between a first mount and an
additional mount so that userspace knows the options will be ignored.

Then the code to replicate the current behavior can look like:

	fd = fsopen(...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);
	
	if (fsconfig(fd, CMD_CREATE) == -EBUSY) {
        	fsconfig(fd, CMD_BIND_INTERNAL);
        }

But userspace would then be free to issue a warning or do something
else if CMD_CREATE returns -EBUSY.

I don't know how the above wound up being construed as asking that the
code call sget directly but that is what has happened.

> Having something like a second callback for mount_bdev() that would
> be called when we'd found an existing instance for the same block
> device?  Sure, no problem.  Having a helper for doing such comparison
> that would work in enough cases to bother, so that different fs
> could avoid boilerplate in that callback?  Again, more power to you.

Normal forms etc.  If we want to do that it just requires a wee bit of
discipline.  And if all of the option parsing is being rewritten and
retested anyway I don't see why we can't do something like that as well.
So it does not sound unreasonable to me.

It does sound like more work than what I was proposing.

Eric
Eric W. Biederman Aug. 11, 2018, 1:19 a.m. UTC | #22
David Howells <dhowells@redhat.com> writes:

> Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> There is a serious problem with mount options today that fsopen does not
>> address.  The problem is that mount options are ignored for block based
>> filesystems, and any other type of filesystem that follows the same
>> pattern.
>
> Yes.  Since you *absolutely* *insist* on this being fixed *right* *now* *or*
> *else*, I'm working up a set of additional patches to give userspace the
> option of whether they want no sharing; sharing, but only with exactly the
> same parameters; or to ignore the parameter differences and just accept
> sharing of what's already already mounted (ie. the current behaviour).
>
> The second option, however, is not trivial as it needs to compare the fs
> contexts, including the LSM parameters.  To make that work, I really need to
> remove the old security_mnt_opts stuff - which means I need to port btrfs to
> the new context stuff.
>
> We discussed this yesterday, and I proposed a solution, and I'm working on it.

I repeated this because after some comments from Al on IRC yesterday
and Miklos's email replay. It appeared clear that I had not specified
why my issue was clearly enough for people reading the thread to
understand the problem that I see.

> Yes, I agree it would be nice to have, but it *doesn't* really need supporting
> right this minute, since what I have now oughtn't to break the current
> behaviour.

I am really reluctant to endorse anything that propagates the issues of
the current interface in the new mount interface.

Eric
Eric W. Biederman Aug. 11, 2018, 1:32 a.m. UTC | #23
"Darrick J. Wong" <darrick.wong@oracle.com> writes:

> On Fri, Aug 10, 2018 at 07:54:47PM -0400, Theodore Y. Ts'o wrote:

>> The reason why I bring this up here is that in container land, there
>> are those who believe that "container root" should be able to mount
>> file systems, and if the "container root" isn't trusted, the fact that
>> the "container root" can crash the host kernel, or worse, corrupt the
>> host kernel and break out of the container as a result, that would be
>> sad.
>> 
>> I was pretty sure most file system developers are on the same page
>> that allowing untrusted "container roots" the ability to mount
>> arbitrary block device file systems is insanity.
>
> Agreed.

For me I am happy with fuse.  That is sufficient to cover any container
use cases people have.   If anyone comes bugging you for more I will be
happy to push back.

The only thing that containers have to do with this is I wind up
touching a lot of the kernel/user boundary so I get to see a lot of it
and sometimes see weird things.

Eric
Theodore Ts'o Aug. 11, 2018, 1:46 a.m. UTC | #24
On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
> 
> My complaint is that the current implemented behavior of practically
> every filesystem in the kernel, is that it will ignore mount options
> when mounted a second time.

The file system is ***not*** mounted a second time.

The design bug is that we allow bind mounts to be specified via a
block device.  A bind mount is not "a second mount" of the file
system.  Bind mounts != mounts.

I had assumed we had allowed bind mounts to be specified via the block
device because of container use cases.  If the container folks don't
want it, I would be pushing to simply not allow bind mounts to be
specified via block device at all.

The only reason why we should support it is because we don't want to
break scripts; and if the goal is not to break scripts, then we have
to keep to the current semantics, however broken you think it is.

					- Ted
Al Viro Aug. 11, 2018, 1:58 a.m. UTC | #25
On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:

> All I proposed was that we distinguish between a first mount and an
> additional mount so that userspace knows the options will be ignored.

For pity sake, just what does it take to explain to you that your
notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT
and may depend upon the pieces of state userland (especially in container)
simply does not have?

One more time, slowly:

mount -t nfs4 wank.example.org:/foo/bar /mnt/a
mount -t nfs4 wank.example.org:/baz/barf /mnt/b

yield the same superblock.  Is anyone who mounts something over NFS
required to know if anybody else has mounted something from the same
server, and if so how the hell are they supposed to find that out,
so that they could decide whether they are creating the "first" or
"additional" mount, whatever that might mean in this situation?

And how, kernel-side, is that supposed to be handled by generic code
of any description?  

While we are at it,
mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c
is *NOT* the same superblock as the previous two.

> I don't know how the above wound up being construed as asking that the
> code call sget directly but that is what has happened.

Not by me.  What I'm saying is that the entire superblock-creating
machinery - all of it - is nothing but library helpers.  With the
decision of when/how/if they are to be used being down to filesystem
driver.  Your "first mount"/"additional mount" simply do not map
to anything universally applicable.

> > Having something like a second callback for mount_bdev() that would
> > be called when we'd found an existing instance for the same block
> > device?  Sure, no problem.  Having a helper for doing such comparison
> > that would work in enough cases to bother, so that different fs
> > could avoid boilerplate in that callback?  Again, more power to you.
> 
> Normal forms etc.  If we want to do that it just requires a wee bit of
> discipline.  And if all of the option parsing is being rewritten and
> retested anyway I don't see why we can't do something like that as well.
> So it does not sound unreasonable to me.

See above.
Al Viro Aug. 11, 2018, 2:17 a.m. UTC | #26
On Sat, Aug 11, 2018 at 02:58:15AM +0100, Al Viro wrote:
> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
> 
> > All I proposed was that we distinguish between a first mount and an
> > additional mount so that userspace knows the options will be ignored.
> 
> For pity sake, just what does it take to explain to you that your
> notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT
> and may depend upon the pieces of state userland (especially in container)
> simply does not have?
> 
> One more time, slowly:
> 
> mount -t nfs4 wank.example.org:/foo/bar /mnt/a
> mount -t nfs4 wank.example.org:/baz/barf /mnt/b
> 
> yield the same superblock.  Is anyone who mounts something over NFS
> required to know if anybody else has mounted something from the same
> server, and if so how the hell are they supposed to find that out,
> so that they could decide whether they are creating the "first" or
> "additional" mount, whatever that might mean in this situation?
> 
> And how, kernel-side, is that supposed to be handled by generic code
> of any description?  
> 
> While we are at it,
> mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c
> is *NOT* the same superblock as the previous two.

s/as the previous two/as in the previous two cases/, that is - the first two
examples yield one superblock, this one - another.
Eric W. Biederman Aug. 11, 2018, 4:43 a.m. UTC | #27
Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Sat, Aug 11, 2018 at 02:58:15AM +0100, Al Viro wrote:
>> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
>> 
>> > All I proposed was that we distinguish between a first mount and an
>> > additional mount so that userspace knows the options will be ignored.
>> 
>> For pity sake, just what does it take to explain to you that your
>> notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT
>> and may depend upon the pieces of state userland (especially in container)
>> simply does not have?
>> 
>> One more time, slowly:
>> 
>> mount -t nfs4 wank.example.org:/foo/bar /mnt/a
>> mount -t nfs4 wank.example.org:/baz/barf /mnt/b
>> 
>> yield the same superblock.  Is anyone who mounts something over NFS
>> required to know if anybody else has mounted something from the same
>> server, and if so how the hell are they supposed to find that out,
>> so that they could decide whether they are creating the "first" or
>> "additional" mount, whatever that might mean in this situation?
>> 
>> And how, kernel-side, is that supposed to be handled by generic code
>> of any description?  
>> 
>> While we are at it,
>> mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c
>> is *NOT* the same superblock as the previous two.
>
> s/as the previous two/as in the previous two cases/, that is - the first two
> examples yield one superblock, this one - another.

Exactly because the mount options differ.

I don't have a problem if we have something sophisticated like nfs that
handles all of the hairy details and does not reuse a superblock unless the
mount options match.

What I have a problem with is the helper for ordinary filesystems that
are not as sophisticated as nfs that don't handle all of the option
magic and give userspace something different from what userspace asked
for.

It may take a little generalization of the definitions I proposed but it
still remains simple and straight forward.

CMD_THESE_MOUNT_OPTIONS_NO_SURPRISES
CMD_WHATEVER_ALREADY_EXISTS

Or we can make the filesystems more sophisticated when we move
them to the new API and perform the comparisons there.  I think
that is what David Howells is working on.

Eric
Eric W. Biederman Aug. 11, 2018, 4:48 a.m. UTC | #28
"Theodore Y. Ts'o" <tytso@mit.edu> writes:

> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
>> 
>> My complaint is that the current implemented behavior of practically
>> every filesystem in the kernel, is that it will ignore mount options
>> when mounted a second time.
>
> The file system is ***not*** mounted a second time.
>
> The design bug is that we allow bind mounts to be specified via a
> block device.  A bind mount is not "a second mount" of the file
> system.  Bind mounts != mounts.
>
> I had assumed we had allowed bind mounts to be specified via the block
> device because of container use cases.  If the container folks don't
> want it, I would be pushing to simply not allow bind mounts to be
> specified via block device at all.

No it is not a container thing.

> The only reason why we should support it is because we don't want to
> break scripts; and if the goal is not to break scripts, then we have
> to keep to the current semantics, however broken you think it is.

But we don't have to support returning filesystems with mismatched mount
options in the new fsopen api.   That is my concern.  Confusing
userspace this way has been shown to be harmful let's not keep doing it.

Eric
David Howells Aug. 11, 2018, 7:29 a.m. UTC | #29
Eric W. Biederman <ebiederm@xmission.com> wrote:

> > Yes, I agree it would be nice to have, but it *doesn't* really need
> > supporting right this minute, since what I have now oughtn't to break the
> > current behaviour.
> 
> I am really reluctant to endorse anything that propagates the issues of
> the current interface in the new mount interface.

Do realise that your problem cannot be solved through fsopen() until every
filesystem is converted to the new fs_context-based sget() since the flag has
to make it from the VFS through the filesystem to sget().

I'm reluctant to add this flag till that point until that time unless we error
out if the flag is set against a legacy filesystem.

David
Andy Lutomirski Aug. 11, 2018, 4:31 p.m. UTC | #30
> On Aug 11, 2018, at 12:29 AM, David Howells <dhowells@redhat.com> wrote:
> 
> Eric W. Biederman <ebiederm@xmission.com> wrote:
> 
>>> Yes, I agree it would be nice to have, but it *doesn't* really need
>>> supporting right this minute, since what I have now oughtn't to break the
>>> current behaviour.
>> 
>> I am really reluctant to endorse anything that propagates the issues of
>> the current interface in the new mount interface.
> 
> Do realise that your problem cannot be solved through fsopen() until every
> filesystem is converted to the new fs_context-based sget() since the flag has
> to make it from the VFS through the filesystem to sget().
> 
> I'm reluctant to add this flag till that point until that time unless we error
> out if the flag is set against a legacy filesystem.
> 
> 

I don’t see why we need all this fancy “do the options match” stuff.  For the handful of filesystems (like NFS) that do something intelligent when multiple non-bind mount requests against the same underlying storage happen,  we can keep that behavior in the new API. For other filesystems that don’t have this feature, we should simply fail the request.

IOW I see so compelling reason to call sget() at all from the new API.  The only sort-of-legit use case I can think of is mounting more than one btrfs subvolume. But even that should probably not be done by asking the kernel to separately instantiate the filesystem.

As another way of looking at it: for a network filesystem, mounting the same target ip and path from two different Linux machines works, so mounting it twice from the same machine should also work.  But mounting the same underlying ext4 block device from two different Linux machines (using nbd, iscsi, etc) would be a catastrophe, so I see no reason that it needs to be supported if it’s two mounts from one machine.

The case folding example is interesting, and I think it should probably have a slightly different API. A program could open_tree a nocasefold mount and then make a request to create what is functionally a bind mount but with different options.

mount(8) will presumably just keep using mount(2).
Al Viro Aug. 11, 2018, 4:51 p.m. UTC | #31
On Sat, Aug 11, 2018 at 09:31:29AM -0700, Andy Lutomirski wrote:

> I don’t see why we need all this fancy “do the options match” stuff.  For the handful of filesystems (like NFS) that do something intelligent when multiple non-bind mount requests against the same underlying storage happen,  we can keep that behavior in the new API. For other filesystems that don’t have this feature, we should simply fail the request.

> IOW I see so compelling reason to call sget() at all from the new API.  The only sort-of-legit use case I can think of is mounting more than one btrfs subvolume. But even that should probably not be done by asking the kernel to separately instantiate the filesystem.


May I politely suggest the esteemed participants of that conversation
to RTFS?  Yes, I know that it's less fun that talking about your
rather vague ideas of how the things (surely) work, but it just might
avoid the feats of idiocy like the above.

Andy, I don't know how to put it more plainly: read the fucking source.
Even grep would do.  The same NFS you've granted (among the "handful"
of filesystems) an exception, *DOES* *CALL* *THE* *FUCKING* sget().

Yes, really.  And in some obscure[1] cases (including the one mentioned
upthread) it does reuse a pre-existing superblock.  For a very good
reason.

[1] such as, oh, mounting two filesystems from the same server with
default options - who would've ever thought of doing something so
perverted?
Casey Schaufler Aug. 11, 2018, 5:47 p.m. UTC | #32
On 8/10/2018 9:48 PM, Eric W. Biederman wrote:
> "Theodore Y. Ts'o" <tytso@mit.edu> writes:
>
>> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
>>> My complaint is that the current implemented behavior of practically
>>> every filesystem in the kernel, is that it will ignore mount options
>>> when mounted a second time.
>> The file system is ***not*** mounted a second time.
>>
>> The design bug is that we allow bind mounts to be specified via a
>> block device.  A bind mount is not "a second mount" of the file
>> system.  Bind mounts != mounts.
>>
>> I had assumed we had allowed bind mounts to be specified via the block
>> device because of container use cases.  If the container folks don't
>> want it, I would be pushing to simply not allow bind mounts to be
>> specified via block device at all.
> No it is not a container thing.

	Inigo: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."
	Rugen: "Stop saying that!"

	Eric:  "It is not a container thing."
	Casey: "Stop saying that!"

Yes, Virginia, it *is* a container thing. Your container manager expects all
filesystems to be server-client based. It makes bad assumptions. It is doing
things that we would fire a sysadmin for doing. Don't blame the filesystems
for behaving as documented. Export the filesystem using NFS and mount them
using the NFS mechanism, which is designed to do what you're asking for. The
problem is not in the mount mechanism, it's in the way you want to abuse it.

>> The only reason why we should support it is because we don't want to
>> break scripts; and if the goal is not to break scripts, then we have
>> to keep to the current semantics, however broken you think it is.
> But we don't have to support returning filesystems with mismatched mount
> options in the new fsopen api.   That is my concern.  Confusing
> userspace this way has been shown to be harmful let's not keep doing it.

It's not "userspace" that's confused. Developers of userspace code
implementing system behavior (e.g. systemd, container managers) need to
understand how the system works. The container manager needs to know
that it can't mount filesystems with different options. That's the kind
of thing "managers" do. If it has to go to the mount table and check
on how the device is already mounted before doing a mount, so be it.

Unless, of course, you want the concept of "container" introduced into
the kernel. There's a whole lot of feldercarb that container managers
have to deal with that would be lots easier to deal with down below.
I'm not advocating that, and I understand the arguments against it.
On the other hand, if you want a platform that is optimized for a
container environment ...

> Eric
Miklos Szeredi Aug. 13, 2018, 12:54 p.m. UTC | #33
On Sat, Aug 11, 2018 at 3:58 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:

>  What I'm saying is that the entire superblock-creating
> machinery - all of it - is nothing but library helpers.  With the
> decision of when/how/if they are to be used being down to filesystem
> driver.  Your "first mount"/"additional mount" simply do not map
> to anything universally applicable.

Why so?   (Note: using the "mount" terminology here is fundamentally
broken to start with, mounts have nothing to do with this...
Filesystem instance is better word.)

You bring up NFS as an example, but creating and/or reusing an nfs
client instance connected to a certain server is certainly a clear and
well defined concept.

The question becomes:  does it make  sense to generalize this concept
and export it to userspace with the new API?

You know the Plan 9 fs interface much better, but to me it looks like
there's a separate namespace for filesystem instances, and the mount
command just refers to such an instance.  So there's no comparing of
options or any such horror, just the need to explicitly instantiate a
new instance when necessary.  Doesn't sound very difficult to
implement in the new API.

Thanks,
Miklos
Alan Cox Aug. 13, 2018, 4:35 p.m. UTC | #34
> If the same block device is visible, with rw access, in two different
> containers, I don't see any anything good can happen.  Sure, with the

At the raw level there are lots of use cases involving high performance
data capture, media streaming and the like.

At the file system layer you can use GFS2 for example.

So there are cases where it's possible. There are even cases where it's
actually useful at the filesystem level although not many I agree.

Alan
Andy Lutomirski Aug. 13, 2018, 4:48 p.m. UTC | #35
On Mon, Aug 13, 2018 at 9:35 AM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
>> If the same block device is visible, with rw access, in two different
>> containers, I don't see any anything good can happen.  Sure, with the
>
> At the raw level there are lots of use cases involving high performance
> data capture, media streaming and the like.
>
> At the file system layer you can use GFS2 for example.

Ugh.  I even thought of this case, and I should have been a bit more precise:

I would consider the GFS2 case to be essentially equivalent to the NFS
case.  I think we can probably divide all the filesystems into three
or four types:

pseudo file systems: Multiple instantiations of the same fs driver
pointing at the same backing store give separate filesystems.  (Same
backing store includes the case where there isn't any backing store.)
tmpfs is an example.  This isn't particularly interesting.

network-like file systems: Multiple instantiations of the same fs
driver pointing at the same backing store are expected.  This includes
NFS, GFS2, AFS, CIFS, etc.  This is only really interesting to the
extent that, if the fs driver internally wants to share state between
multiple instantiations, it should be smart enough to make sure the
options are compatible or that it can otherwise handle mismatched
options correctly.  NFS does this right.

non-network-like filesystems: There are complicated ones like btrfs
and ZFS and simple ones like ext4.  In either case, multiple totally
separate instantiations of the driver sharing the backing store will
lead to corruption.  In cases like ext4, we seem to support it for
legacy reasons, because we're afraid that there are scripts that try
to mount the same block device more than once, and I think the new API
has no need to support this.  In cases like btrfs, we also seem to
support multiple user requests for "mounts" with the same underlying
block devices because we need it for full functionality.  But I think
this is because our API is wrong.

Are there cases I'm missing?  It sounds like the API could be improved
to fully model the last case, and everything will work nicely.
Al Viro Aug. 13, 2018, 5:29 p.m. UTC | #36
On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:

> I would consider the GFS2 case to be essentially equivalent to the NFS
> case.  I think we can probably divide all the filesystems into three
> or four types:
> 
> pseudo file systems: Multiple instantiations of the same fs driver
> pointing at the same backing store give separate filesystems.  (Same
> backing store includes the case where there isn't any backing store.)
> tmpfs is an example.  This isn't particularly interesting.
> 
> network-like file systems: Multiple instantiations of the same fs
> driver pointing at the same backing store are expected.  This includes
> NFS, GFS2, AFS, CIFS, etc.  This is only really interesting to the
> extent that, if the fs driver internally wants to share state between
> multiple instantiations, it should be smart enough to make sure the
> options are compatible or that it can otherwise handle mismatched
> options correctly.  NFS does this right.
> 
> non-network-like filesystems: There are complicated ones like btrfs
> and ZFS and simple ones like ext4.  In either case, multiple totally
> separate instantiations of the driver sharing the backing store will
> lead to corruption.  In cases like ext4, we seem to support it for
> legacy reasons, because we're afraid that there are scripts that try
> to mount the same block device more than once, and I think the new API
> has no need to support this.  In cases like btrfs, we also seem to
> support multiple user requests for "mounts" with the same underlying
> block devices because we need it for full functionality.  But I think
> this is because our API is wrong.
> 
> Are there cases I'm missing?  It sounds like the API could be improved
> to fully model the last case, and everything will work nicely.

	You know, that's starting to remind of this little gem of Borges:
http://www.alamut.com/subj/artiface/language/johnWilkins.html
Especially the delightful (fake) quote contained in there:
[...] it is written that the animals are divided into:
	(a) belonging to the emperor,
	(b) embalmed,
	(c) tame,
	(d) sucking pigs,
	(e) sirens,
	(f) fabulous,
	(g) stray dogs,
	(h) included in the present classification,
	(i) frenzied,
	(j) innumerable,
	(k) drawn with a very fine camelhair brush,
	(l) et cetera,
	(m) having just broken the water pitcher,
	(n) that from a long way off look like flies.
James Morris Aug. 13, 2018, 7 p.m. UTC | #37
On Mon, 13 Aug 2018, Al Viro wrote:

> On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:

> > Are there cases I'm missing?  It sounds like the API could be improved
> > to fully model the last case, and everything will work nicely.
> 
> 	You know, that's starting to remind of this little gem of Borges:
> http://www.alamut.com/subj/artiface/language/johnWilkins.html
> Especially the delightful (fake) quote contained in there:
> [...] it is written that the animals are divided into:
> 	(a) belonging to the emperor,
> 	(b) embalmed,
> 	(c) tame,
> 	(d) sucking pigs,
> 	(e) sirens,
> 	(f) fabulous,
> 	(g) stray dogs,
> 	(h) included in the present classification,
> 	(i) frenzied,
> 	(j) innumerable,
> 	(k) drawn with a very fine camelhair brush,
> 	(l) et cetera,
> 	(m) having just broken the water pitcher,
> 	(n) that from a long way off look like flies.


Coincidentally, this was also the model for Linux capabilities.
Casey Schaufler Aug. 13, 2018, 7:20 p.m. UTC | #38
On 8/13/2018 12:00 PM, James Morris wrote:
> On Mon, 13 Aug 2018, Al Viro wrote:
>
>> On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:
>>> Are there cases I'm missing?  It sounds like the API could be improved
>>> to fully model the last case, and everything will work nicely.
>> 	You know, that's starting to remind of this little gem of Borges:
>> http://www.alamut.com/subj/artiface/language/johnWilkins.html
>> Especially the delightful (fake) quote contained in there:
>> [...] it is written that the animals are divided into:
>> 	(a) belonging to the emperor,
>> 	(b) embalmed,
>> 	(c) tame,
>> 	(d) sucking pigs,
>> 	(e) sirens,
>> 	(f) fabulous,
>> 	(g) stray dogs,
>> 	(h) included in the present classification,
>> 	(i) frenzied,
>> 	(j) innumerable,
>> 	(k) drawn with a very fine camelhair brush,
>> 	(l) et cetera,
>> 	(m) having just broken the water pitcher,
>> 	(n) that from a long way off look like flies.
>
> Coincidentally, this was also the model for Linux capabilities.

Linux capabilities are POSIX capabilities which are modeled closely
to accommodate the historical behavior manifest in the P1003.1 specification.
So except for (c), (f) and (k) you can use this characterization. 

On a slightly more serious note, there's a lot of Linux, mount semantics
included, that have grow organically and that aren't quite up to the
usage models they are being applied to. I applaud David's work in part
because it may make it possible to accommodate more of those cases going
forward.
Eric W. Biederman Aug. 15, 2018, 4:03 a.m. UTC | #39
Casey Schaufler <casey@schaufler-ca.com> writes:

> Don't blame the filesystems for behaving as documented.

No.  This behavior is not documented.  At least I certainly don't see a
word about this in any of the man pages.  Where does it say mounting a
filesystem will not honor it's mount options?

It is also rare enough in practice it is something it is reasonable to
expect people to be surprised by.

> The problem is not in the mount mechanism, it's in the way you want to
> abuse it.

I am not asking for this behavior.  I am pointing out this behavior
exists.  I am pointing out this behavior is harmful.  I am asking we
stop doing this harmful thing in the new API where we don't have a
chance of breaking anything.

The place where this has bitten the hardest is someone wrote a script to
do something for Xen in a chroot.  That script involved a chroot that
mounted devpts and in doing so happend to change the options of the main
/dev/pts.  Which resulted in ptys created with /dev/ptmx outside the
chroot with the wrong permissions.  That in turn caused several distros
to retain the ancient suid pt_chown binary from libc that the devpts
filesystem was built to make obsolete.  As the world turned that
pt_chown binary could be confused into chowning the wrong pty if a pty
from a container was used.

The fix was to mount a new instance of devpts every time mount of devpts
is called.  That simplified the code, and allowed pt_chown to be removed
permanently.  The tricky bit was figuring out how keep /dev/ptmx
working.  I wound up testing on every distribution I could think of to
ensure no one would notice the slightly changed behavior of the devpts
filesystem.

The behavior in other filesystems of ignoring the options instead of
changing them on the filesystem isn't quite as bad.  But it still has
the potential for a lot of mischief.

Eric
Serge E. Hallyn Aug. 15, 2018, 11:29 p.m. UTC | #40
Quoting James Morris (jmorris@namei.org):
> On Mon, 13 Aug 2018, Al Viro wrote:
> 
> > On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:
> 
> > > Are there cases I'm missing?  It sounds like the API could be improved
> > > to fully model the last case, and everything will work nicely.
> > 
> > 	You know, that's starting to remind of this little gem of Borges:
> > http://www.alamut.com/subj/artiface/language/johnWilkins.html
> > Especially the delightful (fake) quote contained in there:
> > [...] it is written that the animals are divided into:
> > 	(a) belonging to the emperor,
> > 	(b) embalmed,
> > 	(c) tame,
> > 	(d) sucking pigs,
> > 	(e) sirens,
> > 	(f) fabulous,
> > 	(g) stray dogs,
> > 	(h) included in the present classification,
> > 	(i) frenzied,
> > 	(j) innumerable,
> > 	(k) drawn with a very fine camelhair brush,
> > 	(l) et cetera,
> > 	(m) having just broken the water pitcher,
> > 	(n) that from a long way off look like flies.
> 
> 
> Coincidentally, this was also the model for Linux capabilities.

But maybe we want to split the stray dogs up by breed.