mbox series

[0/4] fs: add mount_setattr()

Message ID 20200714161415.3886463-1-christian.brauner@ubuntu.com (mailing list archive)
Headers show
Series fs: add mount_setattr() | expand

Message

Christian Brauner July 14, 2020, 4:14 p.m. UTC
Hey everyone,

This series can be found at:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=mount_setattr
https://gitlab.com/brauner/linux/-/commits/mount_setattr
https://github.com/brauner/linux/tree/mount_setattr

This implements the mount_setattr() syscall which has come up a few
times already back in 2018 and again now in recent discussions and as
promised (see [1] and [2]) here is the first version. Sorry for the
delay but there's never enough time to get things done on time.

While the new mount api allows to change the properties of a superblock
there is currently no way to change the mount properties of a mount or
mount tree using mount file descriptors which the new mount api is based
on. In addition the old mount api has the big restriction that mount
options cannot be applied recursively. This hasn't changed since
changing mount options on a per-mount basis was implemented and has been
a frequent request and a source of bugs. The inability to change the
read-only mount option recursively alone is invaluable to userspace. Not
too long ago there was another significant security issue that was
caused by mount properties not being applied recurisvely (cf. [3]).
Additionally, various userspace projects have switched over to the new
mount api (including the ones I maintain) but we still have terrible
hacks and loops around changing mount properties for existing mount
trees. With mount_setattr() all these codepaths will be demoted to
fallback codepaths, hopefully. Michael Kerrisk recently pointed out the
missing support for this syscall as well.

The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:

int mount_setattr(int dfd, const char *path, unsigned flags,
                  struct mount_attr *uattr, size_t usize);

Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.

mount_setattr() can be expected to grow over time and is designed with
extensibility in mind. It follows the extensible syscall pattern we have
used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:

struct mount_attr {
	__u64 attr_set;
	__u64 attr_clr;
	__u32 propagation;
	__u32 atime;
};

The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.

The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.

Finally, struct mount_attr contains an @atime field which can be used to
set the atime behavior of a mount tree. Currently, access times are
already treated and defined like an enum in the new mount api so there's
no reason to treat them equivalent to a flag argument. A new atime enum
is introduced. The reason for not reusing the atime flags useable with
fsmount() and defined in the new mount api is that the
MOUNT_ATTR_RELATIME enum is defined as 0. This means, a user wanting to
transition to relative atime cannot simply specify MOUNT_ATTR_RELATIME
in @atime or @attr_set as this would mean not specifying any atime
settings is equivalent to specifying relative atime. This would cause
confusion for userspace as not specifying atime settings would switch
them to relatime. The new set of enums rectifies this by starting the
definition at 1 and letting 0 mean that atime settings are supposed to
be left unchanged.

Changing mount option has quite a few moving parts and the locking is
quite intricate so it is not unlikely that I got subtleties (very) wrong.

I've also merged the syscall addition into the individual arches into
the main patch itself since Linus once expressed that he prefers this
wher it makes sense. Manpage and selftests included.

[1]: https://lore.kernel.org/lkml/20200518144212.xpfjlajgwzwhlq7r@wittgenstein/
[2]: https://lore.kernel.org/lkml/CAKgNAkioH1z-pVimHziWP=ZtyBgCOwoC7ekWGFwzaZ1FPYg-tA@mail.gmail.com/
[3]: https://github.com/moby/moby/issues/37838

Thanks!
Christian

Christian Brauner (4):
  namespace: take lock_mount_hash() directly when changing flags
  namespace: only take read lock in do_reconfigure_mnt()
  fs: add mount_setattr()
  tests: add mount_setattr() selftests

 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 fs/internal.h                                 |   7 +
 fs/namespace.c                                | 300 ++++++-
 include/linux/syscalls.h                      |   3 +
 include/uapi/asm-generic/unistd.h             |   4 +-
 include/uapi/linux/mount.h                    |  31 +
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/mount_setattr/.gitignore        |   1 +
 .../testing/selftests/mount_setattr/Makefile  |   7 +
 tools/testing/selftests/mount_setattr/config  |   1 +
 .../mount_setattr/mount_setattr_test.c        | 802 ++++++++++++++++++
 27 files changed, 1145 insertions(+), 30 deletions(-)
 create mode 100644 tools/testing/selftests/mount_setattr/.gitignore
 create mode 100644 tools/testing/selftests/mount_setattr/Makefile
 create mode 100644 tools/testing/selftests/mount_setattr/config
 create mode 100644 tools/testing/selftests/mount_setattr/mount_setattr_test.c


base-commit: dcb7fd82c75ee2d6e6f9d8cc71c52519ed52e258

Comments

Al Viro July 19, 2020, 5:10 p.m. UTC | #1
On Tue, Jul 14, 2020 at 06:14:11PM +0200, Christian Brauner wrote:

> mount_setattr() can be expected to grow over time and is designed with
> extensibility in mind. It follows the extensible syscall pattern we have
> used with other syscalls such as openat2(), clone3(),
> sched_{set,get}attr(), and others.

I.e. it's a likely crap insertion vector; any patches around that thing
will require the same level of review as addition of a brand new syscall.
And they will be harder to spot - consider the likely subjects for such
patches and compare to open addition of a new syscall...
Christian Brauner July 19, 2020, 5:55 p.m. UTC | #2
On Sun, Jul 19, 2020 at 06:10:54PM +0100, Al Viro wrote:
> On Tue, Jul 14, 2020 at 06:14:11PM +0200, Christian Brauner wrote:
> 
> > mount_setattr() can be expected to grow over time and is designed with
> > extensibility in mind. It follows the extensible syscall pattern we have
> > used with other syscalls such as openat2(), clone3(),
> > sched_{set,get}attr(), and others.
> 
> I.e. it's a likely crap insertion vector; any patches around that thing
> will require the same level of review as addition of a brand new syscall.

Which is just how we should and hopefully treat any meaningful
extension, yes. Otherwise let's just never add a flag argument to any
syscall and only have dup()- and accept()-like syscalls.

> And they will be harder to spot - consider the likely subjects for such
> patches and compare to open addition of a new syscall...

In the new revision I have dropped the atime argument because David
already plumbed setting atime into fsmount() via flags and making
userspace jump through more hoops by adding more constants seems
pointless. So the additional arguments can be moved up because we're
below the 6 syscall args limit.

Though I really want to stress that while I see the worry it is less a
technial argument but one for our review process where we should treat
extensions to syscalls as strict as syscall additions. Which yes, very
much so.

Thanks!
Christian
Pavel Tikhomirov Oct. 1, 2020, 2:17 p.m. UTC | #3
> 
> mount_setattr() can be expected to grow over time and is designed with
> extensibility in mind. It follows the extensible syscall pattern we have
> used with other syscalls such as openat2(), clone3(),
> sched_{set,get}attr(), and others.
> The set of mount options is passed in the uapi struct mount_attr which
> currently has the following layout:
> 
> struct mount_attr {
> 	__u64 attr_set;
> 	__u64 attr_clr;
> 	__u32 propagation;
> 	__u32 atime;
> };
> 

We probably can rework "mnt: allow to add a mount into an existing 
group" (MS_SET_GROUP https://lkml.org/lkml/2017/1/23/712) to an 
extension mount_setattr. Do anyone have any objections?

We need it in CRIU because it is a big problem to restore complex mount 
trees with complex propagation flags (see my LPC talk for details 
https://linuxplumbersconf.org/event/7/contributions/640/) of 
system-containers.

If we allow set(copy) sharing options we can separate mount tree restore 
and propagation restore and everything becomes much simpler. (And we 
already have CRIU implementation based on it, which helps with variety 
of bugs with mounts we previously had. 
https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#880)

I've also tried to consider another approach 
https://github.com/Snorch/linux/commit/84886f588527b062993ec3e9760c879163852518 
to disable actual propagation while restoring the mount tree. But with 
this approach it looks like it would be still hard to restore in CRIU 
because: With "MS_SET_GROUP" we don't care about roots. Imagine we have 
mount A (shared_id=1, root="/some/sub/path") and mount B (master_id=1, 
root="/some/other/sub/path", with MS_SET_GROUP we can copy sharing from 
mount A to B and make B MS_SLAVE to restore them. In case we want to do 
the same with only inheritance we would have to have some helper mount 
in this share C (shared_id=1, root="/") so that B can inherit this share...