Message ID | 149382750838.30481.8003919639826341255.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, 2017-05-03 at 17:05 +0100, David Howells wrote: > Provide an fsopen() system call that starts the process of preparing to > mount, using an fd as a context handle. fsopen() is given the name of the > filesystem that will be used: > > int mfd = fsopen(const char *fsname, int reserved, > int open_flags); > > where reserved should be -1 for the moment (it will be used to pass the > namespace information in future) and open_flags can be 0 or O_CLOEXEC. > > For example: > > mfd = fsopen("ext4", -1, O_CLOEXEC); > write(mfd, "d /dev/sdb1"); // note I'm ignoring write's length arg > write(mfd, "o noatime"); > write(mfd, "o acl"); > write(mfd, "o user_attr"); > write(mfd, "o iversion"); > write(mfd, "o "); > write(mfd, "r /my/container"); // root inside the fs > fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW); > > mfd = fsopen("afs", -1); > write(mfd, "d %grand.central.org:root.cell"); > write(mfd, "o cell=grand.central.org"); > write(mfd, "r /"); > fsmount(mfd, AT_FDCWD, "/mnt", 0); > I think one of the neat things here is that we can now error out as soon as a bogus mount option is passed in. That takes out some of the guesswork on which option is the problem when you get back something like -EINVAL. On something like CIFS (which has a lot of crazy mount options) that is much more useful. > If an error is reported at any step, an error message may be available to be > read() back (ENODATA will be reported if there isn't an error available) in > the form: > > "e <subsys>:<problem>" > "e SELinux:Mount on mountpoint not permitted" > I like this too. No need to printk stuff when there is a problem mounting. The error goes straight to the caller who initiated the mount. That's a very nice thing in heavily containerized environments, for instance. If you're doing mounts in a container and they fail, then trying to scrape dmesg and figure out which messages refer to _your_ mount attempt seems like it could be rather nasty. This solves that problem in a much more sane way, IMO. > Once fsmount() has been called, further write() calls will incur EBUSY, > even if the fsmount() fails. read() is still possible to retrieve error > information. > What's the rationale for the above behavior? A failed attempt to graft it into the tree doesn't seem like it would have any real effect on the mount_context. While I can't think of a use case for being able to try fsmount() again, I don't quite understand why we'd prohibit someone from doing it. > The fsopen() syscall creates a mount context and hangs it of the fd that it > returns. > > Netlink is not used because it is optional. > > Signed-off-by: David Howells <dhowells@redhat.com> > --- > > arch/x86/entry/syscalls/syscall_32.tbl | 1 > arch/x86/entry/syscalls/syscall_64.tbl | 1 > fs/Makefile | 2 > fs/fsopen.c | 295 ++++++++++++++++++++++++++++++++ > include/linux/syscalls.h | 1 > include/uapi/linux/magic.h | 1 > kernel/sys_ni.c | 3 > 7 files changed, 303 insertions(+), 1 deletion(-) > create mode 100644 fs/fsopen.c > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > index 448ac2161112..9bf8d4c62f85 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -391,3 +391,4 @@ > 382 i386 pkey_free sys_pkey_free > 383 i386 statx sys_statx > 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl > +385 i386 fsopen sys_fsopen > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index 5aef183e2f85..9b198c5fc412 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -339,6 +339,7 @@ > 330 common pkey_alloc sys_pkey_alloc > 331 common pkey_free sys_pkey_free > 332 common statx sys_statx > +333 common fsopen sys_fsopen > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/fs/Makefile b/fs/Makefile > index 308a104a9a07..b79024dbb37c 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -12,7 +12,7 @@ obj-y := open.o read_write.o file_table.o super.o \ > seq_file.o xattr.o libfs.o fs-writeback.o \ > pnode.o splice.o sync.o utimes.o \ > stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \ > - mount_context.o > + mount_context.o fsopen.o > > ifeq ($(CONFIG_BLOCK),y) > obj-y += buffer.o block_dev.o direct-io.o mpage.o > diff --git a/fs/fsopen.c b/fs/fsopen.c > new file mode 100644 > index 000000000000..f02ea7d265db > --- /dev/null > +++ b/fs/fsopen.c > @@ -0,0 +1,295 @@ > +/* fsopen.c: description > + * > + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved. > + * Written by David Howells (dhowells@redhat.com) > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public Licence > + * as published by the Free Software Foundation; either version > + * 2 of the Licence, or (at your option) any later version. > + */ > + > +#include <linux/mount.h> > +#include <linux/slab.h> > +#include <linux/uaccess.h> > +#include <linux/file.h> > +#include <linux/magic.h> > +#include <linux/syscalls.h> > + > +static struct vfsmount *fs_fs_mnt __read_mostly; > +static struct qstr empty_name = { .name = "" }; > + > +static int fs_fs_release(struct inode *inode, struct file *file) > +{ > + struct mount_context *mc = file->private_data; > + > + file->private_data = NULL; > + > + put_mount_context(mc); > + return 0; > +} > + > +/* > + * Read any error message back from the fd. Will be prefixed by "e ". > + */ > +static ssize_t fs_fs_read(struct file *file, char __user *_buf, size_t len, loff_t *pos) > +{ > + struct mount_context *mc = file->private_data; > + const char *msg; > + size_t mlen; > + > + msg = mc->error; > + if (!msg) > + return -ENODATA; > + > + mlen = strlen(msg); > + if (mlen + 2 > len) > + return -ETOOSMALL; > + if (copy_to_user(_buf, "e ", 2) != 0 || > + copy_to_user(_buf + 2, msg, mlen) != 0) > + return -EFAULT; > + return mlen + 2; > +} > + > +/* > + * Userspace writes configuration data to the fd and we parse it here. For the > + * moment, we assume a single option per write. Each line written is of the form > + * > + * <option_type><space><stuff...> > + * > + * d /dev/sda1 -- Device name > + * o noatime -- Option without value > + * o cell=grand.central.org -- Option with value > + * r / -- Dir within device to mount > + */ > +static ssize_t fs_fs_write(struct file *file, > + const char __user *_buf, size_t len, loff_t *pos) > +{ > + struct mount_context *mc = file->private_data; > + struct inode *inode = file_inode(file); > + char opt[2], *data; > + ssize_t ret; > + > + if (len < 3 || len > 4095) > + return -EINVAL; > + > + if (copy_from_user(opt, _buf, 2) != 0) > + return -EFAULT; > + switch (opt[0]) { > + case 'd': > + case 'o': > + case 'r': > + break; > + default: > + return -EINVAL; > + } > + if (opt[1] != ' ') > + return -EINVAL; > + > + data = kmalloc(len - 2 + 1, GFP_KERNEL); > + if (!data) > + return -ENOMEM; > + > + ret = -EFAULT; > + if (copy_from_user(data, _buf + 2, len - 2) != 0) > + goto err_free; > + data[len - 2] = 0; > + > + /* From this point onwards we need to lock the fd against someone > + * trying to mount it. > + */ > + ret = inode_lock_killable(inode); > + if (ret < 0) > + return ret; > + > + ret = -EBUSY; > + if (mc->mounted) > + goto err_unlock; > + > + ret = -EINVAL; > + switch (opt[0]) { > + case 'd': > + if (mc->device) > + goto err_unlock; > + mc->device = data; > + data = NULL; > + break; > + > + case 'o': > + ret = vfs_mount_option(mc, data); > + if (ret < 0) > + goto err_unlock; > + break; > + > + case 'r': > + if (mc->root_path) > + goto err_unlock; > + mc->root_path = data; > + data = NULL; > + break; > + > + default: > + goto err_unlock; > + } > + > + ret = len; > +err_unlock: > + inode_unlock(inode); > +err_free: > + kfree(data); > + return ret; > +} > + > +const struct file_operations fs_fs_fops = { > + .read = fs_fs_read, > + .write = fs_fs_write, > + .release = fs_fs_release, > + .llseek = no_llseek, > +}; > + > +/* > + * Indicate the name we want to display the filesystem file as. > + */ > +static char *fs_fs_dname(struct dentry *dentry, char *buffer, int buflen) > +{ > + return dynamic_dname(dentry, buffer, buflen, "fs:[%lu]", > + d_inode(dentry)->i_ino); > +} > + > +static const struct dentry_operations fs_fs_dentry_operations = { > + .d_dname = fs_fs_dname, > +}; > + > +/* > + * Create a file that can be used to configure a new mount. > + */ > +static struct file *create_fs_file(struct mount_context *mc) > +{ > + struct inode *inode; > + struct file *f; > + struct path path; > + int ret; > + > + inode = alloc_anon_inode(fs_fs_mnt->mnt_sb); > + if (!inode) > + return ERR_PTR(-ENFILE); > + inode->i_fop = &fs_fs_fops; > + > + ret = -ENOMEM; > + path.dentry = d_alloc_pseudo(fs_fs_mnt->mnt_sb, &empty_name); > + if (!path.dentry) > + goto err_inode; > + path.mnt = mntget(fs_fs_mnt); > + > + d_instantiate(path.dentry, inode); > + > + f = alloc_file(&path, FMODE_READ | FMODE_WRITE, &fs_fs_fops); > + if (IS_ERR(f)) { > + ret = PTR_ERR(f); > + goto err_file; > + } > + > + f->private_data = mc; > + return f; > + > +err_file: > + path_put(&path); > + return ERR_PTR(ret); > + > +err_inode: > + iput(inode); > + return ERR_PTR(ret); > +} > + > +static const struct super_operations fs_fs_ops = { > + .drop_inode = generic_delete_inode, > + .destroy_inode = free_inode_nonrcu, > + .statfs = simple_statfs, > +}; > + > +static struct dentry *fs_fs_mount(struct file_system_type *fs_type, > + int flags, const char *dev_name, > + void *data) > +{ > + return mount_pseudo(fs_type, "fs_fs:", &fs_fs_ops, > + &fs_fs_dentry_operations, FS_FS_MAGIC); > +} > + > +static struct file_system_type fs_fs_type = { > + .name = "fs_fs", > + .mount = fs_fs_mount, > + .kill_sb = kill_anon_super, > +}; > + > +static int __init init_fs_fs(void) > +{ > + int ret; > + > + ret = register_filesystem(&fs_fs_type); > + if (ret < 0) > + panic("Cannot register fs_fs\n"); > + > + fs_fs_mnt = kern_mount(&fs_fs_type); > + if (IS_ERR(fs_fs_mnt)) > + panic("Cannot mount fs_fs: %ld\n", PTR_ERR(fs_fs_mnt)); > + return 0; > +} > + > +fs_initcall(init_fs_fs); > + > +/* > + * Open a filesystem by name so that it can be configured for mounting. > + * > + * We are allowed to specify a container in which the filesystem will be > + * opened, thereby indicating which namespaces will be used (notably, which > + * network namespace will be used for network filesystems). > + */ > +SYSCALL_DEFINE3(fsopen, const char __user *, _fs_name, int, reserved, > + unsigned int, flags) > +{ > + struct mount_context *mc; > + struct file *file; > + const char *fs_name; > + int fd, ret; > + > + if (flags & ~O_CLOEXEC || reserved != -1) > + return -EINVAL; > + > + fs_name = strndup_user(_fs_name, PAGE_SIZE); > + if (IS_ERR(fs_name)) > + return PTR_ERR(fs_name); > + > + mc = vfs_fsopen(fs_name); > + if (IS_ERR(mc)) { > + ret = PTR_ERR(mc); > + goto err_fs_name; > + } > + > + ret = -ENOTSUPP; > + if (!mc->ops) > + goto err_mc; > + > + file = create_fs_file(mc); > + if (IS_ERR(file)) { > + ret = PTR_ERR(file); > + goto err_mc; > + } > + > + ret = get_unused_fd_flags(flags & O_CLOEXEC); > + if (ret < 0) > + goto err_file; > + > + fd = ret; > + fd_install(fd, file); > + return fd; > + > +err_file: > + fput(file); > + return ret; > + > +err_mc: > + put_mount_context(mc); > +err_fs_name: > + kfree(fs_name); > + return ret; > +} > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 980c3c9b06f8..91ec8802ad5d 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -905,5 +905,6 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val); > asmlinkage long sys_pkey_free(int pkey); > asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags, > unsigned mask, struct statx __user *buffer); > +asmlinkage long sys_fsopen(const char *fs_name, int containerfd, unsigned int flags); > > #endif > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h > index e230af2e6855..88ae83492f7c 100644 > --- a/include/uapi/linux/magic.h > +++ b/include/uapi/linux/magic.h > @@ -84,5 +84,6 @@ > #define UDF_SUPER_MAGIC 0x15013346 > #define BALLOON_KVM_MAGIC 0x13661366 > #define ZSMALLOC_MAGIC 0x58295829 > +#define FS_FS_MAGIC 0x66736673 > > #endif /* __LINUX_MAGIC_H__ */ > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index 8acef8576ce9..de1dc63e7e47 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -258,3 +258,6 @@ cond_syscall(sys_membarrier); > cond_syscall(sys_pkey_mprotect); > cond_syscall(sys_pkey_alloc); > cond_syscall(sys_pkey_free); > + > +/* fd-based mount */ > +cond_syscall(sys_fsopen); >
Jeff Layton <jlayton@poochiereds.net> wrote: > I think one of the neat things here is that we can now error out as > soon as a bogus mount option is passed in. It means that the 'sloppy' option can also now be implemented in userspace. > > Once fsmount() has been called, further write() calls will incur EBUSY, > > even if the fsmount() fails. read() is still possible to retrieve error > > information. > > > > What's the rationale for the above behavior? > > A failed attempt to graft it into the tree doesn't seem like it would > have any real effect on the mount_context. While I can't think of a use > case for being able to try fsmount() again, I don't quite understand > why we'd prohibit someone from doing it. The mount procedure is allowed to preallocate resources and attach them to the mount context and ->mount() is allowed to use them up, say by transferring them to the superblock. The mount context is then in a degraded state and cannot necessarily be reused. David -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 03 2017, David Howells <dhowells@redhat.com> wrote: > --- /dev/null > +++ b/fs/fsopen.c > @@ -0,0 +1,295 @@ > +/* fsopen.c: description > + * leftover from some template? > + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved. > + * Written by David Howells (dhowells@redhat.com) > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public Licence > + * as published by the Free Software Foundation; either version > + * 2 of the Licence, or (at your option) any later version. > + */ > + > +#include <linux/mount.h> > +#include <linux/slab.h> > +#include <linux/uaccess.h> > +#include <linux/file.h> > +#include <linux/magic.h> > +#include <linux/syscalls.h> > + > +static struct vfsmount *fs_fs_mnt __read_mostly; > +static struct qstr empty_name = { .name = "" }; > + > +static int fs_fs_release(struct inode *inode, struct file *file) > +{ > + struct mount_context *mc = file->private_data; > + > + file->private_data = NULL; > + > + put_mount_context(mc); > + return 0; > +} > + > +/* > + * Read any error message back from the fd. Will be prefixed by "e ". > + */ > +static ssize_t fs_fs_read(struct file *file, char __user *_buf, size_t len, loff_t *pos) > +{ > + struct mount_context *mc = file->private_data; > + const char *msg; > + size_t mlen; > + > + msg = mc->error; > + if (!msg) > + return -ENODATA; > + > + mlen = strlen(msg); > + if (mlen + 2 > len) > + return -ETOOSMALL; > + > + if (copy_to_user(_buf, "e ", 2) != 0 || > + copy_to_user(_buf + 2, msg, mlen) != 0) > + return -EFAULT; > + return mlen + 2; > +} OK, mc->error must be static data, so no lifetime problems. But is it possible for the compiler to mess this up and reload msg from mc->error when it's about to do the user copy, so that if some other thread has managed to change mc->error (or is the error state sticky and no further operations allowed?) we'd copy from a string with a different length? > +/* > + * Userspace writes configuration data to the fd and we parse it here. For the > + * moment, we assume a single option per write. Each line written is of the form > + * > + * <option_type><space><stuff...> > + * > + * d /dev/sda1 -- Device name > + * o noatime -- Option without value > + * o cell=grand.central.org -- Option with value > + * r / -- Dir within device to mount > + */ > +static ssize_t fs_fs_write(struct file *file, > + const char __user *_buf, size_t len, loff_t *pos) > +{ > + struct mount_context *mc = file->private_data; > + struct inode *inode = file_inode(file); > + char opt[2], *data; > + ssize_t ret; > + > + if (len < 3 || len > 4095) > + return -EINVAL; > + > + if (copy_from_user(opt, _buf, 2) != 0) > + return -EFAULT; > + switch (opt[0]) { > + case 'd': > + case 'o': > + case 'r': > + break; > + default: > + return -EINVAL; > + } > + if (opt[1] != ' ') > + return -EINVAL; > + > + data = kmalloc(len - 2 + 1, GFP_KERNEL); > + if (!data) > + return -ENOMEM; > + > + ret = -EFAULT; > + if (copy_from_user(data, _buf + 2, len - 2) != 0) > + goto err_free; > + data[len - 2] = 0; > + This hunk seems to be equivalent to data = memdup_user_nul(_buf + 2, len - 2); if (IS_ERR(data)) return PTR_ERR(data); plus the err_free: label gets killed... > > + /* From this point onwards we need to lock the fd against someone > + * trying to mount it. > + */ > + ret = inode_lock_killable(inode); > + if (ret < 0) > + return ret; ...except that we need to jump to it here to avoid leaking data. > + ret = -EBUSY; > + if (mc->mounted) > + goto err_unlock; > + > + ret = -EINVAL; > + switch (opt[0]) { > + case 'd': > + if (mc->device) > + goto err_unlock; > + mc->device = data; > + data = NULL; > + break; > + > + case 'o': > + ret = vfs_mount_option(mc, data); > + if (ret < 0) > + goto err_unlock; > + break; > + > + case 'r': > + if (mc->root_path) > + goto err_unlock; > + mc->root_path = data; > + data = NULL; > + break; > + > + default: > + goto err_unlock; > + } > + > + ret = len; > +err_unlock: > + inode_unlock(inode); > +err_free: > + kfree(data); > + return ret; > +} > + > +const struct file_operations fs_fs_fops = { > + .read = fs_fs_read, > + .write = fs_fs_write, > + .release = fs_fs_release, > + .llseek = no_llseek, > +}; > + static const struct ? > +/* > + * Open a filesystem by name so that it can be configured for mounting. > + * > + * We are allowed to specify a container in which the filesystem will be > + * opened, thereby indicating which namespaces will be used (notably, which > + * network namespace will be used for network filesystems). > + */ > +SYSCALL_DEFINE3(fsopen, const char __user *, _fs_name, int, reserved, > + unsigned int, flags) > +{ > + struct mount_context *mc; > + struct file *file; > + const char *fs_name; > + int fd, ret; > + > + if (flags & ~O_CLOEXEC || reserved != -1) > + return -EINVAL; > + > + fs_name = strndup_user(_fs_name, PAGE_SIZE); > + if (IS_ERR(fs_name)) > + return PTR_ERR(fs_name); > + > + mc = vfs_fsopen(fs_name); > + if (IS_ERR(mc)) { > + ret = PTR_ERR(mc); > + goto err_fs_name; > + } > + Where does fs_name now get freed? vfs_fsopen doesn't seem to do it on success? (If it did, the fallthrough from err_mc: to err_fs_name: would be wrong.) > + ret = -ENOTSUPP; > + if (!mc->ops) > + goto err_mc; > + > + file = create_fs_file(mc); > + if (IS_ERR(file)) { > + ret = PTR_ERR(file); > + goto err_mc; > + } > + > + ret = get_unused_fd_flags(flags & O_CLOEXEC); > + if (ret < 0) > + goto err_file; > + > + fd = ret; > + fd_install(fd, file); > + return fd; > + > +err_file: > + fput(file); > + return ret; > + > +err_mc: > + put_mount_context(mc); > +err_fs_name: > + kfree(fs_name); > + return ret; > +} -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 03, 2017 at 05:05:08PM +0100, David Howells wrote: > mfd = fsopen("ext4", -1, O_CLOEXEC); > write(mfd, "d /dev/sdb1"); // note I'm ignoring write's length arg Not sure about 'd', in many cases it is not device, for mount(2) syscall we call it "source". > write(mfd, "o noatime"); > write(mfd, "o acl"); > write(mfd, "o user_attr"); > write(mfd, "o iversion"); > write(mfd, "o "); > write(mfd, "r /my/container"); // root inside the fs > fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW); > > mfd = fsopen("afs", -1); > write(mfd, "d %grand.central.org:root.cell"); > write(mfd, "o cell=grand.central.org"); > write(mfd, "r /"); > fsmount(mfd, AT_FDCWD, "/mnt", 0); > > If an error is reported at any step, an error message may be available to be > read() back (ENODATA will be reported if there isn't an error available) in > the form: > > "e <subsys>:<problem>" > "e SELinux:Mount on mountpoint not permitted" > > Once fsmount() has been called, further write() calls will incur EBUSY, > even if the fsmount() fails. read() is still possible to retrieve error > information. The very basic mount(2) problem is that you have to parse /proc/self/mountinfo to get information about the mounted filesystem. It seems that your read() is also one way communication. What we really need is to have a way how to specify *what* you want to read. The error message is not enough, I want to know the finally used mount options, mount ID, etc. It would be nice to have something like fsmount(mfd, AT_FDCWD, "/mnt", 0); write(mfd, "o"); read(mfd, ....); // read mount options write(mdf, "i"); read(mfd, ....); // read mount ID but it seems ugly. Maybe introduce another function like fsinfo(mdf, "o", buf, bufsz) to get mount options (etc.) and to avoid separate write & read. > Netlink is not used because it is optional. +1 Karel
Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote: > OK, mc->error must be static data, so no lifetime problems. Technically, mc->error is only good as long as we hold a ref on its module. It might be better to copy the string, especially as we can then do printf-style formatting. > But is it possible for the compiler to mess this up and reload msg from > mc->error when it's about to do the user copy, so that if some other thread > has managed to change mc->error (or is the error state sticky and no further > operations allowed?) we'd copy from a string with a different length? Yeah - I need to put a READ_ONCE() in there. > Where does fs_name now get freed? vfs_fsopen doesn't seem to do it on > success? (If it did, the fallthrough from err_mc: to err_fs_name: would > be wrong.) I'll add a free right after the vfs_fsopen() call before checking the error - then I can get rid of err_fs_name. David -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote: > > +const struct file_operations fs_fs_fops = { > > + .read = fs_fs_read, > > + .write = fs_fs_write, > > + .release = fs_fs_release, > > + .llseek = no_llseek, > > +}; > > + > > static const struct ? No. It's used in the next patch to validate the fd passed to sys_fsmount(): if (f.file->f_op != &fs_fs_fops) goto err_fsfd; David -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Karel Zak <kzak@redhat.com> wrote: > > write(mfd, "d /dev/sdb1"); // note I'm ignoring write's length arg > > Not sure about 'd', in many cases it is not device, for mount(2) > syscall we call it "source". sys_mount() calls it devname. But whatever - I'm not particularly attached to the letter 'd' for this. > The very basic mount(2) problem is that you have to parse > /proc/self/mountinfo to get information about the mounted filesystem. > It seems that your read() is also one way communication. > > What we really need is to have a way how to specify *what* you want to > read. The error message is not enough, I want to know the finally used > mount options, mount ID, etc. It would be nice to have something like > > > fsmount(mfd, AT_FDCWD, "/mnt", 0); > > write(mfd, "o"); > read(mfd, ....); // read mount options > > write(mdf, "i"); > read(mfd, ....); // read mount ID > > > but it seems ugly. Maybe introduce another function like > > fsinfo(mdf, "o", buf, bufsz) > > to get mount options (etc.) and to avoid separate write & read. What is it you're trying to do? Just read back the state of the new mount? Or read back the state of a specified extant mount? David -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 04, 2017 at 02:06:51PM +0100, David Howells wrote: > Karel Zak <kzak@redhat.com> wrote: > > The very basic mount(2) problem is that you have to parse > > /proc/self/mountinfo to get information about the mounted filesystem. > > It seems that your read() is also one way communication. > > > > What we really need is to have a way how to specify *what* you want to > > read. The error message is not enough, I want to know the finally used > > mount options, mount ID, etc. It would be nice to have something like > > > > > > fsmount(mfd, AT_FDCWD, "/mnt", 0); > > > > write(mfd, "o"); > > read(mfd, ....); // read mount options > > > > write(mdf, "i"); > > read(mfd, ....); // read mount ID > > > > > > but it seems ugly. Maybe introduce another function like > > > > fsinfo(mdf, "o", buf, bufsz) > > > > to get mount options (etc.) and to avoid separate write & read. > > What is it you're trying to do? Just read back the state of the new mount? ...read back the state of the new mount, because for example mount options can be modified by FS driver. It would be also nice to have API to get state of arbitrary mount without parsing mountinfo (the file is huge on some systems). Karel
On Wed, May 3, 2017 at 6:05 PM, David Howells <dhowells@redhat.com> wrote: > Provide an fsopen() system call that starts the process of preparing to > mount, using an fd as a context handle. fsopen() is given the name of the > filesystem that will be used: > > int mfd = fsopen(const char *fsname, int reserved, > int open_flags); > > where reserved should be -1 for the moment (it will be used to pass the > namespace information in future) and open_flags can be 0 or O_CLOEXEC. Someone also suggested using /dev/fs/${FSTYPE} to open the fsfd. I realize that does not have the namespace info that you also want to add, but wondering if that really has to come from open and cannot be set later? Alternatives are /proc/fs/${FSTYPE}/dev or /sys/fs/${FSTYPE}/dev. Obviously neither can be used for bootstraping but there's still old mount(2) for that. I haven't convinced myself whether using plain open(2) or a new fsopen(2) syscall is better, just mentioning that this is a possibility as well. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miklos Szeredi <mszeredi@redhat.com> wrote: > Someone also suggested using /dev/fs/${FSTYPE} to open the fsfd. The downside of using open() for this is that you then have a chicken-and-egg problem with respect to booting as you point out. > I realize that does not have the namespace info that you also want to add, > but wondering if that really has to come from open and cannot be set later? When do you do the security checks? Those are going to be affected by the namespaces. Other things are as well, such as setting hostnames, IP addresses, device file paths and default UIDs/GIDs, but these are probably more okay with being deferred to the parameter validation step. > ALTERNATIVES are /proc/fs/${FSTYPE}/dev or /sys/fs/${FSTYPE}/dev. > > Obviously neither can be used for bootstraping but there's still old > mount(2) for that. It should also be possible to build-time disable mount(2) in future. Obviously, this would mean providing other vectors for the other functions of mount(2). David -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2017-05-04 at 15:34 +0200, Karel Zak wrote: > On Thu, May 04, 2017 at 02:06:51PM +0100, David Howells wrote: > > Karel Zak <kzak@redhat.com> wrote: > > > The very basic mount(2) problem is that you have to parse > > > /proc/self/mountinfo to get information about the mounted filesystem. > > > It seems that your read() is also one way communication. > > > > > > What we really need is to have a way how to specify *what* you want to > > > read. The error message is not enough, I want to know the finally used > > > mount options, mount ID, etc. It would be nice to have something like > > > > > > > > > fsmount(mfd, AT_FDCWD, "/mnt", 0); > > > > > > write(mfd, "o"); > > > read(mfd, ....); // read mount options > > > > > > write(mdf, "i"); > > > read(mfd, ....); // read mount ID > > > > > > > > > but it seems ugly. Maybe introduce another function like > > > > > > fsinfo(mdf, "o", buf, bufsz) > > > Not sure that it's any prettier, but you could use an ioctl to switch between different read requests: ioctl(mfd, FSOPEN_READ_OPTIONS); read(mfd, ....); ioctl(mfd, FSOPEN_READ_MOUNT_ID); read(mfd, ....); > > > to get mount options (etc.) and to avoid separate write & read. > > > > What is it you're trying to do? Just read back the state of the new mount? > > ...read back the state of the new mount, because for example mount > options can be modified by FS driver. It would be also nice to have > API to get state of arbitrary mount without parsing mountinfo (the > file is huge on some systems). >
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 448ac2161112..9bf8d4c62f85 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -391,3 +391,4 @@ 382 i386 pkey_free sys_pkey_free 383 i386 statx sys_statx 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl +385 i386 fsopen sys_fsopen diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 5aef183e2f85..9b198c5fc412 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -339,6 +339,7 @@ 330 common pkey_alloc sys_pkey_alloc 331 common pkey_free sys_pkey_free 332 common statx sys_statx +333 common fsopen sys_fsopen # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 308a104a9a07..b79024dbb37c 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -12,7 +12,7 @@ obj-y := open.o read_write.o file_table.o super.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o splice.o sync.o utimes.o \ stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \ - mount_context.o + mount_context.o fsopen.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o block_dev.o direct-io.o mpage.o diff --git a/fs/fsopen.c b/fs/fsopen.c new file mode 100644 index 000000000000..f02ea7d265db --- /dev/null +++ b/fs/fsopen.c @@ -0,0 +1,295 @@ +/* fsopen.c: description + * + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public Licence + * as published by the Free Software Foundation; either version + * 2 of the Licence, or (at your option) any later version. + */ + +#include <linux/mount.h> +#include <linux/slab.h> +#include <linux/uaccess.h> +#include <linux/file.h> +#include <linux/magic.h> +#include <linux/syscalls.h> + +static struct vfsmount *fs_fs_mnt __read_mostly; +static struct qstr empty_name = { .name = "" }; + +static int fs_fs_release(struct inode *inode, struct file *file) +{ + struct mount_context *mc = file->private_data; + + file->private_data = NULL; + + put_mount_context(mc); + return 0; +} + +/* + * Read any error message back from the fd. Will be prefixed by "e ". + */ +static ssize_t fs_fs_read(struct file *file, char __user *_buf, size_t len, loff_t *pos) +{ + struct mount_context *mc = file->private_data; + const char *msg; + size_t mlen; + + msg = mc->error; + if (!msg) + return -ENODATA; + + mlen = strlen(msg); + if (mlen + 2 > len) + return -ETOOSMALL; + if (copy_to_user(_buf, "e ", 2) != 0 || + copy_to_user(_buf + 2, msg, mlen) != 0) + return -EFAULT; + return mlen + 2; +} + +/* + * Userspace writes configuration data to the fd and we parse it here. For the + * moment, we assume a single option per write. Each line written is of the form + * + * <option_type><space><stuff...> + * + * d /dev/sda1 -- Device name + * o noatime -- Option without value + * o cell=grand.central.org -- Option with value + * r / -- Dir within device to mount + */ +static ssize_t fs_fs_write(struct file *file, + const char __user *_buf, size_t len, loff_t *pos) +{ + struct mount_context *mc = file->private_data; + struct inode *inode = file_inode(file); + char opt[2], *data; + ssize_t ret; + + if (len < 3 || len > 4095) + return -EINVAL; + + if (copy_from_user(opt, _buf, 2) != 0) + return -EFAULT; + switch (opt[0]) { + case 'd': + case 'o': + case 'r': + break; + default: + return -EINVAL; + } + if (opt[1] != ' ') + return -EINVAL; + + data = kmalloc(len - 2 + 1, GFP_KERNEL); + if (!data) + return -ENOMEM; + + ret = -EFAULT; + if (copy_from_user(data, _buf + 2, len - 2) != 0) + goto err_free; + data[len - 2] = 0; + + /* From this point onwards we need to lock the fd against someone + * trying to mount it. + */ + ret = inode_lock_killable(inode); + if (ret < 0) + return ret; + + ret = -EBUSY; + if (mc->mounted) + goto err_unlock; + + ret = -EINVAL; + switch (opt[0]) { + case 'd': + if (mc->device) + goto err_unlock; + mc->device = data; + data = NULL; + break; + + case 'o': + ret = vfs_mount_option(mc, data); + if (ret < 0) + goto err_unlock; + break; + + case 'r': + if (mc->root_path) + goto err_unlock; + mc->root_path = data; + data = NULL; + break; + + default: + goto err_unlock; + } + + ret = len; +err_unlock: + inode_unlock(inode); +err_free: + kfree(data); + return ret; +} + +const struct file_operations fs_fs_fops = { + .read = fs_fs_read, + .write = fs_fs_write, + .release = fs_fs_release, + .llseek = no_llseek, +}; + +/* + * Indicate the name we want to display the filesystem file as. + */ +static char *fs_fs_dname(struct dentry *dentry, char *buffer, int buflen) +{ + return dynamic_dname(dentry, buffer, buflen, "fs:[%lu]", + d_inode(dentry)->i_ino); +} + +static const struct dentry_operations fs_fs_dentry_operations = { + .d_dname = fs_fs_dname, +}; + +/* + * Create a file that can be used to configure a new mount. + */ +static struct file *create_fs_file(struct mount_context *mc) +{ + struct inode *inode; + struct file *f; + struct path path; + int ret; + + inode = alloc_anon_inode(fs_fs_mnt->mnt_sb); + if (!inode) + return ERR_PTR(-ENFILE); + inode->i_fop = &fs_fs_fops; + + ret = -ENOMEM; + path.dentry = d_alloc_pseudo(fs_fs_mnt->mnt_sb, &empty_name); + if (!path.dentry) + goto err_inode; + path.mnt = mntget(fs_fs_mnt); + + d_instantiate(path.dentry, inode); + + f = alloc_file(&path, FMODE_READ | FMODE_WRITE, &fs_fs_fops); + if (IS_ERR(f)) { + ret = PTR_ERR(f); + goto err_file; + } + + f->private_data = mc; + return f; + +err_file: + path_put(&path); + return ERR_PTR(ret); + +err_inode: + iput(inode); + return ERR_PTR(ret); +} + +static const struct super_operations fs_fs_ops = { + .drop_inode = generic_delete_inode, + .destroy_inode = free_inode_nonrcu, + .statfs = simple_statfs, +}; + +static struct dentry *fs_fs_mount(struct file_system_type *fs_type, + int flags, const char *dev_name, + void *data) +{ + return mount_pseudo(fs_type, "fs_fs:", &fs_fs_ops, + &fs_fs_dentry_operations, FS_FS_MAGIC); +} + +static struct file_system_type fs_fs_type = { + .name = "fs_fs", + .mount = fs_fs_mount, + .kill_sb = kill_anon_super, +}; + +static int __init init_fs_fs(void) +{ + int ret; + + ret = register_filesystem(&fs_fs_type); + if (ret < 0) + panic("Cannot register fs_fs\n"); + + fs_fs_mnt = kern_mount(&fs_fs_type); + if (IS_ERR(fs_fs_mnt)) + panic("Cannot mount fs_fs: %ld\n", PTR_ERR(fs_fs_mnt)); + return 0; +} + +fs_initcall(init_fs_fs); + +/* + * Open a filesystem by name so that it can be configured for mounting. + * + * We are allowed to specify a container in which the filesystem will be + * opened, thereby indicating which namespaces will be used (notably, which + * network namespace will be used for network filesystems). + */ +SYSCALL_DEFINE3(fsopen, const char __user *, _fs_name, int, reserved, + unsigned int, flags) +{ + struct mount_context *mc; + struct file *file; + const char *fs_name; + int fd, ret; + + if (flags & ~O_CLOEXEC || reserved != -1) + return -EINVAL; + + fs_name = strndup_user(_fs_name, PAGE_SIZE); + if (IS_ERR(fs_name)) + return PTR_ERR(fs_name); + + mc = vfs_fsopen(fs_name); + if (IS_ERR(mc)) { + ret = PTR_ERR(mc); + goto err_fs_name; + } + + ret = -ENOTSUPP; + if (!mc->ops) + goto err_mc; + + file = create_fs_file(mc); + if (IS_ERR(file)) { + ret = PTR_ERR(file); + goto err_mc; + } + + ret = get_unused_fd_flags(flags & O_CLOEXEC); + if (ret < 0) + goto err_file; + + fd = ret; + fd_install(fd, file); + return fd; + +err_file: + fput(file); + return ret; + +err_mc: + put_mount_context(mc); +err_fs_name: + kfree(fs_name); + return ret; +} diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 980c3c9b06f8..91ec8802ad5d 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -905,5 +905,6 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val); asmlinkage long sys_pkey_free(int pkey); asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags, unsigned mask, struct statx __user *buffer); +asmlinkage long sys_fsopen(const char *fs_name, int containerfd, unsigned int flags); #endif diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index e230af2e6855..88ae83492f7c 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -84,5 +84,6 @@ #define UDF_SUPER_MAGIC 0x15013346 #define BALLOON_KVM_MAGIC 0x13661366 #define ZSMALLOC_MAGIC 0x58295829 +#define FS_FS_MAGIC 0x66736673 #endif /* __LINUX_MAGIC_H__ */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 8acef8576ce9..de1dc63e7e47 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -258,3 +258,6 @@ cond_syscall(sys_membarrier); cond_syscall(sys_pkey_mprotect); cond_syscall(sys_pkey_alloc); cond_syscall(sys_pkey_free); + +/* fd-based mount */ +cond_syscall(sys_fsopen);
Provide an fsopen() system call that starts the process of preparing to mount, using an fd as a context handle. fsopen() is given the name of the filesystem that will be used: int mfd = fsopen(const char *fsname, int reserved, int open_flags); where reserved should be -1 for the moment (it will be used to pass the namespace information in future) and open_flags can be 0 or O_CLOEXEC. For example: mfd = fsopen("ext4", -1, O_CLOEXEC); write(mfd, "d /dev/sdb1"); // note I'm ignoring write's length arg write(mfd, "o noatime"); write(mfd, "o acl"); write(mfd, "o user_attr"); write(mfd, "o iversion"); write(mfd, "o "); write(mfd, "r /my/container"); // root inside the fs fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW); mfd = fsopen("afs", -1); write(mfd, "d %grand.central.org:root.cell"); write(mfd, "o cell=grand.central.org"); write(mfd, "r /"); fsmount(mfd, AT_FDCWD, "/mnt", 0); If an error is reported at any step, an error message may be available to be read() back (ENODATA will be reported if there isn't an error available) in the form: "e <subsys>:<problem>" "e SELinux:Mount on mountpoint not permitted" Once fsmount() has been called, further write() calls will incur EBUSY, even if the fsmount() fails. read() is still possible to retrieve error information. The fsopen() syscall creates a mount context and hangs it of the fd that it returns. Netlink is not used because it is optional. Signed-off-by: David Howells <dhowells@redhat.com> --- arch/x86/entry/syscalls/syscall_32.tbl | 1 arch/x86/entry/syscalls/syscall_64.tbl | 1 fs/Makefile | 2 fs/fsopen.c | 295 ++++++++++++++++++++++++++++++++ include/linux/syscalls.h | 1 include/uapi/linux/magic.h | 1 kernel/sys_ni.c | 3 7 files changed, 303 insertions(+), 1 deletion(-) create mode 100644 fs/fsopen.c -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html