Message ID | 20220322192712.709170-1-mszeredi@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] getvalues(2) prototype | expand |
On Tue, Mar 22, 2022 at 8:27 PM Miklos Szeredi <mszeredi@redhat.com> wrote: > > Add a new userspace API that allows getting multiple short values in a > single syscall. Attaching a test program that allows arbitrary queries. Thanks, Miklos
On 3/22/2022 12:27 PM, Miklos Szeredi wrote: > Add a new userspace API that allows getting multiple short values in a > single syscall. > > This would be useful for the following reasons: > > - Calling open/read/close for many small files is inefficient. E.g. on my > desktop invoking lsof(1) results in ~60k open + read + close calls under > /proc and 90% of those are 128 bytes or less. You don't need the generality below to address this issue. int openandread(const char *path, char *buffer, size_t size); would address this case swimmingly. > - Interfaces for getting various attributes and statistics are fragmented. > For files we have basic stat, statx, extended attributes, file attributes > (for which there are two overlapping ioctl interfaces). For mounts and > superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}. > The latter also has the problem on not allowing queries on a specific > mount. > > - Some attributes are cheap to generate, some are expensive. Allowing > userspace to select which ones it needs should allow optimizing queries. > > - Adding an ascii namespace should allow easy extension and self > description. > > - The values can be text or binary, whichever is fits best. > > The interface definition is: > > struct name_val { > const char *name; /* in */ > struct iovec value_in; /* in */ > struct iovec value_out; /* out */ > uint32_t error; /* out */ > uint32_t reserved; > }; > > int getvalues(int dfd, const char *path, struct name_val *vec, size_t num, > unsigned int flags); > > @dfd and @path are used to lookup object $ORIGIN. To be conventional you should have int getvalues(const char *path, struct name_val *vec, size_t num, unsigned int flags); and int fgetvalues(int dfd, struct name_val *vec, size_t num, unsigned int flags); > @vec contains @num > name/value descriptors. @flags contains lookup flags for @path. > > The syscall returns the number of values filled or an error. > > A single name/value descriptor has the following fields: > > @name describes the object whose value is to be returned. E.g. > > mnt - list of mount parameters > mnt:mountpoint - the mountpoint of the mount of $ORIGIN > mntns - list of mount ID's reachable from the current root > mntns:21:parentid - parent ID of the mount with ID of 21 > xattr:security.selinux - the security.selinux extended attribute > data:foo/bar - the data contained in file $ORIGIN/foo/bar > > If the name starts with the separator, then it is taken to have the same > prefix as the previous name/value descriptor. E.g. in the following > sequence of names the second one is equivalent to mnt:parentid: > > mnt:mountpoint > :parentid I would consider this a clever optimization that is likely to cause confusion and result in lots of bugs. Yes, you'll save some parsing time, but the debugging headaches it would introduce would more than make up for it. > @value_in supplies the buffer to store value(s) in. If a subsequent > name/value descriptor has NULL value of value_in.iov_base, then the buffer > from the previous name/value descriptor will be used. This way it's > possible to use a shared buffer for multiple values. I would not trust very many application developers to use the NULL value_in.iov_base correctly. In fact, I can't think of a way it could be used sensibly. Sure, you could put two things of known size into one buffer and use the known offset, but again that's a clever optimization that will result in more bugs than it is worth. > The starting address and length of the actual value will be stored in > @value_out, unless an error has occurred in which case @error will be set to > the positive errno value. You only need to return the address if you do the multi-value in a buffer. Which I've already expressed distaste for. If the application asks for 6 attributes and is denied access to the 3rd (by an LSM let's say) what is returned? Are the 1st two buffers filled? How can I tell which attribute was unavailable? > Multiple names starting with the same prefix (including the shorthand form) > may also be batched together under the same lock, so the order of the names > can determine atomicity. I will believe you, but it's hardly obvious why this is true. > > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> > --- > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > fs/Makefile | 2 +- > fs/mount.h | 8 + > fs/namespace.c | 42 ++ > fs/proc_namespace.c | 2 +- > fs/values.c | 524 +++++++++++++++++++++++++ > 6 files changed, 577 insertions(+), 2 deletions(-) > create mode 100644 fs/values.c > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index c84d12608cd2..c72668001b39 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -372,6 +372,7 @@ > 448 common process_mrelease sys_process_mrelease > 449 common futex_waitv sys_futex_waitv > 450 common set_mempolicy_home_node sys_set_mempolicy_home_node > +451 common getvalues sys_getvalues > > # > # Due to a historical design error, certain syscalls are numbered differently > diff --git a/fs/Makefile b/fs/Makefile > index 208a74e0b00e..f00d6bcd1178 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -16,7 +16,7 @@ obj-y := open.o read_write.o file_table.o super.o \ > pnode.o splice.o sync.o utimes.o d_path.o \ > stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \ > fs_types.o fs_context.o fs_parser.o fsopen.o init.o \ > - kernel_read_file.o remap_range.o > + kernel_read_file.o remap_range.o values.o > > ifeq ($(CONFIG_BLOCK),y) > obj-y += buffer.o direct-io.o mpage.o > diff --git a/fs/mount.h b/fs/mount.h > index 0b6e08cf8afb..a3ca5233e481 100644 > --- a/fs/mount.h > +++ b/fs/mount.h > @@ -148,3 +148,11 @@ static inline bool is_anon_ns(struct mnt_namespace *ns) > } > > extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor); > + > +extern void namespace_lock_read(void); > +extern void namespace_unlock_read(void); > +extern void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt); > +extern void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns, > + struct path *root); > +extern struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, > + struct path *root, int id); > diff --git a/fs/namespace.c b/fs/namespace.c > index de6fae84f1a1..52b15c17251f 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1405,6 +1405,38 @@ void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor) > } > #endif /* CONFIG_PROC_FS */ > > +void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns, > + struct path *root) > +{ > + struct mount *m; > + > + down_read(&namespace_sem); > + for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) { > + if (is_path_reachable(m, m->mnt.mnt_root, root)) { > + seq_printf(seq, "%i", m->mnt_id); > + seq_putc(seq, '\0'); > + } > + } > + up_read(&namespace_sem); > +} > + > +/* called with namespace_sem held for read */ > +struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, struct path *root, > + int id) > +{ > + struct mount *m; > + > + for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) { > + if (m->mnt_id == id) { > + if (is_path_reachable(m, m->mnt.mnt_root, root)) > + return mntget(&m->mnt); > + else > + return NULL; > + } > + } > + return NULL; > +} > + > /** > * may_umount_tree - check if a mount tree is busy > * @m: root of mount tree > @@ -1494,6 +1526,16 @@ static inline void namespace_lock(void) > down_write(&namespace_sem); > } > > +void namespace_lock_read(void) > +{ > + down_read(&namespace_sem); > +} > + > +void namespace_unlock_read(void) > +{ > + up_read(&namespace_sem); > +} > + > enum umount_tree_flags { > UMOUNT_SYNC = 1, > UMOUNT_PROPAGATE = 2, > diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c > index 49650e54d2f8..fa6dc2c20578 100644 > --- a/fs/proc_namespace.c > +++ b/fs/proc_namespace.c > @@ -61,7 +61,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb) > return security_sb_show_options(m, sb); > } > > -static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) > +void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) > { > static const struct proc_fs_opts mnt_opts[] = { > { MNT_NOSUID, ",nosuid" }, > diff --git a/fs/values.c b/fs/values.c > new file mode 100644 > index 000000000000..618fa9bf48a1 > --- /dev/null > +++ b/fs/values.c > @@ -0,0 +1,524 @@ > +#include <linux/syscalls.h> > +#include <linux/printk.h> > +#include <linux/namei.h> > +#include <linux/fs_struct.h> > +#include <linux/posix_acl_xattr.h> > +#include <linux/xattr.h> > +#include "pnode.h" > +#include "internal.h" > + > +#define VAL_GRSEP ':' > + > +struct name_val { > + const char __user *name; /* in */ > + struct iovec value_in; /* in */ > + struct iovec value_out; /* out */ > + __u32 error; /* out */ > + __u32 reserved; > +}; > + > +struct val_iter { > + struct name_val __user *curr; > + size_t num; > + struct iovec vec; > + char name[256]; > + size_t bufsize; > + struct seq_file seq; > + const char *prefix; > + const char *sub; > +}; > + > +struct val_desc { > + const char *name; > + union { > + int idx; > + int (*get)(struct val_iter *vi, const struct path *path); > + }; > +}; > + > +static int val_get(struct val_iter *vi) > +{ > + struct name_val nameval; > + long err; > + > + if (copy_from_user(&nameval, vi->curr, sizeof(nameval))) > + return -EFAULT; > + > + err = strncpy_from_user(vi->name, nameval.name, sizeof(vi->name)); > + if (err < 0) > + return err; > + if (err == sizeof(vi->name)) > + return -ERANGE; > + > + if (nameval.value_in.iov_base) > + vi->vec = nameval.value_in; > + > + vi->seq.size = min(vi->vec.iov_len, vi->bufsize); > + vi->seq.count = 0; > + > + return 0; > +} > + > +static int val_next(struct val_iter *vi) > +{ > + vi->curr++; > + vi->num--; > + > + return vi->num ? val_get(vi) : 0; > +} > + > +static int val_end(struct val_iter *vi, size_t count) > +{ > + struct iovec iov = { > + .iov_base = vi->vec.iov_base, > + .iov_len = count, > + }; > + > + if (copy_to_user(&vi->curr->value_out, &iov, sizeof(iov))) > + return -EFAULT; > + > + vi->vec.iov_base += count; > + vi->vec.iov_len -= count; > + > + return val_next(vi); > +} > + > +static int val_err(struct val_iter *vi, int err) > +{ > + if (put_user(-err, &vi->curr->error)) > + return -EFAULT; > + > + return val_next(vi); > +} > + > +static int val_end_seq(struct val_iter *vi, int err) > +{ > + size_t count = vi->seq.count; > + > + if (err) > + return val_err(vi, err); > + > + if (count == vi->seq.size) > + return -EOVERFLOW; > + > + if (copy_to_user(vi->vec.iov_base, vi->seq.buf, count)) > + return -EFAULT; > + > + return val_end(vi, count); > +} > + > +static struct val_desc *val_lookup(struct val_iter *vi, struct val_desc *vd) > +{ > + const char *name = vi->name; > + const char *prefix = vi->prefix; > + size_t prefixlen = strlen(prefix); > + > + if (prefixlen) { > + /* > + * Name beggining with a group separator is a shorthand for > + * previously prefix. > + */ > + if (name[0] == VAL_GRSEP) { > + name++; > + } else { > + if (strncmp(name, prefix, prefixlen) != 0) > + return NULL; > + name += prefixlen; > + } > + } > + > + vi->sub = NULL; > + for (; vd->name; vd++) { > + if (strcmp(name, vd->name) == 0) > + break; > + else { > + size_t grlen = strlen(vd->name); > + > + if (strncmp(vd->name, name, grlen) == 0 && > + name[grlen] == VAL_GRSEP) { > + vi->sub = name + grlen + 1; > + break; > + } > + } > + } > + return vd; > +} > + > +static int val_get_group(struct val_iter *vi, struct val_desc *vd) > +{ > + for (; vd->name; vd++) > + seq_write(&vi->seq, vd->name, strlen(vd->name) + 1); > + > + return val_end_seq(vi, 0); > +} > + > +static bool val_push_prefix(struct val_iter *vi, const char **oldprefix) > +{ > + char *newprefix; > + > + newprefix = kmemdup_nul(vi->name, vi->sub - vi->name, GFP_KERNEL); > + if (newprefix) { > + *oldprefix = vi->prefix; > + vi->prefix = newprefix; > + } > + > + return newprefix; > +} > + > +static void val_pop_prefix(struct val_iter *vi, const char *oldprefix) > +{ > + kfree(vi->prefix); > + vi->prefix = oldprefix; > +} > + > +enum { > + VAL_MNT_ID, > + VAL_MNT_PARENTID, > + VAL_MNT_ROOT, > + VAL_MNT_MOUNTPOINT, > + VAL_MNT_OPTIONS, > + VAL_MNT_SHARED, > + VAL_MNT_MASTER, > + VAL_MNT_PROPAGATE_FROM, > + VAL_MNT_UNBINDABLE, > + VAL_MNT_NOTFOUND, > +}; > + > +static struct val_desc val_mnt_group[] = { > + { .name = "id", .idx = VAL_MNT_ID }, > + { .name = "parentid", .idx = VAL_MNT_PARENTID, }, > + { .name = "root", .idx = VAL_MNT_ROOT, }, > + { .name = "mountpoint", .idx = VAL_MNT_MOUNTPOINT, }, > + { .name = "options", .idx = VAL_MNT_OPTIONS, }, > + { .name = "shared", .idx = VAL_MNT_SHARED, }, > + { .name = "master", .idx = VAL_MNT_MASTER, }, > + { .name = "propagate_from", .idx = VAL_MNT_PROPAGATE_FROM, }, > + { .name = "unbindable", .idx = VAL_MNT_UNBINDABLE, }, > + { .name = NULL, .idx = VAL_MNT_NOTFOUND }, > +}; > + > +static int seq_mnt_root(struct seq_file *seq, struct vfsmount *mnt) > +{ > + struct super_block *sb = mnt->mnt_sb; > + int err = 0; > + > + if (sb->s_op->show_path) { > + err = sb->s_op->show_path(seq, mnt->mnt_root); > + if (!err) { > + seq_putc(seq, '\0'); > + if (seq->count < seq->size) > + seq->count = string_unescape(seq->buf, seq->buf, seq->size, UNESCAPE_OCTAL); > + } > + } else { > + seq_dentry(seq, mnt->mnt_root, ""); > + } > + > + return err; > +} > + > +static int val_mnt_show(struct val_iter *vi, struct vfsmount *mnt) > +{ > + struct mount *m = real_mount(mnt); > + struct path root, mnt_path; > + struct val_desc *vd; > + const char *oldprefix; > + int err = 0; > + > + if (!val_push_prefix(vi, &oldprefix)) > + return -ENOMEM; > + > + while (!err && vi->num) { > + vd = val_lookup(vi, val_mnt_group); > + if (!vd) > + break; > + > + switch(vd->idx) { > + case VAL_MNT_ID: > + seq_printf(&vi->seq, "%i", m->mnt_id); > + break; > + case VAL_MNT_PARENTID: > + seq_printf(&vi->seq, "%i", m->mnt_parent->mnt_id); > + break; > + case VAL_MNT_ROOT: > + seq_mnt_root(&vi->seq, mnt); > + break; > + case VAL_MNT_MOUNTPOINT: > + get_fs_root(current->fs, &root); > + mnt_path.dentry = mnt->mnt_root; > + mnt_path.mnt = mnt; > + err = seq_path_root(&vi->seq, &mnt_path, &root, ""); > + path_put(&root); > + break; > + case VAL_MNT_OPTIONS: > + seq_puts(&vi->seq, mnt->mnt_flags & MNT_READONLY ? "ro" : "rw"); > + show_mnt_opts(&vi->seq, mnt); > + break; > + case VAL_MNT_SHARED: > + if (IS_MNT_SHARED(m)) > + seq_printf(&vi->seq, "%i,", m->mnt_group_id); > + break; > + case VAL_MNT_MASTER: > + if (IS_MNT_SLAVE(m)) > + seq_printf(&vi->seq, "%i,", > + m->mnt_master->mnt_group_id); > + break; > + case VAL_MNT_PROPAGATE_FROM: > + if (IS_MNT_SLAVE(m)) { > + int dom; > + > + get_fs_root(current->fs, &root); > + dom = get_dominating_id(m, &root); > + path_put(&root); > + if (dom) > + seq_printf(&vi->seq, "%i,", dom); > + } > + break; > + case VAL_MNT_UNBINDABLE: > + if (IS_MNT_UNBINDABLE(m)) > + seq_puts(&vi->seq, "yes"); > + break; > + default: > + err = -ENOENT; > + break; > + } > + err = val_end_seq(vi, err); > + } > + val_pop_prefix(vi, oldprefix); > + > + return err; > +} > + > +static int val_mnt_get(struct val_iter *vi, const struct path *path) > +{ > + int err; > + > + if (!vi->sub) > + return val_get_group(vi, val_mnt_group); > + > + namespace_lock_read(); > + err = val_mnt_show(vi, path->mnt); > + namespace_unlock_read(); > + > + return err; > +} > + > +static int val_mntns_get(struct val_iter *vi, const struct path *path) > +{ > + struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns; > + struct vfsmount *mnt; > + struct path root; > + char *end; > + int mnt_id; > + int err; > + > + if (!vi->sub) { > + get_fs_root(current->fs, &root); > + seq_mnt_list(&vi->seq, mnt_ns, &root); > + path_put(&root); > + return val_end_seq(vi, 0); > + } > + > + end = strchr(vi->sub, VAL_GRSEP); > + if (end) > + *end = '\0'; > + err = kstrtoint(vi->sub, 10, &mnt_id); > + if (err) > + return val_err(vi, err); > + vi->sub = NULL; > + if (end) { > + *end = VAL_GRSEP; > + vi->sub = end + 1; > + } > + > + namespace_lock_read(); > + get_fs_root(current->fs, &root); > + mnt = mnt_lookup_by_id(mnt_ns, &root, mnt_id); > + path_put(&root); > + if (!mnt) { > + namespace_unlock_read(); > + return val_err(vi, -ENOENT); > + } > + if (vi->sub) > + err = val_mnt_show(vi, mnt); > + else > + err = val_get_group(vi, val_mnt_group); > + > + namespace_unlock_read(); > + mntput(mnt); > + > + return err; > +} > + > +static ssize_t val_do_read(struct val_iter *vi, struct path *path) > +{ > + ssize_t ret; > + struct file *file; > + struct open_flags op = { > + .open_flag = O_RDONLY, > + .acc_mode = MAY_READ, > + .intent = LOOKUP_OPEN, > + }; > + > + file = do_file_open_root(path, "", &op); > + if (IS_ERR(file)) > + return PTR_ERR(file); > + > + ret = vfs_read(file, vi->vec.iov_base, vi->vec.iov_len, NULL); > + fput(file); > + > + return ret; > +} > + > +static ssize_t val_do_readlink(struct val_iter *vi, struct path *path) > +{ > + int ret; > + > + ret = security_inode_readlink(path->dentry); > + if (ret) > + return ret; > + > + return vfs_readlink(path->dentry, vi->vec.iov_base, vi->vec.iov_len); > +} > + > +static inline bool dot_or_dotdot(const char *s) > +{ > + return s[0] == '.' && > + (s[1] == '/' || s[1] == '\0' || > + (s[1] == '.' && (s[2] == '/' || s[2] == '\0'))); > +} > + > +/* > + * - empty path is okay > + * - must not begin or end with slash or have a double slash anywhere > + * - must not have . or .. components > + */ > +static bool val_verify_path(const char *subpath) > +{ > + const char *s = subpath; > + > + if (s[0] == '\0') > + return true; > + > + if (s[0] == '/' || s[strlen(s) - 1] == '/' || strstr(s, "//")) > + return false; > + > + for (s--; s; s = strstr(s + 3, "/.")) > + if (dot_or_dotdot(s + 1)) > + return false; > + > + return true; > +} > + > +static int val_data_get(struct val_iter *vi, const struct path *path) > +{ > + struct path this; > + ssize_t ret; > + > + if (!vi->sub) > + return val_err(vi, -ENOENT); > + > + if (!val_verify_path(vi->sub)) > + return val_err(vi, -EINVAL); > + > + ret = vfs_path_lookup(path->dentry, path->mnt, vi->sub, > + LOOKUP_NO_XDEV | LOOKUP_BENEATH | > + LOOKUP_IN_ROOT, &this); > + if (ret) > + return val_err(vi, ret); > + > + ret = -ENODATA; > + if (d_is_reg(this.dentry) || d_is_symlink(this.dentry)) { > + if (d_is_reg(this.dentry)) > + ret = val_do_read(vi, &this); > + else > + ret = val_do_readlink(vi, &this); > + } > + path_put(&this); > + if (ret == -EFAULT) > + return ret; > + if (ret < 0) > + return val_err(vi, ret); > + if (ret == vi->vec.iov_len) > + return -EOVERFLOW; > + > + return val_end(vi, ret); > +} > + > +static int val_xattr_get(struct val_iter *vi, const struct path *path) > +{ > + ssize_t ret; > + struct user_namespace *mnt_userns = mnt_user_ns(path->mnt); > + void *value = vi->seq.buf + vi->seq.count; > + size_t size = min_t(size_t, vi->seq.size - vi->seq.count, > + XATTR_SIZE_MAX); > + > + if (!vi->sub) > + return val_err(vi, -ENOENT); > + > + ret = vfs_getxattr(mnt_userns, path->dentry, vi->sub, value, size); > + if (ret < 0) > + return val_err(vi, ret); > + > + if ((strcmp(vi->sub, XATTR_NAME_POSIX_ACL_ACCESS) == 0) || > + (strcmp(vi->sub, XATTR_NAME_POSIX_ACL_DEFAULT) == 0)) > + posix_acl_fix_xattr_to_user(mnt_userns, value, ret); > + > + vi->seq.count += ret; > + > + return val_end_seq(vi, 0); > +} > + > + > +static struct val_desc val_toplevel_group[] = { > + { .name = "mnt", .get = val_mnt_get, }, > + { .name = "mntns", .get = val_mntns_get, }, > + { .name = "xattr", .get = val_xattr_get, }, > + { .name = "data", .get = val_data_get, }, > + { .name = NULL }, > +}; > + > +SYSCALL_DEFINE5(getvalues, > + int, dfd, > + const char __user *, pathname, > + struct name_val __user *, vec, > + size_t, num, > + unsigned int, flags) > +{ > + char vals[1024]; > + struct val_iter vi = { > + .curr = vec, > + .num = num, > + .seq.buf = vals, > + .bufsize = sizeof(vals), > + .prefix = "", > + }; > + struct val_desc *vd; > + struct path path = {}; > + ssize_t err; > + > + err = user_path_at(dfd, pathname, 0, &path); > + if (err) > + return err; > + > + err = val_get(&vi); > + if (err) > + goto out; > + > + if (!strlen(vi.name)) { > + err = val_get_group(&vi, val_toplevel_group); > + goto out; > + } > + while (!err && vi.num) { > + vd = val_lookup(&vi, val_toplevel_group); > + if (!vd->name) > + err = val_err(&vi, -ENOENT); > + else > + err = vd->get(&vi, &path); > + } > +out: > + if (err == -EOVERFLOW) > + err = 0; > + > + path_put(&path); > + return err < 0 ? err : num - vi.num; > +}
On 3/22/2022 1:36 PM, Casey Schaufler wrote: > On 3/22/2022 12:27 PM, Miklos Szeredi wrote: >> Add a new userspace API that allows getting multiple short values in a >> single syscall. >> >> This would be useful for the following reasons: >> >> - Calling open/read/close for many small files is inefficient. E.g. on my >> desktop invoking lsof(1) results in ~60k open + read + close calls under >> /proc and 90% of those are 128 bytes or less. > > You don't need the generality below to address this issue. > > int openandread(const char *path, char *buffer, size_t size); > > would address this case swimmingly. > >> - Interfaces for getting various attributes and statistics are fragmented. >> For files we have basic stat, statx, extended attributes, file attributes >> (for which there are two overlapping ioctl interfaces). For mounts and >> superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}. >> The latter also has the problem on not allowing queries on a specific >> mount. >> >> - Some attributes are cheap to generate, some are expensive. Allowing >> userspace to select which ones it needs should allow optimizing queries. >> >> - Adding an ascii namespace should allow easy extension and self >> description. ... I forgot to mention that without a mechanism to ask what attributes are available on the file this is completely pointless. If the application asks for xattr:security.selinux on a system using Smack, which would have xattr:security.SMACK64 and might have xattr:security.SMACK64EXEC or xattr:security.SMACK64TRANSMUTE, it won't be happy. Or on a system with no LSM for that matter. >> >> - The values can be text or binary, whichever is fits best. >> >> The interface definition is: >> >> struct name_val { >> const char *name; /* in */ >> struct iovec value_in; /* in */ >> struct iovec value_out; /* out */ >> uint32_t error; /* out */ >> uint32_t reserved; >> }; >> >> int getvalues(int dfd, const char *path, struct name_val *vec, size_t num, >> unsigned int flags); >> >> @dfd and @path are used to lookup object $ORIGIN. > > To be conventional you should have > > int getvalues(const char *path, struct name_val *vec, size_t num, > unsigned int flags); > > and > > int fgetvalues(int dfd, struct name_val *vec, size_t num, > unsigned int flags); > >> @vec contains @num >> name/value descriptors. @flags contains lookup flags for @path. >> >> The syscall returns the number of values filled or an error. >> >> A single name/value descriptor has the following fields: >> >> @name describes the object whose value is to be returned. E.g. >> >> mnt - list of mount parameters >> mnt:mountpoint - the mountpoint of the mount of $ORIGIN >> mntns - list of mount ID's reachable from the current root >> mntns:21:parentid - parent ID of the mount with ID of 21 >> xattr:security.selinux - the security.selinux extended attribute >> data:foo/bar - the data contained in file $ORIGIN/foo/bar >> >> If the name starts with the separator, then it is taken to have the same >> prefix as the previous name/value descriptor. E.g. in the following >> sequence of names the second one is equivalent to mnt:parentid: >> >> mnt:mountpoint >> :parentid > > I would consider this a clever optimization that is likely to > cause confusion and result in lots of bugs. Yes, you'll save some > parsing time, but the debugging headaches it would introduce would > more than make up for it. > >> @value_in supplies the buffer to store value(s) in. If a subsequent >> name/value descriptor has NULL value of value_in.iov_base, then the buffer >> from the previous name/value descriptor will be used. This way it's >> possible to use a shared buffer for multiple values. > > I would not trust very many application developers to use the NULL > value_in.iov_base correctly. In fact, I can't think of a way it could > be used sensibly. Sure, you could put two things of known size into > one buffer and use the known offset, but again that's a clever > optimization that will result in more bugs than it is worth. > >> The starting address and length of the actual value will be stored in >> @value_out, unless an error has occurred in which case @error will be set to >> the positive errno value. > > You only need to return the address if you do the multi-value in a buffer. > Which I've already expressed distaste for. > > If the application asks for 6 attributes and is denied access to the > 3rd (by an LSM let's say) what is returned? Are the 1st two buffers > filled? How can I tell which attribute was unavailable? > >> Multiple names starting with the same prefix (including the shorthand form) >> may also be batched together under the same lock, so the order of the names >> can determine atomicity. > > I will believe you, but it's hardly obvious why this is true. > >> >> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> >> --- >> arch/x86/entry/syscalls/syscall_64.tbl | 1 + >> fs/Makefile | 2 +- >> fs/mount.h | 8 + >> fs/namespace.c | 42 ++ >> fs/proc_namespace.c | 2 +- >> fs/values.c | 524 +++++++++++++++++++++++++ >> 6 files changed, 577 insertions(+), 2 deletions(-) >> create mode 100644 fs/values.c >> >> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl >> index c84d12608cd2..c72668001b39 100644 >> --- a/arch/x86/entry/syscalls/syscall_64.tbl >> +++ b/arch/x86/entry/syscalls/syscall_64.tbl >> @@ -372,6 +372,7 @@ >> 448 common process_mrelease sys_process_mrelease >> 449 common futex_waitv sys_futex_waitv >> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node >> +451 common getvalues sys_getvalues >> # >> # Due to a historical design error, certain syscalls are numbered differently >> diff --git a/fs/Makefile b/fs/Makefile >> index 208a74e0b00e..f00d6bcd1178 100644 >> --- a/fs/Makefile >> +++ b/fs/Makefile >> @@ -16,7 +16,7 @@ obj-y := open.o read_write.o file_table.o super.o \ >> pnode.o splice.o sync.o utimes.o d_path.o \ >> stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \ >> fs_types.o fs_context.o fs_parser.o fsopen.o init.o \ >> - kernel_read_file.o remap_range.o >> + kernel_read_file.o remap_range.o values.o >> ifeq ($(CONFIG_BLOCK),y) >> obj-y += buffer.o direct-io.o mpage.o >> diff --git a/fs/mount.h b/fs/mount.h >> index 0b6e08cf8afb..a3ca5233e481 100644 >> --- a/fs/mount.h >> +++ b/fs/mount.h >> @@ -148,3 +148,11 @@ static inline bool is_anon_ns(struct mnt_namespace *ns) >> } >> extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor); >> + >> +extern void namespace_lock_read(void); >> +extern void namespace_unlock_read(void); >> +extern void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt); >> +extern void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns, >> + struct path *root); >> +extern struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, >> + struct path *root, int id); >> diff --git a/fs/namespace.c b/fs/namespace.c >> index de6fae84f1a1..52b15c17251f 100644 >> --- a/fs/namespace.c >> +++ b/fs/namespace.c >> @@ -1405,6 +1405,38 @@ void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor) >> } >> #endif /* CONFIG_PROC_FS */ >> +void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns, >> + struct path *root) >> +{ >> + struct mount *m; >> + >> + down_read(&namespace_sem); >> + for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) { >> + if (is_path_reachable(m, m->mnt.mnt_root, root)) { >> + seq_printf(seq, "%i", m->mnt_id); >> + seq_putc(seq, '\0'); >> + } >> + } >> + up_read(&namespace_sem); >> +} >> + >> +/* called with namespace_sem held for read */ >> +struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, struct path *root, >> + int id) >> +{ >> + struct mount *m; >> + >> + for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) { >> + if (m->mnt_id == id) { >> + if (is_path_reachable(m, m->mnt.mnt_root, root)) >> + return mntget(&m->mnt); >> + else >> + return NULL; >> + } >> + } >> + return NULL; >> +} >> + >> /** >> * may_umount_tree - check if a mount tree is busy >> * @m: root of mount tree >> @@ -1494,6 +1526,16 @@ static inline void namespace_lock(void) >> down_write(&namespace_sem); >> } >> +void namespace_lock_read(void) >> +{ >> + down_read(&namespace_sem); >> +} >> + >> +void namespace_unlock_read(void) >> +{ >> + up_read(&namespace_sem); >> +} >> + >> enum umount_tree_flags { >> UMOUNT_SYNC = 1, >> UMOUNT_PROPAGATE = 2, >> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c >> index 49650e54d2f8..fa6dc2c20578 100644 >> --- a/fs/proc_namespace.c >> +++ b/fs/proc_namespace.c >> @@ -61,7 +61,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb) >> return security_sb_show_options(m, sb); >> } >> -static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) >> +void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) >> { >> static const struct proc_fs_opts mnt_opts[] = { >> { MNT_NOSUID, ",nosuid" }, >> diff --git a/fs/values.c b/fs/values.c >> new file mode 100644 >> index 000000000000..618fa9bf48a1 >> --- /dev/null >> +++ b/fs/values.c >> @@ -0,0 +1,524 @@ >> +#include <linux/syscalls.h> >> +#include <linux/printk.h> >> +#include <linux/namei.h> >> +#include <linux/fs_struct.h> >> +#include <linux/posix_acl_xattr.h> >> +#include <linux/xattr.h> >> +#include "pnode.h" >> +#include "internal.h" >> + >> +#define VAL_GRSEP ':' >> + >> +struct name_val { >> + const char __user *name; /* in */ >> + struct iovec value_in; /* in */ >> + struct iovec value_out; /* out */ >> + __u32 error; /* out */ >> + __u32 reserved; >> +}; >> + >> +struct val_iter { >> + struct name_val __user *curr; >> + size_t num; >> + struct iovec vec; >> + char name[256]; >> + size_t bufsize; >> + struct seq_file seq; >> + const char *prefix; >> + const char *sub; >> +}; >> + >> +struct val_desc { >> + const char *name; >> + union { >> + int idx; >> + int (*get)(struct val_iter *vi, const struct path *path); >> + }; >> +}; >> + >> +static int val_get(struct val_iter *vi) >> +{ >> + struct name_val nameval; >> + long err; >> + >> + if (copy_from_user(&nameval, vi->curr, sizeof(nameval))) >> + return -EFAULT; >> + >> + err = strncpy_from_user(vi->name, nameval.name, sizeof(vi->name)); >> + if (err < 0) >> + return err; >> + if (err == sizeof(vi->name)) >> + return -ERANGE; >> + >> + if (nameval.value_in.iov_base) >> + vi->vec = nameval.value_in; >> + >> + vi->seq.size = min(vi->vec.iov_len, vi->bufsize); >> + vi->seq.count = 0; >> + >> + return 0; >> +} >> + >> +static int val_next(struct val_iter *vi) >> +{ >> + vi->curr++; >> + vi->num--; >> + >> + return vi->num ? val_get(vi) : 0; >> +} >> + >> +static int val_end(struct val_iter *vi, size_t count) >> +{ >> + struct iovec iov = { >> + .iov_base = vi->vec.iov_base, >> + .iov_len = count, >> + }; >> + >> + if (copy_to_user(&vi->curr->value_out, &iov, sizeof(iov))) >> + return -EFAULT; >> + >> + vi->vec.iov_base += count; >> + vi->vec.iov_len -= count; >> + >> + return val_next(vi); >> +} >> + >> +static int val_err(struct val_iter *vi, int err) >> +{ >> + if (put_user(-err, &vi->curr->error)) >> + return -EFAULT; >> + >> + return val_next(vi); >> +} >> + >> +static int val_end_seq(struct val_iter *vi, int err) >> +{ >> + size_t count = vi->seq.count; >> + >> + if (err) >> + return val_err(vi, err); >> + >> + if (count == vi->seq.size) >> + return -EOVERFLOW; >> + >> + if (copy_to_user(vi->vec.iov_base, vi->seq.buf, count)) >> + return -EFAULT; >> + >> + return val_end(vi, count); >> +} >> + >> +static struct val_desc *val_lookup(struct val_iter *vi, struct val_desc *vd) >> +{ >> + const char *name = vi->name; >> + const char *prefix = vi->prefix; >> + size_t prefixlen = strlen(prefix); >> + >> + if (prefixlen) { >> + /* >> + * Name beggining with a group separator is a shorthand for >> + * previously prefix. >> + */ >> + if (name[0] == VAL_GRSEP) { >> + name++; >> + } else { >> + if (strncmp(name, prefix, prefixlen) != 0) >> + return NULL; >> + name += prefixlen; >> + } >> + } >> + >> + vi->sub = NULL; >> + for (; vd->name; vd++) { >> + if (strcmp(name, vd->name) == 0) >> + break; >> + else { >> + size_t grlen = strlen(vd->name); >> + >> + if (strncmp(vd->name, name, grlen) == 0 && >> + name[grlen] == VAL_GRSEP) { >> + vi->sub = name + grlen + 1; >> + break; >> + } >> + } >> + } >> + return vd; >> +} >> + >> +static int val_get_group(struct val_iter *vi, struct val_desc *vd) >> +{ >> + for (; vd->name; vd++) >> + seq_write(&vi->seq, vd->name, strlen(vd->name) + 1); >> + >> + return val_end_seq(vi, 0); >> +} >> + >> +static bool val_push_prefix(struct val_iter *vi, const char **oldprefix) >> +{ >> + char *newprefix; >> + >> + newprefix = kmemdup_nul(vi->name, vi->sub - vi->name, GFP_KERNEL); >> + if (newprefix) { >> + *oldprefix = vi->prefix; >> + vi->prefix = newprefix; >> + } >> + >> + return newprefix; >> +} >> + >> +static void val_pop_prefix(struct val_iter *vi, const char *oldprefix) >> +{ >> + kfree(vi->prefix); >> + vi->prefix = oldprefix; >> +} >> + >> +enum { >> + VAL_MNT_ID, >> + VAL_MNT_PARENTID, >> + VAL_MNT_ROOT, >> + VAL_MNT_MOUNTPOINT, >> + VAL_MNT_OPTIONS, >> + VAL_MNT_SHARED, >> + VAL_MNT_MASTER, >> + VAL_MNT_PROPAGATE_FROM, >> + VAL_MNT_UNBINDABLE, >> + VAL_MNT_NOTFOUND, >> +}; >> + >> +static struct val_desc val_mnt_group[] = { >> + { .name = "id", .idx = VAL_MNT_ID }, >> + { .name = "parentid", .idx = VAL_MNT_PARENTID, }, >> + { .name = "root", .idx = VAL_MNT_ROOT, }, >> + { .name = "mountpoint", .idx = VAL_MNT_MOUNTPOINT, }, >> + { .name = "options", .idx = VAL_MNT_OPTIONS, }, >> + { .name = "shared", .idx = VAL_MNT_SHARED, }, >> + { .name = "master", .idx = VAL_MNT_MASTER, }, >> + { .name = "propagate_from", .idx = VAL_MNT_PROPAGATE_FROM, }, >> + { .name = "unbindable", .idx = VAL_MNT_UNBINDABLE, }, >> + { .name = NULL, .idx = VAL_MNT_NOTFOUND }, >> +}; >> + >> +static int seq_mnt_root(struct seq_file *seq, struct vfsmount *mnt) >> +{ >> + struct super_block *sb = mnt->mnt_sb; >> + int err = 0; >> + >> + if (sb->s_op->show_path) { >> + err = sb->s_op->show_path(seq, mnt->mnt_root); >> + if (!err) { >> + seq_putc(seq, '\0'); >> + if (seq->count < seq->size) >> + seq->count = string_unescape(seq->buf, seq->buf, seq->size, UNESCAPE_OCTAL); >> + } >> + } else { >> + seq_dentry(seq, mnt->mnt_root, ""); >> + } >> + >> + return err; >> +} >> + >> +static int val_mnt_show(struct val_iter *vi, struct vfsmount *mnt) >> +{ >> + struct mount *m = real_mount(mnt); >> + struct path root, mnt_path; >> + struct val_desc *vd; >> + const char *oldprefix; >> + int err = 0; >> + >> + if (!val_push_prefix(vi, &oldprefix)) >> + return -ENOMEM; >> + >> + while (!err && vi->num) { >> + vd = val_lookup(vi, val_mnt_group); >> + if (!vd) >> + break; >> + >> + switch(vd->idx) { >> + case VAL_MNT_ID: >> + seq_printf(&vi->seq, "%i", m->mnt_id); >> + break; >> + case VAL_MNT_PARENTID: >> + seq_printf(&vi->seq, "%i", m->mnt_parent->mnt_id); >> + break; >> + case VAL_MNT_ROOT: >> + seq_mnt_root(&vi->seq, mnt); >> + break; >> + case VAL_MNT_MOUNTPOINT: >> + get_fs_root(current->fs, &root); >> + mnt_path.dentry = mnt->mnt_root; >> + mnt_path.mnt = mnt; >> + err = seq_path_root(&vi->seq, &mnt_path, &root, ""); >> + path_put(&root); >> + break; >> + case VAL_MNT_OPTIONS: >> + seq_puts(&vi->seq, mnt->mnt_flags & MNT_READONLY ? "ro" : "rw"); >> + show_mnt_opts(&vi->seq, mnt); >> + break; >> + case VAL_MNT_SHARED: >> + if (IS_MNT_SHARED(m)) >> + seq_printf(&vi->seq, "%i,", m->mnt_group_id); >> + break; >> + case VAL_MNT_MASTER: >> + if (IS_MNT_SLAVE(m)) >> + seq_printf(&vi->seq, "%i,", >> + m->mnt_master->mnt_group_id); >> + break; >> + case VAL_MNT_PROPAGATE_FROM: >> + if (IS_MNT_SLAVE(m)) { >> + int dom; >> + >> + get_fs_root(current->fs, &root); >> + dom = get_dominating_id(m, &root); >> + path_put(&root); >> + if (dom) >> + seq_printf(&vi->seq, "%i,", dom); >> + } >> + break; >> + case VAL_MNT_UNBINDABLE: >> + if (IS_MNT_UNBINDABLE(m)) >> + seq_puts(&vi->seq, "yes"); >> + break; >> + default: >> + err = -ENOENT; >> + break; >> + } >> + err = val_end_seq(vi, err); >> + } >> + val_pop_prefix(vi, oldprefix); >> + >> + return err; >> +} >> + >> +static int val_mnt_get(struct val_iter *vi, const struct path *path) >> +{ >> + int err; >> + >> + if (!vi->sub) >> + return val_get_group(vi, val_mnt_group); >> + >> + namespace_lock_read(); >> + err = val_mnt_show(vi, path->mnt); >> + namespace_unlock_read(); >> + >> + return err; >> +} >> + >> +static int val_mntns_get(struct val_iter *vi, const struct path *path) >> +{ >> + struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns; >> + struct vfsmount *mnt; >> + struct path root; >> + char *end; >> + int mnt_id; >> + int err; >> + >> + if (!vi->sub) { >> + get_fs_root(current->fs, &root); >> + seq_mnt_list(&vi->seq, mnt_ns, &root); >> + path_put(&root); >> + return val_end_seq(vi, 0); >> + } >> + >> + end = strchr(vi->sub, VAL_GRSEP); >> + if (end) >> + *end = '\0'; >> + err = kstrtoint(vi->sub, 10, &mnt_id); >> + if (err) >> + return val_err(vi, err); >> + vi->sub = NULL; >> + if (end) { >> + *end = VAL_GRSEP; >> + vi->sub = end + 1; >> + } >> + >> + namespace_lock_read(); >> + get_fs_root(current->fs, &root); >> + mnt = mnt_lookup_by_id(mnt_ns, &root, mnt_id); >> + path_put(&root); >> + if (!mnt) { >> + namespace_unlock_read(); >> + return val_err(vi, -ENOENT); >> + } >> + if (vi->sub) >> + err = val_mnt_show(vi, mnt); >> + else >> + err = val_get_group(vi, val_mnt_group); >> + >> + namespace_unlock_read(); >> + mntput(mnt); >> + >> + return err; >> +} >> + >> +static ssize_t val_do_read(struct val_iter *vi, struct path *path) >> +{ >> + ssize_t ret; >> + struct file *file; >> + struct open_flags op = { >> + .open_flag = O_RDONLY, >> + .acc_mode = MAY_READ, >> + .intent = LOOKUP_OPEN, >> + }; >> + >> + file = do_file_open_root(path, "", &op); >> + if (IS_ERR(file)) >> + return PTR_ERR(file); >> + >> + ret = vfs_read(file, vi->vec.iov_base, vi->vec.iov_len, NULL); >> + fput(file); >> + >> + return ret; >> +} >> + >> +static ssize_t val_do_readlink(struct val_iter *vi, struct path *path) >> +{ >> + int ret; >> + >> + ret = security_inode_readlink(path->dentry); >> + if (ret) >> + return ret; >> + >> + return vfs_readlink(path->dentry, vi->vec.iov_base, vi->vec.iov_len); >> +} >> + >> +static inline bool dot_or_dotdot(const char *s) >> +{ >> + return s[0] == '.' && >> + (s[1] == '/' || s[1] == '\0' || >> + (s[1] == '.' && (s[2] == '/' || s[2] == '\0'))); >> +} >> + >> +/* >> + * - empty path is okay >> + * - must not begin or end with slash or have a double slash anywhere >> + * - must not have . or .. components >> + */ >> +static bool val_verify_path(const char *subpath) >> +{ >> + const char *s = subpath; >> + >> + if (s[0] == '\0') >> + return true; >> + >> + if (s[0] == '/' || s[strlen(s) - 1] == '/' || strstr(s, "//")) >> + return false; >> + >> + for (s--; s; s = strstr(s + 3, "/.")) >> + if (dot_or_dotdot(s + 1)) >> + return false; >> + >> + return true; >> +} >> + >> +static int val_data_get(struct val_iter *vi, const struct path *path) >> +{ >> + struct path this; >> + ssize_t ret; >> + >> + if (!vi->sub) >> + return val_err(vi, -ENOENT); >> + >> + if (!val_verify_path(vi->sub)) >> + return val_err(vi, -EINVAL); >> + >> + ret = vfs_path_lookup(path->dentry, path->mnt, vi->sub, >> + LOOKUP_NO_XDEV | LOOKUP_BENEATH | >> + LOOKUP_IN_ROOT, &this); >> + if (ret) >> + return val_err(vi, ret); >> + >> + ret = -ENODATA; >> + if (d_is_reg(this.dentry) || d_is_symlink(this.dentry)) { >> + if (d_is_reg(this.dentry)) >> + ret = val_do_read(vi, &this); >> + else >> + ret = val_do_readlink(vi, &this); >> + } >> + path_put(&this); >> + if (ret == -EFAULT) >> + return ret; >> + if (ret < 0) >> + return val_err(vi, ret); >> + if (ret == vi->vec.iov_len) >> + return -EOVERFLOW; >> + >> + return val_end(vi, ret); >> +} >> + >> +static int val_xattr_get(struct val_iter *vi, const struct path *path) >> +{ >> + ssize_t ret; >> + struct user_namespace *mnt_userns = mnt_user_ns(path->mnt); >> + void *value = vi->seq.buf + vi->seq.count; >> + size_t size = min_t(size_t, vi->seq.size - vi->seq.count, >> + XATTR_SIZE_MAX); >> + >> + if (!vi->sub) >> + return val_err(vi, -ENOENT); >> + >> + ret = vfs_getxattr(mnt_userns, path->dentry, vi->sub, value, size); >> + if (ret < 0) >> + return val_err(vi, ret); >> + >> + if ((strcmp(vi->sub, XATTR_NAME_POSIX_ACL_ACCESS) == 0) || >> + (strcmp(vi->sub, XATTR_NAME_POSIX_ACL_DEFAULT) == 0)) >> + posix_acl_fix_xattr_to_user(mnt_userns, value, ret); >> + >> + vi->seq.count += ret; >> + >> + return val_end_seq(vi, 0); >> +} >> + >> + >> +static struct val_desc val_toplevel_group[] = { >> + { .name = "mnt", .get = val_mnt_get, }, >> + { .name = "mntns", .get = val_mntns_get, }, >> + { .name = "xattr", .get = val_xattr_get, }, >> + { .name = "data", .get = val_data_get, }, >> + { .name = NULL }, >> +}; >> + >> +SYSCALL_DEFINE5(getvalues, >> + int, dfd, >> + const char __user *, pathname, >> + struct name_val __user *, vec, >> + size_t, num, >> + unsigned int, flags) >> +{ >> + char vals[1024]; >> + struct val_iter vi = { >> + .curr = vec, >> + .num = num, >> + .seq.buf = vals, >> + .bufsize = sizeof(vals), >> + .prefix = "", >> + }; >> + struct val_desc *vd; >> + struct path path = {}; >> + ssize_t err; >> + >> + err = user_path_at(dfd, pathname, 0, &path); >> + if (err) >> + return err; >> + >> + err = val_get(&vi); >> + if (err) >> + goto out; >> + >> + if (!strlen(vi.name)) { >> + err = val_get_group(&vi, val_toplevel_group); >> + goto out; >> + } >> + while (!err && vi.num) { >> + vd = val_lookup(&vi, val_toplevel_group); >> + if (!vd->name) >> + err = val_err(&vi, -ENOENT); >> + else >> + err = vd->get(&vi, &path); >> + } >> +out: >> + if (err == -EOVERFLOW) >> + err = 0; >> + >> + path_put(&path); >> + return err < 0 ? err : num - vi.num; >> +}
On Tue, Mar 22, 2022 at 01:36:26PM -0700, Casey Schaufler wrote: > On 3/22/2022 12:27 PM, Miklos Szeredi wrote: > > Add a new userspace API that allows getting multiple short values in a > > single syscall. > > > > This would be useful for the following reasons: > > > > - Calling open/read/close for many small files is inefficient. E.g. on my > > desktop invoking lsof(1) results in ~60k open + read + close calls under > > /proc and 90% of those are 128 bytes or less. > > You don't need the generality below to address this issue. > > int openandread(const char *path, char *buffer, size_t size); > > would address this case swimmingly. Or you can use my readfile(2) proposal: https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org But you had better actually benchmark the thing. Turns out that I could not find a real-world use that shows improvements in anything. Miklos, what userspace tool will use this new syscall and how will it be faster than readfile() was? I should rebase that against 5.17 again and see if anything is different due to the new spectre-bhb slowdowns. thanks, greg k-h
On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote: > Add a new userspace API that allows getting multiple short values in a > single syscall. > > This would be useful for the following reasons: > > - Calling open/read/close for many small files is inefficient. E.g. on my > desktop invoking lsof(1) results in ~60k open + read + close calls under > /proc and 90% of those are 128 bytes or less. As I found out in testing readfile(): https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org microbenchmarks do show a tiny improvement in doing something like this, but that's not a real-world application. Do you have anything real that can use this that shows a speedup? thanks, greg k-h
On 3/23/22 08:16, Greg KH wrote: > On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote: >> Add a new userspace API that allows getting multiple short values in a >> single syscall. >> >> This would be useful for the following reasons: >> >> - Calling open/read/close for many small files is inefficient. E.g. on my >> desktop invoking lsof(1) results in ~60k open + read + close calls under >> /proc and 90% of those are 128 bytes or less. > > As I found out in testing readfile(): > https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org > > microbenchmarks do show a tiny improvement in doing something like this, > but that's not a real-world application. > > Do you have anything real that can use this that shows a speedup? Add in network file systems. Demonstrating that this is useful locally and with micro benchmarks - yeah, helps a bit to make it locally faster. But the real case is when thousands of clients are handled by a few network servers. Even reducing wire latency for a single client would make a difference here. There is a bit of chicken-egg problem - it is a bit of work to add to file systems like NFS (or others that are not the kernel), but the work won't be made there before there is no syscall for it. To demonstrate it on NFS one also needs a an official protocol change first. And then applications also need to support that new syscall first. I had a hard time explaining weather physicist back in 2009 that it is not a good idea to have millions of 512B files on Lustre. With recent AI workload this gets even worse. This is the same issue in fact with the fuse patches we are creating (https://lwn.net/Articles/888877/). Miklos asked for benchmark numbers - we can only demonstrate slight effects locally, but out goal is in fact to reduce network latencies and server load. - Bernd
On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote: > Add a new userspace API that allows getting multiple short values in a > single syscall. > > This would be useful for the following reasons: > > - Calling open/read/close for many small files is inefficient. E.g. on my > desktop invoking lsof(1) results in ~60k open + read + close calls under > /proc and 90% of those are 128 bytes or less. > > - Interfaces for getting various attributes and statistics are fragmented. > For files we have basic stat, statx, extended attributes, file attributes > (for which there are two overlapping ioctl interfaces). For mounts and > superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}. > The latter also has the problem on not allowing queries on a specific > mount. > > - Some attributes are cheap to generate, some are expensive. Allowing > userspace to select which ones it needs should allow optimizing queries. > > - Adding an ascii namespace should allow easy extension and self > description. > > - The values can be text or binary, whichever is fits best. > > The interface definition is: > > struct name_val { > const char *name; /* in */ > struct iovec value_in; /* in */ > struct iovec value_out; /* out */ > uint32_t error; /* out */ > uint32_t reserved; > }; Yes, we really need a way to query for various fs information. I'm a bit torn about the details of this interface though. I would really like if we had interfaces that are really easy to use from userspace comparable to statx for example. I know having this generic as possible was the goal but I'm just a bit uneasy with such interfaces. They become cumbersome to use in userspace. I'm not sure if the data: part for example should be in this at all. That seems a bit out of place to me. Would it be really that bad if we added multiple syscalls for different types of info? For example, querying mount information could reasonably be a more focussed separate system call allowing to retrieve detailed mount propagation info, flags, idmappings and so on. Prior approaches to solve this in a completely generic way have gotten us not very far too so I'm a bit worried about this aspect too. For performance concerns, once those multiple system calls are in place they will naturally be made available in io_uring and so if people do really have performance issues they can rely on io_uring (to e.g., offload system call cost). > > int getvalues(int dfd, const char *path, struct name_val *vec, size_t num, > unsigned int flags); > > @dfd and @path are used to lookup object $ORIGIN. @vec contains @num > name/value descriptors. @flags contains lookup flags for @path. > > The syscall returns the number of values filled or an error. > > A single name/value descriptor has the following fields: > > @name describes the object whose value is to be returned. E.g. > > mnt - list of mount parameters > mnt:mountpoint - the mountpoint of the mount of $ORIGIN > mntns - list of mount ID's reachable from the current root > mntns:21:parentid - parent ID of the mount with ID of 21 > xattr:security.selinux - the security.selinux extended attribute > data:foo/bar - the data contained in file $ORIGIN/foo/bar > > If the name starts with the separator, then it is taken to have the same > prefix as the previous name/value descriptor. E.g. in the following > sequence of names the second one is equivalent to mnt:parentid: > > mnt:mountpoint > :parentid > > @value_in supplies the buffer to store value(s) in. If a subsequent > name/value descriptor has NULL value of value_in.iov_base, then the buffer > from the previous name/value descriptor will be used. This way it's > possible to use a shared buffer for multiple values. > > The starting address and length of the actual value will be stored in > @value_out, unless an error has occurred in which case @error will be set to > the positive errno value. > > Multiple names starting with the same prefix (including the shorthand form) > may also be batched together under the same lock, so the order of the names > can determine atomicity. > > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> > --- > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > fs/Makefile | 2 +- > fs/mount.h | 8 + > fs/namespace.c | 42 ++ > fs/proc_namespace.c | 2 +- > fs/values.c | 524 +++++++++++++++++++++++++ > 6 files changed, 577 insertions(+), 2 deletions(-) > create mode 100644 fs/values.c > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index c84d12608cd2..c72668001b39 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -372,6 +372,7 @@ > 448 common process_mrelease sys_process_mrelease > 449 common futex_waitv sys_futex_waitv > 450 common set_mempolicy_home_node sys_set_mempolicy_home_node > +451 common getvalues sys_getvalues > > # > # Due to a historical design error, certain syscalls are numbered differently > diff --git a/fs/Makefile b/fs/Makefile > index 208a74e0b00e..f00d6bcd1178 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -16,7 +16,7 @@ obj-y := open.o read_write.o file_table.o super.o \ > pnode.o splice.o sync.o utimes.o d_path.o \ > stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \ > fs_types.o fs_context.o fs_parser.o fsopen.o init.o \ > - kernel_read_file.o remap_range.o > + kernel_read_file.o remap_range.o values.o > > ifeq ($(CONFIG_BLOCK),y) > obj-y += buffer.o direct-io.o mpage.o > diff --git a/fs/mount.h b/fs/mount.h > index 0b6e08cf8afb..a3ca5233e481 100644 > --- a/fs/mount.h > +++ b/fs/mount.h > @@ -148,3 +148,11 @@ static inline bool is_anon_ns(struct mnt_namespace *ns) > } > > extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor); > + > +extern void namespace_lock_read(void); > +extern void namespace_unlock_read(void); > +extern void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt); > +extern void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns, > + struct path *root); > +extern struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, > + struct path *root, int id); > diff --git a/fs/namespace.c b/fs/namespace.c > index de6fae84f1a1..52b15c17251f 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1405,6 +1405,38 @@ void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor) > } > #endif /* CONFIG_PROC_FS */ > > +void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns, > + struct path *root) > +{ > + struct mount *m; > + > + down_read(&namespace_sem); > + for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) { > + if (is_path_reachable(m, m->mnt.mnt_root, root)) { > + seq_printf(seq, "%i", m->mnt_id); > + seq_putc(seq, '\0'); > + } > + } > + up_read(&namespace_sem); > +} > + > +/* called with namespace_sem held for read */ > +struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, struct path *root, > + int id) > +{ > + struct mount *m; > + > + for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) { > + if (m->mnt_id == id) { > + if (is_path_reachable(m, m->mnt.mnt_root, root)) > + return mntget(&m->mnt); > + else > + return NULL; > + } > + } > + return NULL; > +} > + > /** > * may_umount_tree - check if a mount tree is busy > * @m: root of mount tree > @@ -1494,6 +1526,16 @@ static inline void namespace_lock(void) > down_write(&namespace_sem); > } > > +void namespace_lock_read(void) > +{ > + down_read(&namespace_sem); > +} > + > +void namespace_unlock_read(void) > +{ > + up_read(&namespace_sem); > +} > + > enum umount_tree_flags { > UMOUNT_SYNC = 1, > UMOUNT_PROPAGATE = 2, > diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c > index 49650e54d2f8..fa6dc2c20578 100644 > --- a/fs/proc_namespace.c > +++ b/fs/proc_namespace.c > @@ -61,7 +61,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb) > return security_sb_show_options(m, sb); > } > > -static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) > +void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) > { > static const struct proc_fs_opts mnt_opts[] = { > { MNT_NOSUID, ",nosuid" }, > diff --git a/fs/values.c b/fs/values.c > new file mode 100644 > index 000000000000..618fa9bf48a1 > --- /dev/null > +++ b/fs/values.c > @@ -0,0 +1,524 @@ > +#include <linux/syscalls.h> > +#include <linux/printk.h> > +#include <linux/namei.h> > +#include <linux/fs_struct.h> > +#include <linux/posix_acl_xattr.h> > +#include <linux/xattr.h> > +#include "pnode.h" > +#include "internal.h" > + > +#define VAL_GRSEP ':' > + > +struct name_val { > + const char __user *name; /* in */ > + struct iovec value_in; /* in */ > + struct iovec value_out; /* out */ > + __u32 error; /* out */ > + __u32 reserved; > +}; > + > +struct val_iter { > + struct name_val __user *curr; > + size_t num; > + struct iovec vec; > + char name[256]; > + size_t bufsize; > + struct seq_file seq; > + const char *prefix; > + const char *sub; > +}; > + > +struct val_desc { > + const char *name; > + union { > + int idx; > + int (*get)(struct val_iter *vi, const struct path *path); > + }; > +}; > + > +static int val_get(struct val_iter *vi) > +{ > + struct name_val nameval; > + long err; > + > + if (copy_from_user(&nameval, vi->curr, sizeof(nameval))) > + return -EFAULT; > + > + err = strncpy_from_user(vi->name, nameval.name, sizeof(vi->name)); > + if (err < 0) > + return err; > + if (err == sizeof(vi->name)) > + return -ERANGE; > + > + if (nameval.value_in.iov_base) > + vi->vec = nameval.value_in; > + > + vi->seq.size = min(vi->vec.iov_len, vi->bufsize); > + vi->seq.count = 0; > + > + return 0; > +} > + > +static int val_next(struct val_iter *vi) > +{ > + vi->curr++; > + vi->num--; > + > + return vi->num ? val_get(vi) : 0; > +} > + > +static int val_end(struct val_iter *vi, size_t count) > +{ > + struct iovec iov = { > + .iov_base = vi->vec.iov_base, > + .iov_len = count, > + }; > + > + if (copy_to_user(&vi->curr->value_out, &iov, sizeof(iov))) > + return -EFAULT; > + > + vi->vec.iov_base += count; > + vi->vec.iov_len -= count; > + > + return val_next(vi); > +} > + > +static int val_err(struct val_iter *vi, int err) > +{ > + if (put_user(-err, &vi->curr->error)) > + return -EFAULT; > + > + return val_next(vi); > +} > + > +static int val_end_seq(struct val_iter *vi, int err) > +{ > + size_t count = vi->seq.count; > + > + if (err) > + return val_err(vi, err); > + > + if (count == vi->seq.size) > + return -EOVERFLOW; > + > + if (copy_to_user(vi->vec.iov_base, vi->seq.buf, count)) > + return -EFAULT; > + > + return val_end(vi, count); > +} > + > +static struct val_desc *val_lookup(struct val_iter *vi, struct val_desc *vd) > +{ > + const char *name = vi->name; > + const char *prefix = vi->prefix; > + size_t prefixlen = strlen(prefix); > + > + if (prefixlen) { > + /* > + * Name beggining with a group separator is a shorthand for > + * previously prefix. > + */ > + if (name[0] == VAL_GRSEP) { > + name++; > + } else { > + if (strncmp(name, prefix, prefixlen) != 0) > + return NULL; > + name += prefixlen; > + } > + } > + > + vi->sub = NULL; > + for (; vd->name; vd++) { > + if (strcmp(name, vd->name) == 0) > + break; > + else { > + size_t grlen = strlen(vd->name); > + > + if (strncmp(vd->name, name, grlen) == 0 && > + name[grlen] == VAL_GRSEP) { > + vi->sub = name + grlen + 1; > + break; > + } > + } > + } > + return vd; > +} > + > +static int val_get_group(struct val_iter *vi, struct val_desc *vd) > +{ > + for (; vd->name; vd++) > + seq_write(&vi->seq, vd->name, strlen(vd->name) + 1); > + > + return val_end_seq(vi, 0); > +} > + > +static bool val_push_prefix(struct val_iter *vi, const char **oldprefix) > +{ > + char *newprefix; > + > + newprefix = kmemdup_nul(vi->name, vi->sub - vi->name, GFP_KERNEL); > + if (newprefix) { > + *oldprefix = vi->prefix; > + vi->prefix = newprefix; > + } > + > + return newprefix; > +} > + > +static void val_pop_prefix(struct val_iter *vi, const char *oldprefix) > +{ > + kfree(vi->prefix); > + vi->prefix = oldprefix; > +} > + > +enum { > + VAL_MNT_ID, > + VAL_MNT_PARENTID, > + VAL_MNT_ROOT, > + VAL_MNT_MOUNTPOINT, > + VAL_MNT_OPTIONS, > + VAL_MNT_SHARED, > + VAL_MNT_MASTER, > + VAL_MNT_PROPAGATE_FROM, > + VAL_MNT_UNBINDABLE, > + VAL_MNT_NOTFOUND, > +}; > + > +static struct val_desc val_mnt_group[] = { > + { .name = "id", .idx = VAL_MNT_ID }, > + { .name = "parentid", .idx = VAL_MNT_PARENTID, }, > + { .name = "root", .idx = VAL_MNT_ROOT, }, > + { .name = "mountpoint", .idx = VAL_MNT_MOUNTPOINT, }, > + { .name = "options", .idx = VAL_MNT_OPTIONS, }, > + { .name = "shared", .idx = VAL_MNT_SHARED, }, > + { .name = "master", .idx = VAL_MNT_MASTER, }, > + { .name = "propagate_from", .idx = VAL_MNT_PROPAGATE_FROM, }, > + { .name = "unbindable", .idx = VAL_MNT_UNBINDABLE, }, > + { .name = NULL, .idx = VAL_MNT_NOTFOUND }, > +}; > + > +static int seq_mnt_root(struct seq_file *seq, struct vfsmount *mnt) > +{ > + struct super_block *sb = mnt->mnt_sb; > + int err = 0; > + > + if (sb->s_op->show_path) { > + err = sb->s_op->show_path(seq, mnt->mnt_root); > + if (!err) { > + seq_putc(seq, '\0'); > + if (seq->count < seq->size) > + seq->count = string_unescape(seq->buf, seq->buf, seq->size, UNESCAPE_OCTAL); > + } > + } else { > + seq_dentry(seq, mnt->mnt_root, ""); > + } > + > + return err; > +} > + > +static int val_mnt_show(struct val_iter *vi, struct vfsmount *mnt) > +{ > + struct mount *m = real_mount(mnt); > + struct path root, mnt_path; > + struct val_desc *vd; > + const char *oldprefix; > + int err = 0; > + > + if (!val_push_prefix(vi, &oldprefix)) > + return -ENOMEM; > + > + while (!err && vi->num) { > + vd = val_lookup(vi, val_mnt_group); > + if (!vd) > + break; > + > + switch(vd->idx) { > + case VAL_MNT_ID: > + seq_printf(&vi->seq, "%i", m->mnt_id); > + break; > + case VAL_MNT_PARENTID: > + seq_printf(&vi->seq, "%i", m->mnt_parent->mnt_id); > + break; > + case VAL_MNT_ROOT: > + seq_mnt_root(&vi->seq, mnt); > + break; > + case VAL_MNT_MOUNTPOINT: > + get_fs_root(current->fs, &root); > + mnt_path.dentry = mnt->mnt_root; > + mnt_path.mnt = mnt; > + err = seq_path_root(&vi->seq, &mnt_path, &root, ""); > + path_put(&root); > + break; > + case VAL_MNT_OPTIONS: > + seq_puts(&vi->seq, mnt->mnt_flags & MNT_READONLY ? "ro" : "rw"); > + show_mnt_opts(&vi->seq, mnt); > + break; > + case VAL_MNT_SHARED: > + if (IS_MNT_SHARED(m)) > + seq_printf(&vi->seq, "%i,", m->mnt_group_id); > + break; > + case VAL_MNT_MASTER: > + if (IS_MNT_SLAVE(m)) > + seq_printf(&vi->seq, "%i,", > + m->mnt_master->mnt_group_id); > + break; > + case VAL_MNT_PROPAGATE_FROM: > + if (IS_MNT_SLAVE(m)) { > + int dom; > + > + get_fs_root(current->fs, &root); > + dom = get_dominating_id(m, &root); > + path_put(&root); > + if (dom) > + seq_printf(&vi->seq, "%i,", dom); > + } > + break; > + case VAL_MNT_UNBINDABLE: > + if (IS_MNT_UNBINDABLE(m)) > + seq_puts(&vi->seq, "yes"); > + break; > + default: > + err = -ENOENT; > + break; > + } > + err = val_end_seq(vi, err); > + } > + val_pop_prefix(vi, oldprefix); > + > + return err; > +} > + > +static int val_mnt_get(struct val_iter *vi, const struct path *path) > +{ > + int err; > + > + if (!vi->sub) > + return val_get_group(vi, val_mnt_group); > + > + namespace_lock_read(); > + err = val_mnt_show(vi, path->mnt); > + namespace_unlock_read(); > + > + return err; > +} > + > +static int val_mntns_get(struct val_iter *vi, const struct path *path) > +{ > + struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns; > + struct vfsmount *mnt; > + struct path root; > + char *end; > + int mnt_id; > + int err; > + > + if (!vi->sub) { > + get_fs_root(current->fs, &root); > + seq_mnt_list(&vi->seq, mnt_ns, &root); > + path_put(&root); > + return val_end_seq(vi, 0); > + } > + > + end = strchr(vi->sub, VAL_GRSEP); > + if (end) > + *end = '\0'; > + err = kstrtoint(vi->sub, 10, &mnt_id); > + if (err) > + return val_err(vi, err); > + vi->sub = NULL; > + if (end) { > + *end = VAL_GRSEP; > + vi->sub = end + 1; > + } > + > + namespace_lock_read(); > + get_fs_root(current->fs, &root); > + mnt = mnt_lookup_by_id(mnt_ns, &root, mnt_id); > + path_put(&root); > + if (!mnt) { > + namespace_unlock_read(); > + return val_err(vi, -ENOENT); > + } > + if (vi->sub) > + err = val_mnt_show(vi, mnt); > + else > + err = val_get_group(vi, val_mnt_group); > + > + namespace_unlock_read(); > + mntput(mnt); > + > + return err; > +} > + > +static ssize_t val_do_read(struct val_iter *vi, struct path *path) > +{ > + ssize_t ret; > + struct file *file; > + struct open_flags op = { > + .open_flag = O_RDONLY, > + .acc_mode = MAY_READ, > + .intent = LOOKUP_OPEN, > + }; > + > + file = do_file_open_root(path, "", &op); > + if (IS_ERR(file)) > + return PTR_ERR(file); > + > + ret = vfs_read(file, vi->vec.iov_base, vi->vec.iov_len, NULL); > + fput(file); > + > + return ret; > +} > + > +static ssize_t val_do_readlink(struct val_iter *vi, struct path *path) > +{ > + int ret; > + > + ret = security_inode_readlink(path->dentry); > + if (ret) > + return ret; > + > + return vfs_readlink(path->dentry, vi->vec.iov_base, vi->vec.iov_len); > +} > + > +static inline bool dot_or_dotdot(const char *s) > +{ > + return s[0] == '.' && > + (s[1] == '/' || s[1] == '\0' || > + (s[1] == '.' && (s[2] == '/' || s[2] == '\0'))); > +} > + > +/* > + * - empty path is okay > + * - must not begin or end with slash or have a double slash anywhere > + * - must not have . or .. components > + */ > +static bool val_verify_path(const char *subpath) > +{ > + const char *s = subpath; > + > + if (s[0] == '\0') > + return true; > + > + if (s[0] == '/' || s[strlen(s) - 1] == '/' || strstr(s, "//")) > + return false; > + > + for (s--; s; s = strstr(s + 3, "/.")) > + if (dot_or_dotdot(s + 1)) > + return false; > + > + return true; > +} > + > +static int val_data_get(struct val_iter *vi, const struct path *path) > +{ > + struct path this; > + ssize_t ret; > + > + if (!vi->sub) > + return val_err(vi, -ENOENT); > + > + if (!val_verify_path(vi->sub)) > + return val_err(vi, -EINVAL); > + > + ret = vfs_path_lookup(path->dentry, path->mnt, vi->sub, > + LOOKUP_NO_XDEV | LOOKUP_BENEATH | > + LOOKUP_IN_ROOT, &this); > + if (ret) > + return val_err(vi, ret); > + > + ret = -ENODATA; > + if (d_is_reg(this.dentry) || d_is_symlink(this.dentry)) { > + if (d_is_reg(this.dentry)) > + ret = val_do_read(vi, &this); > + else > + ret = val_do_readlink(vi, &this); > + } > + path_put(&this); > + if (ret == -EFAULT) > + return ret; > + if (ret < 0) > + return val_err(vi, ret); > + if (ret == vi->vec.iov_len) > + return -EOVERFLOW; > + > + return val_end(vi, ret); > +} > + > +static int val_xattr_get(struct val_iter *vi, const struct path *path) > +{ > + ssize_t ret; > + struct user_namespace *mnt_userns = mnt_user_ns(path->mnt); > + void *value = vi->seq.buf + vi->seq.count; > + size_t size = min_t(size_t, vi->seq.size - vi->seq.count, > + XATTR_SIZE_MAX); > + > + if (!vi->sub) > + return val_err(vi, -ENOENT); > + > + ret = vfs_getxattr(mnt_userns, path->dentry, vi->sub, value, size); > + if (ret < 0) > + return val_err(vi, ret); > + > + if ((strcmp(vi->sub, XATTR_NAME_POSIX_ACL_ACCESS) == 0) || > + (strcmp(vi->sub, XATTR_NAME_POSIX_ACL_DEFAULT) == 0)) > + posix_acl_fix_xattr_to_user(mnt_userns, value, ret); > + > + vi->seq.count += ret; > + > + return val_end_seq(vi, 0); > +} > + > + > +static struct val_desc val_toplevel_group[] = { > + { .name = "mnt", .get = val_mnt_get, }, > + { .name = "mntns", .get = val_mntns_get, }, > + { .name = "xattr", .get = val_xattr_get, }, > + { .name = "data", .get = val_data_get, }, > + { .name = NULL }, > +}; > + > +SYSCALL_DEFINE5(getvalues, > + int, dfd, > + const char __user *, pathname, > + struct name_val __user *, vec, > + size_t, num, > + unsigned int, flags) > +{ > + char vals[1024]; > + struct val_iter vi = { > + .curr = vec, > + .num = num, > + .seq.buf = vals, > + .bufsize = sizeof(vals), > + .prefix = "", > + }; > + struct val_desc *vd; > + struct path path = {}; > + ssize_t err; > + > + err = user_path_at(dfd, pathname, 0, &path); > + if (err) > + return err; > + > + err = val_get(&vi); > + if (err) > + goto out; > + > + if (!strlen(vi.name)) { > + err = val_get_group(&vi, val_toplevel_group); > + goto out; > + } > + while (!err && vi.num) { > + vd = val_lookup(&vi, val_toplevel_group); > + if (!vd->name) > + err = val_err(&vi, -ENOENT); > + else > + err = vd->get(&vi, &path); > + } > +out: > + if (err == -EOVERFLOW) > + err = 0; > + > + path_put(&path); > + return err < 0 ? err : num - vi.num; > +} > -- > 2.35.1 >
On Wed, Mar 23, 2022 at 11:26:11AM +0100, Bernd Schubert wrote: > On 3/23/22 08:16, Greg KH wrote: > > On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote: > > > Add a new userspace API that allows getting multiple short values in a > > > single syscall. > > > > > > This would be useful for the following reasons: > > > > > > - Calling open/read/close for many small files is inefficient. E.g. on my > > > desktop invoking lsof(1) results in ~60k open + read + close calls under > > > /proc and 90% of those are 128 bytes or less. > > > > As I found out in testing readfile(): > > https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org > > > > microbenchmarks do show a tiny improvement in doing something like this, > > but that's not a real-world application. > > > > Do you have anything real that can use this that shows a speedup? > > Add in network file systems. Demonstrating that this is useful locally and > with micro benchmarks - yeah, helps a bit to make it locally faster. But the > real case is when thousands of clients are handled by a few network servers. > Even reducing wire latency for a single client would make a difference here. I think I tried running readfile on NFS. Didn't see any improvements. But please, try it again. Also note that this proposal isn't for NFS, or any other "real" filesystem :) > There is a bit of chicken-egg problem - it is a bit of work to add to file > systems like NFS (or others that are not the kernel), but the work won't be > made there before there is no syscall for it. To demonstrate it on NFS one > also needs a an official protocol change first. And then applications also > need to support that new syscall first. > I had a hard time explaining weather physicist back in 2009 that it is not a > good idea to have millions of 512B files on Lustre. With recent AI workload > this gets even worse. Can you try using the readfile() patch to see if that helps you all out on Lustre? If so, that's a good reason to consider it. But again, has nothing to do with this getvalues(2) api. thanks, greg k-h
On 3/23/22 12:42, Greg KH wrote: > On Wed, Mar 23, 2022 at 11:26:11AM +0100, Bernd Schubert wrote: >> On 3/23/22 08:16, Greg KH wrote: >>> On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote: >>>> Add a new userspace API that allows getting multiple short values in a >>>> single syscall. >>>> >>>> This would be useful for the following reasons: >>>> >>>> - Calling open/read/close for many small files is inefficient. E.g. on my >>>> desktop invoking lsof(1) results in ~60k open + read + close calls under >>>> /proc and 90% of those are 128 bytes or less. >>> >>> As I found out in testing readfile(): >>> https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org >>> >>> microbenchmarks do show a tiny improvement in doing something like this, >>> but that's not a real-world application. >>> >>> Do you have anything real that can use this that shows a speedup? >> >> Add in network file systems. Demonstrating that this is useful locally and >> with micro benchmarks - yeah, helps a bit to make it locally faster. But the >> real case is when thousands of clients are handled by a few network servers. >> Even reducing wire latency for a single client would make a difference here. > > I think I tried running readfile on NFS. Didn't see any improvements. > But please, try it again. Also note that this proposal isn't for NFS, > or any other "real" filesystem :) How did you run it on NFS? To get real benefit you would need to add a READ_FILE rpc to the NFS protocol and code? Just having it locally won't avoid the expensive wire calls? > >> There is a bit of chicken-egg problem - it is a bit of work to add to file >> systems like NFS (or others that are not the kernel), but the work won't be >> made there before there is no syscall for it. To demonstrate it on NFS one >> also needs a an official protocol change first. And then applications also >> need to support that new syscall first. >> I had a hard time explaining weather physicist back in 2009 that it is not a >> good idea to have millions of 512B files on Lustre. With recent AI workload >> this gets even worse. > > Can you try using the readfile() patch to see if that helps you all out > on Lustre? If so, that's a good reason to consider it. But again, has > nothing to do with this getvalues(2) api. I don't have a Lustre system to easily play with (I'm working on another network file system). But unless Lustre would implement aggressive prefetch of data on stat, I don't see how either approach would work without a protocol addition. For Lustre it probably would be helpful only when small data are inlined into the inode. In end this is exactly the chicken-egg problem - Lustre (or anything else) won't implement it before the kernel does not support it. But then the new syscall won't be added before it is proven that it helps. - Bernd
On Wed, Mar 23, 2022 at 01:06:33PM +0100, Bernd Schubert wrote: > > > On 3/23/22 12:42, Greg KH wrote: > > On Wed, Mar 23, 2022 at 11:26:11AM +0100, Bernd Schubert wrote: > > > On 3/23/22 08:16, Greg KH wrote: > > > > On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote: > > > > > Add a new userspace API that allows getting multiple short values in a > > > > > single syscall. > > > > > > > > > > This would be useful for the following reasons: > > > > > > > > > > - Calling open/read/close for many small files is inefficient. E.g. on my > > > > > desktop invoking lsof(1) results in ~60k open + read + close calls under > > > > > /proc and 90% of those are 128 bytes or less. > > > > > > > > As I found out in testing readfile(): > > > > https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org > > > > > > > > microbenchmarks do show a tiny improvement in doing something like this, > > > > but that's not a real-world application. > > > > > > > > Do you have anything real that can use this that shows a speedup? > > > > > > Add in network file systems. Demonstrating that this is useful locally and > > > with micro benchmarks - yeah, helps a bit to make it locally faster. But the > > > real case is when thousands of clients are handled by a few network servers. > > > Even reducing wire latency for a single client would make a difference here. > > > > I think I tried running readfile on NFS. Didn't see any improvements. > > But please, try it again. Also note that this proposal isn't for NFS, > > or any other "real" filesystem :) > > How did you run it on NFS? To get real benefit you would need to add a > READ_FILE rpc to the NFS protocol and code? Just having it locally won't > avoid the expensive wire calls? I did not touch anything related to NFS code, which is perhaps why I did not notice any difference :) thanks, greg k-h
On Wed, 23 Mar 2022 at 12:43, Christian Brauner <brauner@kernel.org> wrote: > Yes, we really need a way to query for various fs information. I'm a bit > torn about the details of this interface though. I would really like if > we had interfaces that are really easy to use from userspace comparable > to statx for example. The reason I stated thinking about this is that Amir wanted a per-sb iostat interface and dumped it into /proc/PID/mountstats. And that is definitely not the right way to go about this. So we could add a statfsx() and start filling in new stuff, and that's what Linus suggested. But then we might need to add stuff that is not representable in a flat structure (like for example the stuff that nfs_show_stats does) and that again needs new infrastructure. Another example is task info in /proc. Utilities are doing a crazy number of syscalls to get trivial information. Why don't we have a procx(2) syscall? I guess because lots of that is difficult to represent in a flat structure. Just take the lsof example: tt's doing hundreds of thousands of syscalls on a desktop computer with just a few hundred processes. So I'm trying to look beyond fsinfo and about how we could better retrieve attributes, statistics, small bits and pieces within a unified framework. The ease of use argument does not really come into the picture here, because (unlike stat and friends) most of this info is specialized and will be either consumed by libraries, specialized utilities (util-linux, procos) or with a generic utility application that can query any information about anything that is exported through such an interface. That applies to plain stat(2) as well: most users will not switch to statx() simply because that's too generic. And that's fine, for info as common as struct stat a syscall is warranted. If the info is more specialized, then I think a truly generic interface is a much better choice. > I know having this generic as possible was the > goal but I'm just a bit uneasy with such interfaces. They become > cumbersome to use in userspace. I'm not sure if the data: part for > example should be in this at all. That seems a bit out of place to me. Good point, reduction of scope may help. > Would it be really that bad if we added multiple syscalls for different > types of info? For example, querying mount information could reasonably > be a more focussed separate system call allowing to retrieve detailed > mount propagation info, flags, idmappings and so on. Prior approaches to > solve this in a completely generic way have gotten us not very far too > so I'm a bit worried about this aspect too. And I fear that this will just result in more and more ad-hoc interfaces being added, because a new feature didn't quite fit the old API. You can see the history of this happening all over the place with multiple new syscall versions being added as the old one turns out to be not generic enough. I think a new interface needs to - be uniform (a single utility can be used to retrieve various attributes and statistics, contrast this with e.g. stat(1), getfattr(1), lsattr(1) not to mention various fs specific tools). - have a hierarchical namespace (the unix path lookup is an example of this that stood the test of time) - allow retrieving arbitrary text or binary data And whatever form it takes, I'm sure it will be easier to use than the mess we currently have in various interfaces like the mount or process stats. Thanks, Miklos
On Wed, Mar 23, 2022 at 02:24:40PM +0100, Miklos Szeredi wrote: > On Wed, 23 Mar 2022 at 12:43, Christian Brauner <brauner@kernel.org> wrote: > > > Yes, we really need a way to query for various fs information. I'm a bit > > torn about the details of this interface though. I would really like if > > we had interfaces that are really easy to use from userspace comparable > > to statx for example. > > The reason I stated thinking about this is that Amir wanted a per-sb > iostat interface and dumped it into /proc/PID/mountstats. And that is > definitely not the right way to go about this. > > So we could add a statfsx() and start filling in new stuff, and that's > what Linus suggested. But then we might need to add stuff that is not > representable in a flat structure (like for example the stuff that > nfs_show_stats does) and that again needs new infrastructure. > > Another example is task info in /proc. Utilities are doing a crazy > number of syscalls to get trivial information. Why don't we have a > procx(2) syscall? I guess because lots of that is difficult to > represent in a flat structure. Just take the lsof example: tt's doing > hundreds of thousands of syscalls on a desktop computer with just a > few hundred processes. > > So I'm trying to look beyond fsinfo and about how we could better > retrieve attributes, statistics, small bits and pieces within a > unified framework. > > The ease of use argument does not really come into the picture here, > because (unlike stat and friends) most of this info is specialized and > will be either consumed by libraries, specialized utilities > (util-linux, procos) or with a generic utility application that can > query any information about anything that is exported through such an > interface. That applies to plain stat(2) as well: most users will > not switch to statx() simply because that's too generic. And that's > fine, for info as common as struct stat a syscall is warranted. If > the info is more specialized, then I think a truly generic interface > is a much better choice. > > > I know having this generic as possible was the > > goal but I'm just a bit uneasy with such interfaces. They become > > cumbersome to use in userspace. I'm not sure if the data: part for > > example should be in this at all. That seems a bit out of place to me. > > Good point, reduction of scope may help. > > > Would it be really that bad if we added multiple syscalls for different > > types of info? For example, querying mount information could reasonably > > be a more focussed separate system call allowing to retrieve detailed > > mount propagation info, flags, idmappings and so on. Prior approaches to > > solve this in a completely generic way have gotten us not very far too > > so I'm a bit worried about this aspect too. > > And I fear that this will just result in more and more ad-hoc > interfaces being added, because a new feature didn't quite fit the old > API. You can see the history of this happening all over the place > with multiple new syscall versions being added as the old one turns > out to be not generic enough. > > I think a new interface needs to > > - be uniform (a single utility can be used to retrieve various > attributes and statistics, contrast this with e.g. stat(1), > getfattr(1), lsattr(1) not to mention various fs specific tools). > > - have a hierarchical namespace (the unix path lookup is an example > of this that stood the test of time) > > - allow retrieving arbitrary text or binary data > > And whatever form it takes, I'm sure it will be easier to use than the > mess we currently have in various interfaces like the mount or process > stats. This has been proposed in the past a few times. Most recently by the KVM developers, which tried to create a "generic" api, but ended up just making something to work for KVM as they got tired of people ignoring their more intrusive patch sets. See virt/kvm/binary_stats.c for what they ended up with, and perhaps you can just use that same type of interface here as well? thanks, greg k-h
On 3/23/2022 6:24 AM, Miklos Szeredi wrote: > On Wed, 23 Mar 2022 at 12:43, Christian Brauner <brauner@kernel.org> wrote: > >> Yes, we really need a way to query for various fs information. I'm a bit >> torn about the details of this interface though. I would really like if >> we had interfaces that are really easy to use from userspace comparable >> to statx for example. > The reason I stated thinking about this is that Amir wanted a per-sb > iostat interface and dumped it into /proc/PID/mountstats. And that is > definitely not the right way to go about this. > > So we could add a statfsx() and start filling in new stuff, and that's > what Linus suggested. But then we might need to add stuff that is not > representable in a flat structure (like for example the stuff that > nfs_show_stats does) and that again needs new infrastructure. > > Another example is task info in /proc. Utilities are doing a crazy > number of syscalls to get trivial information. Why don't we have a > procx(2) syscall? I guess because lots of that is difficult to > represent in a flat structure. Just take the lsof example: tt's doing > hundreds of thousands of syscalls on a desktop computer with just a > few hundred processes. > > So I'm trying to look beyond fsinfo and about how we could better > retrieve attributes, statistics, small bits and pieces within a > unified framework. > > The ease of use argument does not really come into the picture here, > because (unlike stat and friends) most of this info is specialized and > will be either consumed by libraries, specialized utilities > (util-linux, procos) or with a generic utility application that can > query any information about anything that is exported through such an > interface. That applies to plain stat(2) as well: most users will > not switch to statx() simply because that's too generic. And that's > fine, for info as common as struct stat a syscall is warranted. If > the info is more specialized, then I think a truly generic interface > is a much better choice. > >> I know having this generic as possible was the >> goal but I'm just a bit uneasy with such interfaces. They become >> cumbersome to use in userspace. I'm not sure if the data: part for >> example should be in this at all. That seems a bit out of place to me. > Good point, reduction of scope may help. > >> Would it be really that bad if we added multiple syscalls for different >> types of info? For example, querying mount information could reasonably >> be a more focussed separate system call allowing to retrieve detailed >> mount propagation info, flags, idmappings and so on. Prior approaches to >> solve this in a completely generic way have gotten us not very far too >> so I'm a bit worried about this aspect too. > And I fear that this will just result in more and more ad-hoc > interfaces being added, because a new feature didn't quite fit the old > API. You can see the history of this happening all over the place > with multiple new syscall versions being added as the old one turns > out to be not generic enough. > > I think a new interface needs to > > - be uniform (a single utility can be used to retrieve various > attributes and statistics, contrast this with e.g. stat(1), > getfattr(1), lsattr(1) not to mention various fs specific tools). > > - have a hierarchical namespace (the unix path lookup is an example > of this that stood the test of time) > > - allow retrieving arbitrary text or binary data You also need a way to get a list off what attributes are available and/or a way to get all available attributes. Applications and especially libraries shouldn't have to guess what information is relevant. If the attributes change depending on the filesystem and/or LSM involved, and they do, how can a general purpose library function know what data to ask for? > > And whatever form it takes, I'm sure it will be easier to use than the > mess we currently have in various interfaces like the mount or process > stats. > > Thanks, > Miklos
On Wed, 23 Mar 2022 at 14:51, Casey Schaufler <casey@schaufler-ca.com> wrote: > You also need a way to get a list off what attributes are available > and/or a way to get all available attributes. Applications and especially > libraries shouldn't have to guess what information is relevant. If the > attributes change depending on the filesystem and/or LSM involved, and > they do, how can a general purpose library function know what data to > ask for? Oh, yes. Even the current prototype does that: # ~/getvalues / "" [] = "mnt" "mntns" "xattr" "data" (len=21) # ~/getvalues / "mnt" [mnt] = "id" "parentid" "root" "mountpoint" "options" "shared" "master" "propagate_from" "unbindable" (len=76) # ~/getvalues / "mntns" [mntns] = "21" "22" "24" "25" "23" "26" "27" "28" "29" "30" "31" "32" (len=36) ~/getvalues / "mntns:21" [mntns:21] = "id" "parentid" "root" "mountpoint" "options" "shared" "master" "propagate_from" "unbindable" (len=76) I didn't implement enumeration for "data" and "xattr" but that is certainly possible and not even difficult to do. Thanks, Miklos
On Wed, 23 Mar 2022 at 14:38, Greg KH <gregkh@linuxfoundation.org> wrote: > This has been proposed in the past a few times. Most recently by the > KVM developers, which tried to create a "generic" api, but ended up just > making something to work for KVM as they got tired of people ignoring > their more intrusive patch sets. See virt/kvm/binary_stats.c for what > they ended up with, and perhaps you can just use that same type of > interface here as well? So this looks like a fixed set of statistics where each one has a descriptor (a name, size, offset, flags, ...) that tells about the piece of data to be exported. The stats are kept up to date in kernel memory and copied to userspace on read. The copy can be selective, since the read can specify the offset and size of data it would like to retrieve. The interface is self descriptive and selective, but its structure is fixed for a specific object type, there's no way this could be extended to look up things like extended attributes. Maybe that's not a problem, but the lack of a hierarchical namespace could turn out to be a major drawback. I think people underestimate the usefulness of hierarchical namespaces, even though we use them extensively in lots of well established interfaces. Thanks, Miklos
On Wed, Mar 23, 2022 at 11:26:11AM +0100, Bernd Schubert wrote: > Add in network file systems. Demonstrating that this is useful > locally and with micro benchmarks - yeah, helps a bit to make it > locally faster. But the real case is when thousands of clients are > handled by a few network servers. Even reducing wire latency for a > single client would make a difference here. > > There is a bit of chicken-egg problem - it is a bit of work to add > to file systems like NFS (or others that are not the kernel), but > the work won't be made there before there is no syscall for it. To > demonstrate it on NFS one also needs a an official protocol change > first. I wouldn't assume that. NFSv4 already supports compound rpc operations, so you can do OPEN+READ+CLOSE in a single round trip. The client's never done that, but there are pynfs tests that can testify to the fact that our server supports it. It's not something anyone's used much outside of artificial tests, so there may well turn out be issues, but the protocol's definitely sufficient to prototype this at least. I'm not volunteering, but it doesn't seem too difficult in theory if someone's interested. --b. > And then applications also need to support that new syscall > first. > I had a hard time explaining weather physicist back in 2009 that it > is not a good idea to have millions of 512B files on Lustre. With > recent AI workload this gets even worse. > > This is the same issue in fact with the fuse patches we are creating > (https://lwn.net/Articles/888877/). Miklos asked for benchmark > numbers - we can only demonstrate slight effects locally, but out > goal is in fact to reduce network latencies and server load. > > - Bernd
On Wed, Mar 23, 2022 at 02:24:40PM +0100, Miklos Szeredi wrote: > The reason I stated thinking about this is that Amir wanted a per-sb > iostat interface and dumped it into /proc/PID/mountstats. And that is > definitely not the right way to go about this. > > So we could add a statfsx() and start filling in new stuff, and that's > what Linus suggested. But then we might need to add stuff that is not > representable in a flat structure (like for example the stuff that > nfs_show_stats does) and that again needs new infrastructure. > > Another example is task info in /proc. Utilities are doing a crazy > number of syscalls to get trivial information. Why don't we have a > procx(2) syscall? I guess because lots of that is difficult to > represent in a flat structure. Just take the lsof example: tt's doing > hundreds of thousands of syscalls on a desktop computer with just a > few hundred processes. I'm still a bit puzzled about the reason for getvalues(2) beyond, "reduce the number of system calls". Is this a performance argument? If so, have you benchmarked lsof using this new interface? I did a quickie run on my laptop, which currently had 444 process. "lsof /home/tytso > /tmp/foo" didn't take long: % time lsof /home/tytso >& /tmp/foo real 0m0.144s user 0m0.039s sys 0m0.087s And an strace of that same lsof command indicated had 67,889 lines. So yeah, lots of system calls. But is this new system call really going to speed up things by all that much? If the argument is "make it easier to use", what's wrong the solution of creating userspace libraries which abstract away calls to open/read/close a whole bunch of procfs files to make life easier for application programmers? In short, what problem is this new system call going to solve? Each new system call, especially with all of the parsing that this one is going to use, is going to be an additional attack surface, and an additional new system call that we have to maintain --- and for the first 7-10 years, userspace programs are going to have to use the existing open/read/close interface since enterprise kernels stick a wrong for a L-O-N-G time, so any kind of ease-of-use argument isn't really going to help application programs until RHEL 10 becomes obsolete. (Unless you plan to backport this into RHEL 9 beta, but still, waiting for RHEL 9 to become completely EOL is going to be... a while.) So whatever the benefits of this new interface is going to be, I suggest we should be sure that it's really worth it. Cheers, - Ted
On 3/23/2022 7:00 AM, Miklos Szeredi wrote: > On Wed, 23 Mar 2022 at 14:51, Casey Schaufler <casey@schaufler-ca.com> wrote: > >> You also need a way to get a list off what attributes are available >> and/or a way to get all available attributes. Applications and especially >> libraries shouldn't have to guess what information is relevant. If the >> attributes change depending on the filesystem and/or LSM involved, and >> they do, how can a general purpose library function know what data to >> ask for? > Oh, yes. Even the current prototype does that: > > # ~/getvalues / "" > [] = "mnt" "mntns" "xattr" "data" (len=21) > # ~/getvalues / "mnt" > [mnt] = "id" "parentid" "root" "mountpoint" "options" "shared" > "master" "propagate_from" "unbindable" (len=76) > # ~/getvalues / "mntns" > [mntns] = "21" "22" "24" "25" "23" "26" "27" "28" "29" "30" "31" "32" (len=36) > ~/getvalues / "mntns:21" > [mntns:21] = "id" "parentid" "root" "mountpoint" "options" "shared" > "master" "propagate_from" "unbindable" (len=76) That requires multiple calls and hierarchy tracking by the caller. Not to mention that in this case the caller needs to understand how mount namespaces are being used. I don't see that you've made anything cleaner. You have discarded the type checking provided by the "classic" APIs. Elsewhere in this thread the claims of improved performance have been questioned, but I can't say boo about that. Is this interface targeted for languages other than C for which the paradigm might provide (more?) value? > > I didn't implement enumeration for "data" and "xattr" but that is > certainly possible and not even difficult to do. > > Thanks, > Miklos
On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote: > Add a new userspace API that allows getting multiple short values in a > single syscall. > > This would be useful for the following reasons: > > - Calling open/read/close for many small files is inefficient. E.g. on my > desktop invoking lsof(1) results in ~60k open + read + close calls under > /proc and 90% of those are 128 bytes or less. How does doing the open/read/close in a single syscall make this any more efficient? All it saves is the overhead of a couple of syscalls, it doesn't reduce any of the setup or teardown overhead needed to read the data itself.... > - Interfaces for getting various attributes and statistics are fragmented. > For files we have basic stat, statx, extended attributes, file attributes > (for which there are two overlapping ioctl interfaces). For mounts and > superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}. > The latter also has the problem on not allowing queries on a specific > mount. https://xkcd.com/927/ > - Some attributes are cheap to generate, some are expensive. Allowing > userspace to select which ones it needs should allow optimizing queries. > > - Adding an ascii namespace should allow easy extension and self > description. > > - The values can be text or binary, whichever is fits best. > > The interface definition is: > > struct name_val { > const char *name; /* in */ > struct iovec value_in; /* in */ > struct iovec value_out; /* out */ > uint32_t error; /* out */ > uint32_t reserved; > }; Ahhh, XFS_IOC_ATTRMULTI_BY_HANDLE reborn. This is how xfsdump gets and sets attributes efficiently when dumping and restoring files - it's an interface that allows batches of xattr operations to be run on a file in a single syscall. I've said in the past when discussing things like statx() that maybe everything should be addressable via the xattr namespace and set/queried via xattr names regardless of how the filesystem stores the data. The VFS/filesystem simply translates the name to the storage location of the information. It might be held in xattrs, but it could just be a flag bit in an inode field. Then we just get named xattrs in batches from an open fd. > int getvalues(int dfd, const char *path, struct name_val *vec, size_t num, > unsigned int flags); > > @dfd and @path are used to lookup object $ORIGIN. @vec contains @num > name/value descriptors. @flags contains lookup flags for @path. > > The syscall returns the number of values filled or an error. > > A single name/value descriptor has the following fields: > > @name describes the object whose value is to be returned. E.g. > > mnt - list of mount parameters > mnt:mountpoint - the mountpoint of the mount of $ORIGIN > mntns - list of mount ID's reachable from the current root > mntns:21:parentid - parent ID of the mount with ID of 21 > xattr:security.selinux - the security.selinux extended attribute > data:foo/bar - the data contained in file $ORIGIN/foo/bar How are these different from just declaring new xattr namespaces for these things. e.g. open any file and list the xattrs in the xattr:mount.mnt namespace to get the list of mount parameters for that mount. Why do we need a new "xattr in everything but name" interface when we could just extend the one we've already got and formalise a new, cleaner version of xattr batch APIs that have been around for 20-odd years already? Cheers, Dave.
On 3/23/2022 3:58 PM, Dave Chinner wrote: > On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote: >> Add a new userspace API that allows getting multiple short values in a >> single syscall. >> >> This would be useful for the following reasons: >> >> - Calling open/read/close for many small files is inefficient. E.g. on my >> desktop invoking lsof(1) results in ~60k open + read + close calls under >> /proc and 90% of those are 128 bytes or less. > How does doing the open/read/close in a single syscall make this any > more efficient? All it saves is the overhead of a couple of > syscalls, it doesn't reduce any of the setup or teardown overhead > needed to read the data itself.... > >> - Interfaces for getting various attributes and statistics are fragmented. >> For files we have basic stat, statx, extended attributes, file attributes >> (for which there are two overlapping ioctl interfaces). For mounts and >> superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}. >> The latter also has the problem on not allowing queries on a specific >> mount. > https://xkcd.com/927/ > >> - Some attributes are cheap to generate, some are expensive. Allowing >> userspace to select which ones it needs should allow optimizing queries. >> >> - Adding an ascii namespace should allow easy extension and self >> description. >> >> - The values can be text or binary, whichever is fits best. >> >> The interface definition is: >> >> struct name_val { >> const char *name; /* in */ >> struct iovec value_in; /* in */ >> struct iovec value_out; /* out */ >> uint32_t error; /* out */ >> uint32_t reserved; >> }; > Ahhh, XFS_IOC_ATTRMULTI_BY_HANDLE reborn. This is how xfsdump gets > and sets attributes efficiently when dumping and restoring files - > it's an interface that allows batches of xattr operations to be run > on a file in a single syscall. > > I've said in the past when discussing things like statx() that maybe > everything should be addressable via the xattr namespace and > set/queried via xattr names regardless of how the filesystem stores > the data. The VFS/filesystem simply translates the name to the > storage location of the information. It might be held in xattrs, but > it could just be a flag bit in an inode field. > > Then we just get named xattrs in batches from an open fd. > >> int getvalues(int dfd, const char *path, struct name_val *vec, size_t num, >> unsigned int flags); >> >> @dfd and @path are used to lookup object $ORIGIN. @vec contains @num >> name/value descriptors. @flags contains lookup flags for @path. >> >> The syscall returns the number of values filled or an error. >> >> A single name/value descriptor has the following fields: >> >> @name describes the object whose value is to be returned. E.g. >> >> mnt - list of mount parameters >> mnt:mountpoint - the mountpoint of the mount of $ORIGIN >> mntns - list of mount ID's reachable from the current root >> mntns:21:parentid - parent ID of the mount with ID of 21 >> xattr:security.selinux - the security.selinux extended attribute >> data:foo/bar - the data contained in file $ORIGIN/foo/bar > How are these different from just declaring new xattr namespaces for > these things. e.g. open any file and list the xattrs in the > xattr:mount.mnt namespace to get the list of mount parameters for > that mount. There is a significant and vocal set of people who dislike xattrs passionately. I often hear them whinging whenever someone proposes using them. I think that your suggestion has all the advantages of the getvalues(2) interface while also addressing its shortcomings. If we could get it past the anti-xattr crowd we might have something. You could even provide getvalues() on top of it. > > Why do we need a new "xattr in everything but name" interface when > we could just extend the one we've already got and formalise a new, > cleaner version of xattr batch APIs that have been around for 20-odd > years already? > > Cheers, > > Dave. >
On Wed, Mar 23, 2022 at 06:19:51PM -0400, Theodore Ts'o wrote: > I'm still a bit puzzled about the reason for getvalues(2) beyond, > "reduce the number of system calls". Is this a performance argument? > If so, have you benchmarked lsof using this new interface? Yeah. Even if open + read + close is a bnottle neck for fuse or network file systems I think a io_uring op for just that is a much better choice instead of this crazy multi-value operation. And even on that I need to be sold first.
On Wed, Mar 23, 2022 at 04:23:34PM +0100, Miklos Szeredi wrote: > On Wed, 23 Mar 2022 at 14:38, Greg KH <gregkh@linuxfoundation.org> wrote: > > > This has been proposed in the past a few times. Most recently by the > > KVM developers, which tried to create a "generic" api, but ended up just > > making something to work for KVM as they got tired of people ignoring > > their more intrusive patch sets. See virt/kvm/binary_stats.c for what > > they ended up with, and perhaps you can just use that same type of > > interface here as well? > > So this looks like a fixed set of statistics where each one has a > descriptor (a name, size, offset, flags, ...) that tells about the > piece of data to be exported. The stats are kept up to date in kernel > memory and copied to userspace on read. The copy can be selective, > since the read can specify the offset and size of data it would like > to retrieve. > > The interface is self descriptive and selective, but its structure is > fixed for a specific object type, there's no way this could be > extended to look up things like extended attributes. Maybe that's not > a problem, but the lack of a hierarchical namespace could turn out to > be a major drawback. > > I think people underestimate the usefulness of hierarchical > namespaces, even though we use them extensively in lots of well > established interfaces. I like the namespaces, they work well. If you want self-describing interfaces (which I think your patch does), then why not just use the varlink protocol? It's been implemented for the kernel already many years ago: https://github.com/varlink and specifically: https://github.com/varlink/linux-varlink It doesn't need a new syscall. thanks, greg k-h
On Wed, 23 Mar 2022 at 23:20, Theodore Ts'o <tytso@mit.edu> wrote: > > On Wed, Mar 23, 2022 at 02:24:40PM +0100, Miklos Szeredi wrote: > > The reason I stated thinking about this is that Amir wanted a per-sb > > iostat interface and dumped it into /proc/PID/mountstats. And that is > > definitely not the right way to go about this. > > > > So we could add a statfsx() and start filling in new stuff, and that's > > what Linus suggested. But then we might need to add stuff that is not > > representable in a flat structure (like for example the stuff that > > nfs_show_stats does) and that again needs new infrastructure. > > > > Another example is task info in /proc. Utilities are doing a crazy > > number of syscalls to get trivial information. Why don't we have a > > procx(2) syscall? I guess because lots of that is difficult to > > represent in a flat structure. Just take the lsof example: tt's doing > > hundreds of thousands of syscalls on a desktop computer with just a > > few hundred processes. > > I'm still a bit puzzled about the reason for getvalues(2) beyond, > "reduce the number of system calls". Is this a performance argument? One argument that can't be worked around without batchingis atomicity. Not sure how important that is, but IIRC it was one of the requirements relating to the proposed fsinfo syscall, which this API is meant to supersede. Performance was also oft repeated regarding the fsinfo API, but I'm less bought into that. > If so, have you benchmarked lsof using this new interface? Not yet. Looked yesterday at both lsof and procps source code, and both are pretty complex and not easy to plug in a new interface. But I've not yet given up... > I did a quickie run on my laptop, which currently had 444 process. > "lsof /home/tytso > /tmp/foo" didn't take long: > > % time lsof /home/tytso >& /tmp/foo > real 0m0.144s > user 0m0.039s > sys 0m0.087s > > And an strace of that same lsof command indicated had 67,889 lines. > So yeah, lots of system calls. But is this new system call really > going to speed up things by all that much? $ ps uax | wc -l 335 $ time lsof > /dev/null real 0m3.011s user 0m1.257s sys 0m1.249s $ strace -o /tmp/strace lsof > /dev/null $ wc -l /tmp/strace 638523 /tmp/strace That's an order of magnitude higher than in your case; don't know what could cause this. Thanks, Millos
On Wed, 23 Mar 2022 at 23:58, Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote: > > - Interfaces for getting various attributes and statistics are fragmented. > > For files we have basic stat, statx, extended attributes, file attributes > > (for which there are two overlapping ioctl interfaces). For mounts and > > superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}. > > The latter also has the problem on not allowing queries on a specific > > mount. > > https://xkcd.com/927/ Haha! > I've said in the past when discussing things like statx() that maybe > everything should be addressable via the xattr namespace and > set/queried via xattr names regardless of how the filesystem stores > the data. The VFS/filesystem simply translates the name to the > storage location of the information. It might be held in xattrs, but > it could just be a flag bit in an inode field. Right, that would definitely make sense for inode attributes. What about other objects' attributes, statistics? Remember this started out as a way to replace /proc/self/mountinfo with something that can query individual mount. > > mnt - list of mount parameters > > mnt:mountpoint - the mountpoint of the mount of $ORIGIN > > mntns - list of mount ID's reachable from the current root > > mntns:21:parentid - parent ID of the mount with ID of 21 > > xattr:security.selinux - the security.selinux extended attribute > > data:foo/bar - the data contained in file $ORIGIN/foo/bar > > How are these different from just declaring new xattr namespaces for > these things. e.g. open any file and list the xattrs in the > xattr:mount.mnt namespace to get the list of mount parameters for > that mount. Okay. > Why do we need a new "xattr in everything but name" interface when > we could just extend the one we've already got and formalise a new, > cleaner version of xattr batch APIs that have been around for 20-odd > years already? Seems to make sense. But...will listxattr list everyting recursively? I guess that won't work, better just list traditional xattrs, otherwise we'll likely get regressions, and anyway the point of a hierarchical namespace is to be able to list nodes on each level. We can use getxattr() for this purpose, just like getvalues() does in the above example. Thanks, Miklos
> > I've said in the past when discussing things like statx() that maybe > > everything should be addressable via the xattr namespace and > > set/queried via xattr names regardless of how the filesystem stores > > the data. The VFS/filesystem simply translates the name to the > > storage location of the information. It might be held in xattrs, but > > it could just be a flag bit in an inode field. > > Right, that would definitely make sense for inode attributes. Why limit to inode attributes? The argument of getxattr()/fgetxattr() is exactly the same as the argument for statfs()fstatfs() and the latter returns the attributes of the sb and the mnt (i.e. calculate_f_flags()). I don't see a problem with querying attributes of a mount/sb the same way as long as the namespace is clear about what is the object that is being queried (e.g. getxattr(path, "fsinfo.sbiostats.rchar",...). > > What about other objects' attributes, statistics? Remember this > started out as a way to replace /proc/self/mountinfo with something > that can query individual mount. > > > > mnt - list of mount parameters > > > mnt:mountpoint - the mountpoint of the mount of $ORIGIN > > > mntns - list of mount ID's reachable from the current root > > > mntns:21:parentid - parent ID of the mount with ID of 21 > > > xattr:security.selinux - the security.selinux extended attribute > > > data:foo/bar - the data contained in file $ORIGIN/foo/bar > > > > How are these different from just declaring new xattr namespaces for > > these things. e.g. open any file and list the xattrs in the > > xattr:mount.mnt namespace to get the list of mount parameters for > > that mount. > > Okay. > > > Why do we need a new "xattr in everything but name" interface when > > we could just extend the one we've already got and formalise a new, > > cleaner version of xattr batch APIs that have been around for 20-odd > > years already? > > Seems to make sense. But...will listxattr list everyting recursively? > I guess that won't work, better just list traditional xattrs, > otherwise we'll likely get regressions, and anyway the point of a > hierarchical namespace is to be able to list nodes on each level. We > can use getxattr() for this purpose, just like getvalues() does in the > above example. > FYI, there are already precedents for "virtual" xattrs, see the user.smb3.* family in fs/cifs/xattr.c for example. Those cifs "virtual" (or "remote") xattrs are not listed by listxattr, even though they ARE properties of the file which are very relevant for backup. Currently, they use the user.* namespace, but the values could be also exposed via a more generic fsinfo.* namespace that is dedicated to these sort of things and then, as you suggest, getxattr(path, "fsinfo",...) can list "smb3" for cifs. I like where this is going :) Thanks, Amir.
Miklos Szeredi <miklos@szeredi.hu> writes: > On Wed, 23 Mar 2022 at 23:20, Theodore Ts'o <tytso@mit.edu> wrote: >> >> On Wed, Mar 23, 2022 at 02:24:40PM +0100, Miklos Szeredi wrote: >> > The reason I stated thinking about this is that Amir wanted a per-sb >> > iostat interface and dumped it into /proc/PID/mountstats. And that is >> > definitely not the right way to go about this. >> > >> > So we could add a statfsx() and start filling in new stuff, and that's >> > what Linus suggested. But then we might need to add stuff that is not >> > representable in a flat structure (like for example the stuff that >> > nfs_show_stats does) and that again needs new infrastructure. >> > >> > Another example is task info in /proc. Utilities are doing a crazy >> > number of syscalls to get trivial information. Why don't we have a >> > procx(2) syscall? I guess because lots of that is difficult to >> > represent in a flat structure. Just take the lsof example: tt's doing >> > hundreds of thousands of syscalls on a desktop computer with just a >> > few hundred processes. >> >> I'm still a bit puzzled about the reason for getvalues(2) beyond, >> "reduce the number of system calls". Is this a performance argument? > > One argument that can't be worked around without batchingis atomicity. > Not sure how important that is, but IIRC it was one of the > requirements relating to the proposed fsinfo syscall, which this API > is meant to supersede. Performance was also oft repeated regarding > the fsinfo API, but I'm less bought into that. A silly question. Have you looked to see if you can perform this work with io_uring? I know io_uring does all of the batching already, so I think io_uring is as ready as anything is to solve the performance issues, and the general small file problem. There is also the bpf information extractor (Sorry I forget what it's proper name is) that also can solve many of the small read problems. I am very confused you mention atomicity but I don't see any new filesystem hooks or anyway you could implement atomicity for reads much less writes in the patch you posted. If the real target is something like fsinfo that is returning information that is not currently available except by possibly processing /proc/self/mountinfo perhaps a more targeted name would help. I certainly did not get the impression when skimming your introduction to this that you were trying to solve anything except reading a number of small files. Eric
On Thu, Mar 24, 2022 at 09:57:26AM +0100, Miklos Szeredi wrote: > On Wed, 23 Mar 2022 at 23:58, Dave Chinner <david@fromorbit.com> wrote: > > > > On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote: > > > > - Interfaces for getting various attributes and statistics are fragmented. > > > For files we have basic stat, statx, extended attributes, file attributes > > > (for which there are two overlapping ioctl interfaces). For mounts and > > > superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}. > > > The latter also has the problem on not allowing queries on a specific > > > mount. > > > > https://xkcd.com/927/ > > Haha! > > > I've said in the past when discussing things like statx() that maybe > > everything should be addressable via the xattr namespace and > > set/queried via xattr names regardless of how the filesystem stores > > the data. The VFS/filesystem simply translates the name to the > > storage location of the information. It might be held in xattrs, but > > it could just be a flag bit in an inode field. > > Right, that would definitely make sense for inode attributes. > > What about other objects' attributes, statistics? Remember this > started out as a way to replace /proc/self/mountinfo with something > that can query individual mount. For individual mount info, why do we even need to query something in /proc? I mean, every open file in the mount has access to the mount and the underlying superblock, so why not just make the query namespace accessable from any open fd on that mount? e.g. /proc/self/mountinfo tells you where the mounts are, then you can just open(O_PATH) the mount point you want the info from and retrieve the relevant xattrs from that fd. The information itself does not need to be in /proc, nor only accessible from /proc, nor be limited to proc infrastructure, nor be constrained by proc's arbitrary "one value per file" presentation.... Indeed, we don't have to centralise all the information in one place - all we need is to have a well defined, consistent method for indexing that information and all the shenanigans for accessing common stuff can be wrapped up in a common userspace library (similar to how iterating the mount table is generic C library functionality). > > > mnt - list of mount parameters > > > mnt:mountpoint - the mountpoint of the mount of $ORIGIN > > > mntns - list of mount ID's reachable from the current root > > > mntns:21:parentid - parent ID of the mount with ID of 21 > > > xattr:security.selinux - the security.selinux extended attribute > > > data:foo/bar - the data contained in file $ORIGIN/foo/bar > > > > How are these different from just declaring new xattr namespaces for > > these things. e.g. open any file and list the xattrs in the > > xattr:mount.mnt namespace to get the list of mount parameters for > > that mount. > > Okay. > > > Why do we need a new "xattr in everything but name" interface when > > we could just extend the one we've already got and formalise a new, > > cleaner version of xattr batch APIs that have been around for 20-odd > > years already? > > Seems to make sense. But...will listxattr list everyting recursively? > I guess that won't work, better just list traditional xattrs, > otherwise we'll likely get regressions, *nod* > and anyway the point of a > hierarchical namespace is to be able to list nodes on each level. We > can use getxattr() for this purpose, just like getvalues() does in the > above example. Yup, and like Casey suggests, you could implement a generic getvalues()-like user library on top of it so users don't even need to know where and how the values are located or retrieved. The other advantage of an xattr interface is that is also provides a symmetrical API for -changing- values. No need for some special configfs or configfd thingy for setting parameters - just change the value of the parameter or mount option with a simple setxattr call. That retains the simplicity of proc and sysfs attributes in that you can change them just by writing a new value to the file.... Cheers, Dave.
On Thu, Mar 24, 2022 at 09:44:38AM +0100, Miklos Szeredi wrote: > > If so, have you benchmarked lsof using this new interface? > > Not yet. Looked yesterday at both lsof and procps source code, and > both are pretty complex and not easy to plug in a new interface. But > I've not yet given up... I can imagine something like getvalues(2) in lsblk (based on /sys) or in lsfd (based on /proc; lsof replacement). The tools have defined set of information to read from kernel, so gather all the requests to the one syscall for each process or block device makes sense and it will dramatically reduce number of open+read+close syscalls. Karel
On Fri, Mar 25, 2022 at 09:46:46AM +0100, Karel Zak wrote: > On Thu, Mar 24, 2022 at 09:44:38AM +0100, Miklos Szeredi wrote: > > > If so, have you benchmarked lsof using this new interface? > > > > Not yet. Looked yesterday at both lsof and procps source code, and > > both are pretty complex and not easy to plug in a new interface. But > > I've not yet given up... > > I can imagine something like getvalues(2) in lsblk (based on /sys) or > in lsfd (based on /proc; lsof replacement). The tools have defined set > of information to read from kernel, so gather all the requests to the > one syscall for each process or block device makes sense and it will > dramatically reduce number of open+read+close syscalls. And do those open+read+close syscalls actually show up in measurements? Again, I tried to find a real-world application that turning those 3 into 1 would matter, and I couldn't. procps had no decreased system load that I could notice. I'll mess with lsof but that's really just a stress-test, not anything that is run all the time, right? And as others have said, using io_uring() would also solve the 3 syscall issue, but no one seems to want to convert these tools to use that, which implies that it's not really an issue for anyone :) thanks, greg k-h
On Fri, Mar 25, 2022 at 07:31:16AM +1100, Dave Chinner wrote: > > What about other objects' attributes, statistics? Remember this > > started out as a way to replace /proc/self/mountinfo with something > > that can query individual mount. > > For individual mount info, why do we even need to query something in > /proc? I mean, every open file in the mount has access to the mount > and the underlying superblock, so why not just make the query > namespace accessable from any open fd on that mount? The current most problematic situation is in systemd. We get generic notification (poll() on mountinfo) that something has been modified in the mount table, and then we need to parse all the file to get details. So, the ideal solution would be notification that points to the FS and interface to read information (e.g. getvalues()) about the FS. Karel
On Fri, Mar 25, 2022 at 09:54:21AM +0100, Greg KH wrote: > On Fri, Mar 25, 2022 at 09:46:46AM +0100, Karel Zak wrote: > > On Thu, Mar 24, 2022 at 09:44:38AM +0100, Miklos Szeredi wrote: > > > > If so, have you benchmarked lsof using this new interface? > > > > > > Not yet. Looked yesterday at both lsof and procps source code, and > > > both are pretty complex and not easy to plug in a new interface. But > > > I've not yet given up... > > > > I can imagine something like getvalues(2) in lsblk (based on /sys) or > > in lsfd (based on /proc; lsof replacement). The tools have defined set > > of information to read from kernel, so gather all the requests to the > > one syscall for each process or block device makes sense and it will > > dramatically reduce number of open+read+close syscalls. > > And do those open+read+close syscalls actually show up in measurements? > > Again, I tried to find a real-world application that turning those 3 > into 1 would matter, and I couldn't. procps had no decreased system > load that I could notice. I'll mess with lsof but that's really just a > stress-test, not anything that is run all the time, right? Right, the speed of ps(1) or lsof(1) is not important. IMHO the current discussion about getvalues() goes in wrong direction :-) I guess the primary motivation is not to replace open+read+close, but provide to userspace something usable to get information from mount table, because the current /proc/#/mountinfo and notification by poll() is horrible. Don't forget that the previous attempt was fsinfo() from David Howells (unfortunately, it was too complex and rejected by Linus). > And as others have said, using io_uring() would also solve the 3 syscall > issue, but no one seems to want to convert these tools to use that, > which implies that it's not really an issue for anyone :) OK, I'll think about it :-) Karel
Hi! > > If so, have you benchmarked lsof using this new interface? > > Not yet. Looked yesterday at both lsof and procps source code, and > both are pretty complex and not easy to plug in a new interface. But > I've not yet given up... Looking at lsof it seems to use fopen() and fgets() to parse various proc files. I doubt that we can make the parsing singificantly faster without completely rewriting the internals. As for procps the readproc.c has file2str() function that does copy whole proc files into a buffer with open() - read() - close(). It may be reasonably easy to hook the new systall there and it will probably make ps and top slightly faster.
On Fri, 2022-03-25 at 07:31 +1100, Dave Chinner wrote: > On Thu, Mar 24, 2022 at 09:57:26AM +0100, Miklos Szeredi wrote: > > On Wed, 23 Mar 2022 at 23:58, Dave Chinner <david@fromorbit.com> > > wrote: > > > > > > On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote: > > > > > > - Interfaces for getting various attributes and statistics are > > > > fragmented. > > > > For files we have basic stat, statx, extended attributes, > > > > file attributes > > > > (for which there are two overlapping ioctl interfaces). For > > > > mounts and > > > > superblocks we have stat*fs as well as > > > > /proc/$PID/{mountinfo,mountstats}. > > > > The latter also has the problem on not allowing queries on a > > > > specific > > > > mount. > > > > > > https://xkcd.com/927/ > > > > Haha! > > > > > I've said in the past when discussing things like statx() that > > > maybe > > > everything should be addressable via the xattr namespace and > > > set/queried via xattr names regardless of how the filesystem > > > stores > > > the data. The VFS/filesystem simply translates the name to the > > > storage location of the information. It might be held in xattrs, > > > but > > > it could just be a flag bit in an inode field. > > > > Right, that would definitely make sense for inode attributes. > > > > What about other objects' attributes, statistics? Remember this > > started out as a way to replace /proc/self/mountinfo with something > > that can query individual mount. > > For individual mount info, why do we even need to query something in > /proc? I mean, every open file in the mount has access to the mount > and the underlying superblock, so why not just make the query > namespace accessable from any open fd on that mount? > > e.g. /proc/self/mountinfo tells you where the mounts are, then you > can just open(O_PATH) the mount point you want the info from and > retrieve the relevant xattrs from that fd. The information itself > does not need to be in /proc, nor only accessible from /proc, nor be > limited to proc infrastructure, nor be constrained by proc's > arbitrary "one value per file" presentation.... > > Indeed, we don't have to centralise all the information in one place > - all we need is to have a well defined, consistent method for > indexing that information and all the shenanigans for accessing > common stuff can be wrapped up in a common userspace library > (similar to how iterating the mount table is generic C library > functionality). > > > > > mnt - list of mount parameters > > > > mnt:mountpoint - the mountpoint of the mount of $ORIGIN > > > > mntns - list of mount ID's reachable from the > > > > current root > > > > mntns:21:parentid - parent ID of the mount with ID of 21 > > > > xattr:security.selinux - the security.selinux extended > > > > attribute > > > > data:foo/bar - the data contained in file > > > > $ORIGIN/foo/bar > > > > > > How are these different from just declaring new xattr namespaces > > > for > > > these things. e.g. open any file and list the xattrs in the > > > xattr:mount.mnt namespace to get the list of mount parameters for > > > that mount. > > > > Okay. > > > > > Why do we need a new "xattr in everything but name" interface > > > when > > > we could just extend the one we've already got and formalise a > > > new, > > > cleaner version of xattr batch APIs that have been around for 20- > > > odd > > > years already? > > > > Seems to make sense. But...will listxattr list everyting > > recursively? > > I guess that won't work, better just list traditional xattrs, > > otherwise we'll likely get regressions, > > *nod* > > > and anyway the point of a > > hierarchical namespace is to be able to list nodes on each level. > > We > > can use getxattr() for this purpose, just like getvalues() does in > > the > > above example. > > Yup, and like Casey suggests, you could implement a generic > getvalues()-like user library on top of it so users don't even need > to know where and how the values are located or retrieved. > > The other advantage of an xattr interface is that is also provides a > symmetrical API for -changing- values. No need for some special > configfs or configfd thingy for setting parameters - just change the > value of the parameter or mount option with a simple setxattr call. > That retains the simplicity of proc and sysfs attributes in that you > can change them just by writing a new value to the file.... The downsides are, however, that the current interface provides little in the way of atomicity if you want to read or write to multiple attributes at the same time. Something like a backup program might want to be able to atomically retrieve the ctime when it is backing up the attributes. Also, when setting attributes, I'd like to avoid multiple syscalls when I'm changing multiple related attributes. IOW: Adding a batching interface that is akin to what Miklos was proposing would be a helpful change if we want to go down this path.
On Fri, Mar 25, 2022 at 1:46 AM Karel Zak <kzak@redhat.com> wrote: > > I can imagine something like getvalues(2) in lsblk (based on /sys) or > in lsfd (based on /proc; lsof replacement). I really would be very hesitant to add new interfaces for completely specialty purposes. As others have mentioned, this has been tried for much more fundamental reasons (that whole "open-and-read" thing), and it hasn't been an obvious improvement. It *could* possibly be an improvement if it would allow us to take advantage of special server-side operations in a networked filesystem (ie like the whole "copy_file_range" kinds of interfaces that allow server-side things) where you need to transfer less data, or need fewer back-and-forth operations. And even that is clearly questionable, with some of those network file interfaces basically not having been used in real world situations in the past, so.. (Also, with "copy_file_range" we not only had others actively doing it, but the wins were "several orders of manitude", so even if it was fairly rare, it was _so_ big that it was worth doing anyway). With the "open-and-read" thing, the wins aren't that enormous. And getvalues() isn't even that. It's literally a speciality interface for a very special thing. Yes, I'm sure it avoids several system calls. Yes, I'm sure it avoids parsing strings etc. But I really don't think this is something we want to do unless people can show enormous and real-world examples of where it makes such a huge difference that we absolutely have to do it. Linus
On Fri, Mar 25, 2022 at 10:25:53AM +0100, Karel Zak wrote: > > Right, the speed of ps(1) or lsof(1) is not important. IMHO the current > discussion about getvalues() goes in wrong direction :-) > > I guess the primary motivation is not to replace open+read+close, but > provide to userspace something usable to get information from mount > table, because the current /proc/#/mountinfo and notification by > poll() is horrible. I think that's because the getvalues(2) prototype *only* optimizes away open+read+close, and doesn't do a *thing* with respect to /proc/<pid>/mountinfo. > Don't forget that the previous attempt was fsinfo() from David Howells > (unfortunately, it was too complex and rejected by Linus). fsinfo() tried to do a lot more than solving the /proc/<pid>/mountinfo problem; perhaps that was the cause of the complexity. Ignoring the notification problem (which I suspect we could solve with an extension of fsnotify), if the goal is to find a cleaner way to fetch information about a process's mount namespace and the mounts in that namespace, why not trying to export that information via sysfs? Information about devices are just as complex, after all. We could make mount namespaces to be their own first class object, so there would be an entry in /proc/<pid> which returns the mount namespace id used by a particular process. Similarly, let each mounted file system be its own first class object. Information about each mount namespace would be in /sys/mnt_ns, and information about each mounted file system would be in /sys/superblock. Then in /sys/mnt_ns there would be a directory for each (superblock, mountpoint) pair. Given how quickly programs like lsof can open tens of thousands of small files, and typically there are't that many mounted file systems in a particular mount namespace, performance really shouldn't be a problem. If it works well enough for other kernel objects that are accessed via sysfs, and fsinfo() is way to complex, why don't we try a pattern which has worked and is "native" to Linux? - Ted
On Fri, Mar 25, 2022 at 04:42:27PM +0000, Trond Myklebust wrote: > On Fri, 2022-03-25 at 07:31 +1100, Dave Chinner wrote: > > > and anyway the point of a > > > hierarchical namespace is to be able to list nodes on each level. > > > We > > > can use getxattr() for this purpose, just like getvalues() does in > > > the > > > above example. > > > > Yup, and like Casey suggests, you could implement a generic > > getvalues()-like user library on top of it so users don't even need > > to know where and how the values are located or retrieved. > > > > The other advantage of an xattr interface is that is also provides a > > symmetrical API for -changing- values. No need for some special > > configfs or configfd thingy for setting parameters - just change the > > value of the parameter or mount option with a simple setxattr call. > > That retains the simplicity of proc and sysfs attributes in that you > > can change them just by writing a new value to the file.... > > The downsides are, however, that the current interface provides little > in the way of atomicity if you want to read or write to multiple > attributes at the same time. Something like a backup program might want > to be able to atomically retrieve the ctime when it is backing up the > attributes. I assumed that batched updates were implied and understood after my earlier comments about XFS_IOC_ATTRMULTI_BY_HANDLE as used by xfsdump/restore for the past 20+ years. > Also, when setting attributes, I'd like to avoid multiple syscalls when > I'm changing multiple related attributes. > > IOW: Adding a batching interface that is akin to what Miklos was > proposing would be a helpful change if we want to go down this path. Yup, that's exactly what XFS_IOC_ATTRMULTI_BY_HANDLE provides and I'm assuming that would also be provided by whatever formalised generic syscall API we come up with here... Cheers, Dave.
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c84d12608cd2..c72668001b39 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,7 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common getvalues sys_getvalues # # Due to a historical design error, certain syscalls are numbered differently diff --git a/fs/Makefile b/fs/Makefile index 208a74e0b00e..f00d6bcd1178 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -16,7 +16,7 @@ obj-y := open.o read_write.o file_table.o super.o \ pnode.o splice.o sync.o utimes.o d_path.o \ stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \ fs_types.o fs_context.o fs_parser.o fsopen.o init.o \ - kernel_read_file.o remap_range.o + kernel_read_file.o remap_range.o values.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o direct-io.o mpage.o diff --git a/fs/mount.h b/fs/mount.h index 0b6e08cf8afb..a3ca5233e481 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -148,3 +148,11 @@ static inline bool is_anon_ns(struct mnt_namespace *ns) } extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor); + +extern void namespace_lock_read(void); +extern void namespace_unlock_read(void); +extern void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt); +extern void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns, + struct path *root); +extern struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, + struct path *root, int id); diff --git a/fs/namespace.c b/fs/namespace.c index de6fae84f1a1..52b15c17251f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1405,6 +1405,38 @@ void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor) } #endif /* CONFIG_PROC_FS */ +void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns, + struct path *root) +{ + struct mount *m; + + down_read(&namespace_sem); + for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) { + if (is_path_reachable(m, m->mnt.mnt_root, root)) { + seq_printf(seq, "%i", m->mnt_id); + seq_putc(seq, '\0'); + } + } + up_read(&namespace_sem); +} + +/* called with namespace_sem held for read */ +struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, struct path *root, + int id) +{ + struct mount *m; + + for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) { + if (m->mnt_id == id) { + if (is_path_reachable(m, m->mnt.mnt_root, root)) + return mntget(&m->mnt); + else + return NULL; + } + } + return NULL; +} + /** * may_umount_tree - check if a mount tree is busy * @m: root of mount tree @@ -1494,6 +1526,16 @@ static inline void namespace_lock(void) down_write(&namespace_sem); } +void namespace_lock_read(void) +{ + down_read(&namespace_sem); +} + +void namespace_unlock_read(void) +{ + up_read(&namespace_sem); +} + enum umount_tree_flags { UMOUNT_SYNC = 1, UMOUNT_PROPAGATE = 2, diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 49650e54d2f8..fa6dc2c20578 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -61,7 +61,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb) return security_sb_show_options(m, sb); } -static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) +void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) { static const struct proc_fs_opts mnt_opts[] = { { MNT_NOSUID, ",nosuid" }, diff --git a/fs/values.c b/fs/values.c new file mode 100644 index 000000000000..618fa9bf48a1 --- /dev/null +++ b/fs/values.c @@ -0,0 +1,524 @@ +#include <linux/syscalls.h> +#include <linux/printk.h> +#include <linux/namei.h> +#include <linux/fs_struct.h> +#include <linux/posix_acl_xattr.h> +#include <linux/xattr.h> +#include "pnode.h" +#include "internal.h" + +#define VAL_GRSEP ':' + +struct name_val { + const char __user *name; /* in */ + struct iovec value_in; /* in */ + struct iovec value_out; /* out */ + __u32 error; /* out */ + __u32 reserved; +}; + +struct val_iter { + struct name_val __user *curr; + size_t num; + struct iovec vec; + char name[256]; + size_t bufsize; + struct seq_file seq; + const char *prefix; + const char *sub; +}; + +struct val_desc { + const char *name; + union { + int idx; + int (*get)(struct val_iter *vi, const struct path *path); + }; +}; + +static int val_get(struct val_iter *vi) +{ + struct name_val nameval; + long err; + + if (copy_from_user(&nameval, vi->curr, sizeof(nameval))) + return -EFAULT; + + err = strncpy_from_user(vi->name, nameval.name, sizeof(vi->name)); + if (err < 0) + return err; + if (err == sizeof(vi->name)) + return -ERANGE; + + if (nameval.value_in.iov_base) + vi->vec = nameval.value_in; + + vi->seq.size = min(vi->vec.iov_len, vi->bufsize); + vi->seq.count = 0; + + return 0; +} + +static int val_next(struct val_iter *vi) +{ + vi->curr++; + vi->num--; + + return vi->num ? val_get(vi) : 0; +} + +static int val_end(struct val_iter *vi, size_t count) +{ + struct iovec iov = { + .iov_base = vi->vec.iov_base, + .iov_len = count, + }; + + if (copy_to_user(&vi->curr->value_out, &iov, sizeof(iov))) + return -EFAULT; + + vi->vec.iov_base += count; + vi->vec.iov_len -= count; + + return val_next(vi); +} + +static int val_err(struct val_iter *vi, int err) +{ + if (put_user(-err, &vi->curr->error)) + return -EFAULT; + + return val_next(vi); +} + +static int val_end_seq(struct val_iter *vi, int err) +{ + size_t count = vi->seq.count; + + if (err) + return val_err(vi, err); + + if (count == vi->seq.size) + return -EOVERFLOW; + + if (copy_to_user(vi->vec.iov_base, vi->seq.buf, count)) + return -EFAULT; + + return val_end(vi, count); +} + +static struct val_desc *val_lookup(struct val_iter *vi, struct val_desc *vd) +{ + const char *name = vi->name; + const char *prefix = vi->prefix; + size_t prefixlen = strlen(prefix); + + if (prefixlen) { + /* + * Name beggining with a group separator is a shorthand for + * previously prefix. + */ + if (name[0] == VAL_GRSEP) { + name++; + } else { + if (strncmp(name, prefix, prefixlen) != 0) + return NULL; + name += prefixlen; + } + } + + vi->sub = NULL; + for (; vd->name; vd++) { + if (strcmp(name, vd->name) == 0) + break; + else { + size_t grlen = strlen(vd->name); + + if (strncmp(vd->name, name, grlen) == 0 && + name[grlen] == VAL_GRSEP) { + vi->sub = name + grlen + 1; + break; + } + } + } + return vd; +} + +static int val_get_group(struct val_iter *vi, struct val_desc *vd) +{ + for (; vd->name; vd++) + seq_write(&vi->seq, vd->name, strlen(vd->name) + 1); + + return val_end_seq(vi, 0); +} + +static bool val_push_prefix(struct val_iter *vi, const char **oldprefix) +{ + char *newprefix; + + newprefix = kmemdup_nul(vi->name, vi->sub - vi->name, GFP_KERNEL); + if (newprefix) { + *oldprefix = vi->prefix; + vi->prefix = newprefix; + } + + return newprefix; +} + +static void val_pop_prefix(struct val_iter *vi, const char *oldprefix) +{ + kfree(vi->prefix); + vi->prefix = oldprefix; +} + +enum { + VAL_MNT_ID, + VAL_MNT_PARENTID, + VAL_MNT_ROOT, + VAL_MNT_MOUNTPOINT, + VAL_MNT_OPTIONS, + VAL_MNT_SHARED, + VAL_MNT_MASTER, + VAL_MNT_PROPAGATE_FROM, + VAL_MNT_UNBINDABLE, + VAL_MNT_NOTFOUND, +}; + +static struct val_desc val_mnt_group[] = { + { .name = "id", .idx = VAL_MNT_ID }, + { .name = "parentid", .idx = VAL_MNT_PARENTID, }, + { .name = "root", .idx = VAL_MNT_ROOT, }, + { .name = "mountpoint", .idx = VAL_MNT_MOUNTPOINT, }, + { .name = "options", .idx = VAL_MNT_OPTIONS, }, + { .name = "shared", .idx = VAL_MNT_SHARED, }, + { .name = "master", .idx = VAL_MNT_MASTER, }, + { .name = "propagate_from", .idx = VAL_MNT_PROPAGATE_FROM, }, + { .name = "unbindable", .idx = VAL_MNT_UNBINDABLE, }, + { .name = NULL, .idx = VAL_MNT_NOTFOUND }, +}; + +static int seq_mnt_root(struct seq_file *seq, struct vfsmount *mnt) +{ + struct super_block *sb = mnt->mnt_sb; + int err = 0; + + if (sb->s_op->show_path) { + err = sb->s_op->show_path(seq, mnt->mnt_root); + if (!err) { + seq_putc(seq, '\0'); + if (seq->count < seq->size) + seq->count = string_unescape(seq->buf, seq->buf, seq->size, UNESCAPE_OCTAL); + } + } else { + seq_dentry(seq, mnt->mnt_root, ""); + } + + return err; +} + +static int val_mnt_show(struct val_iter *vi, struct vfsmount *mnt) +{ + struct mount *m = real_mount(mnt); + struct path root, mnt_path; + struct val_desc *vd; + const char *oldprefix; + int err = 0; + + if (!val_push_prefix(vi, &oldprefix)) + return -ENOMEM; + + while (!err && vi->num) { + vd = val_lookup(vi, val_mnt_group); + if (!vd) + break; + + switch(vd->idx) { + case VAL_MNT_ID: + seq_printf(&vi->seq, "%i", m->mnt_id); + break; + case VAL_MNT_PARENTID: + seq_printf(&vi->seq, "%i", m->mnt_parent->mnt_id); + break; + case VAL_MNT_ROOT: + seq_mnt_root(&vi->seq, mnt); + break; + case VAL_MNT_MOUNTPOINT: + get_fs_root(current->fs, &root); + mnt_path.dentry = mnt->mnt_root; + mnt_path.mnt = mnt; + err = seq_path_root(&vi->seq, &mnt_path, &root, ""); + path_put(&root); + break; + case VAL_MNT_OPTIONS: + seq_puts(&vi->seq, mnt->mnt_flags & MNT_READONLY ? "ro" : "rw"); + show_mnt_opts(&vi->seq, mnt); + break; + case VAL_MNT_SHARED: + if (IS_MNT_SHARED(m)) + seq_printf(&vi->seq, "%i,", m->mnt_group_id); + break; + case VAL_MNT_MASTER: + if (IS_MNT_SLAVE(m)) + seq_printf(&vi->seq, "%i,", + m->mnt_master->mnt_group_id); + break; + case VAL_MNT_PROPAGATE_FROM: + if (IS_MNT_SLAVE(m)) { + int dom; + + get_fs_root(current->fs, &root); + dom = get_dominating_id(m, &root); + path_put(&root); + if (dom) + seq_printf(&vi->seq, "%i,", dom); + } + break; + case VAL_MNT_UNBINDABLE: + if (IS_MNT_UNBINDABLE(m)) + seq_puts(&vi->seq, "yes"); + break; + default: + err = -ENOENT; + break; + } + err = val_end_seq(vi, err); + } + val_pop_prefix(vi, oldprefix); + + return err; +} + +static int val_mnt_get(struct val_iter *vi, const struct path *path) +{ + int err; + + if (!vi->sub) + return val_get_group(vi, val_mnt_group); + + namespace_lock_read(); + err = val_mnt_show(vi, path->mnt); + namespace_unlock_read(); + + return err; +} + +static int val_mntns_get(struct val_iter *vi, const struct path *path) +{ + struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns; + struct vfsmount *mnt; + struct path root; + char *end; + int mnt_id; + int err; + + if (!vi->sub) { + get_fs_root(current->fs, &root); + seq_mnt_list(&vi->seq, mnt_ns, &root); + path_put(&root); + return val_end_seq(vi, 0); + } + + end = strchr(vi->sub, VAL_GRSEP); + if (end) + *end = '\0'; + err = kstrtoint(vi->sub, 10, &mnt_id); + if (err) + return val_err(vi, err); + vi->sub = NULL; + if (end) { + *end = VAL_GRSEP; + vi->sub = end + 1; + } + + namespace_lock_read(); + get_fs_root(current->fs, &root); + mnt = mnt_lookup_by_id(mnt_ns, &root, mnt_id); + path_put(&root); + if (!mnt) { + namespace_unlock_read(); + return val_err(vi, -ENOENT); + } + if (vi->sub) + err = val_mnt_show(vi, mnt); + else + err = val_get_group(vi, val_mnt_group); + + namespace_unlock_read(); + mntput(mnt); + + return err; +} + +static ssize_t val_do_read(struct val_iter *vi, struct path *path) +{ + ssize_t ret; + struct file *file; + struct open_flags op = { + .open_flag = O_RDONLY, + .acc_mode = MAY_READ, + .intent = LOOKUP_OPEN, + }; + + file = do_file_open_root(path, "", &op); + if (IS_ERR(file)) + return PTR_ERR(file); + + ret = vfs_read(file, vi->vec.iov_base, vi->vec.iov_len, NULL); + fput(file); + + return ret; +} + +static ssize_t val_do_readlink(struct val_iter *vi, struct path *path) +{ + int ret; + + ret = security_inode_readlink(path->dentry); + if (ret) + return ret; + + return vfs_readlink(path->dentry, vi->vec.iov_base, vi->vec.iov_len); +} + +static inline bool dot_or_dotdot(const char *s) +{ + return s[0] == '.' && + (s[1] == '/' || s[1] == '\0' || + (s[1] == '.' && (s[2] == '/' || s[2] == '\0'))); +} + +/* + * - empty path is okay + * - must not begin or end with slash or have a double slash anywhere + * - must not have . or .. components + */ +static bool val_verify_path(const char *subpath) +{ + const char *s = subpath; + + if (s[0] == '\0') + return true; + + if (s[0] == '/' || s[strlen(s) - 1] == '/' || strstr(s, "//")) + return false; + + for (s--; s; s = strstr(s + 3, "/.")) + if (dot_or_dotdot(s + 1)) + return false; + + return true; +} + +static int val_data_get(struct val_iter *vi, const struct path *path) +{ + struct path this; + ssize_t ret; + + if (!vi->sub) + return val_err(vi, -ENOENT); + + if (!val_verify_path(vi->sub)) + return val_err(vi, -EINVAL); + + ret = vfs_path_lookup(path->dentry, path->mnt, vi->sub, + LOOKUP_NO_XDEV | LOOKUP_BENEATH | + LOOKUP_IN_ROOT, &this); + if (ret) + return val_err(vi, ret); + + ret = -ENODATA; + if (d_is_reg(this.dentry) || d_is_symlink(this.dentry)) { + if (d_is_reg(this.dentry)) + ret = val_do_read(vi, &this); + else + ret = val_do_readlink(vi, &this); + } + path_put(&this); + if (ret == -EFAULT) + return ret; + if (ret < 0) + return val_err(vi, ret); + if (ret == vi->vec.iov_len) + return -EOVERFLOW; + + return val_end(vi, ret); +} + +static int val_xattr_get(struct val_iter *vi, const struct path *path) +{ + ssize_t ret; + struct user_namespace *mnt_userns = mnt_user_ns(path->mnt); + void *value = vi->seq.buf + vi->seq.count; + size_t size = min_t(size_t, vi->seq.size - vi->seq.count, + XATTR_SIZE_MAX); + + if (!vi->sub) + return val_err(vi, -ENOENT); + + ret = vfs_getxattr(mnt_userns, path->dentry, vi->sub, value, size); + if (ret < 0) + return val_err(vi, ret); + + if ((strcmp(vi->sub, XATTR_NAME_POSIX_ACL_ACCESS) == 0) || + (strcmp(vi->sub, XATTR_NAME_POSIX_ACL_DEFAULT) == 0)) + posix_acl_fix_xattr_to_user(mnt_userns, value, ret); + + vi->seq.count += ret; + + return val_end_seq(vi, 0); +} + + +static struct val_desc val_toplevel_group[] = { + { .name = "mnt", .get = val_mnt_get, }, + { .name = "mntns", .get = val_mntns_get, }, + { .name = "xattr", .get = val_xattr_get, }, + { .name = "data", .get = val_data_get, }, + { .name = NULL }, +}; + +SYSCALL_DEFINE5(getvalues, + int, dfd, + const char __user *, pathname, + struct name_val __user *, vec, + size_t, num, + unsigned int, flags) +{ + char vals[1024]; + struct val_iter vi = { + .curr = vec, + .num = num, + .seq.buf = vals, + .bufsize = sizeof(vals), + .prefix = "", + }; + struct val_desc *vd; + struct path path = {}; + ssize_t err; + + err = user_path_at(dfd, pathname, 0, &path); + if (err) + return err; + + err = val_get(&vi); + if (err) + goto out; + + if (!strlen(vi.name)) { + err = val_get_group(&vi, val_toplevel_group); + goto out; + } + while (!err && vi.num) { + vd = val_lookup(&vi, val_toplevel_group); + if (!vd->name) + err = val_err(&vi, -ENOENT); + else + err = vd->get(&vi, &path); + } +out: + if (err == -EOVERFLOW) + err = 0; + + path_put(&path); + return err < 0 ? err : num - vi.num; +}
Add a new userspace API that allows getting multiple short values in a single syscall. This would be useful for the following reasons: - Calling open/read/close for many small files is inefficient. E.g. on my desktop invoking lsof(1) results in ~60k open + read + close calls under /proc and 90% of those are 128 bytes or less. - Interfaces for getting various attributes and statistics are fragmented. For files we have basic stat, statx, extended attributes, file attributes (for which there are two overlapping ioctl interfaces). For mounts and superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}. The latter also has the problem on not allowing queries on a specific mount. - Some attributes are cheap to generate, some are expensive. Allowing userspace to select which ones it needs should allow optimizing queries. - Adding an ascii namespace should allow easy extension and self description. - The values can be text or binary, whichever is fits best. The interface definition is: struct name_val { const char *name; /* in */ struct iovec value_in; /* in */ struct iovec value_out; /* out */ uint32_t error; /* out */ uint32_t reserved; }; int getvalues(int dfd, const char *path, struct name_val *vec, size_t num, unsigned int flags); @dfd and @path are used to lookup object $ORIGIN. @vec contains @num name/value descriptors. @flags contains lookup flags for @path. The syscall returns the number of values filled or an error. A single name/value descriptor has the following fields: @name describes the object whose value is to be returned. E.g. mnt - list of mount parameters mnt:mountpoint - the mountpoint of the mount of $ORIGIN mntns - list of mount ID's reachable from the current root mntns:21:parentid - parent ID of the mount with ID of 21 xattr:security.selinux - the security.selinux extended attribute data:foo/bar - the data contained in file $ORIGIN/foo/bar If the name starts with the separator, then it is taken to have the same prefix as the previous name/value descriptor. E.g. in the following sequence of names the second one is equivalent to mnt:parentid: mnt:mountpoint :parentid @value_in supplies the buffer to store value(s) in. If a subsequent name/value descriptor has NULL value of value_in.iov_base, then the buffer from the previous name/value descriptor will be used. This way it's possible to use a shared buffer for multiple values. The starting address and length of the actual value will be stored in @value_out, unless an error has occurred in which case @error will be set to the positive errno value. Multiple names starting with the same prefix (including the shorthand form) may also be batched together under the same lock, so the order of the names can determine atomicity. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/Makefile | 2 +- fs/mount.h | 8 + fs/namespace.c | 42 ++ fs/proc_namespace.c | 2 +- fs/values.c | 524 +++++++++++++++++++++++++ 6 files changed, 577 insertions(+), 2 deletions(-) create mode 100644 fs/values.c