Message ID | 20241008121930.869054-1-luca.boccassi@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v9] pidfd: add ioctl to retrieve pid info | expand |
On Tue, Oct 08, 2024 at 01:18:20PM GMT, luca.boccassi@gmail.com wrote: > From: Luca Boccassi <luca.boccassi@gmail.com> > > A common pattern when using pid fds is having to get information > about the process, which currently requires /proc being mounted, > resolving the fd to a pid, and then do manual string parsing of > /proc/N/status and friends. This needs to be reimplemented over > and over in all userspace projects (e.g.: I have reimplemented > resolving in systemd, dbus, dbus-daemon, polkit so far), and > requires additional care in checking that the fd is still valid > after having parsed the data, to avoid races. > > Having a programmatic API that can be used directly removes all > these requirements, including having /proc mounted. > > As discussed at LPC24, add an ioctl with an extensible struct > so that more parameters can be added later if needed. Start with > returning pid/tgid/ppid and creds unconditionally, and cgroupid > optionally. > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> > --- > v9: drop result_mask and reuse request_mask instead > v8: use RAII guard for rcu, call put_cred() > v7: fix RCU issue and style issue introduced by v6 found by reviewer > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to > get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end > of the call to avoid providing incomplete data, document what the > callers should expect > v5: check again that the task hasn't exited immediately before copying > the result out to userspace, to ensure we are not returning stale data > add an ifdef around the cgroup structs usage to fix build errors when > the feature is disabled > v4: fix arg check in pidfd_ioctl() by moving it after the new call > v3: switch from pid_vnr() to task_pid_vnr() > v2: Apply comments from Christian, apart from the one about pid namespaces > as I need additional hints on how to implement it. > Drop the security_context string as it is not the appropriate > metadata to give userspace these days. > > fs/pidfs.c | 88 ++++++++++++++++++- > include/uapi/linux/pidfd.h | 30 +++++++ > .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++- > 3 files changed, 194 insertions(+), 4 deletions(-) > > diff --git a/fs/pidfs.c b/fs/pidfs.c > index 80675b6bf884..15cdc7fe4968 100644 > --- a/fs/pidfs.c > +++ b/fs/pidfs.c > @@ -2,6 +2,7 @@ > #include <linux/anon_inodes.h> > #include <linux/file.h> > #include <linux/fs.h> > +#include <linux/cgroup.h> > #include <linux/magic.h> > #include <linux/mount.h> > #include <linux/pid.h> > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > return poll_flags; > } > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > +{ > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > + size_t usize = _IOC_SIZE(cmd); > + struct pidfd_info kinfo = {}; > + struct user_namespace *user_ns; > + const struct cred *c; > + __u64 request_mask; > + > + if (!uinfo) > + return -EINVAL; > + if (usize < sizeof(struct pidfd_info)) > + return -EINVAL; /* First version, no smaller struct possible */ > + > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > + return -EFAULT; > + > + c = get_task_cred(task); > + if (!c) > + return -ESRCH; > + > + /* Unconditionally return identifiers and credentials, the rest only on request */ > + > + user_ns = current_user_ns(); > + kinfo.ruid = from_kuid_munged(user_ns, c->uid); > + kinfo.rgid = from_kgid_munged(user_ns, c->gid); > + kinfo.euid = from_kuid_munged(user_ns, c->euid); > + kinfo.egid = from_kgid_munged(user_ns, c->egid); > + kinfo.suid = from_kuid_munged(user_ns, c->suid); > + kinfo.sgid = from_kgid_munged(user_ns, c->sgid); > + kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid); > + kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid); > + put_cred(c); > + > +#ifdef CONFIG_CGROUPS > + if (request_mask & PIDFD_INFO_CGROUPID) { > + struct cgroup *cgrp; > + > + guard(rcu)(); > + cgrp = task_cgroup(task, pids_cgrp_id); > + if (!cgrp) > + return -ENODEV; Afaict this means that the task has already exited. In other words, the cgroup id cannot be retrieved anymore for a task that has exited but not been reaped. Frankly, I would have expected the cgroup id to be retrievable until the task has been reaped but that's another discussion. My point is if you contrast this with the other information in here: If the task has exited but hasn't been reaped then you can still get credentials such as *uid/*gid, and pid namespace relative information such as pid/tgid/ppid. So really, I would argue that you don't want to fail this but only report 0 here. That's me working under the assumption that cgroup ids start from 1... /me checks Yes, they start from 1 so 0 is invalid. > + kinfo.cgroupid = cgroup_id(cgrp); Fwiw, it looks like getting the cgroup id is basically just dereferencing pointers without having to hold any meaningful locks. So it should be fast. So making it unconditional seems fine to me. > + > + kinfo.request_mask |= PIDFD_INFO_CGROUPID; > + } > +#endif > + > + /* > + * Copy pid/tgid last, to reduce the chances the information might be > + * stale. Note that it is not possible to ensure it will be valid as the > + * task might return as soon as the copy_to_user finishes, but that's ok > + * and userspace expects that might happen and can act accordingly, so > + * this is just best-effort. What we can do however is checking that all > + * the fields are set correctly, or return ESRCH to avoid providing > + * incomplete information. */ > + > + kinfo.ppid = task_ppid_nr_ns(task, NULL); > + kinfo.tgid = task_tgid_vnr(task); > + kinfo.pid = task_pid_vnr(task); > + > + if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1)) > + return -ESRCH; > + > + /* > + * If userspace and the kernel have the same struct size it can just > + * be copied. If userspace provides an older struct, only the bits that > + * userspace knows about will be copied. If userspace provides a new > + * struct, only the bits that the kernel knows about will be copied and > + * the size value will be set to the size the kernel knows about. > + */ > + if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo)))) > + return -EFAULT; > + > + return 0; > +} > + > static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > { > struct task_struct *task __free(put_task) = NULL; > @@ -122,13 +200,17 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > struct ns_common *ns_common = NULL; > struct pid_namespace *pid_ns; > > - if (arg) > - return -EINVAL; > - > task = get_pid_task(pid, PIDTYPE_PID); > if (!task) > return -ESRCH; > > + /* Extensible IOCTL that does not open namespace FDs, take a shortcut */ > + if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO)) > + return pidfd_info(task, cmd, arg); > + > + if (arg) > + return -EINVAL; > + > scoped_guard(task_lock, task) { > nsp = task->nsproxy; > if (nsp) > diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h > index 565fc0629fff..d685eeeedc51 100644 > --- a/include/uapi/linux/pidfd.h > +++ b/include/uapi/linux/pidfd.h > @@ -16,6 +16,35 @@ > #define PIDFD_SIGNAL_THREAD_GROUP (1UL << 1) > #define PIDFD_SIGNAL_PROCESS_GROUP (1UL << 2) > > +/* Flags for pidfd_info. */ > +#define PIDFD_INFO_CGROUPID (1UL << 0) > + > +struct pidfd_info { > + /* Let userspace request expensive stuff explictly, and let the kernel > + * indicate whether it knows about it. */ > + __u64 request_mask; > + /* > + * The information contained in the following fields might be stale at the > + * time it is received, as the target process might have exited as soon as > + * the IOCTL was processed, and there is no way to avoid that. However, it > + * is guaranteed that if the call was successful, then the information was > + * correct and referred to the intended process at the time the work was > + * performed. */ > + __u64 cgroupid; > + __u32 pid; > + __u32 tgid; > + __u32 ppid; > + __u32 ruid; > + __u32 rgid; > + __u32 euid; > + __u32 egid; > + __u32 suid; > + __u32 sgid; > + __u32 fsuid; > + __u32 fsgid; > + __u32 spare0[1]; > +}; > + > #define PIDFS_IOCTL_MAGIC 0xFF > > #define PIDFD_GET_CGROUP_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 1) > @@ -28,5 +57,6 @@ > #define PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 8) > #define PIDFD_GET_USER_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 9) > #define PIDFD_GET_UTS_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 10) > +#define PIDFD_GET_INFO _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info) > > #endif /* _UAPI_LINUX_PIDFD_H */ > diff --git a/tools/testing/selftests/pidfd/pidfd_open_test.c b/tools/testing/selftests/pidfd/pidfd_open_test.c > index c62564c264b1..b2a8cfb19a74 100644 > --- a/tools/testing/selftests/pidfd/pidfd_open_test.c > +++ b/tools/testing/selftests/pidfd/pidfd_open_test.c > @@ -13,6 +13,7 @@ > #include <stdlib.h> > #include <string.h> > #include <syscall.h> > +#include <sys/ioctl.h> > #include <sys/mount.h> > #include <sys/prctl.h> > #include <sys/wait.h> > @@ -21,6 +22,34 @@ > #include "pidfd.h" > #include "../kselftest.h" > > +#ifndef PIDFS_IOCTL_MAGIC > +#define PIDFS_IOCTL_MAGIC 0xFF > +#endif > + > +#ifndef PIDFD_GET_INFO > +#define PIDFD_GET_INFO _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info) > +#define PIDFD_INFO_CGROUPID (1UL << 0) > + > +struct pidfd_info { > + /* Let userspace request expensive stuff explictly, and let the kernel > + * indicate whether it knows about it. */ > + __u64 request_mask; > + __u64 cgroupid; > + __u32 pid; > + __u32 tgid; > + __u32 ppid; > + __u32 ruid; > + __u32 rgid; > + __u32 euid; > + __u32 egid; > + __u32 suid; > + __u32 sgid; > + __u32 fsuid; > + __u32 fsgid; > + __u32 spare0[1]; > +}; > +#endif > + > static int safe_int(const char *numstr, int *converted) > { > char *err = NULL; > @@ -120,10 +149,13 @@ static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen) > > int main(int argc, char **argv) > { > + struct pidfd_info info = { > + .request_mask = PIDFD_INFO_CGROUPID, > + }; > int pidfd = -1, ret = 1; > pid_t pid; > > - ksft_set_plan(3); > + ksft_set_plan(4); > > pidfd = sys_pidfd_open(-1, 0); > if (pidfd >= 0) { > @@ -153,6 +185,52 @@ int main(int argc, char **argv) > pid = get_pid_from_fdinfo_file(pidfd, "Pid:", sizeof("Pid:") - 1); > ksft_print_msg("pidfd %d refers to process with pid %d\n", pidfd, pid); > > + if (ioctl(pidfd, PIDFD_GET_INFO, &info) < 0) { > + ksft_print_msg("%s - failed to get info from pidfd\n", strerror(errno)); > + goto on_error; > + } > + if (info.pid != pid) { > + ksft_print_msg("pid from fdinfo file %d does not match pid from ioctl %d\n", > + pid, info.pid); > + goto on_error; > + } > + if (info.ppid != getppid()) { > + ksft_print_msg("ppid %d does not match ppid from ioctl %d\n", > + pid, info.pid); > + goto on_error; > + } > + if (info.ruid != getuid()) { > + ksft_print_msg("uid %d does not match uid from ioctl %d\n", > + getuid(), info.ruid); > + goto on_error; > + } > + if (info.rgid != getgid()) { > + ksft_print_msg("gid %d does not match gid from ioctl %d\n", > + getgid(), info.rgid); > + goto on_error; > + } > + if (info.euid != geteuid()) { > + ksft_print_msg("euid %d does not match euid from ioctl %d\n", > + geteuid(), info.euid); > + goto on_error; > + } > + if (info.egid != getegid()) { > + ksft_print_msg("egid %d does not match egid from ioctl %d\n", > + getegid(), info.egid); > + goto on_error; > + } > + if (info.suid != geteuid()) { > + ksft_print_msg("suid %d does not match suid from ioctl %d\n", > + geteuid(), info.suid); > + goto on_error; > + } > + if (info.sgid != getegid()) { > + ksft_print_msg("sgid %d does not match sgid from ioctl %d\n", > + getegid(), info.sgid); > + goto on_error; > + } > + ksft_test_result_pass("get info from pidfd test: passed\n"); > + > ret = 0; > > on_error: > > base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b > -- > 2.45.2 >
On Tue, 8 Oct 2024 at 14:06, Christian Brauner <brauner@kernel.org> wrote: > > On Tue, Oct 08, 2024 at 01:18:20PM GMT, luca.boccassi@gmail.com wrote: > > From: Luca Boccassi <luca.boccassi@gmail.com> > > > > A common pattern when using pid fds is having to get information > > about the process, which currently requires /proc being mounted, > > resolving the fd to a pid, and then do manual string parsing of > > /proc/N/status and friends. This needs to be reimplemented over > > and over in all userspace projects (e.g.: I have reimplemented > > resolving in systemd, dbus, dbus-daemon, polkit so far), and > > requires additional care in checking that the fd is still valid > > after having parsed the data, to avoid races. > > > > Having a programmatic API that can be used directly removes all > > these requirements, including having /proc mounted. > > > > As discussed at LPC24, add an ioctl with an extensible struct > > so that more parameters can be added later if needed. Start with > > returning pid/tgid/ppid and creds unconditionally, and cgroupid > > optionally. > > > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> > > --- > > v9: drop result_mask and reuse request_mask instead > > v8: use RAII guard for rcu, call put_cred() > > v7: fix RCU issue and style issue introduced by v6 found by reviewer > > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to > > get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end > > of the call to avoid providing incomplete data, document what the > > callers should expect > > v5: check again that the task hasn't exited immediately before copying > > the result out to userspace, to ensure we are not returning stale data > > add an ifdef around the cgroup structs usage to fix build errors when > > the feature is disabled > > v4: fix arg check in pidfd_ioctl() by moving it after the new call > > v3: switch from pid_vnr() to task_pid_vnr() > > v2: Apply comments from Christian, apart from the one about pid namespaces > > as I need additional hints on how to implement it. > > Drop the security_context string as it is not the appropriate > > metadata to give userspace these days. > > > > fs/pidfs.c | 88 ++++++++++++++++++- > > include/uapi/linux/pidfd.h | 30 +++++++ > > .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++- > > 3 files changed, 194 insertions(+), 4 deletions(-) > > > > diff --git a/fs/pidfs.c b/fs/pidfs.c > > index 80675b6bf884..15cdc7fe4968 100644 > > --- a/fs/pidfs.c > > +++ b/fs/pidfs.c > > @@ -2,6 +2,7 @@ > > #include <linux/anon_inodes.h> > > #include <linux/file.h> > > #include <linux/fs.h> > > +#include <linux/cgroup.h> > > #include <linux/magic.h> > > #include <linux/mount.h> > > #include <linux/pid.h> > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > > return poll_flags; > > } > > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > > +{ > > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > > + size_t usize = _IOC_SIZE(cmd); > > + struct pidfd_info kinfo = {}; > > + struct user_namespace *user_ns; > > + const struct cred *c; > > + __u64 request_mask; > > + > > + if (!uinfo) > > + return -EINVAL; > > + if (usize < sizeof(struct pidfd_info)) > > + return -EINVAL; /* First version, no smaller struct possible */ > > + > > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > > + return -EFAULT; > > + > > + c = get_task_cred(task); > > + if (!c) > > + return -ESRCH; > > + > > + /* Unconditionally return identifiers and credentials, the rest only on request */ > > + > > + user_ns = current_user_ns(); > > + kinfo.ruid = from_kuid_munged(user_ns, c->uid); > > + kinfo.rgid = from_kgid_munged(user_ns, c->gid); > > + kinfo.euid = from_kuid_munged(user_ns, c->euid); > > + kinfo.egid = from_kgid_munged(user_ns, c->egid); > > + kinfo.suid = from_kuid_munged(user_ns, c->suid); > > + kinfo.sgid = from_kgid_munged(user_ns, c->sgid); > > + kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid); > > + kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid); > > + put_cred(c); > > + > > +#ifdef CONFIG_CGROUPS > > + if (request_mask & PIDFD_INFO_CGROUPID) { > > + struct cgroup *cgrp; > > + > > + guard(rcu)(); > > + cgrp = task_cgroup(task, pids_cgrp_id); > > + if (!cgrp) > > + return -ENODEV; > > Afaict this means that the task has already exited. In other words, the > cgroup id cannot be retrieved anymore for a task that has exited but not > been reaped. Frankly, I would have expected the cgroup id to be > retrievable until the task has been reaped but that's another > discussion. > > My point is if you contrast this with the other information in here: If > the task has exited but hasn't been reaped then you can still get > credentials such as *uid/*gid, and pid namespace relative information > such as pid/tgid/ppid. > > So really, I would argue that you don't want to fail this but only > report 0 here. That's me working under the assumption that cgroup ids > start from 1... > > /me checks > > Yes, they start from 1 so 0 is invalid. > > > + kinfo.cgroupid = cgroup_id(cgrp); > > Fwiw, it looks like getting the cgroup id is basically just > dereferencing pointers without having to hold any meaningful locks. So > it should be fast. So making it unconditional seems fine to me. There's an ifdef since it's an optional kernel feature, and there's also this thing that we might not have it, so I'd rather keep the flag, and set it only if we can get the information (instead of failing). As a user that seems clearer to me.
> > +#ifdef CONFIG_CGROUPS > > + if (request_mask & PIDFD_INFO_CGROUPID) { > > + struct cgroup *cgrp; > > + > > + guard(rcu)(); > > + cgrp = task_cgroup(task, pids_cgrp_id); > > + if (!cgrp) > > + return -ENODEV; > > Afaict this means that the task has already exited. In other words, the > cgroup id cannot be retrieved anymore for a task that has exited but not > been reaped. Frankly, I would have expected the cgroup id to be > retrievable until the task has been reaped but that's another > discussion. > > My point is if you contrast this with the other information in here: If > the task has exited but hasn't been reaped then you can still get > credentials such as *uid/*gid, and pid namespace relative information > such as pid/tgid/ppid. Related to this and I just want to dump this idea somewhere: I'm aware that it is often desirable or useful to have information about a task around even after the task has exited and been reaped. The exit status comes to mind but maybe there's other stuff that would be useful to have. Since we changed to pidfs we know that all pidfds no matter if they point to the same struct file (e.g., dup()) or to multiple struct files (e.g., multiple pidfd_open() on the same pid) all point to the same dentry and inode. Which is why we switched to stashing struct pid in inode->i_private. So we could easily do something like this: // SKETCH SKETCH SKETCH diff --git a/fs/pidfs.c b/fs/pidfs.c index 7ffdc88dfb52..eeeb907f4889 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -344,9 +344,24 @@ static const struct dentry_operations pidfs_dentry_operations = { .d_prune = stashed_dentry_prune, }; +struct pidfd_kinfo { + // We could even move this back to file->private_data to avoid the + // additional pointer deref though I doubt it matters. + struct pid *pid; + int exit_status; + // other stuff; +}; + static int pidfs_init_inode(struct inode *inode, void *data) { - inode->i_private = data; + struct pidfd_kinfo *kinfo; + + kinfo = kzalloc(sizeof(*info), GFP_KERNEL_ACCOUNT); + if (!kinfo) + return -ENOMEM; + + kinfo->pid = data; + inode->i_private = kinfo; inode->i_flags |= S_PRIVATE; inode->i_mode |= S_IRWXU; inode->i_op = &pidfs_inode_operations; and that enables us to persist information past task exit so that as long as you hold the pidfd you can e.g., query for the exit state of the task or something. I'm mostly thinking out loud but I think that could be potentially interesting.
luca.boccassi@gmail.com writes: > As discussed at LPC24, add an ioctl with an extensible struct > so that more parameters can be added later if needed. Start with > returning pid/tgid/ppid and creds unconditionally, and cgroupid > optionally. I was looking this over, and a couple of questions came to mind... > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> > --- [...] > diff --git a/fs/pidfs.c b/fs/pidfs.c > index 80675b6bf884..15cdc7fe4968 100644 > --- a/fs/pidfs.c > +++ b/fs/pidfs.c > @@ -2,6 +2,7 @@ > #include <linux/anon_inodes.h> > #include <linux/file.h> > #include <linux/fs.h> > +#include <linux/cgroup.h> > #include <linux/magic.h> > #include <linux/mount.h> > #include <linux/pid.h> > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > return poll_flags; > } > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > +{ > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > + size_t usize = _IOC_SIZE(cmd); > + struct pidfd_info kinfo = {}; > + struct user_namespace *user_ns; > + const struct cred *c; > + __u64 request_mask; > + > + if (!uinfo) > + return -EINVAL; > + if (usize < sizeof(struct pidfd_info)) > + return -EINVAL; /* First version, no smaller struct possible */ > + > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > + return -EFAULT; You don't check request_mask for unrecognized flags, so user space will not get an error if it puts random gunk there. That, in turn, can make it harder to add new options in the future. > + c = get_task_cred(task); > + if (!c) > + return -ESRCH; [...] > + > + /* > + * If userspace and the kernel have the same struct size it can just > + * be copied. If userspace provides an older struct, only the bits that > + * userspace knows about will be copied. If userspace provides a new > + * struct, only the bits that the kernel knows about will be copied and > + * the size value will be set to the size the kernel knows about. > + */ > + if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo)))) > + return -EFAULT; Which "size value" are you referring to here; I can't see it. If user space has a bigger struct, should you perhaps zero-fill the part the kernel doesn't know about? > + return 0; > +} Thanks, jon
On 2024-10-08, luca.boccassi@gmail.com <luca.boccassi@gmail.com> wrote: > From: Luca Boccassi <luca.boccassi@gmail.com> > > A common pattern when using pid fds is having to get information > about the process, which currently requires /proc being mounted, > resolving the fd to a pid, and then do manual string parsing of > /proc/N/status and friends. This needs to be reimplemented over > and over in all userspace projects (e.g.: I have reimplemented > resolving in systemd, dbus, dbus-daemon, polkit so far), and > requires additional care in checking that the fd is still valid > after having parsed the data, to avoid races. > > Having a programmatic API that can be used directly removes all > these requirements, including having /proc mounted. > > As discussed at LPC24, add an ioctl with an extensible struct > so that more parameters can be added later if needed. Start with > returning pid/tgid/ppid and creds unconditionally, and cgroupid > optionally. > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> > --- > v9: drop result_mask and reuse request_mask instead > v8: use RAII guard for rcu, call put_cred() > v7: fix RCU issue and style issue introduced by v6 found by reviewer > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to > get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end > of the call to avoid providing incomplete data, document what the > callers should expect > v5: check again that the task hasn't exited immediately before copying > the result out to userspace, to ensure we are not returning stale data > add an ifdef around the cgroup structs usage to fix build errors when > the feature is disabled > v4: fix arg check in pidfd_ioctl() by moving it after the new call > v3: switch from pid_vnr() to task_pid_vnr() > v2: Apply comments from Christian, apart from the one about pid namespaces > as I need additional hints on how to implement it. > Drop the security_context string as it is not the appropriate > metadata to give userspace these days. > > fs/pidfs.c | 88 ++++++++++++++++++- > include/uapi/linux/pidfd.h | 30 +++++++ > .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++- > 3 files changed, 194 insertions(+), 4 deletions(-) > > diff --git a/fs/pidfs.c b/fs/pidfs.c > index 80675b6bf884..15cdc7fe4968 100644 > --- a/fs/pidfs.c > +++ b/fs/pidfs.c > @@ -2,6 +2,7 @@ > #include <linux/anon_inodes.h> > #include <linux/file.h> > #include <linux/fs.h> > +#include <linux/cgroup.h> > #include <linux/magic.h> > #include <linux/mount.h> > #include <linux/pid.h> > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > return poll_flags; > } > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > +{ > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > + size_t usize = _IOC_SIZE(cmd); > + struct pidfd_info kinfo = {}; > + struct user_namespace *user_ns; > + const struct cred *c; > + __u64 request_mask; > + > + if (!uinfo) > + return -EINVAL; > + if (usize < sizeof(struct pidfd_info)) > + return -EINVAL; /* First version, no smaller struct possible */ > + > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > + return -EFAULT; > + > + c = get_task_cred(task); > + if (!c) > + return -ESRCH; > + > + /* Unconditionally return identifiers and credentials, the rest only on request */ > + > + user_ns = current_user_ns(); > + kinfo.ruid = from_kuid_munged(user_ns, c->uid); > + kinfo.rgid = from_kgid_munged(user_ns, c->gid); > + kinfo.euid = from_kuid_munged(user_ns, c->euid); > + kinfo.egid = from_kgid_munged(user_ns, c->egid); > + kinfo.suid = from_kuid_munged(user_ns, c->suid); > + kinfo.sgid = from_kgid_munged(user_ns, c->sgid); > + kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid); > + kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid); > + put_cred(c); > + > +#ifdef CONFIG_CGROUPS > + if (request_mask & PIDFD_INFO_CGROUPID) { > + struct cgroup *cgrp; > + > + guard(rcu)(); > + cgrp = task_cgroup(task, pids_cgrp_id); > + if (!cgrp) > + return -ENODEV; > + kinfo.cgroupid = cgroup_id(cgrp); > + > + kinfo.request_mask |= PIDFD_INFO_CGROUPID; > + } > +#endif > + > + /* > + * Copy pid/tgid last, to reduce the chances the information might be > + * stale. Note that it is not possible to ensure it will be valid as the > + * task might return as soon as the copy_to_user finishes, but that's ok > + * and userspace expects that might happen and can act accordingly, so > + * this is just best-effort. What we can do however is checking that all > + * the fields are set correctly, or return ESRCH to avoid providing > + * incomplete information. */ > + > + kinfo.ppid = task_ppid_nr_ns(task, NULL); > + kinfo.tgid = task_tgid_vnr(task); > + kinfo.pid = task_pid_vnr(task); > + > + if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1)) > + return -ESRCH; > + > + /* > + * If userspace and the kernel have the same struct size it can just > + * be copied. If userspace provides an older struct, only the bits that > + * userspace knows about will be copied. If userspace provides a new > + * struct, only the bits that the kernel knows about will be copied and > + * the size value will be set to the size the kernel knows about. > + */ > + if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo)))) > + return -EFAULT; If usize > ksize, we also want to clear_user() the trailing bytes to avoid userspace thinking that any garbage bytes they had are valid. Also, you mention "the size value" but there is no size in pidfd_info. I don't think it's actually necessary to include such a field (especially when you have a statx-like request_mask), but it means you really should clear the trailing bytes to avoid userspace bugs. I implemented all of these semantics as copy_struct_to_user() in the CHECK_FIELDS patch I sent a few weeks ago (I just sent v3[1]). Maybe you can cherry-pick this patch and use it? The semantics when we extend this pidfd_info to accept new request_mask values with larger structures is going to get a little ugly and copy_struct_to_user() makes this a little easier to deal with. [1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-1-d2833dfe6edd@cyphar.com/ > + > + return 0; > +} > + > static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > { > struct task_struct *task __free(put_task) = NULL; > @@ -122,13 +200,17 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > struct ns_common *ns_common = NULL; > struct pid_namespace *pid_ns; > > - if (arg) > - return -EINVAL; > - > task = get_pid_task(pid, PIDTYPE_PID); > if (!task) > return -ESRCH; > > + /* Extensible IOCTL that does not open namespace FDs, take a shortcut */ > + if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO)) > + return pidfd_info(task, cmd, arg); > + > + if (arg) > + return -EINVAL; > + > scoped_guard(task_lock, task) { > nsp = task->nsproxy; > if (nsp) > diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h > index 565fc0629fff..d685eeeedc51 100644 > --- a/include/uapi/linux/pidfd.h > +++ b/include/uapi/linux/pidfd.h > @@ -16,6 +16,35 @@ > #define PIDFD_SIGNAL_THREAD_GROUP (1UL << 1) > #define PIDFD_SIGNAL_PROCESS_GROUP (1UL << 2) > > +/* Flags for pidfd_info. */ > +#define PIDFD_INFO_CGROUPID (1UL << 0) While it isn't strictly necessary, maybe we should provide some always-set bits like statx does? While they would always be set, it might incentivise programs to write code that checks if the request_mask bits are set after the ioctl(2) returns from the outset. Then again, PIDFD_INFO_CGROUPID is probably enough to justify writing code correctly from the outset. > + > +struct pidfd_info { > + /* Let userspace request expensive stuff explictly, and let the kernel > + * indicate whether it knows about it. */ I would prefer a slightly more informative comment (which mentions that this will also be used for extensibility), something like: /* * This mask is similar to the request_mask in statx(2). * * Userspace indicates what extensions or expensive-to-calculate fields * they want by setting the corresponding bits in request_mask. * * When filling the structure, the kernel will only set bits * corresponding to the fields that were actually filled by the kernel. * This also includes any future extensions that might be automatically * filled. If the structure size is too small to contain a field * (requested or not), to avoid confusion the request_mask will not * contain a bit for that field. * * As such, userspace MUST verify that request_mask contains the * corresponding flags after the ioctl(2) returns to ensure that it is * using valid data. */ The bit about request_mask not containing a bit if the structure size is too small is one of the things that copy_struct_to_user() helps with implementing (see the patch for some more docs). > + __u64 request_mask; > + /* > + * The information contained in the following fields might be stale at the > + * time it is received, as the target process might have exited as soon as > + * the IOCTL was processed, and there is no way to avoid that. However, it > + * is guaranteed that if the call was successful, then the information was > + * correct and referred to the intended process at the time the work was > + * performed. */ > + __u64 cgroupid; > + __u32 pid; > + __u32 tgid; > + __u32 ppid; > + __u32 ruid; > + __u32 rgid; > + __u32 euid; > + __u32 egid; > + __u32 suid; > + __u32 sgid; > + __u32 fsuid; > + __u32 fsgid; > + __u32 spare0[1]; > +}; > + > #define PIDFS_IOCTL_MAGIC 0xFF > > #define PIDFD_GET_CGROUP_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 1) > @@ -28,5 +57,6 @@ > #define PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 8) > #define PIDFD_GET_USER_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 9) > #define PIDFD_GET_UTS_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 10) > +#define PIDFD_GET_INFO _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info) > > #endif /* _UAPI_LINUX_PIDFD_H */ > diff --git a/tools/testing/selftests/pidfd/pidfd_open_test.c b/tools/testing/selftests/pidfd/pidfd_open_test.c > index c62564c264b1..b2a8cfb19a74 100644 > --- a/tools/testing/selftests/pidfd/pidfd_open_test.c > +++ b/tools/testing/selftests/pidfd/pidfd_open_test.c > @@ -13,6 +13,7 @@ > #include <stdlib.h> > #include <string.h> > #include <syscall.h> > +#include <sys/ioctl.h> > #include <sys/mount.h> > #include <sys/prctl.h> > #include <sys/wait.h> > @@ -21,6 +22,34 @@ > #include "pidfd.h" > #include "../kselftest.h" > > +#ifndef PIDFS_IOCTL_MAGIC > +#define PIDFS_IOCTL_MAGIC 0xFF > +#endif > + > +#ifndef PIDFD_GET_INFO > +#define PIDFD_GET_INFO _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info) > +#define PIDFD_INFO_CGROUPID (1UL << 0) > + > +struct pidfd_info { > + /* Let userspace request expensive stuff explictly, and let the kernel > + * indicate whether it knows about it. */ > + __u64 request_mask; > + __u64 cgroupid; > + __u32 pid; > + __u32 tgid; > + __u32 ppid; > + __u32 ruid; > + __u32 rgid; > + __u32 euid; > + __u32 egid; > + __u32 suid; > + __u32 sgid; > + __u32 fsuid; > + __u32 fsgid; > + __u32 spare0[1]; > +}; > +#endif > + > static int safe_int(const char *numstr, int *converted) > { > char *err = NULL; > @@ -120,10 +149,13 @@ static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen) > > int main(int argc, char **argv) > { > + struct pidfd_info info = { > + .request_mask = PIDFD_INFO_CGROUPID, > + }; > int pidfd = -1, ret = 1; > pid_t pid; > > - ksft_set_plan(3); > + ksft_set_plan(4); > > pidfd = sys_pidfd_open(-1, 0); > if (pidfd >= 0) { > @@ -153,6 +185,52 @@ int main(int argc, char **argv) > pid = get_pid_from_fdinfo_file(pidfd, "Pid:", sizeof("Pid:") - 1); > ksft_print_msg("pidfd %d refers to process with pid %d\n", pidfd, pid); > > + if (ioctl(pidfd, PIDFD_GET_INFO, &info) < 0) { > + ksft_print_msg("%s - failed to get info from pidfd\n", strerror(errno)); > + goto on_error; > + } > + if (info.pid != pid) { > + ksft_print_msg("pid from fdinfo file %d does not match pid from ioctl %d\n", > + pid, info.pid); > + goto on_error; > + } > + if (info.ppid != getppid()) { > + ksft_print_msg("ppid %d does not match ppid from ioctl %d\n", > + pid, info.pid); > + goto on_error; > + } > + if (info.ruid != getuid()) { > + ksft_print_msg("uid %d does not match uid from ioctl %d\n", > + getuid(), info.ruid); > + goto on_error; > + } > + if (info.rgid != getgid()) { > + ksft_print_msg("gid %d does not match gid from ioctl %d\n", > + getgid(), info.rgid); > + goto on_error; > + } > + if (info.euid != geteuid()) { > + ksft_print_msg("euid %d does not match euid from ioctl %d\n", > + geteuid(), info.euid); > + goto on_error; > + } > + if (info.egid != getegid()) { > + ksft_print_msg("egid %d does not match egid from ioctl %d\n", > + getegid(), info.egid); > + goto on_error; > + } > + if (info.suid != geteuid()) { > + ksft_print_msg("suid %d does not match suid from ioctl %d\n", > + geteuid(), info.suid); > + goto on_error; > + } > + if (info.sgid != getegid()) { > + ksft_print_msg("sgid %d does not match sgid from ioctl %d\n", > + getegid(), info.sgid); > + goto on_error; > + } > + ksft_test_result_pass("get info from pidfd test: passed\n"); > + > ret = 0; > > on_error: > > base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b > -- > 2.45.2 > >
On 2024-10-08, Luca Boccassi <luca.boccassi@gmail.com> wrote: > On Tue, 8 Oct 2024 at 14:06, Christian Brauner <brauner@kernel.org> wrote: > > > > On Tue, Oct 08, 2024 at 01:18:20PM GMT, luca.boccassi@gmail.com wrote: > > > From: Luca Boccassi <luca.boccassi@gmail.com> > > > > > > A common pattern when using pid fds is having to get information > > > about the process, which currently requires /proc being mounted, > > > resolving the fd to a pid, and then do manual string parsing of > > > /proc/N/status and friends. This needs to be reimplemented over > > > and over in all userspace projects (e.g.: I have reimplemented > > > resolving in systemd, dbus, dbus-daemon, polkit so far), and > > > requires additional care in checking that the fd is still valid > > > after having parsed the data, to avoid races. > > > > > > Having a programmatic API that can be used directly removes all > > > these requirements, including having /proc mounted. > > > > > > As discussed at LPC24, add an ioctl with an extensible struct > > > so that more parameters can be added later if needed. Start with > > > returning pid/tgid/ppid and creds unconditionally, and cgroupid > > > optionally. > > > > > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> > > > --- > > > v9: drop result_mask and reuse request_mask instead > > > v8: use RAII guard for rcu, call put_cred() > > > v7: fix RCU issue and style issue introduced by v6 found by reviewer > > > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to > > > get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end > > > of the call to avoid providing incomplete data, document what the > > > callers should expect > > > v5: check again that the task hasn't exited immediately before copying > > > the result out to userspace, to ensure we are not returning stale data > > > add an ifdef around the cgroup structs usage to fix build errors when > > > the feature is disabled > > > v4: fix arg check in pidfd_ioctl() by moving it after the new call > > > v3: switch from pid_vnr() to task_pid_vnr() > > > v2: Apply comments from Christian, apart from the one about pid namespaces > > > as I need additional hints on how to implement it. > > > Drop the security_context string as it is not the appropriate > > > metadata to give userspace these days. > > > > > > fs/pidfs.c | 88 ++++++++++++++++++- > > > include/uapi/linux/pidfd.h | 30 +++++++ > > > .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++- > > > 3 files changed, 194 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/pidfs.c b/fs/pidfs.c > > > index 80675b6bf884..15cdc7fe4968 100644 > > > --- a/fs/pidfs.c > > > +++ b/fs/pidfs.c > > > @@ -2,6 +2,7 @@ > > > #include <linux/anon_inodes.h> > > > #include <linux/file.h> > > > #include <linux/fs.h> > > > +#include <linux/cgroup.h> > > > #include <linux/magic.h> > > > #include <linux/mount.h> > > > #include <linux/pid.h> > > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > > > return poll_flags; > > > } > > > > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > > > +{ > > > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > > > + size_t usize = _IOC_SIZE(cmd); > > > + struct pidfd_info kinfo = {}; > > > + struct user_namespace *user_ns; > > > + const struct cred *c; > > > + __u64 request_mask; > > > + > > > + if (!uinfo) > > > + return -EINVAL; > > > + if (usize < sizeof(struct pidfd_info)) > > > + return -EINVAL; /* First version, no smaller struct possible */ > > > + > > > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > > > + return -EFAULT; > > > + > > > + c = get_task_cred(task); > > > + if (!c) > > > + return -ESRCH; > > > + > > > + /* Unconditionally return identifiers and credentials, the rest only on request */ > > > + > > > + user_ns = current_user_ns(); > > > + kinfo.ruid = from_kuid_munged(user_ns, c->uid); > > > + kinfo.rgid = from_kgid_munged(user_ns, c->gid); > > > + kinfo.euid = from_kuid_munged(user_ns, c->euid); > > > + kinfo.egid = from_kgid_munged(user_ns, c->egid); > > > + kinfo.suid = from_kuid_munged(user_ns, c->suid); > > > + kinfo.sgid = from_kgid_munged(user_ns, c->sgid); > > > + kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid); > > > + kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid); > > > + put_cred(c); > > > + > > > +#ifdef CONFIG_CGROUPS > > > + if (request_mask & PIDFD_INFO_CGROUPID) { > > > + struct cgroup *cgrp; > > > + > > > + guard(rcu)(); > > > + cgrp = task_cgroup(task, pids_cgrp_id); > > > + if (!cgrp) > > > + return -ENODEV; > > > > Afaict this means that the task has already exited. In other words, the > > cgroup id cannot be retrieved anymore for a task that has exited but not > > been reaped. Frankly, I would have expected the cgroup id to be > > retrievable until the task has been reaped but that's another > > discussion. > > > > My point is if you contrast this with the other information in here: If > > the task has exited but hasn't been reaped then you can still get > > credentials such as *uid/*gid, and pid namespace relative information > > such as pid/tgid/ppid. > > > > So really, I would argue that you don't want to fail this but only > > report 0 here. That's me working under the assumption that cgroup ids > > start from 1... > > > > /me checks > > > > Yes, they start from 1 so 0 is invalid. > > > > > + kinfo.cgroupid = cgroup_id(cgrp); > > > > Fwiw, it looks like getting the cgroup id is basically just > > dereferencing pointers without having to hold any meaningful locks. So > > it should be fast. So making it unconditional seems fine to me. > > There's an ifdef since it's an optional kernel feature, and there's > also this thing that we might not have it, so I'd rather keep the > flag, and set it only if we can get the information (instead of > failing). As a user that seems clearer to me. I think we should keep the request_mask flag when returning from the kernel, but it's not necessary for the user to request it explicitly because it's cheap to get. This is how STATX_MNT_ID works and I think it makes sense to do it that way. (Though there are some complications when we add extensions in the future -- see my other mail about copy_struct_to_user().)
On 2024-10-09, Jonathan Corbet <corbet@lwn.net> wrote: > luca.boccassi@gmail.com writes: > > > As discussed at LPC24, add an ioctl with an extensible struct > > so that more parameters can be added later if needed. Start with > > returning pid/tgid/ppid and creds unconditionally, and cgroupid > > optionally. > > I was looking this over, and a couple of questions came to mind... > > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> > > --- > > [...] > > > diff --git a/fs/pidfs.c b/fs/pidfs.c > > index 80675b6bf884..15cdc7fe4968 100644 > > --- a/fs/pidfs.c > > +++ b/fs/pidfs.c > > @@ -2,6 +2,7 @@ > > #include <linux/anon_inodes.h> > > #include <linux/file.h> > > #include <linux/fs.h> > > +#include <linux/cgroup.h> > > #include <linux/magic.h> > > #include <linux/mount.h> > > #include <linux/pid.h> > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > > return poll_flags; > > } > > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > > +{ > > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > > + size_t usize = _IOC_SIZE(cmd); > > + struct pidfd_info kinfo = {}; > > + struct user_namespace *user_ns; > > + const struct cred *c; > > + __u64 request_mask; > > + > > + if (!uinfo) > > + return -EINVAL; > > + if (usize < sizeof(struct pidfd_info)) > > + return -EINVAL; /* First version, no smaller struct possible */ > > + > > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > > + return -EFAULT; > > You don't check request_mask for unrecognized flags, so user space will > not get an error if it puts random gunk there. That, in turn, can make > it harder to add new options in the future. In fairness, this is how statx works and statx does this to not require syscall retries to figure out what flags the current kernel supports and instead defers that to stx_mask. However, I think verifying the value is slightly less fragile -- as long as we get a cheap way for userspace to check what flags are supported (such as CHECK_FIELDS[1]). It would kind of suck if userspace would have to do 50 syscalls to figure out what request_mask values are valid. [1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-0-d2833dfe6edd@cyphar.com/ > > > + c = get_task_cred(task); > > + if (!c) > > + return -ESRCH; > > [...] > > > + > > + /* > > + * If userspace and the kernel have the same struct size it can just > > + * be copied. If userspace provides an older struct, only the bits that > > + * userspace knows about will be copied. If userspace provides a new > > + * struct, only the bits that the kernel knows about will be copied and > > + * the size value will be set to the size the kernel knows about. > > + */ > > + if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo)))) > > + return -EFAULT; > > Which "size value" are you referring to here; I can't see it. > > If user space has a bigger struct, should you perhaps zero-fill the part > the kernel doesn't know about? > > > + return 0; > > +} > > Thanks, > > jon >
On 2024-10-10, Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2024-10-09, Jonathan Corbet <corbet@lwn.net> wrote: > > luca.boccassi@gmail.com writes: > > > > > As discussed at LPC24, add an ioctl with an extensible struct > > > so that more parameters can be added later if needed. Start with > > > returning pid/tgid/ppid and creds unconditionally, and cgroupid > > > optionally. > > > > I was looking this over, and a couple of questions came to mind... > > > > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> > > > --- > > > > [...] > > > > > diff --git a/fs/pidfs.c b/fs/pidfs.c > > > index 80675b6bf884..15cdc7fe4968 100644 > > > --- a/fs/pidfs.c > > > +++ b/fs/pidfs.c > > > @@ -2,6 +2,7 @@ > > > #include <linux/anon_inodes.h> > > > #include <linux/file.h> > > > #include <linux/fs.h> > > > +#include <linux/cgroup.h> > > > #include <linux/magic.h> > > > #include <linux/mount.h> > > > #include <linux/pid.h> > > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > > > return poll_flags; > > > } > > > > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > > > +{ > > > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > > > + size_t usize = _IOC_SIZE(cmd); > > > + struct pidfd_info kinfo = {}; > > > + struct user_namespace *user_ns; > > > + const struct cred *c; > > > + __u64 request_mask; > > > + > > > + if (!uinfo) > > > + return -EINVAL; > > > + if (usize < sizeof(struct pidfd_info)) > > > + return -EINVAL; /* First version, no smaller struct possible */ > > > + > > > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > > > + return -EFAULT; > > > > You don't check request_mask for unrecognized flags, so user space will > > not get an error if it puts random gunk there. That, in turn, can make > > it harder to add new options in the future. > > In fairness, this is how statx works and statx does this to not require > syscall retries to figure out what flags the current kernel supports and > instead defers that to stx_mask. > > However, I think verifying the value is slightly less fragile -- as long > as we get a cheap way for userspace to check what flags are supported > (such as CHECK_FIELDS[1]). It would kind of suck if userspace would have > to do 50 syscalls to figure out what request_mask values are valid. Unfortunately, we probably need to find a different way to do CHECK_FIELDS for extensible-struct ioctls because CHECK_FIELDS uses the top bit in a u64 but we can't set a size that large with ioctl numbers... Then again, _IOC_SIZEBITS is 14 so we could easily set the ioctl size to something >PAGE_SIZE fairly easily at least... > [1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-0-d2833dfe6edd@cyphar.com/ > > > > > > + c = get_task_cred(task); > > > + if (!c) > > > + return -ESRCH; > > > > [...] > > > > > + > > > + /* > > > + * If userspace and the kernel have the same struct size it can just > > > + * be copied. If userspace provides an older struct, only the bits that > > > + * userspace knows about will be copied. If userspace provides a new > > > + * struct, only the bits that the kernel knows about will be copied and > > > + * the size value will be set to the size the kernel knows about. > > > + */ > > > + if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo)))) > > > + return -EFAULT; > > > > Which "size value" are you referring to here; I can't see it. > > > > If user space has a bigger struct, should you perhaps zero-fill the part > > the kernel doesn't know about? > > > > > + return 0; > > > +} > > > > Thanks, > > > > jon > > > > -- > Aleksa Sarai > Senior Software Engineer (Containers) > SUSE Linux GmbH > <https://www.cyphar.com/>
Aleksa Sarai <cyphar@cyphar.com> writes: >> In fairness, this is how statx works and statx does this to not require >> syscall retries to figure out what flags the current kernel supports and >> instead defers that to stx_mask. >> >> However, I think verifying the value is slightly less fragile -- as long >> as we get a cheap way for userspace to check what flags are supported >> (such as CHECK_FIELDS[1]). It would kind of suck if userspace would have >> to do 50 syscalls to figure out what request_mask values are valid. > > Unfortunately, we probably need to find a different way to do > CHECK_FIELDS for extensible-struct ioctls because CHECK_FIELDS uses the > top bit in a u64 but we can't set a size that large with ioctl > numbers... Add a separate PIDFD_GET_VALID_REQUEST_MASK ioctl()? But then I'm bad at designing interfaces... jon
> > + /* > > + * If userspace and the kernel have the same struct size it can just > > + * be copied. If userspace provides an older struct, only the bits that > > + * userspace knows about will be copied. If userspace provides a new > > + * struct, only the bits that the kernel knows about will be copied and > > + * the size value will be set to the size the kernel knows about. > > + */ > > + if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo)))) > > + return -EFAULT; > > Which "size value" are you referring to here; I can't see it. Luca did just copy my comment from another interface which has a separate size parameter. This should indeed be fixed.
On Thu, Oct 10, 2024 at 07:56:53AM +1100, Aleksa Sarai wrote: > On 2024-10-09, Jonathan Corbet <corbet@lwn.net> wrote: > > luca.boccassi@gmail.com writes: > > > > > As discussed at LPC24, add an ioctl with an extensible struct > > > so that more parameters can be added later if needed. Start with > > > returning pid/tgid/ppid and creds unconditionally, and cgroupid > > > optionally. > > > > I was looking this over, and a couple of questions came to mind... > > > > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> > > > --- > > > > [...] > > > > > diff --git a/fs/pidfs.c b/fs/pidfs.c > > > index 80675b6bf884..15cdc7fe4968 100644 > > > --- a/fs/pidfs.c > > > +++ b/fs/pidfs.c > > > @@ -2,6 +2,7 @@ > > > #include <linux/anon_inodes.h> > > > #include <linux/file.h> > > > #include <linux/fs.h> > > > +#include <linux/cgroup.h> > > > #include <linux/magic.h> > > > #include <linux/mount.h> > > > #include <linux/pid.h> > > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > > > return poll_flags; > > > } > > > > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > > > +{ > > > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > > > + size_t usize = _IOC_SIZE(cmd); > > > + struct pidfd_info kinfo = {}; > > > + struct user_namespace *user_ns; > > > + const struct cred *c; > > > + __u64 request_mask; > > > + > > > + if (!uinfo) > > > + return -EINVAL; > > > + if (usize < sizeof(struct pidfd_info)) > > > + return -EINVAL; /* First version, no smaller struct possible */ > > > + > > > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > > > + return -EFAULT; > > > > You don't check request_mask for unrecognized flags, so user space will > > not get an error if it puts random gunk there. That, in turn, can make > > it harder to add new options in the future. > > In fairness, this is how statx works and statx does this to not require > syscall retries to figure out what flags the current kernel supports and > instead defers that to stx_mask. pidfd_info overwrites the request_mask with what is supported by the kernel. I don't think userspace setting random stuff in the request_mask is a problem. It would already be a problem with statx() and we haven't seen that so far. If userspace happens to set a some random bit in the request_mask and that bit ends up being used a few kernel releases later to e.g., retrieve additional information then all that happens is that userspace would now receive information they didn't need. That's not a problem. It is of course very different to e.g. adding a random bit in the flag mask of clone3() or mount_setattr() or any system call that changes kernel state based on the passed bits. In that case ignoring unknown bits and then starting to use them is obviously a big problem. The other related problem would be flag deprecation and reuse of a flag which (CLONE_DETACHED -> CLONE_PIDFD) also is only a real problem for system calls that alter kernel state. So overally, I think ignoring uknown bits in the request mask is safe. It needs to be documented of course.
On Thu, Oct 10, 2024 at 07:52:20AM +1100, Aleksa Sarai wrote: > On 2024-10-08, Luca Boccassi <luca.boccassi@gmail.com> wrote: > > On Tue, 8 Oct 2024 at 14:06, Christian Brauner <brauner@kernel.org> wrote: > > > > > > On Tue, Oct 08, 2024 at 01:18:20PM GMT, luca.boccassi@gmail.com wrote: > > > > From: Luca Boccassi <luca.boccassi@gmail.com> > > > > > > > > A common pattern when using pid fds is having to get information > > > > about the process, which currently requires /proc being mounted, > > > > resolving the fd to a pid, and then do manual string parsing of > > > > /proc/N/status and friends. This needs to be reimplemented over > > > > and over in all userspace projects (e.g.: I have reimplemented > > > > resolving in systemd, dbus, dbus-daemon, polkit so far), and > > > > requires additional care in checking that the fd is still valid > > > > after having parsed the data, to avoid races. > > > > > > > > Having a programmatic API that can be used directly removes all > > > > these requirements, including having /proc mounted. > > > > > > > > As discussed at LPC24, add an ioctl with an extensible struct > > > > so that more parameters can be added later if needed. Start with > > > > returning pid/tgid/ppid and creds unconditionally, and cgroupid > > > > optionally. > > > > > > > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> > > > > --- > > > > v9: drop result_mask and reuse request_mask instead > > > > v8: use RAII guard for rcu, call put_cred() > > > > v7: fix RCU issue and style issue introduced by v6 found by reviewer > > > > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to > > > > get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end > > > > of the call to avoid providing incomplete data, document what the > > > > callers should expect > > > > v5: check again that the task hasn't exited immediately before copying > > > > the result out to userspace, to ensure we are not returning stale data > > > > add an ifdef around the cgroup structs usage to fix build errors when > > > > the feature is disabled > > > > v4: fix arg check in pidfd_ioctl() by moving it after the new call > > > > v3: switch from pid_vnr() to task_pid_vnr() > > > > v2: Apply comments from Christian, apart from the one about pid namespaces > > > > as I need additional hints on how to implement it. > > > > Drop the security_context string as it is not the appropriate > > > > metadata to give userspace these days. > > > > > > > > fs/pidfs.c | 88 ++++++++++++++++++- > > > > include/uapi/linux/pidfd.h | 30 +++++++ > > > > .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++- > > > > 3 files changed, 194 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/fs/pidfs.c b/fs/pidfs.c > > > > index 80675b6bf884..15cdc7fe4968 100644 > > > > --- a/fs/pidfs.c > > > > +++ b/fs/pidfs.c > > > > @@ -2,6 +2,7 @@ > > > > #include <linux/anon_inodes.h> > > > > #include <linux/file.h> > > > > #include <linux/fs.h> > > > > +#include <linux/cgroup.h> > > > > #include <linux/magic.h> > > > > #include <linux/mount.h> > > > > #include <linux/pid.h> > > > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > > > > return poll_flags; > > > > } > > > > > > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > > > > +{ > > > > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > > > > + size_t usize = _IOC_SIZE(cmd); > > > > + struct pidfd_info kinfo = {}; > > > > + struct user_namespace *user_ns; > > > > + const struct cred *c; > > > > + __u64 request_mask; > > > > + > > > > + if (!uinfo) > > > > + return -EINVAL; > > > > + if (usize < sizeof(struct pidfd_info)) > > > > + return -EINVAL; /* First version, no smaller struct possible */ > > > > + > > > > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > > > > + return -EFAULT; > > > > + > > > > + c = get_task_cred(task); > > > > + if (!c) > > > > + return -ESRCH; > > > > + > > > > + /* Unconditionally return identifiers and credentials, the rest only on request */ > > > > + > > > > + user_ns = current_user_ns(); > > > > + kinfo.ruid = from_kuid_munged(user_ns, c->uid); > > > > + kinfo.rgid = from_kgid_munged(user_ns, c->gid); > > > > + kinfo.euid = from_kuid_munged(user_ns, c->euid); > > > > + kinfo.egid = from_kgid_munged(user_ns, c->egid); > > > > + kinfo.suid = from_kuid_munged(user_ns, c->suid); > > > > + kinfo.sgid = from_kgid_munged(user_ns, c->sgid); > > > > + kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid); > > > > + kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid); > > > > + put_cred(c); > > > > + > > > > +#ifdef CONFIG_CGROUPS > > > > + if (request_mask & PIDFD_INFO_CGROUPID) { > > > > + struct cgroup *cgrp; > > > > + > > > > + guard(rcu)(); > > > > + cgrp = task_cgroup(task, pids_cgrp_id); > > > > + if (!cgrp) > > > > + return -ENODEV; > > > > > > Afaict this means that the task has already exited. In other words, the > > > cgroup id cannot be retrieved anymore for a task that has exited but not > > > been reaped. Frankly, I would have expected the cgroup id to be > > > retrievable until the task has been reaped but that's another > > > discussion. > > > > > > My point is if you contrast this with the other information in here: If > > > the task has exited but hasn't been reaped then you can still get > > > credentials such as *uid/*gid, and pid namespace relative information > > > such as pid/tgid/ppid. > > > > > > So really, I would argue that you don't want to fail this but only > > > report 0 here. That's me working under the assumption that cgroup ids > > > start from 1... > > > > > > /me checks > > > > > > Yes, they start from 1 so 0 is invalid. > > > > > > > + kinfo.cgroupid = cgroup_id(cgrp); > > > > > > Fwiw, it looks like getting the cgroup id is basically just > > > dereferencing pointers without having to hold any meaningful locks. So > > > it should be fast. So making it unconditional seems fine to me. > > > > There's an ifdef since it's an optional kernel feature, and there's > > also this thing that we might not have it, so I'd rather keep the > > flag, and set it only if we can get the information (instead of > > failing). As a user that seems clearer to me. > > I think we should keep the request_mask flag when returning from the > kernel, but it's not necessary for the user to request it explicitly > because it's cheap to get. This is how STATX_MNT_ID works and I think it > makes sense to do it that way. So what we should do is not require userspace to request it explicitly but always raise the flag in request_mask when it's available. I agree.
On Thu, Oct 10, 2024 at 07:50:36AM +1100, Aleksa Sarai wrote: > On 2024-10-08, luca.boccassi@gmail.com <luca.boccassi@gmail.com> wrote: > > From: Luca Boccassi <luca.boccassi@gmail.com> > > > > A common pattern when using pid fds is having to get information > > about the process, which currently requires /proc being mounted, > > resolving the fd to a pid, and then do manual string parsing of > > /proc/N/status and friends. This needs to be reimplemented over > > and over in all userspace projects (e.g.: I have reimplemented > > resolving in systemd, dbus, dbus-daemon, polkit so far), and > > requires additional care in checking that the fd is still valid > > after having parsed the data, to avoid races. > > > > Having a programmatic API that can be used directly removes all > > these requirements, including having /proc mounted. > > > > As discussed at LPC24, add an ioctl with an extensible struct > > so that more parameters can be added later if needed. Start with > > returning pid/tgid/ppid and creds unconditionally, and cgroupid > > optionally. > > > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> > > --- > > v9: drop result_mask and reuse request_mask instead > > v8: use RAII guard for rcu, call put_cred() > > v7: fix RCU issue and style issue introduced by v6 found by reviewer > > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to > > get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end > > of the call to avoid providing incomplete data, document what the > > callers should expect > > v5: check again that the task hasn't exited immediately before copying > > the result out to userspace, to ensure we are not returning stale data > > add an ifdef around the cgroup structs usage to fix build errors when > > the feature is disabled > > v4: fix arg check in pidfd_ioctl() by moving it after the new call > > v3: switch from pid_vnr() to task_pid_vnr() > > v2: Apply comments from Christian, apart from the one about pid namespaces > > as I need additional hints on how to implement it. > > Drop the security_context string as it is not the appropriate > > metadata to give userspace these days. > > > > fs/pidfs.c | 88 ++++++++++++++++++- > > include/uapi/linux/pidfd.h | 30 +++++++ > > .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++- > > 3 files changed, 194 insertions(+), 4 deletions(-) > > > > diff --git a/fs/pidfs.c b/fs/pidfs.c > > index 80675b6bf884..15cdc7fe4968 100644 > > --- a/fs/pidfs.c > > +++ b/fs/pidfs.c > > @@ -2,6 +2,7 @@ > > #include <linux/anon_inodes.h> > > #include <linux/file.h> > > #include <linux/fs.h> > > +#include <linux/cgroup.h> > > #include <linux/magic.h> > > #include <linux/mount.h> > > #include <linux/pid.h> > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > > return poll_flags; > > } > > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > > +{ > > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > > + size_t usize = _IOC_SIZE(cmd); > > + struct pidfd_info kinfo = {}; > > + struct user_namespace *user_ns; > > + const struct cred *c; > > + __u64 request_mask; > > + > > + if (!uinfo) > > + return -EINVAL; > > + if (usize < sizeof(struct pidfd_info)) > > + return -EINVAL; /* First version, no smaller struct possible */ > > + > > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > > + return -EFAULT; > > + > > + c = get_task_cred(task); > > + if (!c) > > + return -ESRCH; > > + > > + /* Unconditionally return identifiers and credentials, the rest only on request */ > > + > > + user_ns = current_user_ns(); > > + kinfo.ruid = from_kuid_munged(user_ns, c->uid); > > + kinfo.rgid = from_kgid_munged(user_ns, c->gid); > > + kinfo.euid = from_kuid_munged(user_ns, c->euid); > > + kinfo.egid = from_kgid_munged(user_ns, c->egid); > > + kinfo.suid = from_kuid_munged(user_ns, c->suid); > > + kinfo.sgid = from_kgid_munged(user_ns, c->sgid); > > + kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid); > > + kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid); > > + put_cred(c); > > + > > +#ifdef CONFIG_CGROUPS > > + if (request_mask & PIDFD_INFO_CGROUPID) { > > + struct cgroup *cgrp; > > + > > + guard(rcu)(); > > + cgrp = task_cgroup(task, pids_cgrp_id); > > + if (!cgrp) > > + return -ENODEV; > > + kinfo.cgroupid = cgroup_id(cgrp); > > + > > + kinfo.request_mask |= PIDFD_INFO_CGROUPID; > > + } > > +#endif > > + > > + /* > > + * Copy pid/tgid last, to reduce the chances the information might be > > + * stale. Note that it is not possible to ensure it will be valid as the > > + * task might return as soon as the copy_to_user finishes, but that's ok > > + * and userspace expects that might happen and can act accordingly, so > > + * this is just best-effort. What we can do however is checking that all > > + * the fields are set correctly, or return ESRCH to avoid providing > > + * incomplete information. */ > > + > > + kinfo.ppid = task_ppid_nr_ns(task, NULL); > > + kinfo.tgid = task_tgid_vnr(task); > > + kinfo.pid = task_pid_vnr(task); > > + > > + if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1)) > > + return -ESRCH; > > + > > + /* > > + * If userspace and the kernel have the same struct size it can just > > + * be copied. If userspace provides an older struct, only the bits that > > + * userspace knows about will be copied. If userspace provides a new > > + * struct, only the bits that the kernel knows about will be copied and > > + * the size value will be set to the size the kernel knows about. > > + */ > > + if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo)))) > > + return -EFAULT; > > If usize > ksize, we also want to clear_user() the trailing bytes to > avoid userspace thinking that any garbage bytes they had are valid. > > Also, you mention "the size value" but there is no size in pidfd_info. I > don't think it's actually necessary to include such a field (especially > when you have a statx-like request_mask), but it means you really should > clear the trailing bytes to avoid userspace bugs. > > I implemented all of these semantics as copy_struct_to_user() in the > CHECK_FIELDS patch I sent a few weeks ago (I just sent v3[1]). Maybe you > can cherry-pick this patch and use it? The semantics when we extend this > pidfd_info to accept new request_mask values with larger structures is > going to get a little ugly and copy_struct_to_user() makes this a little > easier to deal with. > > [1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-1-d2833dfe6edd@cyphar.com/ I agree. @Luca, you can either send the two patches together or I can just port the patch to it. I don't mind. > > > + > > + return 0; > > +} > > + > > static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > > { > > struct task_struct *task __free(put_task) = NULL; > > @@ -122,13 +200,17 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > > struct ns_common *ns_common = NULL; > > struct pid_namespace *pid_ns; > > > > - if (arg) > > - return -EINVAL; > > - > > task = get_pid_task(pid, PIDTYPE_PID); > > if (!task) > > return -ESRCH; > > > > + /* Extensible IOCTL that does not open namespace FDs, take a shortcut */ > > + if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO)) > > + return pidfd_info(task, cmd, arg); > > + > > + if (arg) > > + return -EINVAL; > > + > > scoped_guard(task_lock, task) { > > nsp = task->nsproxy; > > if (nsp) > > diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h > > index 565fc0629fff..d685eeeedc51 100644 > > --- a/include/uapi/linux/pidfd.h > > +++ b/include/uapi/linux/pidfd.h > > @@ -16,6 +16,35 @@ > > #define PIDFD_SIGNAL_THREAD_GROUP (1UL << 1) > > #define PIDFD_SIGNAL_PROCESS_GROUP (1UL << 2) > > > > +/* Flags for pidfd_info. */ > > +#define PIDFD_INFO_CGROUPID (1UL << 0) > > While it isn't strictly necessary, maybe we should provide some > always-set bits like statx does? While they would always be set, it Yes, good idea. > might incentivise programs to write code that checks if the request_mask > bits are set after the ioctl(2) returns from the outset. Then again, > PIDFD_INFO_CGROUPID is probably enough to justify writing code correctly > from the outset. > > > + > > +struct pidfd_info { > > + /* Let userspace request expensive stuff explictly, and let the kernel > > + * indicate whether it knows about it. */ > > I would prefer a slightly more informative comment (which mentions that > this will also be used for extensibility), something like: > > > /* > * This mask is similar to the request_mask in statx(2). > * > * Userspace indicates what extensions or expensive-to-calculate fields > * they want by setting the corresponding bits in request_mask. > * > * When filling the structure, the kernel will only set bits > * corresponding to the fields that were actually filled by the kernel. > * This also includes any future extensions that might be automatically > * filled. If the structure size is too small to contain a field > * (requested or not), to avoid confusion the request_mask will not > * contain a bit for that field. > * > * As such, userspace MUST verify that request_mask contains the > * corresponding flags after the ioctl(2) returns to ensure that it is > * using valid data. > */ I agree. This is a good comment.
Christian Brauner <brauner@kernel.org> writes: > pidfd_info overwrites the request_mask with what is supported by the > kernel. I don't think userspace setting random stuff in the request_mask > is a problem. It would already be a problem with statx() and we haven't > seen that so far. > > If userspace happens to set a some random bit in the request_mask and > that bit ends up being used a few kernel releases later to e.g., > retrieve additional information then all that happens is that userspace > would now receive information they didn't need. That's not a problem. That, of course, assumes that there will never be a request_mask bit that affects the information gathering in some other way -- say looking in the parent namespace or such (a random example that just popped into my undercaffeinated brain and is unlikely to be anything we actually do). But then, as I said, I'm bad at this :) jon
... > pidfd_info overwrites the request_mask with what is supported by the > kernel. I don't think userspace setting random stuff in the request_mask > is a problem. It would already be a problem with statx() and we haven't > seen that so far. I don't think it is wise/necessary for the kernel to set bits that weren't set in the request (for values that aren't being returned). That would let you add some mutually exclusive options that use the same part of the buffer area. > If userspace happens to set a some random bit in the request_mask and > that bit ends up being used a few kernel releases later to e.g., > retrieve additional information then all that happens is that userspace > would now receive information they didn't need. That's not a problem. Especially since the buffer is unlikely to be large enough. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, 10 Oct 2024 at 10:46, Christian Brauner <brauner@kernel.org> wrote: > > On Thu, Oct 10, 2024 at 07:50:36AM +1100, Aleksa Sarai wrote: > > On 2024-10-08, luca.boccassi@gmail.com <luca.boccassi@gmail.com> wrote: > > > From: Luca Boccassi <luca.boccassi@gmail.com> > > > + /* > > > + * If userspace and the kernel have the same struct size it can just > > > + * be copied. If userspace provides an older struct, only the bits that > > > + * userspace knows about will be copied. If userspace provides a new > > > + * struct, only the bits that the kernel knows about will be copied and > > > + * the size value will be set to the size the kernel knows about. > > > + */ > > > + if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo)))) > > > + return -EFAULT; > > > > If usize > ksize, we also want to clear_user() the trailing bytes to > > avoid userspace thinking that any garbage bytes they had are valid. > > > > Also, you mention "the size value" but there is no size in pidfd_info. I > > don't think it's actually necessary to include such a field (especially > > when you have a statx-like request_mask), but it means you really should > > clear the trailing bytes to avoid userspace bugs. > > > > I implemented all of these semantics as copy_struct_to_user() in the > > CHECK_FIELDS patch I sent a few weeks ago (I just sent v3[1]). Maybe you > > can cherry-pick this patch and use it? The semantics when we extend this > > pidfd_info to accept new request_mask values with larger structures is > > going to get a little ugly and copy_struct_to_user() makes this a little > > easier to deal with. > > > > [1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-1-d2833dfe6edd@cyphar.com/ > > I agree. @Luca, you can either send the two patches together or I can > just port the patch to it. I don't mind. I've updated for the latter, given that series is not merged yet, thanks.
On 2024-10-09, Jonathan Corbet <corbet@lwn.net> wrote: > Aleksa Sarai <cyphar@cyphar.com> writes: > > >> In fairness, this is how statx works and statx does this to not require > >> syscall retries to figure out what flags the current kernel supports and > >> instead defers that to stx_mask. > >> > >> However, I think verifying the value is slightly less fragile -- as long > >> as we get a cheap way for userspace to check what flags are supported > >> (such as CHECK_FIELDS[1]). It would kind of suck if userspace would have > >> to do 50 syscalls to figure out what request_mask values are valid. > > > > Unfortunately, we probably need to find a different way to do > > CHECK_FIELDS for extensible-struct ioctls because CHECK_FIELDS uses the > > top bit in a u64 but we can't set a size that large with ioctl > > numbers... > > Add a separate PIDFD_GET_VALID_REQUEST_MASK ioctl()? > > But then I'm bad at designing interfaces... This might be a good argument for making CHECK_FIELDS just (-size) instead of setting the highest bit because that would work for any bit size (though admittedly, doing (-size) for a 14-bit number would still be a little weird). > > jon
diff --git a/fs/pidfs.c b/fs/pidfs.c index 80675b6bf884..15cdc7fe4968 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -2,6 +2,7 @@ #include <linux/anon_inodes.h> #include <linux/file.h> #include <linux/fs.h> +#include <linux/cgroup.h> #include <linux/magic.h> #include <linux/mount.h> #include <linux/pid.h> @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) return poll_flags; } +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) +{ + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; + size_t usize = _IOC_SIZE(cmd); + struct pidfd_info kinfo = {}; + struct user_namespace *user_ns; + const struct cred *c; + __u64 request_mask; + + if (!uinfo) + return -EINVAL; + if (usize < sizeof(struct pidfd_info)) + return -EINVAL; /* First version, no smaller struct possible */ + + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) + return -EFAULT; + + c = get_task_cred(task); + if (!c) + return -ESRCH; + + /* Unconditionally return identifiers and credentials, the rest only on request */ + + user_ns = current_user_ns(); + kinfo.ruid = from_kuid_munged(user_ns, c->uid); + kinfo.rgid = from_kgid_munged(user_ns, c->gid); + kinfo.euid = from_kuid_munged(user_ns, c->euid); + kinfo.egid = from_kgid_munged(user_ns, c->egid); + kinfo.suid = from_kuid_munged(user_ns, c->suid); + kinfo.sgid = from_kgid_munged(user_ns, c->sgid); + kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid); + kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid); + put_cred(c); + +#ifdef CONFIG_CGROUPS + if (request_mask & PIDFD_INFO_CGROUPID) { + struct cgroup *cgrp; + + guard(rcu)(); + cgrp = task_cgroup(task, pids_cgrp_id); + if (!cgrp) + return -ENODEV; + kinfo.cgroupid = cgroup_id(cgrp); + + kinfo.request_mask |= PIDFD_INFO_CGROUPID; + } +#endif + + /* + * Copy pid/tgid last, to reduce the chances the information might be + * stale. Note that it is not possible to ensure it will be valid as the + * task might return as soon as the copy_to_user finishes, but that's ok + * and userspace expects that might happen and can act accordingly, so + * this is just best-effort. What we can do however is checking that all + * the fields are set correctly, or return ESRCH to avoid providing + * incomplete information. */ + + kinfo.ppid = task_ppid_nr_ns(task, NULL); + kinfo.tgid = task_tgid_vnr(task); + kinfo.pid = task_pid_vnr(task); + + if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1)) + return -ESRCH; + + /* + * If userspace and the kernel have the same struct size it can just + * be copied. If userspace provides an older struct, only the bits that + * userspace knows about will be copied. If userspace provides a new + * struct, only the bits that the kernel knows about will be copied and + * the size value will be set to the size the kernel knows about. + */ + if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo)))) + return -EFAULT; + + return 0; +} + static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct task_struct *task __free(put_task) = NULL; @@ -122,13 +200,17 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) struct ns_common *ns_common = NULL; struct pid_namespace *pid_ns; - if (arg) - return -EINVAL; - task = get_pid_task(pid, PIDTYPE_PID); if (!task) return -ESRCH; + /* Extensible IOCTL that does not open namespace FDs, take a shortcut */ + if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO)) + return pidfd_info(task, cmd, arg); + + if (arg) + return -EINVAL; + scoped_guard(task_lock, task) { nsp = task->nsproxy; if (nsp) diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h index 565fc0629fff..d685eeeedc51 100644 --- a/include/uapi/linux/pidfd.h +++ b/include/uapi/linux/pidfd.h @@ -16,6 +16,35 @@ #define PIDFD_SIGNAL_THREAD_GROUP (1UL << 1) #define PIDFD_SIGNAL_PROCESS_GROUP (1UL << 2) +/* Flags for pidfd_info. */ +#define PIDFD_INFO_CGROUPID (1UL << 0) + +struct pidfd_info { + /* Let userspace request expensive stuff explictly, and let the kernel + * indicate whether it knows about it. */ + __u64 request_mask; + /* + * The information contained in the following fields might be stale at the + * time it is received, as the target process might have exited as soon as + * the IOCTL was processed, and there is no way to avoid that. However, it + * is guaranteed that if the call was successful, then the information was + * correct and referred to the intended process at the time the work was + * performed. */ + __u64 cgroupid; + __u32 pid; + __u32 tgid; + __u32 ppid; + __u32 ruid; + __u32 rgid; + __u32 euid; + __u32 egid; + __u32 suid; + __u32 sgid; + __u32 fsuid; + __u32 fsgid; + __u32 spare0[1]; +}; + #define PIDFS_IOCTL_MAGIC 0xFF #define PIDFD_GET_CGROUP_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 1) @@ -28,5 +57,6 @@ #define PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 8) #define PIDFD_GET_USER_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 9) #define PIDFD_GET_UTS_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 10) +#define PIDFD_GET_INFO _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info) #endif /* _UAPI_LINUX_PIDFD_H */ diff --git a/tools/testing/selftests/pidfd/pidfd_open_test.c b/tools/testing/selftests/pidfd/pidfd_open_test.c index c62564c264b1..b2a8cfb19a74 100644 --- a/tools/testing/selftests/pidfd/pidfd_open_test.c +++ b/tools/testing/selftests/pidfd/pidfd_open_test.c @@ -13,6 +13,7 @@ #include <stdlib.h> #include <string.h> #include <syscall.h> +#include <sys/ioctl.h> #include <sys/mount.h> #include <sys/prctl.h> #include <sys/wait.h> @@ -21,6 +22,34 @@ #include "pidfd.h" #include "../kselftest.h" +#ifndef PIDFS_IOCTL_MAGIC +#define PIDFS_IOCTL_MAGIC 0xFF +#endif + +#ifndef PIDFD_GET_INFO +#define PIDFD_GET_INFO _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info) +#define PIDFD_INFO_CGROUPID (1UL << 0) + +struct pidfd_info { + /* Let userspace request expensive stuff explictly, and let the kernel + * indicate whether it knows about it. */ + __u64 request_mask; + __u64 cgroupid; + __u32 pid; + __u32 tgid; + __u32 ppid; + __u32 ruid; + __u32 rgid; + __u32 euid; + __u32 egid; + __u32 suid; + __u32 sgid; + __u32 fsuid; + __u32 fsgid; + __u32 spare0[1]; +}; +#endif + static int safe_int(const char *numstr, int *converted) { char *err = NULL; @@ -120,10 +149,13 @@ static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen) int main(int argc, char **argv) { + struct pidfd_info info = { + .request_mask = PIDFD_INFO_CGROUPID, + }; int pidfd = -1, ret = 1; pid_t pid; - ksft_set_plan(3); + ksft_set_plan(4); pidfd = sys_pidfd_open(-1, 0); if (pidfd >= 0) { @@ -153,6 +185,52 @@ int main(int argc, char **argv) pid = get_pid_from_fdinfo_file(pidfd, "Pid:", sizeof("Pid:") - 1); ksft_print_msg("pidfd %d refers to process with pid %d\n", pidfd, pid); + if (ioctl(pidfd, PIDFD_GET_INFO, &info) < 0) { + ksft_print_msg("%s - failed to get info from pidfd\n", strerror(errno)); + goto on_error; + } + if (info.pid != pid) { + ksft_print_msg("pid from fdinfo file %d does not match pid from ioctl %d\n", + pid, info.pid); + goto on_error; + } + if (info.ppid != getppid()) { + ksft_print_msg("ppid %d does not match ppid from ioctl %d\n", + pid, info.pid); + goto on_error; + } + if (info.ruid != getuid()) { + ksft_print_msg("uid %d does not match uid from ioctl %d\n", + getuid(), info.ruid); + goto on_error; + } + if (info.rgid != getgid()) { + ksft_print_msg("gid %d does not match gid from ioctl %d\n", + getgid(), info.rgid); + goto on_error; + } + if (info.euid != geteuid()) { + ksft_print_msg("euid %d does not match euid from ioctl %d\n", + geteuid(), info.euid); + goto on_error; + } + if (info.egid != getegid()) { + ksft_print_msg("egid %d does not match egid from ioctl %d\n", + getegid(), info.egid); + goto on_error; + } + if (info.suid != geteuid()) { + ksft_print_msg("suid %d does not match suid from ioctl %d\n", + geteuid(), info.suid); + goto on_error; + } + if (info.sgid != getegid()) { + ksft_print_msg("sgid %d does not match sgid from ioctl %d\n", + getegid(), info.sgid); + goto on_error; + } + ksft_test_result_pass("get info from pidfd test: passed\n"); + ret = 0; on_error: