[v9] pidfd: add ioctl to retrieve pid info

Message ID	20241008121930.869054-1-luca.boccassi@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-wm1-f46.google.com (mail-wm1-f46.google.com [209.85.128.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CC1091DEFC4; Tue, 8 Oct 2024 12:19:35 +0000 (UTC) From: luca.boccassi@gmail.com To: linux-fsdevel@vger.kernel.org Cc: christian@brauner.io, linux-kernel@vger.kernel.org, oleg@redhat.com Subject: [PATCH v9] pidfd: add ioctl to retrieve pid info Date: Tue, 8 Oct 2024 13:18:20 +0100 Message-ID: <20241008121930.869054-1-luca.boccassi@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v9] pidfd: add ioctl to retrieve pid info \| expand [v9] pidfd: add ioctl to retrieve pid info

Luca Boccassi Oct. 8, 2024, 12:18 p.m. UTC

From: Luca Boccassi <luca.boccassi@gmail.com>

A common pattern when using pid fds is having to get information
about the process, which currently requires /proc being mounted,
resolving the fd to a pid, and then do manual string parsing of
/proc/N/status and friends. This needs to be reimplemented over
and over in all userspace projects (e.g.: I have reimplemented
resolving in systemd, dbus, dbus-daemon, polkit so far), and
requires additional care in checking that the fd is still valid
after having parsed the data, to avoid races.

Having a programmatic API that can be used directly removes all
these requirements, including having /proc mounted.

As discussed at LPC24, add an ioctl with an extensible struct
so that more parameters can be added later if needed. Start with
returning pid/tgid/ppid and creds unconditionally, and cgroupid
optionally.

Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
---
v9: drop result_mask and reuse request_mask instead
v8: use RAII guard for rcu, call put_cred()
v7: fix RCU issue and style issue introduced by v6 found by reviewer
v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to
    get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end
    of the call to avoid providing incomplete data, document what the
    callers should expect
v5: check again that the task hasn't exited immediately before copying
    the result out to userspace, to ensure we are not returning stale data
    add an ifdef around the cgroup structs usage to fix build errors when
    the feature is disabled
v4: fix arg check in pidfd_ioctl() by moving it after the new call
v3: switch from pid_vnr() to task_pid_vnr()
v2: Apply comments from Christian, apart from the one about pid namespaces
    as I need additional hints on how to implement it.
    Drop the security_context string as it is not the appropriate
    metadata to give userspace these days.

 fs/pidfs.c                                    | 88 ++++++++++++++++++-
 include/uapi/linux/pidfd.h                    | 30 +++++++
 .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++-
 3 files changed, 194 insertions(+), 4 deletions(-)


base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b

Christian Brauner Oct. 8, 2024, 1:06 p.m. UTC | #1

On Tue, Oct 08, 2024 at 01:18:20PM GMT, luca.boccassi@gmail.com wrote:
> From: Luca Boccassi <luca.boccassi@gmail.com>
> 
> A common pattern when using pid fds is having to get information
> about the process, which currently requires /proc being mounted,
> resolving the fd to a pid, and then do manual string parsing of
> /proc/N/status and friends. This needs to be reimplemented over
> and over in all userspace projects (e.g.: I have reimplemented
> resolving in systemd, dbus, dbus-daemon, polkit so far), and
> requires additional care in checking that the fd is still valid
> after having parsed the data, to avoid races.
> 
> Having a programmatic API that can be used directly removes all
> these requirements, including having /proc mounted.
> 
> As discussed at LPC24, add an ioctl with an extensible struct
> so that more parameters can be added later if needed. Start with
> returning pid/tgid/ppid and creds unconditionally, and cgroupid
> optionally.
> 
> Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
> ---
> v9: drop result_mask and reuse request_mask instead
> v8: use RAII guard for rcu, call put_cred()
> v7: fix RCU issue and style issue introduced by v6 found by reviewer
> v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to
>     get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end
>     of the call to avoid providing incomplete data, document what the
>     callers should expect
> v5: check again that the task hasn't exited immediately before copying
>     the result out to userspace, to ensure we are not returning stale data
>     add an ifdef around the cgroup structs usage to fix build errors when
>     the feature is disabled
> v4: fix arg check in pidfd_ioctl() by moving it after the new call
> v3: switch from pid_vnr() to task_pid_vnr()
> v2: Apply comments from Christian, apart from the one about pid namespaces
>     as I need additional hints on how to implement it.
>     Drop the security_context string as it is not the appropriate
>     metadata to give userspace these days.
> 
>  fs/pidfs.c                                    | 88 ++++++++++++++++++-
>  include/uapi/linux/pidfd.h                    | 30 +++++++
>  .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++-
>  3 files changed, 194 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 80675b6bf884..15cdc7fe4968 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -2,6 +2,7 @@
>  #include <linux/anon_inodes.h>
>  #include <linux/file.h>
>  #include <linux/fs.h>
> +#include <linux/cgroup.h>
>  #include <linux/magic.h>
>  #include <linux/mount.h>
>  #include <linux/pid.h>
> @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
>  	return poll_flags;
>  }
>  
> +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> +{
> +	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> +	size_t usize = _IOC_SIZE(cmd);
> +	struct pidfd_info kinfo = {};
> +	struct user_namespace *user_ns;
> +	const struct cred *c;
> +	__u64 request_mask;
> +
> +	if (!uinfo)
> +		return -EINVAL;
> +	if (usize < sizeof(struct pidfd_info))
> +		return -EINVAL; /* First version, no smaller struct possible */
> +
> +	if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> +		return -EFAULT;
> +
> +	c = get_task_cred(task);
> +	if (!c)
> +		return -ESRCH;
> +
> +	/* Unconditionally return identifiers and credentials, the rest only on request */
> +
> +	user_ns = current_user_ns();
> +	kinfo.ruid = from_kuid_munged(user_ns, c->uid);
> +	kinfo.rgid = from_kgid_munged(user_ns, c->gid);
> +	kinfo.euid = from_kuid_munged(user_ns, c->euid);
> +	kinfo.egid = from_kgid_munged(user_ns, c->egid);
> +	kinfo.suid = from_kuid_munged(user_ns, c->suid);
> +	kinfo.sgid = from_kgid_munged(user_ns, c->sgid);
> +	kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid);
> +	kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid);
> +	put_cred(c);
> +
> +#ifdef CONFIG_CGROUPS
> +	if (request_mask & PIDFD_INFO_CGROUPID) {
> +		struct cgroup *cgrp;
> +
> +		guard(rcu)();
> +		cgrp = task_cgroup(task, pids_cgrp_id);
> +		if (!cgrp)
> +			return -ENODEV;

Afaict this means that the task has already exited. In other words, the
cgroup id cannot be retrieved anymore for a task that has exited but not
been reaped. Frankly, I would have expected the cgroup id to be
retrievable until the task has been reaped but that's another
discussion.

My point is if you contrast this with the other information in here: If
the task has exited but hasn't been reaped then you can still get
credentials such as *uid/*gid, and pid namespace relative information
such as pid/tgid/ppid.

So really, I would argue that you don't want to fail this but only
report 0 here. That's me working under the assumption that cgroup ids
start from 1...

/me checks

Yes, they start from 1 so 0 is invalid.

> +		kinfo.cgroupid = cgroup_id(cgrp);

Fwiw, it looks like getting the cgroup id is basically just
dereferencing pointers without having to hold any meaningful locks. So
it should be fast. So making it unconditional seems fine to me.

> +
> +		kinfo.request_mask |= PIDFD_INFO_CGROUPID;
> +	}
> +#endif
> +
> +	/*
> +	 * Copy pid/tgid last, to reduce the chances the information might be
> +	 * stale. Note that it is not possible to ensure it will be valid as the
> +	 * task might return as soon as the copy_to_user finishes, but that's ok
> +	 * and userspace expects that might happen and can act accordingly, so
> +	 * this is just best-effort. What we can do however is checking that all
> +	 * the fields are set correctly, or return ESRCH to avoid providing
> +	 * incomplete information. */
> +
> +	kinfo.ppid = task_ppid_nr_ns(task, NULL);
> +	kinfo.tgid = task_tgid_vnr(task);
> +	kinfo.pid = task_pid_vnr(task);
> +
> +	if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1))
> +		return -ESRCH;
> +
> +	/*
> +	 * If userspace and the kernel have the same struct size it can just
> +	 * be copied. If userspace provides an older struct, only the bits that
> +	 * userspace knows about will be copied. If userspace provides a new
> +	 * struct, only the bits that the kernel knows about will be copied and
> +	 * the size value will be set to the size the kernel knows about.
> +	 */
> +	if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
>  static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  {
>  	struct task_struct *task __free(put_task) = NULL;
> @@ -122,13 +200,17 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  	struct ns_common *ns_common = NULL;
>  	struct pid_namespace *pid_ns;
>  
> -	if (arg)
> -		return -EINVAL;
> -
>  	task = get_pid_task(pid, PIDTYPE_PID);
>  	if (!task)
>  		return -ESRCH;
>  
> +	/* Extensible IOCTL that does not open namespace FDs, take a shortcut */
> +	if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO))
> +		return pidfd_info(task, cmd, arg);
> +
> +	if (arg)
> +		return -EINVAL;
> +
>  	scoped_guard(task_lock, task) {
>  		nsp = task->nsproxy;
>  		if (nsp)
> diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> index 565fc0629fff..d685eeeedc51 100644
> --- a/include/uapi/linux/pidfd.h
> +++ b/include/uapi/linux/pidfd.h
> @@ -16,6 +16,35 @@
>  #define PIDFD_SIGNAL_THREAD_GROUP	(1UL << 1)
>  #define PIDFD_SIGNAL_PROCESS_GROUP	(1UL << 2)
>  
> +/* Flags for pidfd_info. */
> +#define PIDFD_INFO_CGROUPID		(1UL << 0)
> +
> +struct pidfd_info {
> +	/* Let userspace request expensive stuff explictly, and let the kernel
> +	 * indicate whether it knows about it. */
> +	__u64 request_mask;
> +	/*
> +	 * The information contained in the following fields might be stale at the
> +	 * time it is received, as the target process might have exited as soon as
> +	 * the IOCTL was processed, and there is no way to avoid that. However, it
> +	 * is guaranteed that if the call was successful, then the information was
> +	 * correct and referred to the intended process at the time the work was
> +	 * performed. */
> +	__u64 cgroupid;
> +	__u32 pid;
> +	__u32 tgid;
> +	__u32 ppid;
> +	__u32 ruid;
> +	__u32 rgid;
> +	__u32 euid;
> +	__u32 egid;
> +	__u32 suid;
> +	__u32 sgid;
> +	__u32 fsuid;
> +	__u32 fsgid;
> +	__u32 spare0[1];
> +};
> +
>  #define PIDFS_IOCTL_MAGIC 0xFF
>  
>  #define PIDFD_GET_CGROUP_NAMESPACE            _IO(PIDFS_IOCTL_MAGIC, 1)
> @@ -28,5 +57,6 @@
>  #define PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 8)
>  #define PIDFD_GET_USER_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 9)
>  #define PIDFD_GET_UTS_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 10)
> +#define PIDFD_GET_INFO                        _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info)
>  
>  #endif /* _UAPI_LINUX_PIDFD_H */
> diff --git a/tools/testing/selftests/pidfd/pidfd_open_test.c b/tools/testing/selftests/pidfd/pidfd_open_test.c
> index c62564c264b1..b2a8cfb19a74 100644
> --- a/tools/testing/selftests/pidfd/pidfd_open_test.c
> +++ b/tools/testing/selftests/pidfd/pidfd_open_test.c
> @@ -13,6 +13,7 @@
>  #include <stdlib.h>
>  #include <string.h>
>  #include <syscall.h>
> +#include <sys/ioctl.h>
>  #include <sys/mount.h>
>  #include <sys/prctl.h>
>  #include <sys/wait.h>
> @@ -21,6 +22,34 @@
>  #include "pidfd.h"
>  #include "../kselftest.h"
>  
> +#ifndef PIDFS_IOCTL_MAGIC
> +#define PIDFS_IOCTL_MAGIC 0xFF
> +#endif
> +
> +#ifndef PIDFD_GET_INFO
> +#define PIDFD_GET_INFO _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info)
> +#define PIDFD_INFO_CGROUPID		(1UL << 0)
> +
> +struct pidfd_info {
> +	/* Let userspace request expensive stuff explictly, and let the kernel
> +	 * indicate whether it knows about it. */
> +	__u64 request_mask;
> +	__u64 cgroupid;
> +	__u32 pid;
> +	__u32 tgid;
> +	__u32 ppid;
> +	__u32 ruid;
> +	__u32 rgid;
> +	__u32 euid;
> +	__u32 egid;
> +	__u32 suid;
> +	__u32 sgid;
> +	__u32 fsuid;
> +	__u32 fsgid;
> +	__u32 spare0[1];
> +};
> +#endif
> +
>  static int safe_int(const char *numstr, int *converted)
>  {
>  	char *err = NULL;
> @@ -120,10 +149,13 @@ static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen)
>  
>  int main(int argc, char **argv)
>  {
> +	struct pidfd_info info = {
> +		.request_mask = PIDFD_INFO_CGROUPID,
> +	};
>  	int pidfd = -1, ret = 1;
>  	pid_t pid;
>  
> -	ksft_set_plan(3);
> +	ksft_set_plan(4);
>  
>  	pidfd = sys_pidfd_open(-1, 0);
>  	if (pidfd >= 0) {
> @@ -153,6 +185,52 @@ int main(int argc, char **argv)
>  	pid = get_pid_from_fdinfo_file(pidfd, "Pid:", sizeof("Pid:") - 1);
>  	ksft_print_msg("pidfd %d refers to process with pid %d\n", pidfd, pid);
>  
> +	if (ioctl(pidfd, PIDFD_GET_INFO, &info) < 0) {
> +		ksft_print_msg("%s - failed to get info from pidfd\n", strerror(errno));
> +		goto on_error;
> +	}
> +	if (info.pid != pid) {
> +		ksft_print_msg("pid from fdinfo file %d does not match pid from ioctl %d\n",
> +			       pid, info.pid);
> +		goto on_error;
> +	}
> +	if (info.ppid != getppid()) {
> +		ksft_print_msg("ppid %d does not match ppid from ioctl %d\n",
> +			       pid, info.pid);
> +		goto on_error;
> +	}
> +	if (info.ruid != getuid()) {
> +		ksft_print_msg("uid %d does not match uid from ioctl %d\n",
> +			       getuid(), info.ruid);
> +		goto on_error;
> +	}
> +	if (info.rgid != getgid()) {
> +		ksft_print_msg("gid %d does not match gid from ioctl %d\n",
> +			       getgid(), info.rgid);
> +		goto on_error;
> +	}
> +	if (info.euid != geteuid()) {
> +		ksft_print_msg("euid %d does not match euid from ioctl %d\n",
> +			       geteuid(), info.euid);
> +		goto on_error;
> +	}
> +	if (info.egid != getegid()) {
> +		ksft_print_msg("egid %d does not match egid from ioctl %d\n",
> +			       getegid(), info.egid);
> +		goto on_error;
> +	}
> +	if (info.suid != geteuid()) {
> +		ksft_print_msg("suid %d does not match suid from ioctl %d\n",
> +			       geteuid(), info.suid);
> +		goto on_error;
> +	}
> +	if (info.sgid != getegid()) {
> +		ksft_print_msg("sgid %d does not match sgid from ioctl %d\n",
> +			       getegid(), info.sgid);
> +		goto on_error;
> +	}
> +	ksft_test_result_pass("get info from pidfd test: passed\n");
> +
>  	ret = 0;
>  
>  on_error:
> 
> base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
> -- 
> 2.45.2
>

Luca Boccassi Oct. 8, 2024, 1:10 p.m. UTC | #2

On Tue, 8 Oct 2024 at 14:06, Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Oct 08, 2024 at 01:18:20PM GMT, luca.boccassi@gmail.com wrote:
> > From: Luca Boccassi <luca.boccassi@gmail.com>
> >
> > A common pattern when using pid fds is having to get information
> > about the process, which currently requires /proc being mounted,
> > resolving the fd to a pid, and then do manual string parsing of
> > /proc/N/status and friends. This needs to be reimplemented over
> > and over in all userspace projects (e.g.: I have reimplemented
> > resolving in systemd, dbus, dbus-daemon, polkit so far), and
> > requires additional care in checking that the fd is still valid
> > after having parsed the data, to avoid races.
> >
> > Having a programmatic API that can be used directly removes all
> > these requirements, including having /proc mounted.
> >
> > As discussed at LPC24, add an ioctl with an extensible struct
> > so that more parameters can be added later if needed. Start with
> > returning pid/tgid/ppid and creds unconditionally, and cgroupid
> > optionally.
> >
> > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
> > ---
> > v9: drop result_mask and reuse request_mask instead
> > v8: use RAII guard for rcu, call put_cred()
> > v7: fix RCU issue and style issue introduced by v6 found by reviewer
> > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to
> >     get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end
> >     of the call to avoid providing incomplete data, document what the
> >     callers should expect
> > v5: check again that the task hasn't exited immediately before copying
> >     the result out to userspace, to ensure we are not returning stale data
> >     add an ifdef around the cgroup structs usage to fix build errors when
> >     the feature is disabled
> > v4: fix arg check in pidfd_ioctl() by moving it after the new call
> > v3: switch from pid_vnr() to task_pid_vnr()
> > v2: Apply comments from Christian, apart from the one about pid namespaces
> >     as I need additional hints on how to implement it.
> >     Drop the security_context string as it is not the appropriate
> >     metadata to give userspace these days.
> >
> >  fs/pidfs.c                                    | 88 ++++++++++++++++++-
> >  include/uapi/linux/pidfd.h                    | 30 +++++++
> >  .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++-
> >  3 files changed, 194 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > index 80675b6bf884..15cdc7fe4968 100644
> > --- a/fs/pidfs.c
> > +++ b/fs/pidfs.c
> > @@ -2,6 +2,7 @@
> >  #include <linux/anon_inodes.h>
> >  #include <linux/file.h>
> >  #include <linux/fs.h>
> > +#include <linux/cgroup.h>
> >  #include <linux/magic.h>
> >  #include <linux/mount.h>
> >  #include <linux/pid.h>
> > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> >       return poll_flags;
> >  }
> >
> > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> > +{
> > +     struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> > +     size_t usize = _IOC_SIZE(cmd);
> > +     struct pidfd_info kinfo = {};
> > +     struct user_namespace *user_ns;
> > +     const struct cred *c;
> > +     __u64 request_mask;
> > +
> > +     if (!uinfo)
> > +             return -EINVAL;
> > +     if (usize < sizeof(struct pidfd_info))
> > +             return -EINVAL; /* First version, no smaller struct possible */
> > +
> > +     if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> > +             return -EFAULT;
> > +
> > +     c = get_task_cred(task);
> > +     if (!c)
> > +             return -ESRCH;
> > +
> > +     /* Unconditionally return identifiers and credentials, the rest only on request */
> > +
> > +     user_ns = current_user_ns();
> > +     kinfo.ruid = from_kuid_munged(user_ns, c->uid);
> > +     kinfo.rgid = from_kgid_munged(user_ns, c->gid);
> > +     kinfo.euid = from_kuid_munged(user_ns, c->euid);
> > +     kinfo.egid = from_kgid_munged(user_ns, c->egid);
> > +     kinfo.suid = from_kuid_munged(user_ns, c->suid);
> > +     kinfo.sgid = from_kgid_munged(user_ns, c->sgid);
> > +     kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid);
> > +     kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid);
> > +     put_cred(c);
> > +
> > +#ifdef CONFIG_CGROUPS
> > +     if (request_mask & PIDFD_INFO_CGROUPID) {
> > +             struct cgroup *cgrp;
> > +
> > +             guard(rcu)();
> > +             cgrp = task_cgroup(task, pids_cgrp_id);
> > +             if (!cgrp)
> > +                     return -ENODEV;
>
> Afaict this means that the task has already exited. In other words, the
> cgroup id cannot be retrieved anymore for a task that has exited but not
> been reaped. Frankly, I would have expected the cgroup id to be
> retrievable until the task has been reaped but that's another
> discussion.
>
> My point is if you contrast this with the other information in here: If
> the task has exited but hasn't been reaped then you can still get
> credentials such as *uid/*gid, and pid namespace relative information
> such as pid/tgid/ppid.
>
> So really, I would argue that you don't want to fail this but only
> report 0 here. That's me working under the assumption that cgroup ids
> start from 1...
>
> /me checks
>
> Yes, they start from 1 so 0 is invalid.
>
> > +             kinfo.cgroupid = cgroup_id(cgrp);
>
> Fwiw, it looks like getting the cgroup id is basically just
> dereferencing pointers without having to hold any meaningful locks. So
> it should be fast. So making it unconditional seems fine to me.

There's an ifdef since it's an optional kernel feature, and there's
also this thing that we might not have it, so I'd rather keep the
flag, and set it only if we can get the information (instead of
failing). As a user that seems clearer to me.

Christian Brauner Oct. 8, 2024, 1:23 p.m. UTC | #3

> > +#ifdef CONFIG_CGROUPS
> > +	if (request_mask & PIDFD_INFO_CGROUPID) {
> > +		struct cgroup *cgrp;
> > +
> > +		guard(rcu)();
> > +		cgrp = task_cgroup(task, pids_cgrp_id);
> > +		if (!cgrp)
> > +			return -ENODEV;
> 
> Afaict this means that the task has already exited. In other words, the
> cgroup id cannot be retrieved anymore for a task that has exited but not
> been reaped. Frankly, I would have expected the cgroup id to be
> retrievable until the task has been reaped but that's another
> discussion.
> 
> My point is if you contrast this with the other information in here: If
> the task has exited but hasn't been reaped then you can still get
> credentials such as *uid/*gid, and pid namespace relative information
> such as pid/tgid/ppid.

Related to this and I just want to dump this idea somewhere:

I'm aware that it is often desirable or useful to have information about
a task around even after the task has exited and been reaped.

The exit status comes to mind but maybe there's other stuff that would
be useful to have.

Since we changed to pidfs we know that all pidfds no matter if they
point to the same struct file (e.g., dup()) or to multiple struct files
(e.g., multiple pidfd_open() on the same pid) all point to the same
dentry and inode. Which is why we switched to stashing struct pid in
inode->i_private.

So we could easily do something like this:

// SKETCH SKETCH SKETCH
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 7ffdc88dfb52..eeeb907f4889 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -344,9 +344,24 @@ static const struct dentry_operations pidfs_dentry_operations = {
        .d_prune        = stashed_dentry_prune,
 };

+struct pidfd_kinfo {
+       // We could even move this back to file->private_data to avoid the
+       // additional pointer deref though I doubt it matters.
+       struct pid *pid;
+       int exit_status;
+       // other stuff;
+};
+
 static int pidfs_init_inode(struct inode *inode, void *data)
 {
-       inode->i_private = data;
+       struct pidfd_kinfo *kinfo;
+
+       kinfo = kzalloc(sizeof(*info), GFP_KERNEL_ACCOUNT);
+       if (!kinfo)
+               return -ENOMEM;
+
+       kinfo->pid = data;
+       inode->i_private = kinfo;
        inode->i_flags |= S_PRIVATE;
        inode->i_mode |= S_IRWXU;
        inode->i_op = &pidfs_inode_operations;

and that enables us to persist information past task exit so that as
long as you hold the pidfd you can e.g., query for the exit state of the
task or something.

I'm mostly thinking out loud but I think that could be potentially
interesting.

Jonathan Corbet Oct. 9, 2024, 7:20 p.m. UTC | #4

luca.boccassi@gmail.com writes:

> As discussed at LPC24, add an ioctl with an extensible struct
> so that more parameters can be added later if needed. Start with
> returning pid/tgid/ppid and creds unconditionally, and cgroupid
> optionally.

I was looking this over, and a couple of questions came to mind...

> Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
> ---

[...]

> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 80675b6bf884..15cdc7fe4968 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -2,6 +2,7 @@
>  #include <linux/anon_inodes.h>
>  #include <linux/file.h>
>  #include <linux/fs.h>
> +#include <linux/cgroup.h>
>  #include <linux/magic.h>
>  #include <linux/mount.h>
>  #include <linux/pid.h>
> @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
>  	return poll_flags;
>  }
>  
> +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> +{
> +	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> +	size_t usize = _IOC_SIZE(cmd);
> +	struct pidfd_info kinfo = {};
> +	struct user_namespace *user_ns;
> +	const struct cred *c;
> +	__u64 request_mask;
> +
> +	if (!uinfo)
> +		return -EINVAL;
> +	if (usize < sizeof(struct pidfd_info))
> +		return -EINVAL; /* First version, no smaller struct possible */
> +
> +	if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> +		return -EFAULT;

You don't check request_mask for unrecognized flags, so user space will
not get an error if it puts random gunk there.  That, in turn, can make
it harder to add new options in the future.

> +	c = get_task_cred(task);
> +	if (!c)
> +		return -ESRCH;

[...]

> +
> +	/*
> +	 * If userspace and the kernel have the same struct size it can just
> +	 * be copied. If userspace provides an older struct, only the bits that
> +	 * userspace knows about will be copied. If userspace provides a new
> +	 * struct, only the bits that the kernel knows about will be copied and
> +	 * the size value will be set to the size the kernel knows about.
> +	 */
> +	if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
> +		return -EFAULT;

Which "size value" are you referring to here; I can't see it.

If user space has a bigger struct, should you perhaps zero-fill the part
the kernel doesn't know about?

> +	return 0;
> +}

Thanks,

jon

Aleksa Sarai Oct. 9, 2024, 8:50 p.m. UTC | #5

On 2024-10-08, luca.boccassi@gmail.com <luca.boccassi@gmail.com> wrote:
> From: Luca Boccassi <luca.boccassi@gmail.com>
> 
> A common pattern when using pid fds is having to get information
> about the process, which currently requires /proc being mounted,
> resolving the fd to a pid, and then do manual string parsing of
> /proc/N/status and friends. This needs to be reimplemented over
> and over in all userspace projects (e.g.: I have reimplemented
> resolving in systemd, dbus, dbus-daemon, polkit so far), and
> requires additional care in checking that the fd is still valid
> after having parsed the data, to avoid races.
> 
> Having a programmatic API that can be used directly removes all
> these requirements, including having /proc mounted.
> 
> As discussed at LPC24, add an ioctl with an extensible struct
> so that more parameters can be added later if needed. Start with
> returning pid/tgid/ppid and creds unconditionally, and cgroupid
> optionally.
> 
> Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
> ---
> v9: drop result_mask and reuse request_mask instead
> v8: use RAII guard for rcu, call put_cred()
> v7: fix RCU issue and style issue introduced by v6 found by reviewer
> v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to
>     get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end
>     of the call to avoid providing incomplete data, document what the
>     callers should expect
> v5: check again that the task hasn't exited immediately before copying
>     the result out to userspace, to ensure we are not returning stale data
>     add an ifdef around the cgroup structs usage to fix build errors when
>     the feature is disabled
> v4: fix arg check in pidfd_ioctl() by moving it after the new call
> v3: switch from pid_vnr() to task_pid_vnr()
> v2: Apply comments from Christian, apart from the one about pid namespaces
>     as I need additional hints on how to implement it.
>     Drop the security_context string as it is not the appropriate
>     metadata to give userspace these days.
> 
>  fs/pidfs.c                                    | 88 ++++++++++++++++++-
>  include/uapi/linux/pidfd.h                    | 30 +++++++
>  .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++-
>  3 files changed, 194 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 80675b6bf884..15cdc7fe4968 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -2,6 +2,7 @@
>  #include <linux/anon_inodes.h>
>  #include <linux/file.h>
>  #include <linux/fs.h>
> +#include <linux/cgroup.h>
>  #include <linux/magic.h>
>  #include <linux/mount.h>
>  #include <linux/pid.h>
> @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
>  	return poll_flags;
>  }
>  
> +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> +{
> +	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> +	size_t usize = _IOC_SIZE(cmd);
> +	struct pidfd_info kinfo = {};
> +	struct user_namespace *user_ns;
> +	const struct cred *c;
> +	__u64 request_mask;
> +
> +	if (!uinfo)
> +		return -EINVAL;
> +	if (usize < sizeof(struct pidfd_info))
> +		return -EINVAL; /* First version, no smaller struct possible */
> +
> +	if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> +		return -EFAULT;
> +
> +	c = get_task_cred(task);
> +	if (!c)
> +		return -ESRCH;
> +
> +	/* Unconditionally return identifiers and credentials, the rest only on request */
> +
> +	user_ns = current_user_ns();
> +	kinfo.ruid = from_kuid_munged(user_ns, c->uid);
> +	kinfo.rgid = from_kgid_munged(user_ns, c->gid);
> +	kinfo.euid = from_kuid_munged(user_ns, c->euid);
> +	kinfo.egid = from_kgid_munged(user_ns, c->egid);
> +	kinfo.suid = from_kuid_munged(user_ns, c->suid);
> +	kinfo.sgid = from_kgid_munged(user_ns, c->sgid);
> +	kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid);
> +	kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid);
> +	put_cred(c);
> +
> +#ifdef CONFIG_CGROUPS
> +	if (request_mask & PIDFD_INFO_CGROUPID) {
> +		struct cgroup *cgrp;
> +
> +		guard(rcu)();
> +		cgrp = task_cgroup(task, pids_cgrp_id);
> +		if (!cgrp)
> +			return -ENODEV;
> +		kinfo.cgroupid = cgroup_id(cgrp);
> +
> +		kinfo.request_mask |= PIDFD_INFO_CGROUPID;
> +	}
> +#endif
> +
> +	/*
> +	 * Copy pid/tgid last, to reduce the chances the information might be
> +	 * stale. Note that it is not possible to ensure it will be valid as the
> +	 * task might return as soon as the copy_to_user finishes, but that's ok
> +	 * and userspace expects that might happen and can act accordingly, so
> +	 * this is just best-effort. What we can do however is checking that all
> +	 * the fields are set correctly, or return ESRCH to avoid providing
> +	 * incomplete information. */
> +
> +	kinfo.ppid = task_ppid_nr_ns(task, NULL);
> +	kinfo.tgid = task_tgid_vnr(task);
> +	kinfo.pid = task_pid_vnr(task);
> +
> +	if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1))
> +		return -ESRCH;
> +
> +	/*
> +	 * If userspace and the kernel have the same struct size it can just
> +	 * be copied. If userspace provides an older struct, only the bits that
> +	 * userspace knows about will be copied. If userspace provides a new
> +	 * struct, only the bits that the kernel knows about will be copied and
> +	 * the size value will be set to the size the kernel knows about.
> +	 */
> +	if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
> +		return -EFAULT;

If usize > ksize, we also want to clear_user() the trailing bytes to
avoid userspace thinking that any garbage bytes they had are valid.

Also, you mention "the size value" but there is no size in pidfd_info. I
don't think it's actually necessary to include such a field (especially
when you have a statx-like request_mask), but it means you really should
clear the trailing bytes to avoid userspace bugs.

I implemented all of these semantics as copy_struct_to_user() in the
CHECK_FIELDS patch I sent a few weeks ago (I just sent v3[1]). Maybe you
can cherry-pick this patch and use it? The semantics when we extend this
pidfd_info to accept new request_mask values with larger structures is
going to get a little ugly and copy_struct_to_user() makes this a little
easier to deal with.

[1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-1-d2833dfe6edd@cyphar.com/

> +
> +	return 0;
> +}
> +
>  static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  {
>  	struct task_struct *task __free(put_task) = NULL;
> @@ -122,13 +200,17 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  	struct ns_common *ns_common = NULL;
>  	struct pid_namespace *pid_ns;
>  
> -	if (arg)
> -		return -EINVAL;
> -
>  	task = get_pid_task(pid, PIDTYPE_PID);
>  	if (!task)
>  		return -ESRCH;
>  
> +	/* Extensible IOCTL that does not open namespace FDs, take a shortcut */
> +	if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO))
> +		return pidfd_info(task, cmd, arg);
> +
> +	if (arg)
> +		return -EINVAL;
> +
>  	scoped_guard(task_lock, task) {
>  		nsp = task->nsproxy;
>  		if (nsp)
> diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> index 565fc0629fff..d685eeeedc51 100644
> --- a/include/uapi/linux/pidfd.h
> +++ b/include/uapi/linux/pidfd.h
> @@ -16,6 +16,35 @@
>  #define PIDFD_SIGNAL_THREAD_GROUP	(1UL << 1)
>  #define PIDFD_SIGNAL_PROCESS_GROUP	(1UL << 2)
>  
> +/* Flags for pidfd_info. */
> +#define PIDFD_INFO_CGROUPID		(1UL << 0)

While it isn't strictly necessary, maybe we should provide some
always-set bits like statx does? While they would always be set, it
might incentivise programs to write code that checks if the request_mask
bits are set after the ioctl(2) returns from the outset. Then again,
PIDFD_INFO_CGROUPID is probably enough to justify writing code correctly
from the outset.

> +
> +struct pidfd_info {
> +	/* Let userspace request expensive stuff explictly, and let the kernel
> +	 * indicate whether it knows about it. */

I would prefer a slightly more informative comment (which mentions that
this will also be used for extensibility), something like:


/*
 * This mask is similar to the request_mask in statx(2).
 *
 * Userspace indicates what extensions or expensive-to-calculate fields
 * they want by setting the corresponding bits in request_mask.
 *
 * When filling the structure, the kernel will only set bits
 * corresponding to the fields that were actually filled by the kernel.
 * This also includes any future extensions that might be automatically
 * filled. If the structure size is too small to contain a field
 * (requested or not), to avoid confusion the request_mask will not
 * contain a bit for that field.
 *
 * As such, userspace MUST verify that request_mask contains the
 * corresponding flags after the ioctl(2) returns to ensure that it is
 * using valid data.
 */


The bit about request_mask not containing a bit if the structure size is
too small is one of the things that copy_struct_to_user() helps with
implementing (see the patch for some more docs).

> +	__u64 request_mask;
> +	/*
> +	 * The information contained in the following fields might be stale at the
> +	 * time it is received, as the target process might have exited as soon as
> +	 * the IOCTL was processed, and there is no way to avoid that. However, it
> +	 * is guaranteed that if the call was successful, then the information was
> +	 * correct and referred to the intended process at the time the work was
> +	 * performed. */
> +	__u64 cgroupid;
> +	__u32 pid;
> +	__u32 tgid;
> +	__u32 ppid;
> +	__u32 ruid;
> +	__u32 rgid;
> +	__u32 euid;
> +	__u32 egid;
> +	__u32 suid;
> +	__u32 sgid;
> +	__u32 fsuid;
> +	__u32 fsgid;
> +	__u32 spare0[1];
> +};
> +
>  #define PIDFS_IOCTL_MAGIC 0xFF
>  
>  #define PIDFD_GET_CGROUP_NAMESPACE            _IO(PIDFS_IOCTL_MAGIC, 1)
> @@ -28,5 +57,6 @@
>  #define PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 8)
>  #define PIDFD_GET_USER_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 9)
>  #define PIDFD_GET_UTS_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 10)
> +#define PIDFD_GET_INFO                        _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info)
>  
>  #endif /* _UAPI_LINUX_PIDFD_H */
> diff --git a/tools/testing/selftests/pidfd/pidfd_open_test.c b/tools/testing/selftests/pidfd/pidfd_open_test.c
> index c62564c264b1..b2a8cfb19a74 100644
> --- a/tools/testing/selftests/pidfd/pidfd_open_test.c
> +++ b/tools/testing/selftests/pidfd/pidfd_open_test.c
> @@ -13,6 +13,7 @@
>  #include <stdlib.h>
>  #include <string.h>
>  #include <syscall.h>
> +#include <sys/ioctl.h>
>  #include <sys/mount.h>
>  #include <sys/prctl.h>
>  #include <sys/wait.h>
> @@ -21,6 +22,34 @@
>  #include "pidfd.h"
>  #include "../kselftest.h"
>  
> +#ifndef PIDFS_IOCTL_MAGIC
> +#define PIDFS_IOCTL_MAGIC 0xFF
> +#endif
> +
> +#ifndef PIDFD_GET_INFO
> +#define PIDFD_GET_INFO _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info)
> +#define PIDFD_INFO_CGROUPID		(1UL << 0)
> +
> +struct pidfd_info {
> +	/* Let userspace request expensive stuff explictly, and let the kernel
> +	 * indicate whether it knows about it. */
> +	__u64 request_mask;
> +	__u64 cgroupid;
> +	__u32 pid;
> +	__u32 tgid;
> +	__u32 ppid;
> +	__u32 ruid;
> +	__u32 rgid;
> +	__u32 euid;
> +	__u32 egid;
> +	__u32 suid;
> +	__u32 sgid;
> +	__u32 fsuid;
> +	__u32 fsgid;
> +	__u32 spare0[1];
> +};
> +#endif
> +
>  static int safe_int(const char *numstr, int *converted)
>  {
>  	char *err = NULL;
> @@ -120,10 +149,13 @@ static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen)
>  
>  int main(int argc, char **argv)
>  {
> +	struct pidfd_info info = {
> +		.request_mask = PIDFD_INFO_CGROUPID,
> +	};
>  	int pidfd = -1, ret = 1;
>  	pid_t pid;
>  
> -	ksft_set_plan(3);
> +	ksft_set_plan(4);
>  
>  	pidfd = sys_pidfd_open(-1, 0);
>  	if (pidfd >= 0) {
> @@ -153,6 +185,52 @@ int main(int argc, char **argv)
>  	pid = get_pid_from_fdinfo_file(pidfd, "Pid:", sizeof("Pid:") - 1);
>  	ksft_print_msg("pidfd %d refers to process with pid %d\n", pidfd, pid);
>  
> +	if (ioctl(pidfd, PIDFD_GET_INFO, &info) < 0) {
> +		ksft_print_msg("%s - failed to get info from pidfd\n", strerror(errno));
> +		goto on_error;
> +	}
> +	if (info.pid != pid) {
> +		ksft_print_msg("pid from fdinfo file %d does not match pid from ioctl %d\n",
> +			       pid, info.pid);
> +		goto on_error;
> +	}
> +	if (info.ppid != getppid()) {
> +		ksft_print_msg("ppid %d does not match ppid from ioctl %d\n",
> +			       pid, info.pid);
> +		goto on_error;
> +	}
> +	if (info.ruid != getuid()) {
> +		ksft_print_msg("uid %d does not match uid from ioctl %d\n",
> +			       getuid(), info.ruid);
> +		goto on_error;
> +	}
> +	if (info.rgid != getgid()) {
> +		ksft_print_msg("gid %d does not match gid from ioctl %d\n",
> +			       getgid(), info.rgid);
> +		goto on_error;
> +	}
> +	if (info.euid != geteuid()) {
> +		ksft_print_msg("euid %d does not match euid from ioctl %d\n",
> +			       geteuid(), info.euid);
> +		goto on_error;
> +	}
> +	if (info.egid != getegid()) {
> +		ksft_print_msg("egid %d does not match egid from ioctl %d\n",
> +			       getegid(), info.egid);
> +		goto on_error;
> +	}
> +	if (info.suid != geteuid()) {
> +		ksft_print_msg("suid %d does not match suid from ioctl %d\n",
> +			       geteuid(), info.suid);
> +		goto on_error;
> +	}
> +	if (info.sgid != getegid()) {
> +		ksft_print_msg("sgid %d does not match sgid from ioctl %d\n",
> +			       getegid(), info.sgid);
> +		goto on_error;
> +	}
> +	ksft_test_result_pass("get info from pidfd test: passed\n");
> +
>  	ret = 0;
>  
>  on_error:
> 
> base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
> -- 
> 2.45.2
> 
>

Aleksa Sarai Oct. 9, 2024, 8:52 p.m. UTC | #6

On 2024-10-08, Luca Boccassi <luca.boccassi@gmail.com> wrote:
> On Tue, 8 Oct 2024 at 14:06, Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Tue, Oct 08, 2024 at 01:18:20PM GMT, luca.boccassi@gmail.com wrote:
> > > From: Luca Boccassi <luca.boccassi@gmail.com>
> > >
> > > A common pattern when using pid fds is having to get information
> > > about the process, which currently requires /proc being mounted,
> > > resolving the fd to a pid, and then do manual string parsing of
> > > /proc/N/status and friends. This needs to be reimplemented over
> > > and over in all userspace projects (e.g.: I have reimplemented
> > > resolving in systemd, dbus, dbus-daemon, polkit so far), and
> > > requires additional care in checking that the fd is still valid
> > > after having parsed the data, to avoid races.
> > >
> > > Having a programmatic API that can be used directly removes all
> > > these requirements, including having /proc mounted.
> > >
> > > As discussed at LPC24, add an ioctl with an extensible struct
> > > so that more parameters can be added later if needed. Start with
> > > returning pid/tgid/ppid and creds unconditionally, and cgroupid
> > > optionally.
> > >
> > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
> > > ---
> > > v9: drop result_mask and reuse request_mask instead
> > > v8: use RAII guard for rcu, call put_cred()
> > > v7: fix RCU issue and style issue introduced by v6 found by reviewer
> > > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to
> > >     get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end
> > >     of the call to avoid providing incomplete data, document what the
> > >     callers should expect
> > > v5: check again that the task hasn't exited immediately before copying
> > >     the result out to userspace, to ensure we are not returning stale data
> > >     add an ifdef around the cgroup structs usage to fix build errors when
> > >     the feature is disabled
> > > v4: fix arg check in pidfd_ioctl() by moving it after the new call
> > > v3: switch from pid_vnr() to task_pid_vnr()
> > > v2: Apply comments from Christian, apart from the one about pid namespaces
> > >     as I need additional hints on how to implement it.
> > >     Drop the security_context string as it is not the appropriate
> > >     metadata to give userspace these days.
> > >
> > >  fs/pidfs.c                                    | 88 ++++++++++++++++++-
> > >  include/uapi/linux/pidfd.h                    | 30 +++++++
> > >  .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++-
> > >  3 files changed, 194 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > > index 80675b6bf884..15cdc7fe4968 100644
> > > --- a/fs/pidfs.c
> > > +++ b/fs/pidfs.c
> > > @@ -2,6 +2,7 @@
> > >  #include <linux/anon_inodes.h>
> > >  #include <linux/file.h>
> > >  #include <linux/fs.h>
> > > +#include <linux/cgroup.h>
> > >  #include <linux/magic.h>
> > >  #include <linux/mount.h>
> > >  #include <linux/pid.h>
> > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> > >       return poll_flags;
> > >  }
> > >
> > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> > > +{
> > > +     struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> > > +     size_t usize = _IOC_SIZE(cmd);
> > > +     struct pidfd_info kinfo = {};
> > > +     struct user_namespace *user_ns;
> > > +     const struct cred *c;
> > > +     __u64 request_mask;
> > > +
> > > +     if (!uinfo)
> > > +             return -EINVAL;
> > > +     if (usize < sizeof(struct pidfd_info))
> > > +             return -EINVAL; /* First version, no smaller struct possible */
> > > +
> > > +     if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> > > +             return -EFAULT;
> > > +
> > > +     c = get_task_cred(task);
> > > +     if (!c)
> > > +             return -ESRCH;
> > > +
> > > +     /* Unconditionally return identifiers and credentials, the rest only on request */
> > > +
> > > +     user_ns = current_user_ns();
> > > +     kinfo.ruid = from_kuid_munged(user_ns, c->uid);
> > > +     kinfo.rgid = from_kgid_munged(user_ns, c->gid);
> > > +     kinfo.euid = from_kuid_munged(user_ns, c->euid);
> > > +     kinfo.egid = from_kgid_munged(user_ns, c->egid);
> > > +     kinfo.suid = from_kuid_munged(user_ns, c->suid);
> > > +     kinfo.sgid = from_kgid_munged(user_ns, c->sgid);
> > > +     kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid);
> > > +     kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid);
> > > +     put_cred(c);
> > > +
> > > +#ifdef CONFIG_CGROUPS
> > > +     if (request_mask & PIDFD_INFO_CGROUPID) {
> > > +             struct cgroup *cgrp;
> > > +
> > > +             guard(rcu)();
> > > +             cgrp = task_cgroup(task, pids_cgrp_id);
> > > +             if (!cgrp)
> > > +                     return -ENODEV;
> >
> > Afaict this means that the task has already exited. In other words, the
> > cgroup id cannot be retrieved anymore for a task that has exited but not
> > been reaped. Frankly, I would have expected the cgroup id to be
> > retrievable until the task has been reaped but that's another
> > discussion.
> >
> > My point is if you contrast this with the other information in here: If
> > the task has exited but hasn't been reaped then you can still get
> > credentials such as *uid/*gid, and pid namespace relative information
> > such as pid/tgid/ppid.
> >
> > So really, I would argue that you don't want to fail this but only
> > report 0 here. That's me working under the assumption that cgroup ids
> > start from 1...
> >
> > /me checks
> >
> > Yes, they start from 1 so 0 is invalid.
> >
> > > +             kinfo.cgroupid = cgroup_id(cgrp);
> >
> > Fwiw, it looks like getting the cgroup id is basically just
> > dereferencing pointers without having to hold any meaningful locks. So
> > it should be fast. So making it unconditional seems fine to me.
> 
> There's an ifdef since it's an optional kernel feature, and there's
> also this thing that we might not have it, so I'd rather keep the
> flag, and set it only if we can get the information (instead of
> failing). As a user that seems clearer to me.

I think we should keep the request_mask flag when returning from the
kernel, but it's not necessary for the user to request it explicitly
because it's cheap to get. This is how STATX_MNT_ID works and I think it
makes sense to do it that way.

(Though there are some complications when we add extensions in the
future -- see my other mail about copy_struct_to_user().)

Aleksa Sarai Oct. 9, 2024, 8:56 p.m. UTC | #7

On 2024-10-09, Jonathan Corbet <corbet@lwn.net> wrote:
> luca.boccassi@gmail.com writes:
> 
> > As discussed at LPC24, add an ioctl with an extensible struct
> > so that more parameters can be added later if needed. Start with
> > returning pid/tgid/ppid and creds unconditionally, and cgroupid
> > optionally.
> 
> I was looking this over, and a couple of questions came to mind...
> 
> > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
> > ---
> 
> [...]
> 
> > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > index 80675b6bf884..15cdc7fe4968 100644
> > --- a/fs/pidfs.c
> > +++ b/fs/pidfs.c
> > @@ -2,6 +2,7 @@
> >  #include <linux/anon_inodes.h>
> >  #include <linux/file.h>
> >  #include <linux/fs.h>
> > +#include <linux/cgroup.h>
> >  #include <linux/magic.h>
> >  #include <linux/mount.h>
> >  #include <linux/pid.h>
> > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> >  	return poll_flags;
> >  }
> >  
> > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> > +{
> > +	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> > +	size_t usize = _IOC_SIZE(cmd);
> > +	struct pidfd_info kinfo = {};
> > +	struct user_namespace *user_ns;
> > +	const struct cred *c;
> > +	__u64 request_mask;
> > +
> > +	if (!uinfo)
> > +		return -EINVAL;
> > +	if (usize < sizeof(struct pidfd_info))
> > +		return -EINVAL; /* First version, no smaller struct possible */
> > +
> > +	if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> > +		return -EFAULT;
> 
> You don't check request_mask for unrecognized flags, so user space will
> not get an error if it puts random gunk there.  That, in turn, can make
> it harder to add new options in the future.

In fairness, this is how statx works and statx does this to not require
syscall retries to figure out what flags the current kernel supports and
instead defers that to stx_mask.

However, I think verifying the value is slightly less fragile -- as long
as we get a cheap way for userspace to check what flags are supported
(such as CHECK_FIELDS[1]). It would kind of suck if userspace would have
to do 50 syscalls to figure out what request_mask values are valid.

[1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-0-d2833dfe6edd@cyphar.com/

> 
> > +	c = get_task_cred(task);
> > +	if (!c)
> > +		return -ESRCH;
> 
> [...]
> 
> > +
> > +	/*
> > +	 * If userspace and the kernel have the same struct size it can just
> > +	 * be copied. If userspace provides an older struct, only the bits that
> > +	 * userspace knows about will be copied. If userspace provides a new
> > +	 * struct, only the bits that the kernel knows about will be copied and
> > +	 * the size value will be set to the size the kernel knows about.
> > +	 */
> > +	if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
> > +		return -EFAULT;
> 
> Which "size value" are you referring to here; I can't see it.
> 
> If user space has a bigger struct, should you perhaps zero-fill the part
> the kernel doesn't know about?
> 
> > +	return 0;
> > +}
> 
> Thanks,
> 
> jon
>

Aleksa Sarai Oct. 9, 2024, 9:06 p.m. UTC | #8

On 2024-10-10, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2024-10-09, Jonathan Corbet <corbet@lwn.net> wrote:
> > luca.boccassi@gmail.com writes:
> > 
> > > As discussed at LPC24, add an ioctl with an extensible struct
> > > so that more parameters can be added later if needed. Start with
> > > returning pid/tgid/ppid and creds unconditionally, and cgroupid
> > > optionally.
> > 
> > I was looking this over, and a couple of questions came to mind...
> > 
> > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
> > > ---
> > 
> > [...]
> > 
> > > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > > index 80675b6bf884..15cdc7fe4968 100644
> > > --- a/fs/pidfs.c
> > > +++ b/fs/pidfs.c
> > > @@ -2,6 +2,7 @@
> > >  #include <linux/anon_inodes.h>
> > >  #include <linux/file.h>
> > >  #include <linux/fs.h>
> > > +#include <linux/cgroup.h>
> > >  #include <linux/magic.h>
> > >  #include <linux/mount.h>
> > >  #include <linux/pid.h>
> > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> > >  	return poll_flags;
> > >  }
> > >  
> > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> > > +{
> > > +	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> > > +	size_t usize = _IOC_SIZE(cmd);
> > > +	struct pidfd_info kinfo = {};
> > > +	struct user_namespace *user_ns;
> > > +	const struct cred *c;
> > > +	__u64 request_mask;
> > > +
> > > +	if (!uinfo)
> > > +		return -EINVAL;
> > > +	if (usize < sizeof(struct pidfd_info))
> > > +		return -EINVAL; /* First version, no smaller struct possible */
> > > +
> > > +	if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> > > +		return -EFAULT;
> > 
> > You don't check request_mask for unrecognized flags, so user space will
> > not get an error if it puts random gunk there.  That, in turn, can make
> > it harder to add new options in the future.
> 
> In fairness, this is how statx works and statx does this to not require
> syscall retries to figure out what flags the current kernel supports and
> instead defers that to stx_mask.
> 
> However, I think verifying the value is slightly less fragile -- as long
> as we get a cheap way for userspace to check what flags are supported
> (such as CHECK_FIELDS[1]). It would kind of suck if userspace would have
> to do 50 syscalls to figure out what request_mask values are valid.

Unfortunately, we probably need to find a different way to do
CHECK_FIELDS for extensible-struct ioctls because CHECK_FIELDS uses the
top bit in a u64 but we can't set a size that large with ioctl
numbers...

Then again, _IOC_SIZEBITS is 14 so we could easily set the ioctl size to
something >PAGE_SIZE fairly easily at least...

> [1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-0-d2833dfe6edd@cyphar.com/
> 
> > 
> > > +	c = get_task_cred(task);
> > > +	if (!c)
> > > +		return -ESRCH;
> > 
> > [...]
> > 
> > > +
> > > +	/*
> > > +	 * If userspace and the kernel have the same struct size it can just
> > > +	 * be copied. If userspace provides an older struct, only the bits that
> > > +	 * userspace knows about will be copied. If userspace provides a new
> > > +	 * struct, only the bits that the kernel knows about will be copied and
> > > +	 * the size value will be set to the size the kernel knows about.
> > > +	 */
> > > +	if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
> > > +		return -EFAULT;
> > 
> > Which "size value" are you referring to here; I can't see it.
> > 
> > If user space has a bigger struct, should you perhaps zero-fill the part
> > the kernel doesn't know about?
> > 
> > > +	return 0;
> > > +}
> > 
> > Thanks,
> > 
> > jon
> > 
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>

Jonathan Corbet Oct. 9, 2024, 10:05 p.m. UTC | #9

Aleksa Sarai <cyphar@cyphar.com> writes:

>> In fairness, this is how statx works and statx does this to not require
>> syscall retries to figure out what flags the current kernel supports and
>> instead defers that to stx_mask.
>> 
>> However, I think verifying the value is slightly less fragile -- as long
>> as we get a cheap way for userspace to check what flags are supported
>> (such as CHECK_FIELDS[1]). It would kind of suck if userspace would have
>> to do 50 syscalls to figure out what request_mask values are valid.
>
> Unfortunately, we probably need to find a different way to do
> CHECK_FIELDS for extensible-struct ioctls because CHECK_FIELDS uses the
> top bit in a u64 but we can't set a size that large with ioctl
> numbers...

Add a separate PIDFD_GET_VALID_REQUEST_MASK ioctl()?

But then I'm bad at designing interfaces...

jon

Christian Brauner Oct. 10, 2024, 9:26 a.m. UTC | #10

> > +	/*
> > +	 * If userspace and the kernel have the same struct size it can just
> > +	 * be copied. If userspace provides an older struct, only the bits that
> > +	 * userspace knows about will be copied. If userspace provides a new
> > +	 * struct, only the bits that the kernel knows about will be copied and
> > +	 * the size value will be set to the size the kernel knows about.
> > +	 */
> > +	if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
> > +		return -EFAULT;
> 
> Which "size value" are you referring to here; I can't see it.

Luca did just copy my comment from another interface which has a
separate size parameter. This should indeed be fixed.

Christian Brauner Oct. 10, 2024, 9:36 a.m. UTC | #11

On Thu, Oct 10, 2024 at 07:56:53AM +1100, Aleksa Sarai wrote:
> On 2024-10-09, Jonathan Corbet <corbet@lwn.net> wrote:
> > luca.boccassi@gmail.com writes:
> > 
> > > As discussed at LPC24, add an ioctl with an extensible struct
> > > so that more parameters can be added later if needed. Start with
> > > returning pid/tgid/ppid and creds unconditionally, and cgroupid
> > > optionally.
> > 
> > I was looking this over, and a couple of questions came to mind...
> > 
> > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
> > > ---
> > 
> > [...]
> > 
> > > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > > index 80675b6bf884..15cdc7fe4968 100644
> > > --- a/fs/pidfs.c
> > > +++ b/fs/pidfs.c
> > > @@ -2,6 +2,7 @@
> > >  #include <linux/anon_inodes.h>
> > >  #include <linux/file.h>
> > >  #include <linux/fs.h>
> > > +#include <linux/cgroup.h>
> > >  #include <linux/magic.h>
> > >  #include <linux/mount.h>
> > >  #include <linux/pid.h>
> > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> > >  	return poll_flags;
> > >  }
> > >  
> > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> > > +{
> > > +	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> > > +	size_t usize = _IOC_SIZE(cmd);
> > > +	struct pidfd_info kinfo = {};
> > > +	struct user_namespace *user_ns;
> > > +	const struct cred *c;
> > > +	__u64 request_mask;
> > > +
> > > +	if (!uinfo)
> > > +		return -EINVAL;
> > > +	if (usize < sizeof(struct pidfd_info))
> > > +		return -EINVAL; /* First version, no smaller struct possible */
> > > +
> > > +	if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> > > +		return -EFAULT;
> > 
> > You don't check request_mask for unrecognized flags, so user space will
> > not get an error if it puts random gunk there.  That, in turn, can make
> > it harder to add new options in the future.
> 
> In fairness, this is how statx works and statx does this to not require
> syscall retries to figure out what flags the current kernel supports and
> instead defers that to stx_mask.

pidfd_info overwrites the request_mask with what is supported by the
kernel. I don't think userspace setting random stuff in the request_mask
is a problem. It would already be a problem with statx() and we haven't
seen that so far.

If userspace happens to set a some random bit in the request_mask and
that bit ends up being used a few kernel releases later to e.g.,
retrieve additional information then all that happens is that userspace
would now receive information they didn't need. That's not a problem.

It is of course very different to e.g. adding a random bit in the flag
mask of clone3() or mount_setattr() or any system call that changes
kernel state based on the passed bits. In that case ignoring unknown
bits and then starting to use them is obviously a big problem.

The other related problem would be flag deprecation and reuse of a flag
which (CLONE_DETACHED -> CLONE_PIDFD) also is only a real problem for
system calls that alter kernel state.

So overally, I think ignoring uknown bits in the request mask is safe.
It needs to be documented of course.

Christian Brauner Oct. 10, 2024, 9:38 a.m. UTC | #12

On Thu, Oct 10, 2024 at 07:52:20AM +1100, Aleksa Sarai wrote:
> On 2024-10-08, Luca Boccassi <luca.boccassi@gmail.com> wrote:
> > On Tue, 8 Oct 2024 at 14:06, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Tue, Oct 08, 2024 at 01:18:20PM GMT, luca.boccassi@gmail.com wrote:
> > > > From: Luca Boccassi <luca.boccassi@gmail.com>
> > > >
> > > > A common pattern when using pid fds is having to get information
> > > > about the process, which currently requires /proc being mounted,
> > > > resolving the fd to a pid, and then do manual string parsing of
> > > > /proc/N/status and friends. This needs to be reimplemented over
> > > > and over in all userspace projects (e.g.: I have reimplemented
> > > > resolving in systemd, dbus, dbus-daemon, polkit so far), and
> > > > requires additional care in checking that the fd is still valid
> > > > after having parsed the data, to avoid races.
> > > >
> > > > Having a programmatic API that can be used directly removes all
> > > > these requirements, including having /proc mounted.
> > > >
> > > > As discussed at LPC24, add an ioctl with an extensible struct
> > > > so that more parameters can be added later if needed. Start with
> > > > returning pid/tgid/ppid and creds unconditionally, and cgroupid
> > > > optionally.
> > > >
> > > > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
> > > > ---
> > > > v9: drop result_mask and reuse request_mask instead
> > > > v8: use RAII guard for rcu, call put_cred()
> > > > v7: fix RCU issue and style issue introduced by v6 found by reviewer
> > > > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to
> > > >     get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end
> > > >     of the call to avoid providing incomplete data, document what the
> > > >     callers should expect
> > > > v5: check again that the task hasn't exited immediately before copying
> > > >     the result out to userspace, to ensure we are not returning stale data
> > > >     add an ifdef around the cgroup structs usage to fix build errors when
> > > >     the feature is disabled
> > > > v4: fix arg check in pidfd_ioctl() by moving it after the new call
> > > > v3: switch from pid_vnr() to task_pid_vnr()
> > > > v2: Apply comments from Christian, apart from the one about pid namespaces
> > > >     as I need additional hints on how to implement it.
> > > >     Drop the security_context string as it is not the appropriate
> > > >     metadata to give userspace these days.
> > > >
> > > >  fs/pidfs.c                                    | 88 ++++++++++++++++++-
> > > >  include/uapi/linux/pidfd.h                    | 30 +++++++
> > > >  .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++-
> > > >  3 files changed, 194 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > > > index 80675b6bf884..15cdc7fe4968 100644
> > > > --- a/fs/pidfs.c
> > > > +++ b/fs/pidfs.c
> > > > @@ -2,6 +2,7 @@
> > > >  #include <linux/anon_inodes.h>
> > > >  #include <linux/file.h>
> > > >  #include <linux/fs.h>
> > > > +#include <linux/cgroup.h>
> > > >  #include <linux/magic.h>
> > > >  #include <linux/mount.h>
> > > >  #include <linux/pid.h>
> > > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> > > >       return poll_flags;
> > > >  }
> > > >
> > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> > > > +{
> > > > +     struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> > > > +     size_t usize = _IOC_SIZE(cmd);
> > > > +     struct pidfd_info kinfo = {};
> > > > +     struct user_namespace *user_ns;
> > > > +     const struct cred *c;
> > > > +     __u64 request_mask;
> > > > +
> > > > +     if (!uinfo)
> > > > +             return -EINVAL;
> > > > +     if (usize < sizeof(struct pidfd_info))
> > > > +             return -EINVAL; /* First version, no smaller struct possible */
> > > > +
> > > > +     if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> > > > +             return -EFAULT;
> > > > +
> > > > +     c = get_task_cred(task);
> > > > +     if (!c)
> > > > +             return -ESRCH;
> > > > +
> > > > +     /* Unconditionally return identifiers and credentials, the rest only on request */
> > > > +
> > > > +     user_ns = current_user_ns();
> > > > +     kinfo.ruid = from_kuid_munged(user_ns, c->uid);
> > > > +     kinfo.rgid = from_kgid_munged(user_ns, c->gid);
> > > > +     kinfo.euid = from_kuid_munged(user_ns, c->euid);
> > > > +     kinfo.egid = from_kgid_munged(user_ns, c->egid);
> > > > +     kinfo.suid = from_kuid_munged(user_ns, c->suid);
> > > > +     kinfo.sgid = from_kgid_munged(user_ns, c->sgid);
> > > > +     kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid);
> > > > +     kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid);
> > > > +     put_cred(c);
> > > > +
> > > > +#ifdef CONFIG_CGROUPS
> > > > +     if (request_mask & PIDFD_INFO_CGROUPID) {
> > > > +             struct cgroup *cgrp;
> > > > +
> > > > +             guard(rcu)();
> > > > +             cgrp = task_cgroup(task, pids_cgrp_id);
> > > > +             if (!cgrp)
> > > > +                     return -ENODEV;
> > >
> > > Afaict this means that the task has already exited. In other words, the
> > > cgroup id cannot be retrieved anymore for a task that has exited but not
> > > been reaped. Frankly, I would have expected the cgroup id to be
> > > retrievable until the task has been reaped but that's another
> > > discussion.
> > >
> > > My point is if you contrast this with the other information in here: If
> > > the task has exited but hasn't been reaped then you can still get
> > > credentials such as *uid/*gid, and pid namespace relative information
> > > such as pid/tgid/ppid.
> > >
> > > So really, I would argue that you don't want to fail this but only
> > > report 0 here. That's me working under the assumption that cgroup ids
> > > start from 1...
> > >
> > > /me checks
> > >
> > > Yes, they start from 1 so 0 is invalid.
> > >
> > > > +             kinfo.cgroupid = cgroup_id(cgrp);
> > >
> > > Fwiw, it looks like getting the cgroup id is basically just
> > > dereferencing pointers without having to hold any meaningful locks. So
> > > it should be fast. So making it unconditional seems fine to me.
> > 
> > There's an ifdef since it's an optional kernel feature, and there's
> > also this thing that we might not have it, so I'd rather keep the
> > flag, and set it only if we can get the information (instead of
> > failing). As a user that seems clearer to me.
> 
> I think we should keep the request_mask flag when returning from the
> kernel, but it's not necessary for the user to request it explicitly
> because it's cheap to get. This is how STATX_MNT_ID works and I think it
> makes sense to do it that way.

So what we should do is not require userspace to request it explicitly
but always raise the flag in request_mask when it's available. I agree.

Christian Brauner Oct. 10, 2024, 9:46 a.m. UTC | #13

On Thu, Oct 10, 2024 at 07:50:36AM +1100, Aleksa Sarai wrote:
> On 2024-10-08, luca.boccassi@gmail.com <luca.boccassi@gmail.com> wrote:
> > From: Luca Boccassi <luca.boccassi@gmail.com>
> > 
> > A common pattern when using pid fds is having to get information
> > about the process, which currently requires /proc being mounted,
> > resolving the fd to a pid, and then do manual string parsing of
> > /proc/N/status and friends. This needs to be reimplemented over
> > and over in all userspace projects (e.g.: I have reimplemented
> > resolving in systemd, dbus, dbus-daemon, polkit so far), and
> > requires additional care in checking that the fd is still valid
> > after having parsed the data, to avoid races.
> > 
> > Having a programmatic API that can be used directly removes all
> > these requirements, including having /proc mounted.
> > 
> > As discussed at LPC24, add an ioctl with an extensible struct
> > so that more parameters can be added later if needed. Start with
> > returning pid/tgid/ppid and creds unconditionally, and cgroupid
> > optionally.
> > 
> > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
> > ---
> > v9: drop result_mask and reuse request_mask instead
> > v8: use RAII guard for rcu, call put_cred()
> > v7: fix RCU issue and style issue introduced by v6 found by reviewer
> > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to
> >     get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end
> >     of the call to avoid providing incomplete data, document what the
> >     callers should expect
> > v5: check again that the task hasn't exited immediately before copying
> >     the result out to userspace, to ensure we are not returning stale data
> >     add an ifdef around the cgroup structs usage to fix build errors when
> >     the feature is disabled
> > v4: fix arg check in pidfd_ioctl() by moving it after the new call
> > v3: switch from pid_vnr() to task_pid_vnr()
> > v2: Apply comments from Christian, apart from the one about pid namespaces
> >     as I need additional hints on how to implement it.
> >     Drop the security_context string as it is not the appropriate
> >     metadata to give userspace these days.
> > 
> >  fs/pidfs.c                                    | 88 ++++++++++++++++++-
> >  include/uapi/linux/pidfd.h                    | 30 +++++++
> >  .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++-
> >  3 files changed, 194 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > index 80675b6bf884..15cdc7fe4968 100644
> > --- a/fs/pidfs.c
> > +++ b/fs/pidfs.c
> > @@ -2,6 +2,7 @@
> >  #include <linux/anon_inodes.h>
> >  #include <linux/file.h>
> >  #include <linux/fs.h>
> > +#include <linux/cgroup.h>
> >  #include <linux/magic.h>
> >  #include <linux/mount.h>
> >  #include <linux/pid.h>
> > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> >  	return poll_flags;
> >  }
> >  
> > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> > +{
> > +	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> > +	size_t usize = _IOC_SIZE(cmd);
> > +	struct pidfd_info kinfo = {};
> > +	struct user_namespace *user_ns;
> > +	const struct cred *c;
> > +	__u64 request_mask;
> > +
> > +	if (!uinfo)
> > +		return -EINVAL;
> > +	if (usize < sizeof(struct pidfd_info))
> > +		return -EINVAL; /* First version, no smaller struct possible */
> > +
> > +	if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> > +		return -EFAULT;
> > +
> > +	c = get_task_cred(task);
> > +	if (!c)
> > +		return -ESRCH;
> > +
> > +	/* Unconditionally return identifiers and credentials, the rest only on request */
> > +
> > +	user_ns = current_user_ns();
> > +	kinfo.ruid = from_kuid_munged(user_ns, c->uid);
> > +	kinfo.rgid = from_kgid_munged(user_ns, c->gid);
> > +	kinfo.euid = from_kuid_munged(user_ns, c->euid);
> > +	kinfo.egid = from_kgid_munged(user_ns, c->egid);
> > +	kinfo.suid = from_kuid_munged(user_ns, c->suid);
> > +	kinfo.sgid = from_kgid_munged(user_ns, c->sgid);
> > +	kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid);
> > +	kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid);
> > +	put_cred(c);
> > +
> > +#ifdef CONFIG_CGROUPS
> > +	if (request_mask & PIDFD_INFO_CGROUPID) {
> > +		struct cgroup *cgrp;
> > +
> > +		guard(rcu)();
> > +		cgrp = task_cgroup(task, pids_cgrp_id);
> > +		if (!cgrp)
> > +			return -ENODEV;
> > +		kinfo.cgroupid = cgroup_id(cgrp);
> > +
> > +		kinfo.request_mask |= PIDFD_INFO_CGROUPID;
> > +	}
> > +#endif
> > +
> > +	/*
> > +	 * Copy pid/tgid last, to reduce the chances the information might be
> > +	 * stale. Note that it is not possible to ensure it will be valid as the
> > +	 * task might return as soon as the copy_to_user finishes, but that's ok
> > +	 * and userspace expects that might happen and can act accordingly, so
> > +	 * this is just best-effort. What we can do however is checking that all
> > +	 * the fields are set correctly, or return ESRCH to avoid providing
> > +	 * incomplete information. */
> > +
> > +	kinfo.ppid = task_ppid_nr_ns(task, NULL);
> > +	kinfo.tgid = task_tgid_vnr(task);
> > +	kinfo.pid = task_pid_vnr(task);
> > +
> > +	if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1))
> > +		return -ESRCH;
> > +
> > +	/*
> > +	 * If userspace and the kernel have the same struct size it can just
> > +	 * be copied. If userspace provides an older struct, only the bits that
> > +	 * userspace knows about will be copied. If userspace provides a new
> > +	 * struct, only the bits that the kernel knows about will be copied and
> > +	 * the size value will be set to the size the kernel knows about.
> > +	 */
> > +	if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
> > +		return -EFAULT;
> 
> If usize > ksize, we also want to clear_user() the trailing bytes to
> avoid userspace thinking that any garbage bytes they had are valid.
> 
> Also, you mention "the size value" but there is no size in pidfd_info. I
> don't think it's actually necessary to include such a field (especially
> when you have a statx-like request_mask), but it means you really should
> clear the trailing bytes to avoid userspace bugs.
> 
> I implemented all of these semantics as copy_struct_to_user() in the
> CHECK_FIELDS patch I sent a few weeks ago (I just sent v3[1]). Maybe you
> can cherry-pick this patch and use it? The semantics when we extend this
> pidfd_info to accept new request_mask values with larger structures is
> going to get a little ugly and copy_struct_to_user() makes this a little
> easier to deal with.
> 
> [1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-1-d2833dfe6edd@cyphar.com/

I agree. @Luca, you can either send the two patches together or I can
just port the patch to it. I don't mind.

> 
> > +
> > +	return 0;
> > +}
> > +
> >  static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >  {
> >  	struct task_struct *task __free(put_task) = NULL;
> > @@ -122,13 +200,17 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >  	struct ns_common *ns_common = NULL;
> >  	struct pid_namespace *pid_ns;
> >  
> > -	if (arg)
> > -		return -EINVAL;
> > -
> >  	task = get_pid_task(pid, PIDTYPE_PID);
> >  	if (!task)
> >  		return -ESRCH;
> >  
> > +	/* Extensible IOCTL that does not open namespace FDs, take a shortcut */
> > +	if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO))
> > +		return pidfd_info(task, cmd, arg);
> > +
> > +	if (arg)
> > +		return -EINVAL;
> > +
> >  	scoped_guard(task_lock, task) {
> >  		nsp = task->nsproxy;
> >  		if (nsp)
> > diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> > index 565fc0629fff..d685eeeedc51 100644
> > --- a/include/uapi/linux/pidfd.h
> > +++ b/include/uapi/linux/pidfd.h
> > @@ -16,6 +16,35 @@
> >  #define PIDFD_SIGNAL_THREAD_GROUP	(1UL << 1)
> >  #define PIDFD_SIGNAL_PROCESS_GROUP	(1UL << 2)
> >  
> > +/* Flags for pidfd_info. */
> > +#define PIDFD_INFO_CGROUPID		(1UL << 0)
> 
> While it isn't strictly necessary, maybe we should provide some
> always-set bits like statx does? While they would always be set, it

Yes, good idea.

> might incentivise programs to write code that checks if the request_mask
> bits are set after the ioctl(2) returns from the outset. Then again,
> PIDFD_INFO_CGROUPID is probably enough to justify writing code correctly
> from the outset.
> 
> > +
> > +struct pidfd_info {
> > +	/* Let userspace request expensive stuff explictly, and let the kernel
> > +	 * indicate whether it knows about it. */
> 
> I would prefer a slightly more informative comment (which mentions that
> this will also be used for extensibility), something like:
> 
> 
> /*
>  * This mask is similar to the request_mask in statx(2).
>  *
>  * Userspace indicates what extensions or expensive-to-calculate fields
>  * they want by setting the corresponding bits in request_mask.
>  *
>  * When filling the structure, the kernel will only set bits
>  * corresponding to the fields that were actually filled by the kernel.
>  * This also includes any future extensions that might be automatically
>  * filled. If the structure size is too small to contain a field
>  * (requested or not), to avoid confusion the request_mask will not
>  * contain a bit for that field.
>  *
>  * As such, userspace MUST verify that request_mask contains the
>  * corresponding flags after the ioctl(2) returns to ensure that it is
>  * using valid data.
>  */

I agree. This is a good comment.

Jonathan Corbet Oct. 10, 2024, 12:41 p.m. UTC | #14

Christian Brauner <brauner@kernel.org> writes:

> pidfd_info overwrites the request_mask with what is supported by the
> kernel. I don't think userspace setting random stuff in the request_mask
> is a problem. It would already be a problem with statx() and we haven't
> seen that so far.
>
> If userspace happens to set a some random bit in the request_mask and
> that bit ends up being used a few kernel releases later to e.g.,
> retrieve additional information then all that happens is that userspace
> would now receive information they didn't need. That's not a problem.

That, of course, assumes that there will never be a request_mask bit
that affects the information gathering in some other way -- say looking
in the parent namespace or such (a random example that just popped into
my undercaffeinated brain and is unlikely to be anything we actually
do).

But then, as I said, I'm bad at this :)

jon

David Laight Oct. 10, 2024, 1:16 p.m. UTC | #15

...
> pidfd_info overwrites the request_mask with what is supported by the
> kernel. I don't think userspace setting random stuff in the request_mask
> is a problem. It would already be a problem with statx() and we haven't
> seen that so far.

I don't think it is wise/necessary for the kernel to set bits that
weren't set in the request (for values that aren't being returned).

That would let you add some mutually exclusive options that use
the same part of the buffer area.

> If userspace happens to set a some random bit in the request_mask and
> that bit ends up being used a few kernel releases later to e.g.,
> retrieve additional information then all that happens is that userspace
> would now receive information they didn't need. That's not a problem.

Especially since the buffer is unlikely to be large enough.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Luca Boccassi Oct. 10, 2024, 3:55 p.m. UTC | #16

On Thu, 10 Oct 2024 at 10:46, Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, Oct 10, 2024 at 07:50:36AM +1100, Aleksa Sarai wrote:
> > On 2024-10-08, luca.boccassi@gmail.com <luca.boccassi@gmail.com> wrote:
> > > From: Luca Boccassi <luca.boccassi@gmail.com>
> > > +   /*
> > > +    * If userspace and the kernel have the same struct size it can just
> > > +    * be copied. If userspace provides an older struct, only the bits that
> > > +    * userspace knows about will be copied. If userspace provides a new
> > > +    * struct, only the bits that the kernel knows about will be copied and
> > > +    * the size value will be set to the size the kernel knows about.
> > > +    */
> > > +   if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
> > > +           return -EFAULT;
> >
> > If usize > ksize, we also want to clear_user() the trailing bytes to
> > avoid userspace thinking that any garbage bytes they had are valid.
> >
> > Also, you mention "the size value" but there is no size in pidfd_info. I
> > don't think it's actually necessary to include such a field (especially
> > when you have a statx-like request_mask), but it means you really should
> > clear the trailing bytes to avoid userspace bugs.
> >
> > I implemented all of these semantics as copy_struct_to_user() in the
> > CHECK_FIELDS patch I sent a few weeks ago (I just sent v3[1]). Maybe you
> > can cherry-pick this patch and use it? The semantics when we extend this
> > pidfd_info to accept new request_mask values with larger structures is
> > going to get a little ugly and copy_struct_to_user() makes this a little
> > easier to deal with.
> >
> > [1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-1-d2833dfe6edd@cyphar.com/
>
> I agree. @Luca, you can either send the two patches together or I can
> just port the patch to it. I don't mind.

I've updated for the latter, given that series is not merged yet, thanks.

Aleksa Sarai Oct. 13, 2024, 5:57 a.m. UTC | #17

On 2024-10-09, Jonathan Corbet <corbet@lwn.net> wrote:
> Aleksa Sarai <cyphar@cyphar.com> writes:
> 
> >> In fairness, this is how statx works and statx does this to not require
> >> syscall retries to figure out what flags the current kernel supports and
> >> instead defers that to stx_mask.
> >> 
> >> However, I think verifying the value is slightly less fragile -- as long
> >> as we get a cheap way for userspace to check what flags are supported
> >> (such as CHECK_FIELDS[1]). It would kind of suck if userspace would have
> >> to do 50 syscalls to figure out what request_mask values are valid.
> >
> > Unfortunately, we probably need to find a different way to do
> > CHECK_FIELDS for extensible-struct ioctls because CHECK_FIELDS uses the
> > top bit in a u64 but we can't set a size that large with ioctl
> > numbers...
> 
> Add a separate PIDFD_GET_VALID_REQUEST_MASK ioctl()?
> 
> But then I'm bad at designing interfaces...

This might be a good argument for making CHECK_FIELDS just (-size)
instead of setting the highest bit because that would work for any bit
size (though admittedly, doing (-size) for a 14-bit number would still
be a little weird).

> 
> jon

[v9] pidfd: add ioctl to retrieve pid info

Commit Message

Comments

Patch