mbox series

[v6,0/5] introduce PIDFD_SELF* sentinels

Message ID cover.1729926229.git.lorenzo.stoakes@oracle.com (mailing list archive)
Headers show
Series introduce PIDFD_SELF* sentinels | expand

Message

Lorenzo Stoakes Oct. 26, 2024, 7:24 a.m. UTC
If you wish to utilise a pidfd interface to refer to the current process or
thread it is rather cumbersome, requiring something like:

	int pidfd = pidfd_open(getpid(), 0 or PIDFD_THREAD);

	...

	close(pidfd);

Or the equivalent call opening /proc/self. It is more convenient to use a
sentinel value to indicate to an interface that accepts a pidfd that we
simply wish to refer to the current process thread.

This series introduces sentinels for this purposes which can be passed as
the pidfd in this instance rather than having to establish a dummy fd for
this purpose.

It is useful to refer to both the current thread from the userland's
perspective for which we use PIDFD_SELF, and the current process from the
userland's perspective, for which we use PIDFD_SELF_PROCESS.

There is unfortunately some confusion between the kernel and userland as to
what constitutes a process - a thread from the userland perspective is a
process in userland, and a userland process is a thread group (more
specifically the thread group leader from the kernel perspective). We
therefore alias things thusly:

* PIDFD_SELF_THREAD aliased by PIDFD_SELF - use PIDTYPE_PID.
* PIDFD_SELF_THREAD_GROUP alised by PIDFD_SELF_PROCESS - use PIDTYPE_TGID.

In all of the kernel code we refer to PIDFD_SELF_THREAD and
PIDFD_SELF_THREAD_GROUP. However we expect users to use PIDFD_SELF and
PIDFD_SELF_PROCESS.

This matters for cases where, for instance, a user unshare()'s FDs or does
thread-specific signal handling and where the user would be hugely confused
if the FDs referenced or signal processed referred to the thread group
leader rather than the individual thread.

We ensure that pidfd_send_signal() and pidfd_getfd() work correctly, and
assert as much in selftests. All other interfaces except setns() will work
implicitly with this new interface, however it doesn't make sense to test
waitid(P_PIDFD, ...) as waiting on ourselves is a blocking operation.

In the case of setns() we explicitly disallow use of PIDFD_SELF* as it
doesn't make sense to obtain the namespaces of our own process, and it
would require work to implement this functionality there that would be of
no use.

We also do not provide the ability to utilise PIDFD_SELF* in ordinary fd
operations such as open() or poll(), as this would require extensive work
and be of no real use.

v6:
* Avoid static inline in UAPI header as suggested by Pedro.
* Place PIDFD_SELF values out of range of errors and any other sentinel as
  suggested by Pedro.

v5:
* Fixup self test dependencies on pidfd/pidfd.h.
https://lore.kernel.org/linux-mm/cover.1729848252.git.lorenzo.stoakes@oracle.com/

v4:
* Avoid returning an fd in the __pidfd_get_pid() function as pointed out by
  Christian, instead simply always pin the pid and maintain fd scope in the
  helper alone.
* Add wrapper header file in tools/include/linux to allow for import of
  UAPI pidfd.h header without encountering the collision between system
  fcntl.h and linux/fcntl.h as discussed with Shuah and John.
* Fixup tests to import the UAPI pidfd.h header working around conflicts
  between system fcntl.h and linux/fcntl.h which the UAPI pidfd.h imports,
  as reported by Shuah.
* Use an int for pidfd_is_self_sentinel() to avoid any dependency on
  stdbool.h in userland.
https://lore.kernel.org/linux-mm/cover.1729198898.git.lorenzo.stoakes@oracle.com/

v3:
* Do not fput() an invalid fd as reported by kernel test bot.
* Fix unintended churn from moving variable declaration.
https://lore.kernel.org/linux-mm/cover.1729073310.git.lorenzo.stoakes@oracle.com/

v2:
* Fix tests as reported by Shuah.
* Correct RFC version lore link.
https://lore.kernel.org/linux-mm/cover.1728643714.git.lorenzo.stoakes@oracle.com/

Non-RFC v1:
* Removed RFC tag - there seems to be general consensus that this change is
  a good idea, but perhaps some debate to be had on implementation. It
  seems sensible then to move forward with the RFC flag removed.
* Introduced PIDFD_SELF_THREAD, PIDFD_SELF_THREAD_GROUP and their aliases
  PIDFD_SELF and PIDFD_SELF_PROCESS respectively.
* Updated testing accordingly.
https://lore.kernel.org/linux-mm/cover.1728578231.git.lorenzo.stoakes@oracle.com/

RFC version:
https://lore.kernel.org/linux-mm/cover.1727644404.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (5):
  pidfd: extend pidfd_get_pid() and de-duplicate pid lookup
  pidfd: add PIDFD_SELF_* sentinels to refer to own thread/process
  tools: testing: separate out wait_for_pid() into helper header
  selftests: pidfd: add pidfd.h UAPI wrapper
  selftests: pidfd: add tests for PIDFD_SELF_*

 include/linux/pid.h                           |  34 ++++-
 include/uapi/linux/pidfd.h                    |  10 ++
 kernel/exit.c                                 |   4 +-
 kernel/nsproxy.c                              |   1 +
 kernel/pid.c                                  |  65 +++++---
 kernel/signal.c                               |  29 +---
 tools/include/linux/pidfd.h                   |  14 ++
 tools/testing/selftests/cgroup/test_kill.c    |   2 +-
 .../pid_namespace/regression_enomem.c         |   2 +-
 tools/testing/selftests/pidfd/Makefile        |   3 +-
 tools/testing/selftests/pidfd/pidfd.h         |  28 +---
 .../selftests/pidfd/pidfd_getfd_test.c        | 141 ++++++++++++++++++
 tools/testing/selftests/pidfd/pidfd_helpers.h |  39 +++++
 .../selftests/pidfd/pidfd_setns_test.c        |  11 ++
 tools/testing/selftests/pidfd/pidfd_test.c    |  76 ++++++++--
 15 files changed, 371 insertions(+), 88 deletions(-)
 create mode 100644 tools/include/linux/pidfd.h
 create mode 100644 tools/testing/selftests/pidfd/pidfd_helpers.h

--
2.47.0

Comments

Christian Brauner Oct. 28, 2024, 3:34 p.m. UTC | #1
On Sat, Oct 26, 2024 at 08:24:58AM +0100, Lorenzo Stoakes wrote:
> It is useful to be able to utilise the pidfd mechanism to reference the
> current thread or process (from a userland point of view - thread group
> leader from the kernel's point of view).
> 
> Therefore introduce PIDFD_SELF_THREAD to refer to the current thread, and
> PIDFD_SELF_THREAD_GROUP to refer to the current thread group leader.
> 
> For convenience and to avoid confusion from userland's perspective we alias
> these:
> 
> * PIDFD_SELF is an alias for PIDFD_SELF_THREAD - This is nearly always what
>   the user will want to use, as they would find it surprising if for
>   instance fd's were unshared()'d and they wanted to invoke pidfd_getfd()
>   and that failed.
> 
> * PIDFD_SELF_PROCESS is an alias for PIDFD_SELF_THREAD_GROUP - Most users
>   have no concept of thread groups or what a thread group leader is, and
>   from userland's perspective and nomenclature this is what userland
>   considers to be a process.
> 
> Due to the refactoring of the central __pidfd_get_pid() function we can
> implement this functionality centrally, providing the use of this sentinel
> in most functionality which utilises pidfd's.
> 
> We need to explicitly adjust kernel_waitid_prepare() to permit this (though
> it wouldn't really make sense to use this there, we provide the ability for
> consistency).
> 
> We explicitly disallow use of this in setns(), which would otherwise have
> required explicit custom handling, as it doesn't make sense to set the
> current calling thread to join the namespace of itself.
> 
> As the callers of pidfd_get_pid() expect an increased reference count on
> the pid we do so in the self case, reducing churn and avoiding any breakage
> from existing logic which decrements this reference count.
> 
> This change implicitly provides PIDFD_SELF_* support in the waitid(P_PIDFS,
> ...), process_madvise(), process_mrelease(), pidfd_send_signal(), and
> pidfd_getfd() system calls.
> 
> Things such as polling a pidfs and general fd operations are not supported,
> this strictly provides the sentinel for APIs which explicitly accept a
> pidfd.
> 
> Suggested-by: Pedro Falcato <pedro.falcato@gmail.com>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---

Currently, a pidfd based system call like pidfd_send_signal() would
simply do:

fdget(pidfd);
// use struct pid
fdput(pidfd);

Where the lifetime of @pid is guaranteed by @file. And in the regular
case where there's only a single thread the file code will avoid taking
a reference. Thus, there's no reference count bump on fdget(), nor a
drop on fdput(), nor a get_pid() or put_pid().

With your patch series you will always cause reference counts on @pid to
be taken for everyone. And I wouldn't be surprised if we get performance
regressions for this.

In one of my earlier mails I had mused about a fdput() like primitive.
What I roughly, hastily, and under the influence of the flu, sketched in
the _completey untested_ patch I appended illustrates roughly what I had
been thinking about.

>  include/linux/pid.h        |  8 ++++--
>  include/uapi/linux/pidfd.h | 10 ++++++++
>  kernel/exit.c              |  4 ++-
>  kernel/nsproxy.c           |  1 +
>  kernel/pid.c               | 51 ++++++++++++++++++++++++--------------
>  5 files changed, 53 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index d466890e1b35..3b2ac7567a88 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -78,11 +78,15 @@ struct file;
>   * __pidfd_get_pid() - Retrieve a pid associated with the specified pidfd.
>   *
>   * @pidfd:      The pidfd whose pid we want, or the fd of a /proc/<pid> file if
> - *              @alloc_proc is also set.
> + *              @alloc_proc is also set, or PIDFD_SELF_* to refer to the current
> + *              thread or thread group leader.
>   * @allow_proc: If set, then an fd of a /proc/<pid> file can be passed instead
>   *              of a pidfd, and this will be used to determine the pid.
> +
>   * @flags:      Output variable, if non-NULL, then the file->f_flags of the
> - *              pidfd will be set here.
> + *              pidfd will be set here or If PIDFD_SELF_THREAD is set, this is
> + *              set to PIDFD_THREAD, otherwise if PIDFD_SELF_THREAD_GROUP then
> + *              this is set to zero.
>   *
>   * Returns: If successful, the pid associated with the pidfd, otherwise an
>   *          error.
> diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> index 565fc0629fff..6fe1d63b2086 100644
> --- a/include/uapi/linux/pidfd.h
> +++ b/include/uapi/linux/pidfd.h
> @@ -29,4 +29,14 @@
>  #define PIDFD_GET_USER_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 9)
>  #define PIDFD_GET_UTS_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 10)
> 
> +/*
> + * Special sentinel values which can be used to refer to the current thread or
> + * thread group leader (which from a userland perspective is the process).
> + */
> +#define PIDFD_SELF		PIDFD_SELF_THREAD
> +#define PIDFD_SELF_PROCESS	PIDFD_SELF_THREAD_GROUP
> +
> +#define PIDFD_SELF_THREAD	-10000 /* Current thread. */
> +#define PIDFD_SELF_THREAD_GROUP	-20000 /* Current thread group leader. */
> +
>  #endif /* _UAPI_LINUX_PIDFD_H */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 619f0014c33b..e4f85ec4ba78 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -71,6 +71,7 @@
>  #include <linux/user_events.h>
>  #include <linux/uaccess.h>
> 
> +#include <uapi/linux/pidfd.h>
>  #include <uapi/linux/wait.h>
> 
>  #include <asm/unistd.h>
> @@ -1739,7 +1740,8 @@ int kernel_waitid_prepare(struct wait_opts *wo, int which, pid_t upid,
>  		break;
>  	case P_PIDFD:
>  		type = PIDTYPE_PID;
> -		if (upid < 0)
> +		if (upid < 0 && upid != PIDFD_SELF_THREAD &&
> +		    upid != PIDFD_SELF_THREAD_GROUP)
>  			return -EINVAL;
> 
>  		pid = pidfd_get_pid(upid, &f_flags);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index dc952c3b05af..d239f7eeaa1f 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -550,6 +550,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags)
>  	struct nsset nsset = {};
>  	int err = 0;
> 
> +	/* If fd is PIDFD_SELF_*, implicitly fail here, as invalid. */
>  	if (!fd_file(f))
>  		return -EBADF;
> 
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 94c97559e5c5..0a1861b4422c 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -535,33 +535,48 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
>  }
>  EXPORT_SYMBOL_GPL(find_ge_pid);
> 
> +static struct pid *pidfd_get_pid_self(unsigned int pidfd, unsigned int *flags)

The @flags argument is unused afaict.

> +{
> +	bool is_thread = pidfd == PIDFD_SELF_THREAD;
> +	enum pid_type type = is_thread ? PIDTYPE_PID : PIDTYPE_TGID;
> +	struct pid *pid = *task_pid_ptr(current, type);
> +
> +	/* The caller expects an elevated reference count. */
> +	get_pid(pid);
> +	return pid;
> +}

Fwiw, what you've done here is essentially reimplement the already
existing get_task_pid() helper that you could simply use.

> +
>  struct pid *__pidfd_get_pid(unsigned int pidfd, bool allow_proc,
>  			    unsigned int *flags)
>  {
> -	struct pid *pid;
> -	struct fd f = fdget(pidfd);
> -	struct file *file = fd_file(f);
> +	if (pidfd == PIDFD_SELF_THREAD || pidfd == PIDFD_SELF_THREAD_GROUP) {
> +		return pidfd_get_pid_self(pidfd, flags);
> +	} else {

I think the else can just go and we can save an indentation level.

> +		struct pid *pid;
> +		struct fd f = fdget(pidfd);
> +		struct file *file = fd_file(f);
> 
> -	if (!file)
> -		return ERR_PTR(-EBADF);
> +		if (!file)
> +			return ERR_PTR(-EBADF);
> 
> -	pid = pidfd_pid(file);
> -	/* If we allow opening a pidfd via /proc/<pid>, do so. */
> -	if (IS_ERR(pid) && allow_proc)
> -		pid = tgid_pidfd_to_pid(file);
> +		pid = pidfd_pid(file);
> +		/* If we allow opening a pidfd via /proc/<pid>, do so. */
> +		if (IS_ERR(pid) && allow_proc)
> +			pid = tgid_pidfd_to_pid(file);
> 
> -	if (IS_ERR(pid)) {
> +		if (IS_ERR(pid)) {
> +			fdput(f);
> +			return pid;
> +		}
> +
> +		/* Pin pid before we release fd. */
> +		get_pid(pid);
> +		if (flags)
> +			*flags = file->f_flags;
>  		fdput(f);
> +
>  		return pid;
>  	}
> -
> -	/* Pin pid before we release fd. */
> -	get_pid(pid);
> -	if (flags)
> -		*flags = file->f_flags;
> -	fdput(f);
> -
> -	return pid;
>  }
> 
>  /**
> --
> 2.47.0
Lorenzo Stoakes Oct. 28, 2024, 4:06 p.m. UTC | #2
On Mon, Oct 28, 2024 at 04:34:33PM +0100, Christian Brauner wrote:
> On Sat, Oct 26, 2024 at 08:24:58AM +0100, Lorenzo Stoakes wrote:
> > It is useful to be able to utilise the pidfd mechanism to reference the
> > current thread or process (from a userland point of view - thread group
> > leader from the kernel's point of view).
> >
> > Therefore introduce PIDFD_SELF_THREAD to refer to the current thread, and
> > PIDFD_SELF_THREAD_GROUP to refer to the current thread group leader.
> >
> > For convenience and to avoid confusion from userland's perspective we alias
> > these:
> >
> > * PIDFD_SELF is an alias for PIDFD_SELF_THREAD - This is nearly always what
> >   the user will want to use, as they would find it surprising if for
> >   instance fd's were unshared()'d and they wanted to invoke pidfd_getfd()
> >   and that failed.
> >
> > * PIDFD_SELF_PROCESS is an alias for PIDFD_SELF_THREAD_GROUP - Most users
> >   have no concept of thread groups or what a thread group leader is, and
> >   from userland's perspective and nomenclature this is what userland
> >   considers to be a process.
> >
> > Due to the refactoring of the central __pidfd_get_pid() function we can
> > implement this functionality centrally, providing the use of this sentinel
> > in most functionality which utilises pidfd's.
> >
> > We need to explicitly adjust kernel_waitid_prepare() to permit this (though
> > it wouldn't really make sense to use this there, we provide the ability for
> > consistency).
> >
> > We explicitly disallow use of this in setns(), which would otherwise have
> > required explicit custom handling, as it doesn't make sense to set the
> > current calling thread to join the namespace of itself.
> >
> > As the callers of pidfd_get_pid() expect an increased reference count on
> > the pid we do so in the self case, reducing churn and avoiding any breakage
> > from existing logic which decrements this reference count.
> >
> > This change implicitly provides PIDFD_SELF_* support in the waitid(P_PIDFS,
> > ...), process_madvise(), process_mrelease(), pidfd_send_signal(), and
> > pidfd_getfd() system calls.
> >
> > Things such as polling a pidfs and general fd operations are not supported,
> > this strictly provides the sentinel for APIs which explicitly accept a
> > pidfd.
> >
> > Suggested-by: Pedro Falcato <pedro.falcato@gmail.com>
> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
>
> Currently, a pidfd based system call like pidfd_send_signal() would
> simply do:
>
> fdget(pidfd);
> // use struct pid
> fdput(pidfd);
>
> Where the lifetime of @pid is guaranteed by @file. And in the regular
> case where there's only a single thread the file code will avoid taking
> a reference. Thus, there's no reference count bump on fdget(), nor a
> drop on fdput(), nor a get_pid() or put_pid().

Right I missed that fdget() wouldn't take a reference count I assumed it
would be equivalent, my mistake.

>
> With your patch series you will always cause reference counts on @pid to
> be taken for everyone. And I wouldn't be surprised if we get performance
> regressions for this.

This was in response to you review saying I can't pass around a pointer to
the fd, originally I didn't do this.

This was the only way I could find to de-jank and make my shared function
not end up problematic in the light of wanting to keep the fd within a
single scope, I didn't realise that passing that by value would be ok.

But obviously hadn't realised that fdget()/fdput() sometimes doesn't change
a reference count, mea culpa on that not an fs person...

>
> In one of my earlier mails I had mused about a fdput() like primitive.
> What I roughly, hastily, and under the influence of the flu, sketched in
> the _completey untested_ patch I appended illustrates roughly what I had
> been thinking about.

OK, I was really uncertain as to what you meant regarding the scope of this
value so had assumed we couldn't do something like assigning the value like
that.

I guess I'll try to adapt that and respin a v7 when I get a chance.

>
> >  include/linux/pid.h        |  8 ++++--
> >  include/uapi/linux/pidfd.h | 10 ++++++++
> >  kernel/exit.c              |  4 ++-
> >  kernel/nsproxy.c           |  1 +
> >  kernel/pid.c               | 51 ++++++++++++++++++++++++--------------
> >  5 files changed, 53 insertions(+), 21 deletions(-)
> >
> > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > index d466890e1b35..3b2ac7567a88 100644
> > --- a/include/linux/pid.h
> > +++ b/include/linux/pid.h
> > @@ -78,11 +78,15 @@ struct file;
> >   * __pidfd_get_pid() - Retrieve a pid associated with the specified pidfd.
> >   *
> >   * @pidfd:      The pidfd whose pid we want, or the fd of a /proc/<pid> file if
> > - *              @alloc_proc is also set.
> > + *              @alloc_proc is also set, or PIDFD_SELF_* to refer to the current
> > + *              thread or thread group leader.
> >   * @allow_proc: If set, then an fd of a /proc/<pid> file can be passed instead
> >   *              of a pidfd, and this will be used to determine the pid.
> > +
> >   * @flags:      Output variable, if non-NULL, then the file->f_flags of the
> > - *              pidfd will be set here.
> > + *              pidfd will be set here or If PIDFD_SELF_THREAD is set, this is
> > + *              set to PIDFD_THREAD, otherwise if PIDFD_SELF_THREAD_GROUP then
> > + *              this is set to zero.
> >   *
> >   * Returns: If successful, the pid associated with the pidfd, otherwise an
> >   *          error.
> > diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> > index 565fc0629fff..6fe1d63b2086 100644
> > --- a/include/uapi/linux/pidfd.h
> > +++ b/include/uapi/linux/pidfd.h
> > @@ -29,4 +29,14 @@
> >  #define PIDFD_GET_USER_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 9)
> >  #define PIDFD_GET_UTS_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 10)
> >
> > +/*
> > + * Special sentinel values which can be used to refer to the current thread or
> > + * thread group leader (which from a userland perspective is the process).
> > + */
> > +#define PIDFD_SELF		PIDFD_SELF_THREAD
> > +#define PIDFD_SELF_PROCESS	PIDFD_SELF_THREAD_GROUP
> > +
> > +#define PIDFD_SELF_THREAD	-10000 /* Current thread. */
> > +#define PIDFD_SELF_THREAD_GROUP	-20000 /* Current thread group leader. */
> > +
> >  #endif /* _UAPI_LINUX_PIDFD_H */
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 619f0014c33b..e4f85ec4ba78 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -71,6 +71,7 @@
> >  #include <linux/user_events.h>
> >  #include <linux/uaccess.h>
> >
> > +#include <uapi/linux/pidfd.h>
> >  #include <uapi/linux/wait.h>
> >
> >  #include <asm/unistd.h>
> > @@ -1739,7 +1740,8 @@ int kernel_waitid_prepare(struct wait_opts *wo, int which, pid_t upid,
> >  		break;
> >  	case P_PIDFD:
> >  		type = PIDTYPE_PID;
> > -		if (upid < 0)
> > +		if (upid < 0 && upid != PIDFD_SELF_THREAD &&
> > +		    upid != PIDFD_SELF_THREAD_GROUP)
> >  			return -EINVAL;
> >
> >  		pid = pidfd_get_pid(upid, &f_flags);
> > diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> > index dc952c3b05af..d239f7eeaa1f 100644
> > --- a/kernel/nsproxy.c
> > +++ b/kernel/nsproxy.c
> > @@ -550,6 +550,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags)
> >  	struct nsset nsset = {};
> >  	int err = 0;
> >
> > +	/* If fd is PIDFD_SELF_*, implicitly fail here, as invalid. */
> >  	if (!fd_file(f))
> >  		return -EBADF;
> >
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 94c97559e5c5..0a1861b4422c 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -535,33 +535,48 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> >  }
> >  EXPORT_SYMBOL_GPL(find_ge_pid);
> >
> > +static struct pid *pidfd_get_pid_self(unsigned int pidfd, unsigned int *flags)
>
> The @flags argument is unused afaict.

Oops will rework on v7.

>
> > +{
> > +	bool is_thread = pidfd == PIDFD_SELF_THREAD;
> > +	enum pid_type type = is_thread ? PIDTYPE_PID : PIDTYPE_TGID;
> > +	struct pid *pid = *task_pid_ptr(current, type);
> > +
> > +	/* The caller expects an elevated reference count. */
> > +	get_pid(pid);
> > +	return pid;
> > +}
>
> Fwiw, what you've done here is essentially reimplement the already
> existing get_task_pid() helper that you could simply use.

We're looking up PIDFD_SELF_* values here. So presumably you mean the:

	struct pid *pid = *task_pid_ptr(current, type);
	/* The caller expects an elevated reference count. */
	get_pid(pid);

Bit is duplicated vs. get_task_pid()?

I did that because it wasn't clear doing that under the RCU lock was
necessary or useful?

It seems useful still to have the PIDFD_SELF stuff qseparate, I can replace
those two lines with a call to get_task_pid() if you prefer? Unless you
meant something else?

>
> > +
> >  struct pid *__pidfd_get_pid(unsigned int pidfd, bool allow_proc,
> >  			    unsigned int *flags)
> >  {
> > -	struct pid *pid;
> > -	struct fd f = fdget(pidfd);
> > -	struct file *file = fd_file(f);
> > +	if (pidfd == PIDFD_SELF_THREAD || pidfd == PIDFD_SELF_THREAD_GROUP) {
> > +		return pidfd_get_pid_self(pidfd, flags);
> > +	} else {
>
> I think the else can just go and we can save an indentation level.

This has been raised a couple times before by other reviewers, this is just
so we can declare variables, especially the fd variable, which you were
very clear _must_ retain scope only where it used.

Otherwise I have to do something like;

	struct fd f = {};

	if (...) { return ...; }

	f = fdget(...);

This way we don't need to do that.

I mean probably the compiler would do the right thing but it just seems
ugly to assign/reassign a stack value like that.

Ah, I see struct fd is just a wrapper around an unsigned long, so probably
not a big deal to just leave it unassigned then.

This was the only reason I did this, I usually much prefer the guard
pattern.

OK if you're fine with this value being assigned like that then no problem
will change!

>
> > +		struct pid *pid;
> > +		struct fd f = fdget(pidfd);
> > +		struct file *file = fd_file(f);
> >
> > -	if (!file)
> > -		return ERR_PTR(-EBADF);
> > +		if (!file)
> > +			return ERR_PTR(-EBADF);
> >
> > -	pid = pidfd_pid(file);
> > -	/* If we allow opening a pidfd via /proc/<pid>, do so. */
> > -	if (IS_ERR(pid) && allow_proc)
> > -		pid = tgid_pidfd_to_pid(file);
> > +		pid = pidfd_pid(file);
> > +		/* If we allow opening a pidfd via /proc/<pid>, do so. */
> > +		if (IS_ERR(pid) && allow_proc)
> > +			pid = tgid_pidfd_to_pid(file);
> >
> > -	if (IS_ERR(pid)) {
> > +		if (IS_ERR(pid)) {
> > +			fdput(f);
> > +			return pid;
> > +		}
> > +
> > +		/* Pin pid before we release fd. */
> > +		get_pid(pid);
> > +		if (flags)
> > +			*flags = file->f_flags;
> >  		fdput(f);
> > +
> >  		return pid;
> >  	}
> > -
> > -	/* Pin pid before we release fd. */
> > -	get_pid(pid);
> > -	if (flags)
> > -		*flags = file->f_flags;
> > -	fdput(f);
> > -
> > -	return pid;
> >  }
> >
> >  /**
> > --
> > 2.47.0
Lorenzo Stoakes Oct. 30, 2024, 4:37 p.m. UTC | #3
On Mon, Oct 28, 2024 at 04:06:07PM +0000, Lorenzo Stoakes wrote:
> I guess I'll try to adapt that and respin a v7 when I get a chance.

Hm looking at this draft patch, it seems like a total rework of pidfd's
across the board right (now all pidfd's will need to be converted to
pid_fd)? Correct me if I'm wrong.

If only for the signal case, it seems like overkill to define a whole
pid_fd and to use this CLASS() wrapper just for this one instance.

If the intent is to convert _all_ pidfd's to use this type, it feels really
out of scope for this series and I think we'd probably instead want to go
off and do that as a separate series and put this on hold until that is
done.

If instead you mean that we ought to do something like this just for the
signal case, it feels like it'd be quite a bit of extra abstraction just
used in this one case but nowhere else, I think if you did an abstraction
like this it would _have_ to be across the board right?

I agree that the issue is with this one signal case that pins only the fd
(rather than this pid) where this 'pinning' doesn't _necessary_ mess around
with reference counts.

So we definitely must address this, but the issue you had with the first
approach was that I think (correct me if I'm wrong) I was passing a pointer
to a struct fd which is not permitted right?

Could we pass the struct fd by value to avoid this? I think we'd have to
unfortunately special-case this and probably duplicate some code which is a
pity as I liked the idea of abstracting everything to one place, but we can
obviously do that.

So I guess to TL;DR it, the options are:

1. Implement pid_fd everywhere, in which case I will leave off on
   this series and I guess, if I have time I could look at trying to
   implement that or perhaps you'd prefer to?

2. We are good for the sake of this series to special-case a pidfd_to_pid()
   implementation (used only by the pidfd_send_signal() syscall)

3. Something else, or I am misunderstanding your point :)

Let me know how you want me to proceed on this as we're at v6 already and I
want to be _really_ sure I'm doing what you want here.

Thanks!
Lorenzo Stoakes Nov. 8, 2024, 2:28 p.m. UTC | #4
On Wed, Oct 30, 2024 at 04:37:37PM +0000, Lorenzo Stoakes wrote:
> On Mon, Oct 28, 2024 at 04:06:07PM +0000, Lorenzo Stoakes wrote:
> > I guess I'll try to adapt that and respin a v7 when I get a chance.
>
> Hm looking at this draft patch, it seems like a total rework of pidfd's
> across the board right (now all pidfd's will need to be converted to
> pid_fd)? Correct me if I'm wrong.
>
> If only for the signal case, it seems like overkill to define a whole
> pid_fd and to use this CLASS() wrapper just for this one instance.
>
> If the intent is to convert _all_ pidfd's to use this type, it feels really
> out of scope for this series and I think we'd probably instead want to go
> off and do that as a separate series and put this on hold until that is
> done.
>
> If instead you mean that we ought to do something like this just for the
> signal case, it feels like it'd be quite a bit of extra abstraction just
> used in this one case but nowhere else, I think if you did an abstraction
> like this it would _have_ to be across the board right?
>
> I agree that the issue is with this one signal case that pins only the fd
> (rather than this pid) where this 'pinning' doesn't _necessary_ mess around
> with reference counts.
>
> So we definitely must address this, but the issue you had with the first
> approach was that I think (correct me if I'm wrong) I was passing a pointer
> to a struct fd which is not permitted right?
>
> Could we pass the struct fd by value to avoid this? I think we'd have to
> unfortunately special-case this and probably duplicate some code which is a
> pity as I liked the idea of abstracting everything to one place, but we can
> obviously do that.
>
> So I guess to TL;DR it, the options are:
>
> 1. Implement pid_fd everywhere, in which case I will leave off on
>    this series and I guess, if I have time I could look at trying to
>    implement that or perhaps you'd prefer to?
>
> 2. We are good for the sake of this series to special-case a pidfd_to_pid()
>    implementation (used only by the pidfd_send_signal() syscall)
>
> 3. Something else, or I am misunderstanding your point :)
>
> Let me know how you want me to proceed on this as we're at v6 already and I
> want to be _really_ sure I'm doing what you want here.
>
> Thanks!

Hi Christian,

Just a gentle nudge on this - as I need some guidance in order to know how
to move the series forwards.

Obviously no rush if your workload is high at the moment as this is pretty
low priority, but just in case you missed it :)

Thanks, Lorenzo
Lorenzo Stoakes Dec. 2, 2024, 10:52 a.m. UTC | #5
On Fri, Nov 08, 2024 at 02:28:14PM +0000, Lorenzo Stoakes wrote:
> On Wed, Oct 30, 2024 at 04:37:37PM +0000, Lorenzo Stoakes wrote:
> > On Mon, Oct 28, 2024 at 04:06:07PM +0000, Lorenzo Stoakes wrote:
> > > I guess I'll try to adapt that and respin a v7 when I get a chance.
> >
> > Hm looking at this draft patch, it seems like a total rework of pidfd's
> > across the board right (now all pidfd's will need to be converted to
> > pid_fd)? Correct me if I'm wrong.
> >
> > If only for the signal case, it seems like overkill to define a whole
> > pid_fd and to use this CLASS() wrapper just for this one instance.
> >
> > If the intent is to convert _all_ pidfd's to use this type, it feels really
> > out of scope for this series and I think we'd probably instead want to go
> > off and do that as a separate series and put this on hold until that is
> > done.
> >
> > If instead you mean that we ought to do something like this just for the
> > signal case, it feels like it'd be quite a bit of extra abstraction just
> > used in this one case but nowhere else, I think if you did an abstraction
> > like this it would _have_ to be across the board right?
> >
> > I agree that the issue is with this one signal case that pins only the fd
> > (rather than this pid) where this 'pinning' doesn't _necessary_ mess around
> > with reference counts.
> >
> > So we definitely must address this, but the issue you had with the first
> > approach was that I think (correct me if I'm wrong) I was passing a pointer
> > to a struct fd which is not permitted right?
> >
> > Could we pass the struct fd by value to avoid this? I think we'd have to
> > unfortunately special-case this and probably duplicate some code which is a
> > pity as I liked the idea of abstracting everything to one place, but we can
> > obviously do that.
> >
> > So I guess to TL;DR it, the options are:
> >
> > 1. Implement pid_fd everywhere, in which case I will leave off on
> >    this series and I guess, if I have time I could look at trying to
> >    implement that or perhaps you'd prefer to?
> >
> > 2. We are good for the sake of this series to special-case a pidfd_to_pid()
> >    implementation (used only by the pidfd_send_signal() syscall)
> >
> > 3. Something else, or I am misunderstanding your point :)
> >
> > Let me know how you want me to proceed on this as we're at v6 already and I
> > want to be _really_ sure I'm doing what you want here.
> >
> > Thanks!
>
> Hi Christian,
>
> Just a gentle nudge on this - as I need some guidance in order to know how
> to move the series forwards.
>
> Obviously no rush if your workload is high at the moment as this is pretty
> low priority, but just in case you missed it :)
>
> Thanks, Lorenzo

Hi Christian,

Just a ping on this now we're past the merge window and it's been over a
month.

It'd be good to at least get a polite ack to indicate you're aware even if
you don't have the time to respond right now.

If you'd prefer this series not to go ahead just let me know, but
unfortunately I really require your input to know how to move forward
otherwise I risk doing work that you might then reject.

Thanks, Lorenzo
Christian Brauner Dec. 2, 2024, 2:31 p.m. UTC | #6
On Wed, Oct 30, 2024 at 04:37:37PM +0000, Lorenzo Stoakes wrote:
> On Mon, Oct 28, 2024 at 04:06:07PM +0000, Lorenzo Stoakes wrote:
> > I guess I'll try to adapt that and respin a v7 when I get a chance.
> 
> Hm looking at this draft patch, it seems like a total rework of pidfd's
> across the board right (now all pidfd's will need to be converted to
> pid_fd)? Correct me if I'm wrong.
> 
> If only for the signal case, it seems like overkill to define a whole
> pid_fd and to use this CLASS() wrapper just for this one instance.
> 
> If the intent is to convert _all_ pidfd's to use this type, it feels really
> out of scope for this series and I think we'd probably instead want to go
> off and do that as a separate series and put this on hold until that is
> done.
> 
> If instead you mean that we ought to do something like this just for the
> signal case, it feels like it'd be quite a bit of extra abstraction just
> used in this one case but nowhere else, I think if you did an abstraction
> like this it would _have_ to be across the board right?
> 
> I agree that the issue is with this one signal case that pins only the fd
> (rather than this pid) where this 'pinning' doesn't _necessary_ mess around
> with reference counts.
> 
> So we definitely must address this, but the issue you had with the first
> approach was that I think (correct me if I'm wrong) I was passing a pointer
> to a struct fd which is not permitted right?
> 
> Could we pass the struct fd by value to avoid this? I think we'd have to
> unfortunately special-case this and probably duplicate some code which is a
> pity as I liked the idea of abstracting everything to one place, but we can
> obviously do that.
> 
> So I guess to TL;DR it, the options are:
> 
> 1. Implement pid_fd everywhere, in which case I will leave off on
>    this series and I guess, if I have time I could look at trying to
>    implement that or perhaps you'd prefer to?
> 
> 2. We are good for the sake of this series to special-case a pidfd_to_pid()
>    implementation (used only by the pidfd_send_signal() syscall)
> 
> 3. Something else, or I am misunderstanding your point :)
> 
> Let me know how you want me to proceed on this as we're at v6 already and I
> want to be _really_ sure I'm doing what you want here.

I don't think we get away with abstracting it in one place without this
ending up a pretty janky api. I need to go back and think about calling
conventions for all this stuff. For now I think I'm fine with something
like the below that abstracts the api to handle mm/ cleanly and then a
special-case for pidfd_send_signal():

diff --git a/kernel/pid.c b/kernel/pid.c
index 6131543e7c09..c1857c44d1a3 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -564,15 +564,29 @@ struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
  */
 struct task_struct *pidfd_get_task(int pidfd, unsigned int *flags)
 {
-       unsigned int f_flags;
+       unsigned int f_flags = 0;
        struct pid *pid;
        struct task_struct *task;
+       enum pid_type type;

-       pid = pidfd_get_pid(pidfd, &f_flags);
-       if (IS_ERR(pid))
-               return ERR_CAST(pid);
+       switch (pidfd) {
+       case PIDFD_SELF_THREAD:
+               type = PIDTYPE_PID;
+               pid = get_task_pid(current, type);
+               break;
+       case PIDFD_SELF_THREAD_GROUP:
+               type = PIDTYPE_TGID;
+               pid = get_task_pid(current, type);
+               break;
+       default:
+               pid = pidfd_get_pid(pidfd, &f_flags);
+               if (IS_ERR(pid))
+                       return ERR_CAST(pid);
+               type = PIDTYPE_TGID;
+               break;
+       }

-       task = get_pid_task(pid, PIDTYPE_TGID);
+       task = get_pid_task(pid, type);
        put_pid(pid);
        if (!task)
                return ERR_PTR(-ESRCH);

That would get you support for PIDFD_SELF_THREAD and
PIDFD_SELF_THREAD_GROUP for process_madvise() and process_mrelease().

And for pidfd_send_signal() we could just open code this for now:

diff --git a/kernel/signal.c b/kernel/signal.c
index 989b1cc9116a..a2e4e3a5ee42 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3990,6 +3990,45 @@ static struct pid *pidfd_to_pid(const struct file *file)
 	(PIDFD_SIGNAL_THREAD | PIDFD_SIGNAL_THREAD_GROUP | \
 	 PIDFD_SIGNAL_PROCESS_GROUP)
 
+static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
+				siginfo_t __user *info, unsigned int flags)
+{
+	kernel_siginfo_t kinfo;
+
+	switch (flags) {
+	case PIDFD_SIGNAL_THREAD:
+		type = PIDTYPE_PID;
+		break;
+	case PIDFD_SIGNAL_THREAD_GROUP:
+		type = PIDTYPE_TGID;
+		break;
+	case PIDFD_SIGNAL_PROCESS_GROUP:
+		type = PIDTYPE_PGID;
+		break;
+	}
+
+	if (info) {
+		int ret = copy_siginfo_from_user_any(&kinfo, info);
+		if (unlikely(ret))
+			return ret;
+
+		if (unlikely(sig != kinfo.si_signo))
+			return -EINVAL;
+
+		/* Only allow sending arbitrary signals to yourself. */
+		if ((task_pid(current) != pid || type > PIDTYPE_TGID) &&
+		    (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL))
+			return -EPERM;
+	} else {
+		prepare_kill_siginfo(sig, &kinfo, type);
+	}
+
+	if (type == PIDTYPE_PGID)
+		return kill_pgrp_info(sig, &kinfo, pid);
+
+	return kill_pid_info_type(sig, &kinfo, pid, type);
+}
+
 /**
  * sys_pidfd_send_signal - Signal a process through a pidfd
  * @pidfd:  file descriptor of the process
@@ -4009,7 +4048,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 {
 	int ret;
 	struct pid *pid;
-	kernel_siginfo_t kinfo;
 	enum pid_type type;
 
 	/* Enforce flags be set to 0 until we add an extension. */
@@ -4021,56 +4059,39 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 		return -EINVAL;
 
 	CLASS(fd, f)(pidfd);
-	if (fd_empty(f))
-		return -EBADF;
 
-	/* Is this a pidfd? */
-	pid = pidfd_to_pid(fd_file(f));
-	if (IS_ERR(pid))
-		return PTR_ERR(pid);
+	switch (pidfd) {
+	case PIDFD_SELF_THREAD:
+		pid = get_task_pid(current, PIDTYPE_PID);
+		type = PIDTYPE_PID;
+		break;
+	case PIDFD_SELF_THREAD_GROUP:
+		pid = get_task_pid(current, PIDTYPE_TGID);
+		type = PIDTYPE_TGID;
+		break;
+	default:
+		if (fd_empty(f))
+			return -EBADF;
 
-	if (!access_pidfd_pidns(pid))
-		return -EINVAL;
+		/* Is this a pidfd? */
+		pid = pidfd_to_pid(fd_file(f));
+		if (IS_ERR(pid))
+			return PTR_ERR(pid);
 
-	switch (flags) {
-	case 0:
+		if (!access_pidfd_pidns(pid))
+			return -EINVAL;
 		/* Infer scope from the type of pidfd. */
 		if (fd_file(f)->f_flags & PIDFD_THREAD)
 			type = PIDTYPE_PID;
 		else
 			type = PIDTYPE_TGID;
 		break;
-	case PIDFD_SIGNAL_THREAD:
-		type = PIDTYPE_PID;
-		break;
-	case PIDFD_SIGNAL_THREAD_GROUP:
-		type = PIDTYPE_TGID;
-		break;
-	case PIDFD_SIGNAL_PROCESS_GROUP:
-		type = PIDTYPE_PGID;
-		break;
 	}
 
-	if (info) {
-		ret = copy_siginfo_from_user_any(&kinfo, info);
-		if (unlikely(ret))
-			return ret;
-
-		if (unlikely(sig != kinfo.si_signo))
-			return -EINVAL;
-
-		/* Only allow sending arbitrary signals to yourself. */
-		if ((task_pid(current) != pid || type > PIDTYPE_TGID) &&
-		    (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL))
-			return -EPERM;
-	} else {
-		prepare_kill_siginfo(sig, &kinfo, type);
-	}
-
-	if (type == PIDTYPE_PGID)
-		return kill_pgrp_info(sig, &kinfo, pid);
-	else
-		return kill_pid_info_type(sig, &kinfo, pid, type);
+	ret = do_pidfd_send_signal(pid, sig, type, info, flags);
+	if (fd_empty(f))
+		put_pid(pid);
+	return ret;
 }
 
 static int