mbox series

[v7,0/6] introduce PIDFD_SELF* sentinels

Message ID cover.1738268370.git.lorenzo.stoakes@oracle.com (mailing list archive)
Headers show
Series introduce PIDFD_SELF* sentinels | expand

Message

Lorenzo Stoakes Jan. 30, 2025, 8:40 p.m. UTC
If you wish to utilise a pidfd interface to refer to the current process or
thread it is rather cumbersome, requiring something like:

	int pidfd = pidfd_open(getpid(), 0 or PIDFD_THREAD);

	...

	close(pidfd);

Or the equivalent call opening /proc/self. It is more convenient to use a
sentinel value to indicate to an interface that accepts a pidfd that we
simply wish to refer to the current process thread.

This series introduces sentinels for this purposes which can be passed as
the pidfd in this instance rather than having to establish a dummy fd for
this purpose.

It is useful to refer to both the current thread from the userland's
perspective for which we use PIDFD_SELF, and the current process from the
userland's perspective, for which we use PIDFD_SELF_PROCESS.

There is unfortunately some confusion between the kernel and userland as to
what constitutes a process - a thread from the userland perspective is a
process in userland, and a userland process is a thread group (more
specifically the thread group leader from the kernel perspective). We
therefore alias things thusly:

* PIDFD_SELF_THREAD aliased by PIDFD_SELF - use PIDTYPE_PID.
* PIDFD_SELF_THREAD_GROUP alised by PIDFD_SELF_PROCESS - use PIDTYPE_TGID.

In all of the kernel code we refer to PIDFD_SELF_THREAD and
PIDFD_SELF_THREAD_GROUP. However we expect users to use PIDFD_SELF and
PIDFD_SELF_PROCESS.

This matters for cases where, for instance, a user unshare()'s FDs or does
thread-specific signal handling and where the user would be hugely confused
if the FDs referenced or signal processed referred to the thread group
leader rather than the individual thread.

For now we only adjust pidfd_get_task() and the pidfd_send_signal() system
call with specific handling for this, implementing this functionality for
process_madvise(), process_mrelease() (albeit, using it here wouldn't
really make sense) and pidfd_send_signal().

We defer making further changes, as this would require a significant rework
of the pidfd mechanism.

The motivating case here is to support PIDFD_SELF in process_madvise(), so
this suffices for immediate uses. Moving forward, this can be further
expanded to other uses.

v7:
* Reworked implementation according to Christian's requirements. We now
  only support process_madvise() and pidfd_send_signal() system calls with
  PIDFD_SELF as specified.
* Updated tests to account for broken pidfd_open_test.c implementation.
* Fixed missing includes in pidfd self tests.
* Removed tests relating to functionality no longer supported.
* Update guard pages test to use PIDFD_SELF.

v6:
* Avoid static inline in UAPI header as suggested by Pedro.
* Place PIDFD_SELF values out of range of errors and any other sentinel as
  suggested by Pedro.
https://lore.kernel.org/linux-mm/cover.1729926229.git.lorenzo.stoakes@oracle.com/

v5:
* Fixup self test dependencies on pidfd/pidfd.h.
https://lore.kernel.org/linux-mm/cover.1729848252.git.lorenzo.stoakes@oracle.com/

v4:
* Avoid returning an fd in the __pidfd_get_pid() function as pointed out by
  Christian, instead simply always pin the pid and maintain fd scope in the
  helper alone.
* Add wrapper header file in tools/include/linux to allow for import of
  UAPI pidfd.h header without encountering the collision between system
  fcntl.h and linux/fcntl.h as discussed with Shuah and John.
* Fixup tests to import the UAPI pidfd.h header working around conflicts
  between system fcntl.h and linux/fcntl.h which the UAPI pidfd.h imports,
  as reported by Shuah.
* Use an int for pidfd_is_self_sentinel() to avoid any dependency on
  stdbool.h in userland.
https://lore.kernel.org/linux-mm/cover.1729198898.git.lorenzo.stoakes@oracle.com/

v3:
* Do not fput() an invalid fd as reported by kernel test bot.
* Fix unintended churn from moving variable declaration.
https://lore.kernel.org/linux-mm/cover.1729073310.git.lorenzo.stoakes@oracle.com/

v2:
* Fix tests as reported by Shuah.
* Correct RFC version lore link.
https://lore.kernel.org/linux-mm/cover.1728643714.git.lorenzo.stoakes@oracle.com/

Non-RFC v1:
* Removed RFC tag - there seems to be general consensus that this change is
  a good idea, but perhaps some debate to be had on implementation. It
  seems sensible then to move forward with the RFC flag removed.
* Introduced PIDFD_SELF_THREAD, PIDFD_SELF_THREAD_GROUP and their aliases
  PIDFD_SELF and PIDFD_SELF_PROCESS respectively.
* Updated testing accordingly.
https://lore.kernel.org/linux-mm/cover.1728578231.git.lorenzo.stoakes@oracle.com/

RFC version:
https://lore.kernel.org/linux-mm/cover.1727644404.git.lorenzo.stoakes@oracle.com/


Lorenzo Stoakes (6):
  pidfd: add PIDFD_SELF* sentinels to refer to own thread/process
  selftests/pidfd: add missing system header imcludes to pidfd tests
  tools: testing: separate out wait_for_pid() into helper header
  selftests: pidfd: add pidfd.h UAPI wrapper
  selftests: pidfd: add tests for PIDFD_SELF_*
  selftests/mm: use PIDFD_SELF in guard pages test

 include/uapi/linux/pidfd.h                    |  24 ++++
 kernel/pid.c                                  |  24 +++-
 kernel/signal.c                               | 106 +++++++++++-------
 tools/include/linux/pidfd.h                   |  14 +++
 tools/testing/selftests/cgroup/test_kill.c    |   2 +-
 tools/testing/selftests/mm/Makefile           |   4 +
 tools/testing/selftests/mm/guard-pages.c      |  15 +--
 .../pid_namespace/regression_enomem.c         |   2 +-
 tools/testing/selftests/pidfd/Makefile        |   3 +-
 tools/testing/selftests/pidfd/pidfd.h         |  28 +----
 .../selftests/pidfd/pidfd_fdinfo_test.c       |   1 +
 tools/testing/selftests/pidfd/pidfd_helpers.h |  39 +++++++
 .../testing/selftests/pidfd/pidfd_open_test.c |   6 +-
 .../selftests/pidfd/pidfd_setns_test.c        |   1 +
 tools/testing/selftests/pidfd/pidfd_test.c    |  76 +++++++++++--
 15 files changed, 242 insertions(+), 103 deletions(-)
 create mode 100644 tools/include/linux/pidfd.h
 create mode 100644 tools/testing/selftests/pidfd/pidfd_helpers.h

--
2.48.1

Comments

Andrew Morton Jan. 30, 2025, 10:37 p.m. UTC | #1
On Thu, 30 Jan 2025 20:40:25 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> If you wish to utilise a pidfd interface to refer to the current process or
> thread it is rather cumbersome, requiring something like:
> 
> 	int pidfd = pidfd_open(getpid(), 0 or PIDFD_THREAD);
> 
> 	...
> 
> 	close(pidfd);
> 
> Or the equivalent call opening /proc/self. It is more convenient to use a
> sentinel value to indicate to an interface that accepts a pidfd that we
> simply wish to refer to the current process thread.
> 

The above code sequence doesn't seem at all onerous.  I'm not
understanding why it's worth altering the kernel to permit this little
shortcut?
Lorenzo Stoakes Jan. 30, 2025, 10:53 p.m. UTC | #2
On Thu, Jan 30, 2025 at 02:37:54PM -0800, Andrew Morton wrote:
> On Thu, 30 Jan 2025 20:40:25 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > If you wish to utilise a pidfd interface to refer to the current process or
> > thread it is rather cumbersome, requiring something like:
> >
> > 	int pidfd = pidfd_open(getpid(), 0 or PIDFD_THREAD);
> >
> > 	...
> >
> > 	close(pidfd);
> >
> > Or the equivalent call opening /proc/self. It is more convenient to use a
> > sentinel value to indicate to an interface that accepts a pidfd that we
> > simply wish to refer to the current process thread.
> >
>
> The above code sequence doesn't seem at all onerous.  I'm not
> understanding why it's worth altering the kernel to permit this little
> shortcut?

In practice it adds quite a bit of overhead for something that whatever
mechanism is using the pidfd can avoid.

It was specifically intended for a real case of utilising
process_madvise(), using the newly extended ability to batch _any_
madvise() operations for the current process, like:

	if (process_madvise(PIDFD_SELF, iovec, 10, MADV_GUARD_INSTALL, 0)) {
	    ... error handling ...
	}

vs.

	pid_t pid = getpid();
	int pidfd = pidfd_open(pid, PIDFD_THREAD);

	if (pidfd < 0) {
	   ... error handling ...
	}

	if (process_madvise(PIDFD_SELF, iovec, 10, MADV_GUARD_INSTALL, 0)) {
	   ... cleanup pidfd ...
	   ... error handling ...
	}

	...

	... cleanup pidfd ...

So in practice, it's actually a lot more ceremony and noise. Suren has been
working with this code in practice and found this to be useful.

The suggestion to embed it as PIDFD_SELF rather than to pass it as a
process_madvise() flag was made on the original series where I extended its
functionality.

So in practice I think it's onerous enough to justify this, plus it allows
for a more fluent use of pidfd's in other cases where one is referring to
the same process/thread, to the extent that I've seen people commenting on
supporting it while sending series relating to pidfd.

Also Christian and others appear to support this idea.
Pedro Falcato Jan. 30, 2025, 11:10 p.m. UTC | #3
On Thu, Jan 30, 2025 at 10:53 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Jan 30, 2025 at 02:37:54PM -0800, Andrew Morton wrote:
> > On Thu, 30 Jan 2025 20:40:25 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > > If you wish to utilise a pidfd interface to refer to the current process or
> > > thread it is rather cumbersome, requiring something like:
> > >
> > >     int pidfd = pidfd_open(getpid(), 0 or PIDFD_THREAD);
> > >
> > >     ...
> > >
> > >     close(pidfd);
> > >
> > > Or the equivalent call opening /proc/self. It is more convenient to use a
> > > sentinel value to indicate to an interface that accepts a pidfd that we
> > > simply wish to refer to the current process thread.
> > >
> >
> > The above code sequence doesn't seem at all onerous.  I'm not
> > understanding why it's worth altering the kernel to permit this little
> > shortcut?
>
> In practice it adds quite a bit of overhead for something that whatever
> mechanism is using the pidfd can avoid.
>
> It was specifically intended for a real case of utilising
> process_madvise(), using the newly extended ability to batch _any_
> madvise() operations for the current process, like:
>
>         if (process_madvise(PIDFD_SELF, iovec, 10, MADV_GUARD_INSTALL, 0)) {
>             ... error handling ...
>         }
>
> vs.
>
>         pid_t pid = getpid();
>         int pidfd = pidfd_open(pid, PIDFD_THREAD);
>
>         if (pidfd < 0) {
>            ... error handling ...
>         }
>
>         if (process_madvise(PIDFD_SELF, iovec, 10, MADV_GUARD_INSTALL, 0)) {
>            ... cleanup pidfd ...
>            ... error handling ...
>         }
>
>         ...
>
>         ... cleanup pidfd ...
>
> So in practice, it's actually a lot more ceremony and noise. Suren has been
> working with this code in practice and found this to be useful.

It's also nice to add that people on the libc/allocator side should
also appreciate skipping pidfd_open's reliability concerns (mostly,
that RLIMIT_NOFILE Should Not(tm) ever affect thread spawning or a
malloc[1]). Besides the big syscall reduction and nice speedup, that
is.

[1] whether this is the already case is an exercise left to the
reader, but at the very least we should not add onto existing problems
Andrew Morton Jan. 30, 2025, 11:32 p.m. UTC | #4
On Thu, 30 Jan 2025 23:10:53 +0000 Pedro Falcato <pedro.falcato@gmail.com> wrote:

> On Thu, Jan 30, 2025 at 10:53 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > > The above code sequence doesn't seem at all onerous.  I'm not
> > > understanding why it's worth altering the kernel to permit this little
> > > shortcut?
> >
> > In practice it adds quite a bit of overhead for something that whatever
> > mechanism is using the pidfd can avoid.
> >
> > It was specifically intended for a real case of utilising
> > process_madvise(), using the newly extended ability to batch _any_
> > madvise() operations for the current process, like:
> >
> >         if (process_madvise(PIDFD_SELF, iovec, 10, MADV_GUARD_INSTALL, 0)) {
> >             ... error handling ...
> >         }
> >
> > vs.
> >
> >         pid_t pid = getpid();
> >         int pidfd = pidfd_open(pid, PIDFD_THREAD);
> >
> >         if (pidfd < 0) {
> >            ... error handling ...
> >         }
> >
> >         if (process_madvise(PIDFD_SELF, iovec, 10, MADV_GUARD_INSTALL, 0)) {
> >            ... cleanup pidfd ...
> >            ... error handling ...
> >         }
> >
> >         ...
> >
> >         ... cleanup pidfd ...
> >
> > So in practice, it's actually a lot more ceremony and noise. Suren has been
> > working with this code in practice and found this to be useful.
> 
> It's also nice to add that people on the libc/allocator side should
> also appreciate skipping pidfd_open's reliability concerns (mostly,
> that RLIMIT_NOFILE Should Not(tm) ever affect thread spawning or a
> malloc[1]). Besides the big syscall reduction and nice speedup, that
> is.
> 
> [1] whether this is the already case is an exercise left to the
> reader, but at the very least we should not add onto existing problems

Thanks.

Could we please get all the above spelled out much more thoroughly in
the [0/n] description (aka Patch Series Sales Brochure)?