diff mbox series

[v3,1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling

Message ID 20250320-work-pidfs-thread_group-v3-1-b7e5f7e2c3b1@kernel.org (mailing list archive)
State New
Headers show
Series pidfs: handle multi-threaded exec and premature thread-group leader exit | expand

Commit Message

Christian Brauner March 20, 2025, 9:29 a.m. UTC
This is another attempt trying to make pidfd polling for multi-threaded
exec and premature thread-group leader exit consistent.

A quick recap of these two cases:

(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
    leader thread, all other threads in the thread-group including the
    thread-group leader are killed and the struct pid of the
    thread-group leader will be taken over by the subthread that called
    exec. IOW, two tasks change their TIDs.

(2) A premature thread-group leader exit means that the thread-group
    leader exited before all of the other subthreads in the thread-group
    have exited.

Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.

The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.

But this is difficult to handle because a thread-group may exit
prematurely as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.

So far there was no way to distinguish between (1) and (2) internally.
This tiny series tries to address this problem by discarding
PIDFD_THREAD notification on premature thread-group leader exit.

If that works correctly then no exit notifications are generated for a
PIDFD_THREAD pidfd for a thread-group leader until all subthreads have
been reaped. If a subthread should exec aftewards no exit notification
will be generated until that task exits or it creates subthreads and
repeates the cycle.

Co-Developed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c      | 22 +++++++++++++++++++++-
 kernel/exit.c   | 12 +++++++++---
 kernel/signal.c |  6 ++++--
 3 files changed, 34 insertions(+), 6 deletions(-)

Comments

Oleg Nesterov March 20, 2025, 10:57 a.m. UTC | #1
Christian,

All the comments look misleading (and overcomplicated) to me.

See below, but first lets recall the commit 64bef697d33b75fc06c5789
("pidfd: implement PIDFD_THREAD flag for pidfd_open()") which says

    pidfd: implement PIDFD_THREAD flag for pidfd_open()

    With this flag:

            ....

            - pidfd_poll() succeeds when the task exits and becomes a
              zombie (iow, passes exit_notify()), even if it is a leader
              and thread-group is not empty.

This patch simply reverts this behaviour, the exiting leader will not
report the exit if it has sub-threads (alive or not). And afaics your
V1 tried to do the same. And this eliminates the

              This means that the behaviour of pidfd_poll(PIDFD_THREAD,
              pid-of-group-leader) is not well defined if it races with
              exec() from its sub-thread; ...

problem mentioned in the changelog. That is all.

IOW, with this change PIDFD_THREAD has no effect.

Except the pid_has_task() checks in sys_pidfd_open() paths, without
PIDFD_THREAD the target task must be a group leader.

On 03/20, Christian Brauner wrote:
>
> @@ -218,12 +218,32 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)

Your forgot to remove the no longer used

	bool thread = file->f_flags & PIDFD_THREAD;

above ;)

>  	/*
>  	 * Depending on PIDFD_THREAD, inform pollers when the thread
>  	 * or the whole thread-group exits.

See above (and below), this no longer depends on PIDFD_THREAD.

> +	else if (task->exit_state && !delay_group_leader(task))
>  		poll_flags = EPOLLIN | EPOLLRDNORM;

So with this change:

If the exiting task is a sub-thread, report EPOLLIN as before.
delay_group_leader() can't be true. In this case PIDFD_THREAD
must be set.

If the exiting task is a leader, we do not care about PIDFD_THREAD.
We report EPOLLIN only if it is the last/only thread.

> diff --git a/kernel/exit.c b/kernel/exit.c
> index 9916305e34d3..ce5cdad5ba9c 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -271,6 +271,9 @@ void release_task(struct task_struct *p)
>  		 * If we were the last child thread and the leader has
>  		 * exited already, and the leader's parent ignores SIGCHLD,
>  		 * then we are the one who should release the leader.
> +		 *
> +		 * This will also wake PIDFD_THREAD pidfds for the
> +		 * thread-group leader that already exited.
>  		 */
>  		zap_leader = do_notify_parent(leader, leader->exit_signal);

Again, this doesn't depend on PIDFD_THREAD.

> @@ -743,10 +746,13 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>
>  	tsk->exit_state = EXIT_ZOMBIE;
>  	/*
> -	 * sub-thread or delay_group_leader(), wake up the
> -	 * PIDFD_THREAD waiters.
> +	 * Wake up PIDFD_THREAD waiters if this is a proper subthread
> +	 * exit. If this is a premature thread-group leader exit delay
> +	 * the notification until the last subthread exits. If a
> +	 * subthread should exec before then no notification will be
> +	 * generated.
>  	 */
> -	if (!thread_group_empty(tsk))
> +	if (!delay_group_leader(tsk))
>  		do_notify_pidfd(tsk);

The same...

> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2180,8 +2180,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
>  	WARN_ON_ONCE(!tsk->ptrace &&
>  	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
>  	/*
> -	 * tsk is a group leader and has no threads, wake up the
> -	 * non-PIDFD_THREAD waiters.
> +	 * This is a thread-group leader without subthreads so wake up
> +	 * the non-PIDFD_THREAD waiters. This also wakes the
> +	 * PIDFD_THREAD waiters for the thread-group leader in case it
> +	 * exited prematurely from release_task().
>  	 */

This too.

Oleg.
Oleg Nesterov March 20, 2025, 11:21 a.m. UTC | #2
And just for the record...

Consider the simplest case: a single-threaded not-ptraced process
exits. In this case do_notify_pidfd() will be called twice, from
exit_notify() and right after that from do_notify_parent().

We can cleanup this logic, but I don't think this is important and
this needs a separate patch.

(With or without this change: if the exiting task is ptraced or its
 parent exits without wait(), do_notify_pidfd() will be called even
 more times, but I think we do not care at all).

Oleg.
Christian Brauner March 20, 2025, 12:36 p.m. UTC | #3
On Thu, Mar 20, 2025 at 11:57:02AM +0100, Oleg Nesterov wrote:
> Christian,

Oleg!

> 
> All the comments look misleading (and overcomplicated) to me.
> 
> See below, but first lets recall the commit 64bef697d33b75fc06c5789
> ("pidfd: implement PIDFD_THREAD flag for pidfd_open()") which says
> 
>     pidfd: implement PIDFD_THREAD flag for pidfd_open()
> 
>     With this flag:
> 
>             ....
> 
>             - pidfd_poll() succeeds when the task exits and becomes a
>               zombie (iow, passes exit_notify()), even if it is a leader
>               and thread-group is not empty.
> 
> This patch simply reverts this behaviour, the exiting leader will not
> report the exit if it has sub-threads (alive or not). And afaics your
> V1 tried to do the same. And this eliminates the
> 
>               This means that the behaviour of pidfd_poll(PIDFD_THREAD,
>               pid-of-group-leader) is not well defined if it races with
>               exec() from its sub-thread; ...
> 
> problem mentioned in the changelog. That is all.
> 
> IOW, with this change PIDFD_THREAD has no effect.

But that's what I'm trying to say: This patch aligns the behavior of
thread-specific and non-thread-specific thread-group leader pidfds. IOW,
the behavior of:

pidfd_open(<thread-group-leader-pid>, 0)
pidfd_open(<thread-group-leader-pid>, PIDFD_THREAD)

is the same wrt to polling after this patch. That's also what the
selftests are designed to test and show.

> 
> Except the pid_has_task() checks in sys_pidfd_open() paths, without
> PIDFD_THREAD the target task must be a group leader.
> 
> On 03/20, Christian Brauner wrote:
> >
> > @@ -218,12 +218,32 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> 
> Your forgot to remove the no longer used
> 
> 	bool thread = file->f_flags & PIDFD_THREAD;
> 
> above ;)

Thanks, fixed.

> 
> >  	/*
> >  	 * Depending on PIDFD_THREAD, inform pollers when the thread
> >  	 * or the whole thread-group exits.
> 
> See above (and below), this no longer depends on PIDFD_THREAD.
> 
> > +	else if (task->exit_state && !delay_group_leader(task))
> >  		poll_flags = EPOLLIN | EPOLLRDNORM;
> 
> So with this change:
> 
> If the exiting task is a sub-thread, report EPOLLIN as before.
> delay_group_leader() can't be true. In this case PIDFD_THREAD
> must be set.
> 
> If the exiting task is a leader, we do not care about PIDFD_THREAD.
> We report EPOLLIN only if it is the last/only thread.
> 
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 9916305e34d3..ce5cdad5ba9c 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -271,6 +271,9 @@ void release_task(struct task_struct *p)
> >  		 * If we were the last child thread and the leader has
> >  		 * exited already, and the leader's parent ignores SIGCHLD,
> >  		 * then we are the one who should release the leader.
> > +		 *
> > +		 * This will also wake PIDFD_THREAD pidfds for the
> > +		 * thread-group leader that already exited.
> >  		 */
> >  		zap_leader = do_notify_parent(leader, leader->exit_signal);
> 
> Again, this doesn't depend on PIDFD_THREAD.

The comment is literally just saying that we delayed notification of
PIDFD_THREAD pidfds for a thread-group leader until now. After all its
subthreads are released instead of when the thread-group leader did
actually exit earlier.

> 
> > @@ -743,10 +746,13 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >
> >  	tsk->exit_state = EXIT_ZOMBIE;
> >  	/*
> > -	 * sub-thread or delay_group_leader(), wake up the
> > -	 * PIDFD_THREAD waiters.
> > +	 * Wake up PIDFD_THREAD waiters if this is a proper subthread
> > +	 * exit. If this is a premature thread-group leader exit delay
> > +	 * the notification until the last subthread exits. If a
> > +	 * subthread should exec before then no notification will be
> > +	 * generated.
> >  	 */
> > -	if (!thread_group_empty(tsk))
> > +	if (!delay_group_leader(tsk))
> >  		do_notify_pidfd(tsk);
> 
> The same...

What you seem to be saying is that you want all references to
PIDFD_THREAD to be dropped in the comments because the behavior is now
identical. But what I would like to have is comments in the code that
illustrate where and how we guarantee this behavioral equivalency.

Because where the notifications happen does differ.

The delayed thread-group leader stuff is literally only apparent to
anyone who has stared and lived with these horrible behavioral warts of
early thread-group leader exit and subthread exec for a really long
time. For anyone else this isn't self-explanatory at all and each time
one has to go look at it it requires jumping around all the locations
where and how exit notifications are generated and piece together the
whole picture. It is laughably complex to follow.

So I'm wiping the comments but I very much disagree that they are
misleading/useless.

> 
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -2180,8 +2180,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
> >  	WARN_ON_ONCE(!tsk->ptrace &&
> >  	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
> >  	/*
> > -	 * tsk is a group leader and has no threads, wake up the
> > -	 * non-PIDFD_THREAD waiters.
> > +	 * This is a thread-group leader without subthreads so wake up
> > +	 * the non-PIDFD_THREAD waiters. This also wakes the
> > +	 * PIDFD_THREAD waiters for the thread-group leader in case it
> > +	 * exited prematurely from release_task().
> >  	 */
> 
> This too.

This one I agree is misplaced. It would be sufficient to have the
comment in release_task().

Christian
Christian Brauner March 20, 2025, 1:16 p.m. UTC | #4
On Thu, Mar 20, 2025 at 12:21:26PM +0100, Oleg Nesterov wrote:
> And just for the record...
> 
> Consider the simplest case: a single-threaded not-ptraced process
> exits. In this case do_notify_pidfd() will be called twice, from
> exit_notify() and right after that from do_notify_parent().

Yes.

> 
> We can cleanup this logic, but I don't think this is important and
> this needs a separate patch.

I would actually clean this up.

> 
> (With or without this change: if the exiting task is ptraced or its
>  parent exits without wait(), do_notify_pidfd() will be called even
>  more times, but I think we do not care at all).

Yes, though the ptrace codeflow is even more cursed so I'm not too
worried about making that clean (I wouldn't know how.).
Oleg Nesterov March 20, 2025, 2:02 p.m. UTC | #5
On 03/20, Christian Brauner wrote:
>
> What you seem to be saying is that you want all references to
> PIDFD_THREAD to be dropped in the comments because the behavior is now
> identical.

yes, to me the references to PIDFD_THREAD look as if PIDFD_THREAD
has some subtle differences in behavior.

With or without PIDFD_THREAD, do_notify_pidfd() is called and pidfd_poll()
returns EPOLLIN when this thread (leader or not) is ready for wait() from
the parent or debugger.

But!

> So I'm wiping the comments but I very much disagree that they are
> misleading/useless.

No, if you don't agree than do not remove the comments ;)


And... can you explain the motivation for this patch?

I mean... Again, the current PIDFD_THREAD/group-leader behavior is
not well defined, this is clear.

But if user-space does sys_pidfd_open(group_leader_pid) and needs the
"correct" EPOLLIN when the whole process exits, then it should not use
PIDFD_THREAD ?

Just in case, I am not arguing, I am just trying to understand.

Oleg.
Christian Brauner March 20, 2025, 2:32 p.m. UTC | #6
On Thu, Mar 20, 2025 at 03:02:00PM +0100, Oleg Nesterov wrote:
> On 03/20, Christian Brauner wrote:
> >
> > What you seem to be saying is that you want all references to
> > PIDFD_THREAD to be dropped in the comments because the behavior is now
> > identical.
> 
> yes, to me the references to PIDFD_THREAD look as if PIDFD_THREAD
> has some subtle differences in behavior.
> 
> With or without PIDFD_THREAD, do_notify_pidfd() is called and pidfd_poll()
> returns EPOLLIN when this thread (leader or not) is ready for wait() from
> the parent or debugger.
> 
> But!
> 
> > So I'm wiping the comments but I very much disagree that they are
> > misleading/useless.
> 
> No, if you don't agree than do not remove the comments ;)

No, it's fine. We always find some compromise and I've reworded the
comments substantially to not rely on PIDFD_THREAD at all. I always
appreciate the feedback, don't get me wrong!

> And... can you explain the motivation for this patch?

Yes, sure.

> 
> I mean... Again, the current PIDFD_THREAD/group-leader behavior is
> not well defined, this is clear.
> 
> But if user-space does sys_pidfd_open(group_leader_pid) and needs the
> "correct" EPOLLIN when the whole process exits, then it should not use
> PIDFD_THREAD ?
> 
> Just in case, I am not arguing, I am just trying to understand.

One driver is consistency. It's really weird to sometimes get exit
notifications and sometimes don't. It's easier to understand that we
delay notification until the thread-group is empty for a thread-based
pidfd for a thread-group leader rather than explaining de_thread()
timing issues for subthread exec.

But also remembering our earlier discussion on PIDFD_INFO_EXIT: If the
thread-group leader exits prematurely and userspace gets an exit
notification they end up with a Zombie they cannot (yet) reap.

I don't think we should carry that behavior over into the pidfd api. I'd
rather have it be so that if you get an exit notification it means that
you can may now reap the thing (I'm probably unaware of some ptrace
induced behavior that render this statement wrong.).
diff mbox series

Patch

diff --git a/fs/pidfs.c b/fs/pidfs.c
index a48cc44ced6f..f1c49a7540f3 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -218,12 +218,32 @@  static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 	/*
 	 * Depending on PIDFD_THREAD, inform pollers when the thread
 	 * or the whole thread-group exits.
+	 *
+	 * There are two corner cases to consider:
+	 *
+	 * (1) If a thread-group leader of a thread-group with
+	 *     subthreads exits prematurely, i.e., before all of the
+	 *     subthreads of the thread-group have exited then no
+	 *     notification will be generated for PIDFD_THREAD pidfds
+	 *     referring to the thread-group leader.
+	 *
+	 *     The exit notification for the thread-group leader will be
+	 *     delayed until the last subthread of the thread-group
+	 *     exits.
+	 *
+	 * (2) If a subthread of a thread-group execs then the
+	 *     current thread-group leader will be SIGKILLed and the
+	 *     subthread will assume the struct pid of the now defunct
+	 *     old thread-group leader. No exit notification will be
+	 *     generated for PIDFD_THREAD pidfds referring to the old
+	 *     thread-group leader as they continue referring to the new
+	 *     thread-group leader.
 	 */
 	guard(rcu)();
 	task = pid_task(pid, PIDTYPE_PID);
 	if (!task)
 		poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
-	else if (task->exit_state && (thread || thread_group_empty(task)))
+	else if (task->exit_state && !delay_group_leader(task))
 		poll_flags = EPOLLIN | EPOLLRDNORM;
 
 	return poll_flags;
diff --git a/kernel/exit.c b/kernel/exit.c
index 9916305e34d3..ce5cdad5ba9c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -271,6 +271,9 @@  void release_task(struct task_struct *p)
 		 * If we were the last child thread and the leader has
 		 * exited already, and the leader's parent ignores SIGCHLD,
 		 * then we are the one who should release the leader.
+		 *
+		 * This will also wake PIDFD_THREAD pidfds for the
+		 * thread-group leader that already exited.
 		 */
 		zap_leader = do_notify_parent(leader, leader->exit_signal);
 		if (zap_leader)
@@ -743,10 +746,13 @@  static void exit_notify(struct task_struct *tsk, int group_dead)
 
 	tsk->exit_state = EXIT_ZOMBIE;
 	/*
-	 * sub-thread or delay_group_leader(), wake up the
-	 * PIDFD_THREAD waiters.
+	 * Wake up PIDFD_THREAD waiters if this is a proper subthread
+	 * exit. If this is a premature thread-group leader exit delay
+	 * the notification until the last subthread exits. If a
+	 * subthread should exec before then no notification will be
+	 * generated.
 	 */
-	if (!thread_group_empty(tsk))
+	if (!delay_group_leader(tsk))
 		do_notify_pidfd(tsk);
 
 	if (unlikely(tsk->ptrace)) {
diff --git a/kernel/signal.c b/kernel/signal.c
index 081f19a24506..0ccef8783dff 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2180,8 +2180,10 @@  bool do_notify_parent(struct task_struct *tsk, int sig)
 	WARN_ON_ONCE(!tsk->ptrace &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
 	/*
-	 * tsk is a group leader and has no threads, wake up the
-	 * non-PIDFD_THREAD waiters.
+	 * This is a thread-group leader without subthreads so wake up
+	 * the non-PIDFD_THREAD waiters. This also wakes the
+	 * PIDFD_THREAD waiters for the thread-group leader in case it
+	 * exited prematurely from release_task().
 	 */
 	if (thread_group_empty(tsk))
 		do_notify_pidfd(tsk);