Message ID | 20250320-work-pidfs-thread_group-v3-1-b7e5f7e2c3b1@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | pidfs: handle multi-threaded exec and premature thread-group leader exit | expand |
Christian, All the comments look misleading (and overcomplicated) to me. See below, but first lets recall the commit 64bef697d33b75fc06c5789 ("pidfd: implement PIDFD_THREAD flag for pidfd_open()") which says pidfd: implement PIDFD_THREAD flag for pidfd_open() With this flag: .... - pidfd_poll() succeeds when the task exits and becomes a zombie (iow, passes exit_notify()), even if it is a leader and thread-group is not empty. This patch simply reverts this behaviour, the exiting leader will not report the exit if it has sub-threads (alive or not). And afaics your V1 tried to do the same. And this eliminates the This means that the behaviour of pidfd_poll(PIDFD_THREAD, pid-of-group-leader) is not well defined if it races with exec() from its sub-thread; ... problem mentioned in the changelog. That is all. IOW, with this change PIDFD_THREAD has no effect. Except the pid_has_task() checks in sys_pidfd_open() paths, without PIDFD_THREAD the target task must be a group leader. On 03/20, Christian Brauner wrote: > > @@ -218,12 +218,32 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) Your forgot to remove the no longer used bool thread = file->f_flags & PIDFD_THREAD; above ;) > /* > * Depending on PIDFD_THREAD, inform pollers when the thread > * or the whole thread-group exits. See above (and below), this no longer depends on PIDFD_THREAD. > + else if (task->exit_state && !delay_group_leader(task)) > poll_flags = EPOLLIN | EPOLLRDNORM; So with this change: If the exiting task is a sub-thread, report EPOLLIN as before. delay_group_leader() can't be true. In this case PIDFD_THREAD must be set. If the exiting task is a leader, we do not care about PIDFD_THREAD. We report EPOLLIN only if it is the last/only thread. > diff --git a/kernel/exit.c b/kernel/exit.c > index 9916305e34d3..ce5cdad5ba9c 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -271,6 +271,9 @@ void release_task(struct task_struct *p) > * If we were the last child thread and the leader has > * exited already, and the leader's parent ignores SIGCHLD, > * then we are the one who should release the leader. > + * > + * This will also wake PIDFD_THREAD pidfds for the > + * thread-group leader that already exited. > */ > zap_leader = do_notify_parent(leader, leader->exit_signal); Again, this doesn't depend on PIDFD_THREAD. > @@ -743,10 +746,13 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > > tsk->exit_state = EXIT_ZOMBIE; > /* > - * sub-thread or delay_group_leader(), wake up the > - * PIDFD_THREAD waiters. > + * Wake up PIDFD_THREAD waiters if this is a proper subthread > + * exit. If this is a premature thread-group leader exit delay > + * the notification until the last subthread exits. If a > + * subthread should exec before then no notification will be > + * generated. > */ > - if (!thread_group_empty(tsk)) > + if (!delay_group_leader(tsk)) > do_notify_pidfd(tsk); The same... > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2180,8 +2180,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig) > WARN_ON_ONCE(!tsk->ptrace && > (tsk->group_leader != tsk || !thread_group_empty(tsk))); > /* > - * tsk is a group leader and has no threads, wake up the > - * non-PIDFD_THREAD waiters. > + * This is a thread-group leader without subthreads so wake up > + * the non-PIDFD_THREAD waiters. This also wakes the > + * PIDFD_THREAD waiters for the thread-group leader in case it > + * exited prematurely from release_task(). > */ This too. Oleg.
And just for the record... Consider the simplest case: a single-threaded not-ptraced process exits. In this case do_notify_pidfd() will be called twice, from exit_notify() and right after that from do_notify_parent(). We can cleanup this logic, but I don't think this is important and this needs a separate patch. (With or without this change: if the exiting task is ptraced or its parent exits without wait(), do_notify_pidfd() will be called even more times, but I think we do not care at all). Oleg.
On Thu, Mar 20, 2025 at 11:57:02AM +0100, Oleg Nesterov wrote: > Christian, Oleg! > > All the comments look misleading (and overcomplicated) to me. > > See below, but first lets recall the commit 64bef697d33b75fc06c5789 > ("pidfd: implement PIDFD_THREAD flag for pidfd_open()") which says > > pidfd: implement PIDFD_THREAD flag for pidfd_open() > > With this flag: > > .... > > - pidfd_poll() succeeds when the task exits and becomes a > zombie (iow, passes exit_notify()), even if it is a leader > and thread-group is not empty. > > This patch simply reverts this behaviour, the exiting leader will not > report the exit if it has sub-threads (alive or not). And afaics your > V1 tried to do the same. And this eliminates the > > This means that the behaviour of pidfd_poll(PIDFD_THREAD, > pid-of-group-leader) is not well defined if it races with > exec() from its sub-thread; ... > > problem mentioned in the changelog. That is all. > > IOW, with this change PIDFD_THREAD has no effect. But that's what I'm trying to say: This patch aligns the behavior of thread-specific and non-thread-specific thread-group leader pidfds. IOW, the behavior of: pidfd_open(<thread-group-leader-pid>, 0) pidfd_open(<thread-group-leader-pid>, PIDFD_THREAD) is the same wrt to polling after this patch. That's also what the selftests are designed to test and show. > > Except the pid_has_task() checks in sys_pidfd_open() paths, without > PIDFD_THREAD the target task must be a group leader. > > On 03/20, Christian Brauner wrote: > > > > @@ -218,12 +218,32 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > > Your forgot to remove the no longer used > > bool thread = file->f_flags & PIDFD_THREAD; > > above ;) Thanks, fixed. > > > /* > > * Depending on PIDFD_THREAD, inform pollers when the thread > > * or the whole thread-group exits. > > See above (and below), this no longer depends on PIDFD_THREAD. > > > + else if (task->exit_state && !delay_group_leader(task)) > > poll_flags = EPOLLIN | EPOLLRDNORM; > > So with this change: > > If the exiting task is a sub-thread, report EPOLLIN as before. > delay_group_leader() can't be true. In this case PIDFD_THREAD > must be set. > > If the exiting task is a leader, we do not care about PIDFD_THREAD. > We report EPOLLIN only if it is the last/only thread. > > > diff --git a/kernel/exit.c b/kernel/exit.c > > index 9916305e34d3..ce5cdad5ba9c 100644 > > --- a/kernel/exit.c > > +++ b/kernel/exit.c > > @@ -271,6 +271,9 @@ void release_task(struct task_struct *p) > > * If we were the last child thread and the leader has > > * exited already, and the leader's parent ignores SIGCHLD, > > * then we are the one who should release the leader. > > + * > > + * This will also wake PIDFD_THREAD pidfds for the > > + * thread-group leader that already exited. > > */ > > zap_leader = do_notify_parent(leader, leader->exit_signal); > > Again, this doesn't depend on PIDFD_THREAD. The comment is literally just saying that we delayed notification of PIDFD_THREAD pidfds for a thread-group leader until now. After all its subthreads are released instead of when the thread-group leader did actually exit earlier. > > > @@ -743,10 +746,13 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > > > > tsk->exit_state = EXIT_ZOMBIE; > > /* > > - * sub-thread or delay_group_leader(), wake up the > > - * PIDFD_THREAD waiters. > > + * Wake up PIDFD_THREAD waiters if this is a proper subthread > > + * exit. If this is a premature thread-group leader exit delay > > + * the notification until the last subthread exits. If a > > + * subthread should exec before then no notification will be > > + * generated. > > */ > > - if (!thread_group_empty(tsk)) > > + if (!delay_group_leader(tsk)) > > do_notify_pidfd(tsk); > > The same... What you seem to be saying is that you want all references to PIDFD_THREAD to be dropped in the comments because the behavior is now identical. But what I would like to have is comments in the code that illustrate where and how we guarantee this behavioral equivalency. Because where the notifications happen does differ. The delayed thread-group leader stuff is literally only apparent to anyone who has stared and lived with these horrible behavioral warts of early thread-group leader exit and subthread exec for a really long time. For anyone else this isn't self-explanatory at all and each time one has to go look at it it requires jumping around all the locations where and how exit notifications are generated and piece together the whole picture. It is laughably complex to follow. So I'm wiping the comments but I very much disagree that they are misleading/useless. > > > --- a/kernel/signal.c > > +++ b/kernel/signal.c > > @@ -2180,8 +2180,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig) > > WARN_ON_ONCE(!tsk->ptrace && > > (tsk->group_leader != tsk || !thread_group_empty(tsk))); > > /* > > - * tsk is a group leader and has no threads, wake up the > > - * non-PIDFD_THREAD waiters. > > + * This is a thread-group leader without subthreads so wake up > > + * the non-PIDFD_THREAD waiters. This also wakes the > > + * PIDFD_THREAD waiters for the thread-group leader in case it > > + * exited prematurely from release_task(). > > */ > > This too. This one I agree is misplaced. It would be sufficient to have the comment in release_task(). Christian
On Thu, Mar 20, 2025 at 12:21:26PM +0100, Oleg Nesterov wrote: > And just for the record... > > Consider the simplest case: a single-threaded not-ptraced process > exits. In this case do_notify_pidfd() will be called twice, from > exit_notify() and right after that from do_notify_parent(). Yes. > > We can cleanup this logic, but I don't think this is important and > this needs a separate patch. I would actually clean this up. > > (With or without this change: if the exiting task is ptraced or its > parent exits without wait(), do_notify_pidfd() will be called even > more times, but I think we do not care at all). Yes, though the ptrace codeflow is even more cursed so I'm not too worried about making that clean (I wouldn't know how.).
On 03/20, Christian Brauner wrote: > > What you seem to be saying is that you want all references to > PIDFD_THREAD to be dropped in the comments because the behavior is now > identical. yes, to me the references to PIDFD_THREAD look as if PIDFD_THREAD has some subtle differences in behavior. With or without PIDFD_THREAD, do_notify_pidfd() is called and pidfd_poll() returns EPOLLIN when this thread (leader or not) is ready for wait() from the parent or debugger. But! > So I'm wiping the comments but I very much disagree that they are > misleading/useless. No, if you don't agree than do not remove the comments ;) And... can you explain the motivation for this patch? I mean... Again, the current PIDFD_THREAD/group-leader behavior is not well defined, this is clear. But if user-space does sys_pidfd_open(group_leader_pid) and needs the "correct" EPOLLIN when the whole process exits, then it should not use PIDFD_THREAD ? Just in case, I am not arguing, I am just trying to understand. Oleg.
On Thu, Mar 20, 2025 at 03:02:00PM +0100, Oleg Nesterov wrote: > On 03/20, Christian Brauner wrote: > > > > What you seem to be saying is that you want all references to > > PIDFD_THREAD to be dropped in the comments because the behavior is now > > identical. > > yes, to me the references to PIDFD_THREAD look as if PIDFD_THREAD > has some subtle differences in behavior. > > With or without PIDFD_THREAD, do_notify_pidfd() is called and pidfd_poll() > returns EPOLLIN when this thread (leader or not) is ready for wait() from > the parent or debugger. > > But! > > > So I'm wiping the comments but I very much disagree that they are > > misleading/useless. > > No, if you don't agree than do not remove the comments ;) No, it's fine. We always find some compromise and I've reworded the comments substantially to not rely on PIDFD_THREAD at all. I always appreciate the feedback, don't get me wrong! > And... can you explain the motivation for this patch? Yes, sure. > > I mean... Again, the current PIDFD_THREAD/group-leader behavior is > not well defined, this is clear. > > But if user-space does sys_pidfd_open(group_leader_pid) and needs the > "correct" EPOLLIN when the whole process exits, then it should not use > PIDFD_THREAD ? > > Just in case, I am not arguing, I am just trying to understand. One driver is consistency. It's really weird to sometimes get exit notifications and sometimes don't. It's easier to understand that we delay notification until the thread-group is empty for a thread-based pidfd for a thread-group leader rather than explaining de_thread() timing issues for subthread exec. But also remembering our earlier discussion on PIDFD_INFO_EXIT: If the thread-group leader exits prematurely and userspace gets an exit notification they end up with a Zombie they cannot (yet) reap. I don't think we should carry that behavior over into the pidfd api. I'd rather have it be so that if you get an exit notification it means that you can may now reap the thing (I'm probably unaware of some ptrace induced behavior that render this statement wrong.).
diff --git a/fs/pidfs.c b/fs/pidfs.c index a48cc44ced6f..f1c49a7540f3 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -218,12 +218,32 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) /* * Depending on PIDFD_THREAD, inform pollers when the thread * or the whole thread-group exits. + * + * There are two corner cases to consider: + * + * (1) If a thread-group leader of a thread-group with + * subthreads exits prematurely, i.e., before all of the + * subthreads of the thread-group have exited then no + * notification will be generated for PIDFD_THREAD pidfds + * referring to the thread-group leader. + * + * The exit notification for the thread-group leader will be + * delayed until the last subthread of the thread-group + * exits. + * + * (2) If a subthread of a thread-group execs then the + * current thread-group leader will be SIGKILLed and the + * subthread will assume the struct pid of the now defunct + * old thread-group leader. No exit notification will be + * generated for PIDFD_THREAD pidfds referring to the old + * thread-group leader as they continue referring to the new + * thread-group leader. */ guard(rcu)(); task = pid_task(pid, PIDTYPE_PID); if (!task) poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP; - else if (task->exit_state && (thread || thread_group_empty(task))) + else if (task->exit_state && !delay_group_leader(task)) poll_flags = EPOLLIN | EPOLLRDNORM; return poll_flags; diff --git a/kernel/exit.c b/kernel/exit.c index 9916305e34d3..ce5cdad5ba9c 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -271,6 +271,9 @@ void release_task(struct task_struct *p) * If we were the last child thread and the leader has * exited already, and the leader's parent ignores SIGCHLD, * then we are the one who should release the leader. + * + * This will also wake PIDFD_THREAD pidfds for the + * thread-group leader that already exited. */ zap_leader = do_notify_parent(leader, leader->exit_signal); if (zap_leader) @@ -743,10 +746,13 @@ static void exit_notify(struct task_struct *tsk, int group_dead) tsk->exit_state = EXIT_ZOMBIE; /* - * sub-thread or delay_group_leader(), wake up the - * PIDFD_THREAD waiters. + * Wake up PIDFD_THREAD waiters if this is a proper subthread + * exit. If this is a premature thread-group leader exit delay + * the notification until the last subthread exits. If a + * subthread should exec before then no notification will be + * generated. */ - if (!thread_group_empty(tsk)) + if (!delay_group_leader(tsk)) do_notify_pidfd(tsk); if (unlikely(tsk->ptrace)) { diff --git a/kernel/signal.c b/kernel/signal.c index 081f19a24506..0ccef8783dff 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2180,8 +2180,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig) WARN_ON_ONCE(!tsk->ptrace && (tsk->group_leader != tsk || !thread_group_empty(tsk))); /* - * tsk is a group leader and has no threads, wake up the - * non-PIDFD_THREAD waiters. + * This is a thread-group leader without subthreads so wake up + * the non-PIDFD_THREAD waiters. This also wakes the + * PIDFD_THREAD waiters for the thread-group leader in case it + * exited prematurely from release_task(). */ if (thread_group_empty(tsk)) do_notify_pidfd(tsk);