Message ID | 20250318-work-pidfs-thread_group-v2-1-2677898ffa2e@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | pidfs: handle multi-threaded exec and premature thread-group leader exit | expand |
On 03/18, Christian Brauner wrote: > > @@ -746,8 +751,23 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > * sub-thread or delay_group_leader(), wake up the > * PIDFD_THREAD waiters. > */ > - if (!thread_group_empty(tsk)) > - do_notify_pidfd(tsk); > + if (!thread_group_empty(tsk)) { > + if (delay_group_leader(tsk)) { > + struct pid *pid; > + > + /* > + * This is a thread-group leader exiting before > + * all of its subthreads have exited allow pidfd > + * polling to detect this case and delay exit > + * notification until the last thread has > + * exited. > + */ > + pid = task_pid(tsk); > + WRITE_ONCE(pid->delayed_leader, 1); This is racy, tsk->exit_state is already set so pidfd_poll() can see task->exit_state && !pid->delayed_leader. But this is minor. I can't understand all these complications, probably because I barely slept tonight ;) I will re-read this patch again tomorrow, but could you explain why we can't simply use the trivial patch below? Oleg. --- diff --git a/fs/pidfs.c b/fs/pidfs.c index d980f779c213..8a95920aed98 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -210,7 +210,6 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) { struct pid *pid = pidfd_pid(file); - bool thread = file->f_flags & PIDFD_THREAD; struct task_struct *task; __poll_t poll_flags = 0; @@ -223,7 +222,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) task = pid_task(pid, PIDTYPE_PID); if (!task) poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP; - else if (task->exit_state && (thread || thread_group_empty(task))) + else if (task->exit_state && !delay_group_leader(task)) poll_flags = EPOLLIN | EPOLLRDNORM; return poll_flags; diff --git a/kernel/exit.c b/kernel/exit.c index 9916305e34d3..356ca41d313b 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -746,7 +746,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead) * sub-thread or delay_group_leader(), wake up the * PIDFD_THREAD waiters. */ - if (!thread_group_empty(tsk)) + if (!delay_group_leader(tsk)) do_notify_pidfd(tsk); if (unlikely(tsk->ptrace)) {
On Wed, Mar 19, 2025 at 03:00:52PM +0100, Oleg Nesterov wrote: > On 03/18, Christian Brauner wrote: > > > > @@ -746,8 +751,23 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > > * sub-thread or delay_group_leader(), wake up the > > * PIDFD_THREAD waiters. > > */ > > - if (!thread_group_empty(tsk)) > > - do_notify_pidfd(tsk); > > + if (!thread_group_empty(tsk)) { > > + if (delay_group_leader(tsk)) { > > + struct pid *pid; > > + > > + /* > > + * This is a thread-group leader exiting before > > + * all of its subthreads have exited allow pidfd > > + * polling to detect this case and delay exit > > + * notification until the last thread has > > + * exited. > > + */ > > + pid = task_pid(tsk); > > + WRITE_ONCE(pid->delayed_leader, 1); > > This is racy, tsk->exit_state is already set so pidfd_poll() can see > task->exit_state && !pid->delayed_leader. You're right. I had not considered that. > But this is minor. I can't understand all these complications, > probably because I barely slept tonight ;) I will re-read this patch > again tomorrow, but could you explain why we can't simply use the > trivial patch below? Sure, if that works I'm more than happy if we run with this. > > Oleg. > --- > > diff --git a/fs/pidfs.c b/fs/pidfs.c > index d980f779c213..8a95920aed98 100644 > --- a/fs/pidfs.c > +++ b/fs/pidfs.c > @@ -210,7 +210,6 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) > static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > { > struct pid *pid = pidfd_pid(file); > - bool thread = file->f_flags & PIDFD_THREAD; > struct task_struct *task; > __poll_t poll_flags = 0; > > @@ -223,7 +222,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > task = pid_task(pid, PIDTYPE_PID); > if (!task) > poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP; > - else if (task->exit_state && (thread || thread_group_empty(task))) > + else if (task->exit_state && !delay_group_leader(task)) > poll_flags = EPOLLIN | EPOLLRDNORM; > > return poll_flags; > diff --git a/kernel/exit.c b/kernel/exit.c > index 9916305e34d3..356ca41d313b 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -746,7 +746,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > * sub-thread or delay_group_leader(), wake up the > * PIDFD_THREAD waiters. > */ > - if (!thread_group_empty(tsk)) > + if (!delay_group_leader(tsk)) > do_notify_pidfd(tsk); Two cases we need to handle: (1) thread-group leader exits prematurely and none of the subthreads ever exec. Once the last thread exits it'll notify the thread-specific and non-thread specific thread-group leader pidfd pollers from release_task(). (2) thread-group leader exits prematurely but one of the subthreads later execs. In this case we don't want any exit notification to be generated for thread-specific thread-group leaders. I was concerned that handling (2) would be more complex but it passes all the new tests so I won't complain about less code needed. ;) Do you want me to just dump your draft and slap a Co-Developed-by on it? Another idea I had that I would welcome your thoughts on: When a task execs we could indicate this by generating a POLLPRI event on the pidfd. If we wanted to be fine-grained we could generate POLLPRI | POLLRDUP if a subthread execs. The latter would give userspace a reliable way to detect this case and figure out that tasks changed TIDs.
diff --git a/fs/pidfs.c b/fs/pidfs.c index d980f779c213..3874ccc0f9d7 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -223,8 +223,31 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) task = pid_task(pid, PIDTYPE_PID); if (!task) poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP; - else if (task->exit_state && (thread || thread_group_empty(task))) - poll_flags = EPOLLIN | EPOLLRDNORM; + else if (task->exit_state) { + if (thread) { + /* + * If this is a regular thread exit then notify + * the PIDFD_THREAD waiters. + * + * Don't notify in the following circumstances: + * + * (1) If this is a premature thread-group + * leader exit then delay the exit nofication + * until the last thread in the thread-group + * gets autoreaped as there might still be a + * thread that execs and revives the struct + * pid of the old thread-group leader. + * (2) There's a multi-threaded exec commencing + * and @pid is the current and therefore new + * thread-group leader's pid. + */ + if (likely(!READ_ONCE(pid->delayed_leader))) + poll_flags = EPOLLIN | EPOLLRDNORM; + } + + if (thread_group_empty(task)) + poll_flags = EPOLLIN | EPOLLRDNORM; + } return poll_flags; } diff --git a/include/linux/pid.h b/include/linux/pid.h index 98837a1ff0f3..e6dade16caad 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -55,7 +55,8 @@ struct upid { struct pid { refcount_t count; - unsigned int level; + u16 level; + u16 delayed_leader; spinlock_t lock; struct dentry *stashed; u64 ino; diff --git a/kernel/exit.c b/kernel/exit.c index 9916305e34d3..f01ee0a08707 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -267,6 +267,11 @@ void release_task(struct task_struct *p) leader = p->group_leader; if (leader != p && thread_group_empty(leader) && leader->exit_state == EXIT_ZOMBIE) { + struct pid *pid; + + pid = task_pid(leader); + WRITE_ONCE(pid->delayed_leader, 0); + /* * If we were the last child thread and the leader has * exited already, and the leader's parent ignores SIGCHLD, @@ -746,8 +751,23 @@ static void exit_notify(struct task_struct *tsk, int group_dead) * sub-thread or delay_group_leader(), wake up the * PIDFD_THREAD waiters. */ - if (!thread_group_empty(tsk)) - do_notify_pidfd(tsk); + if (!thread_group_empty(tsk)) { + if (delay_group_leader(tsk)) { + struct pid *pid; + + /* + * This is a thread-group leader exiting before + * all of its subthreads have exited allow pidfd + * polling to detect this case and delay exit + * notification until the last thread has + * exited. + */ + pid = task_pid(tsk); + WRITE_ONCE(pid->delayed_leader, 1); + } else { + do_notify_pidfd(tsk); + } + } if (unlikely(tsk->ptrace)) { int sig = thread_group_leader(tsk) && diff --git a/kernel/pid.c b/kernel/pid.c index 22f5d2b2e290..7b8ad2a8e74f 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -264,6 +264,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, get_pid_ns(ns); refcount_set(&pid->count, 1); spin_lock_init(&pid->lock); + pid->delayed_leader = 0; for (type = 0; type < PIDTYPE_MAX; ++type) INIT_HLIST_HEAD(&pid->tasks[type]); @@ -386,6 +387,15 @@ void exchange_tids(struct task_struct *left, struct task_struct *right) struct hlist_head *head1 = &pid1->tasks[PIDTYPE_PID]; struct hlist_head *head2 = &pid2->tasks[PIDTYPE_PID]; + /* + * If delayed leader marker is set then this was a malformed + * thread-group exec. The thread-group leader had exited before + * all of its subthreads and then one of the subthreads execed. + * The struct pid continues it's existence so remove the + * premature thread-group leader exit indicator. + */ + WRITE_ONCE(pid2->delayed_leader, 0); + /* Swap the single entry tid lists */ hlists_swap_heads_rcu(head1, head2);
This is another attempt trying to make pidfd polling for multi-threaded exec and premature thread-group leader exit consistent. A quick recap of these two cases: (1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs. (2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited. Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader. The correct behavior would be to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive. But this is difficult to handle because a thread-group may exit prematurely as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point. So far there was no way to distinguish between (1) and (2) internally. This tiny series tries to address this problem by remembering a premature leader exit in struct pid and forgetting it when a subthread execs and takes over the old thread-group leaders struct pid. This can be done without growing struct pid. The 32-bit pid namespace level indicator can be split into two 16-bit integers as only 32 levels of pid namespace nesting are allowed. Even with 16-bit the nesting level can in the future be bumped up to 65,535 (which isn't feasible/sensible for a lot of reasons). The second 16-bit are used as an indicator for a premature thread-group leader exec which is cleared when the last subthread gets autoreaped and the prematurely exited thread-group leader is notified. If that works correctly then no exit notifications are generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec aftewards no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle. Signed-off-by: Christian Brauner <brauner@kernel.org> --- fs/pidfs.c | 27 +++++++++++++++++++++++++-- include/linux/pid.h | 3 ++- kernel/exit.c | 24 ++++++++++++++++++++++-- kernel/pid.c | 10 ++++++++++ 4 files changed, 59 insertions(+), 5 deletions(-)