Message ID | 20250318-geknebelt-anekdote-87bdb6add5fd@brauner (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] pidfs: ensure that PIDFS_INFO_EXIT is available | expand |
I'll try to actually read this patch (and pidfs: improve multi-threaded exec and premature thread-group leader exit polling) tomorrow, but I am a bit confused after the quick glance... On 03/18, Christian Brauner wrote: > > +static inline bool pidfs_pid_valid(struct pid *pid, const struct path *path, > + unsigned int flags) > +{ > + enum pid_type type; > + > + if (flags & CLONE_PIDFD) > + return true; OK, this is clear. > + if (flags & PIDFD_THREAD) > + type = PIDTYPE_PID; > + else > + type = PIDTYPE_TGID; > + > + /* > + * Since pidfs_exit() is called before struct pid's task linkage > + * is removed the case where the task got reaped but a dentry > + * was already attached to struct pid and exit information was > + * recorded and published can be handled correctly. > + */ > + if (unlikely(!pid_has_task(pid, type))) { > + struct inode *inode = d_inode(path->dentry); > + return !!READ_ONCE(pidfs_i(inode)->exit_info); > + } Why pidfs_pid_valid() can't simply return false if !pid_has_task(pid,type) ? pidfd_open() paths check pid_has_task() too and fail if it returns NULL. If this task is already reaped when pidfs_pid_valid() is called, we can pretend it was reaped before sys_pidfd_open() was called? Oleg.
On Tue, Mar 18, 2025 at 03:46:54PM +0100, Oleg Nesterov wrote: > I'll try to actually read this patch (and pidfs: improve multi-threaded > exec and premature thread-group leader exit polling) tomorrow, but I am > a bit confused after the quick glance... > > > On 03/18, Christian Brauner wrote: > > > > +static inline bool pidfs_pid_valid(struct pid *pid, const struct path *path, > > + unsigned int flags) > > +{ > > + enum pid_type type; > > + > > + if (flags & CLONE_PIDFD) > > + return true; > > OK, this is clear. > > > + if (flags & PIDFD_THREAD) > > + type = PIDTYPE_PID; > > + else > > + type = PIDTYPE_TGID; > > + > > + /* > > + * Since pidfs_exit() is called before struct pid's task linkage > > + * is removed the case where the task got reaped but a dentry > > + * was already attached to struct pid and exit information was > > + * recorded and published can be handled correctly. > > + */ > > + if (unlikely(!pid_has_task(pid, type))) { > > + struct inode *inode = d_inode(path->dentry); > > + return !!READ_ONCE(pidfs_i(inode)->exit_info); > > + } > > Why pidfs_pid_valid() can't simply return false if !pid_has_task(pid,type) ? > > pidfd_open() paths check pid_has_task() too and fail if it returns NULL. > If this task is already reaped when pidfs_pid_valid() is called, we can > pretend it was reaped before sys_pidfd_open() was called? We could for sure but why would we. If we know that exit information is available then returning a pidfd can still be valuable for userspace as they can retrieve exit information via PIDFD_INFO_EXIT and it really doesn't hurt to do this.
On 03/19, Christian Brauner wrote: > > On Tue, Mar 18, 2025 at 03:46:54PM +0100, Oleg Nesterov wrote: > > > > Why pidfs_pid_valid() can't simply return false if !pid_has_task(pid,type) ? > > > > pidfd_open() paths check pid_has_task() too and fail if it returns NULL. > > If this task is already reaped when pidfs_pid_valid() is called, we can > > pretend it was reaped before sys_pidfd_open() was called? > > We could for sure but why would we. If we know that exit information is > available then returning a pidfd can still be valuable for userspace as > they can retrieve exit information via PIDFD_INFO_EXIT and it really > doesn't hurt to do this. OK, agreed. I thought I missed another subtle reason. Reviewed-by: Oleg Nesterov <oleg@redhat.com>
diff --git a/fs/pidfs.c b/fs/pidfs.c index 3c630e9d4a62..d980f779c213 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -753,8 +753,49 @@ static int pidfs_export_permission(struct handle_to_path_ctx *ctx, return 0; } +static inline bool pidfs_pid_valid(struct pid *pid, const struct path *path, + unsigned int flags) +{ + enum pid_type type; + + if (flags & CLONE_PIDFD) + return true; + + /* + * Make sure that if a pidfd is created PIDFD_INFO_EXIT + * information will be available. So after an inode for the + * pidfd has been allocated perform another check that the pid + * is still alive. If it is exit information is available even + * if the task gets reaped before the pidfd is returned to + * userspace. The only exception is CLONE_PIDFD where no task + * linkage has been established for @pid yet and the kernel is + * in the middle of process creation so there's nothing for + * pidfs to miss. + */ + if (flags & PIDFD_THREAD) + type = PIDTYPE_PID; + else + type = PIDTYPE_TGID; + + /* + * Since pidfs_exit() is called before struct pid's task linkage + * is removed the case where the task got reaped but a dentry + * was already attached to struct pid and exit information was + * recorded and published can be handled correctly. + */ + if (unlikely(!pid_has_task(pid, type))) { + struct inode *inode = d_inode(path->dentry); + return !!READ_ONCE(pidfs_i(inode)->exit_info); + } + + return true; +} + static struct file *pidfs_export_open(struct path *path, unsigned int oflags) { + if (!pidfs_pid_valid(d_inode(path->dentry)->i_private, path, oflags)) + return ERR_PTR(-ESRCH); + /* * Clear O_LARGEFILE as open_by_handle_at() forces it and raise * O_RDWR as pidfds always are. @@ -818,21 +859,30 @@ static struct file_system_type pidfs_type = { struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags) { - struct file *pidfd_file; - struct path path; + struct path path __free(path_put) = {}; int ret; + /* + * Ensure that CLONE_PIDFD can be passed as a flag without + * overloading other uapi pidfd flags. + */ + BUILD_BUG_ON(CLONE_PIDFD == PIDFD_THREAD); + BUILD_BUG_ON(CLONE_PIDFD == PIDFD_NONBLOCK); + ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path); if (ret < 0) return ERR_PTR(ret); + if (!pidfs_pid_valid(pid, &path, flags)) + return ERR_PTR(-ESRCH); + + flags &= ~CLONE_PIDFD; pidfd_file = dentry_open(&path, flags, current_cred()); /* Raise PIDFD_THREAD explicitly as do_dentry_open() strips it. */ if (!IS_ERR(pidfd_file)) pidfd_file->f_flags |= (flags & PIDFD_THREAD); - path_put(&path); return pidfd_file; } diff --git a/kernel/fork.c b/kernel/fork.c index 8eac9cd3385b..2c25de14df02 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2425,8 +2425,11 @@ __latent_entropy struct task_struct *copy_process( if (clone_flags & CLONE_PIDFD) { int flags = (clone_flags & CLONE_THREAD) ? PIDFD_THREAD : 0; - /* Note that no task has been attached to @pid yet. */ - retval = __pidfd_prepare(pid, flags, &pidfile); + /* + * Note that no task has been attached to @pid yet indicate + * that via CLONE_PIDFD. + */ + retval = __pidfd_prepare(pid, flags | CLONE_PIDFD, &pidfile); if (retval < 0) goto bad_fork_free_pid; pidfd = retval;
When we currently create a pidfd we check that the task hasn't been reaped right before we create the pidfd. But it is of course possible that by the time we return the pidfd to userspace the task has already been reaped since we don't check again after having created a dentry for it. This was fine until now because that race was meaningless. But now that we provide PIDFD_INFO_EXIT it is a problem because it is possible that the kernel returns a reaped pidfd and it depends on the race whether PIDFD_INFO_EXIT information is available. This depends on if the task gets reaped before or after a dentry has been attached to struct pid. Make this consistent and only returned pidfds for reaped tasks if PIDFD_INFO_EXIT information is available. This is done by performing another check whether the task has been reaped right after we attached a dentry to struct pid. Since pidfs_exit() is called before struct pid's task linkage is removed the case where the task got reaped but a dentry was already attached to struct pid and exit information was recorded and published can be handled correctly. In that case we do return a pidfd for a reaped task like we would've before. Link: https://lore.kernel.org/r/20250316-kabel-fehden-66bdb6a83436@brauner Signed-off-by: Christian Brauner <brauner@kernel.org> --- Now tested and no regressions were observed. This contains minor changes. --- fs/pidfs.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++--- kernel/fork.c | 7 +++++-- 2 files changed, 58 insertions(+), 5 deletions(-)