Message ID | 20240608154835.GD7947@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | zap_pid_ns_processes: don't send SIGKILL to sub-threads | expand |
Oleg Nesterov <oleg@redhat.com> writes: > The comment above the idr_for_each_entry_continue() loop tries to explain > why we have to signal each thread in the namespace, but it is outdated. > This code no longer uses kill_proc_info(), we have a target task so we can > check thread_group_leader() and avoid the unnecessary group_send_sig_info. > Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID, > this way it returns NULL if this pid is not a group-leader pid. > > Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process / > thread doesn't necessarily has a pending SIGKILL. Either way these checks > are racy without siglock, so the patch uses data_race() to shut up KCSAN. You remove the comment but the meat of what it was trying to say remains true. For processes in a session or processes is a process group a list of all such processes is kept. No such list is kept for a pid namespace. So the best we can do is walk through the allocated pid numbers in the pid namespace. It would also help if this explains that in the case of SIGKILL complete_signal always sets SIGNAL_GROUP_EXIT which makes that a good check to use to see if the process has been killed (with SIGKILL). There are races with coredump here but *shrug* I don't think this changes behavior in that situation. Eric > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > kernel/pid_namespace.c | 13 +++---------- > 1 file changed, 3 insertions(+), 10 deletions(-) > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index 25f3cf679b35..0f9bd67c9e75 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -191,21 +191,14 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > * The last thread in the cgroup-init thread group is terminating. > * Find remaining pid_ts in the namespace, signal and wait for them > * to exit. > - * > - * Note: This signals each threads in the namespace - even those that > - * belong to the same thread group, To avoid this, we would have > - * to walk the entire tasklist looking a processes in this > - * namespace, but that could be unnecessarily expensive if the > - * pid namespace has just a few processes. Or we need to > - * maintain a tasklist for each pid namespace. > - * > */ > rcu_read_lock(); > read_lock(&tasklist_lock); > nr = 2; > idr_for_each_entry_continue(&pid_ns->idr, pid, nr) { > - task = pid_task(pid, PIDTYPE_PID); > - if (task && !__fatal_signal_pending(task)) > + task = pid_task(pid, PIDTYPE_TGID); > + /* reading signal->flags is racy without sighand->siglock */ > + if (task && !(data_race(task->signal->flags) & SIGNAL_GROUP_EXIT)) > group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX); > } > read_unlock(&tasklist_lock);
On 06/13, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > The comment above the idr_for_each_entry_continue() loop tries to explain > > why we have to signal each thread in the namespace, but it is outdated. > > This code no longer uses kill_proc_info(), we have a target task so we can > > check thread_group_leader() and avoid the unnecessary group_send_sig_info. > > Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID, > > this way it returns NULL if this pid is not a group-leader pid. > > > > Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process / > > thread doesn't necessarily has a pending SIGKILL. Either way these checks > > are racy without siglock, so the patch uses data_race() to shut up KCSAN. > > You remove the comment but the meat of what it was trying to say remains > true. For processes in a session or processes is a process group a list > of all such processes is kept. No such list is kept for a pid > namespace. So the best we can do is walk through the allocated pid > numbers in the pid namespace. OK, I'll recheck tomorrow. Yet I think it doesn't make sense to send SIGKILL to sub-threads, and the comment looks misleading today. This was the main motivation, but again, I'll recheck. > It would also help if this explains that in the case of SIGKILL > complete_signal always sets SIGNAL_GROUP_EXIT which makes that a good > check to use to see if the process has been killed (with SIGKILL). Well, if SIGNAL_GROUP_EXIT is set we do not care if this process was killed or not. It (the whole thread group) is going to exit, that is all. We can even remove this check, it is just the optimization, just like the current fatal_signal_pending() check. Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > On 06/13, Eric W. Biederman wrote: >> >> Oleg Nesterov <oleg@redhat.com> writes: >> >> > The comment above the idr_for_each_entry_continue() loop tries to explain >> > why we have to signal each thread in the namespace, but it is outdated. >> > This code no longer uses kill_proc_info(), we have a target task so we can >> > check thread_group_leader() and avoid the unnecessary group_send_sig_info. >> > Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID, >> > this way it returns NULL if this pid is not a group-leader pid. >> > >> > Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process / >> > thread doesn't necessarily has a pending SIGKILL. Either way these checks >> > are racy without siglock, so the patch uses data_race() to shut up KCSAN. >> >> You remove the comment but the meat of what it was trying to say remains >> true. For processes in a session or processes is a process group a list >> of all such processes is kept. No such list is kept for a pid >> namespace. So the best we can do is walk through the allocated pid >> numbers in the pid namespace. > > OK, I'll recheck tomorrow. Yet I think it doesn't make sense to send > SIGKILL to sub-threads, and the comment looks misleading today. This was > the main motivation, but again, I'll recheck. Yes, we only need to send SIGKILL to only one thread. Of course there are a few weird cases with zombie leader threads, but I think the pattern you are using handles them. >> It would also help if this explains that in the case of SIGKILL >> complete_signal always sets SIGNAL_GROUP_EXIT which makes that a good >> check to use to see if the process has been killed (with SIGKILL). > > Well, if SIGNAL_GROUP_EXIT is set we do not care if this process was > killed or not. It (the whole thread group) is going to exit, that is all. > > We can even remove this check, it is just the optimization, just like > the current fatal_signal_pending() check. I just meant that the optimization is effective because group_send_sig_info calls complete_signal which sets SIGNAL_GROUP_EXIT. Which makes it an almost 100% accurate test, which makes it a very good optimization. Especially in the case of multi-threaded processes where the code will arrive there for every thread. Eric
On 06/13, Oleg Nesterov wrote: > > Well, if SIGNAL_GROUP_EXIT is set we do not care if this process was > killed or not. It (the whole thread group) is going to exit, that is all. OOPS. I forgot again that you removed SIGNAL_GROUP_COREDUMP and thus we can't rely on SIGNAL_GROUP_EXIT in this case. I still think this was not right, but it is too late to complain. Andrew, please drop this patch. Currently zap_pid_ns_processes() kills the coredumping tasks, with this patch it doesn't. Oleg.
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 25f3cf679b35..0f9bd67c9e75 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -191,21 +191,14 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) * The last thread in the cgroup-init thread group is terminating. * Find remaining pid_ts in the namespace, signal and wait for them * to exit. - * - * Note: This signals each threads in the namespace - even those that - * belong to the same thread group, To avoid this, we would have - * to walk the entire tasklist looking a processes in this - * namespace, but that could be unnecessarily expensive if the - * pid namespace has just a few processes. Or we need to - * maintain a tasklist for each pid namespace. - * */ rcu_read_lock(); read_lock(&tasklist_lock); nr = 2; idr_for_each_entry_continue(&pid_ns->idr, pid, nr) { - task = pid_task(pid, PIDTYPE_PID); - if (task && !__fatal_signal_pending(task)) + task = pid_task(pid, PIDTYPE_TGID); + /* reading signal->flags is racy without sighand->siglock */ + if (task && !(data_race(task->signal->flags) & SIGNAL_GROUP_EXIT)) group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX); } read_unlock(&tasklist_lock);
The comment above the idr_for_each_entry_continue() loop tries to explain why we have to signal each thread in the namespace, but it is outdated. This code no longer uses kill_proc_info(), we have a target task so we can check thread_group_leader() and avoid the unnecessary group_send_sig_info. Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID, this way it returns NULL if this pid is not a group-leader pid. Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process / thread doesn't necessarily has a pending SIGKILL. Either way these checks are racy without siglock, so the patch uses data_race() to shut up KCSAN. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/pid_namespace.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-)