Message ID | 20161105145623.GA21207@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Oleg Nesterov <oleg@redhat.com> writes: > On 11/04, Oleg Nesterov wrote: >> >> On 11/04, Oleg Nesterov wrote: >> > >> > On 11/04, Eric W. Biederman wrote: >> > > >> > > The following mostly correct patch modifies zap_other_threads in >> > > the case of a de_thread to not wait for zombies to be reaped. The only >> > > case that cares is ptrace (as threads are self reaping). So I don't >> > > think this will cause any problems except removing the strace -f race. >> > >> > From my previous email: >> > >> > So the only plan I currently have is change de_thread() to wait until >> > other threads pass exit_notify() or even exit_signals(), but I don't >> > like this. >> > >> > And yes, I don't like this, but perhaps this is what we should do. >> > >> > The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT >> > checks doesn't look right, but off course technically this change should >> > be simple enough. >> > >> > But not that simple. Just for example, the exiting sub-threads should >> > not run with ->group_leader pointing to nowhere, in case it was reaped >> > by de_thread. >> >> Not to mention other potential problems outside of ptrace/exec. For example >> userns_install() can fail after mt-exec even without ptrace, simply because >> thread_group_empty() can be false. Sure, easy to fix, and probably _install() >> should use signal->live anyway, but still. >> >> And I didn't mention the fun with sighand unsharing. We simply can't do this >> until all sub-threads go away. IOW, your patch breaks the usage of ->siglock. >> The execing thread and the zombie threads will use different locks to, say, >> remove the task from thread-group. Again, this is fixable, but not that >> simple. >> >> > And we have another problem with PTRACE_EVENT_EXIT which can lead to the >> > same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never >> > defined. But this change will add the user-visible change. >> > >> > And if we add the user-visible changes, then perhaps we could simply untrace >> > the traced sub-threads on exec. This change is simple, we do not even need >> > to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace >> > if exec is in progress. But I'm afraid we can't do this. > > So I was thinking about something like below. Untested, probably buggy/incomplete > too, but hopefully can work. > > flush_old_exec() calls the new kill_sub_threads() helper which waits until > all the sub-threads pass exit_notify(). > > de_thread() is called after install_exec_creds(), it is simplified and waits > for thread_group_empty() without cred_guard_mutex. > > But again, I do not really like this, and we need to do something with > PTRACE_EVENT_EXIT anyway, this needs another/separate change. User-visible. > > And I disagree that this has nothing to do with cred_guard_mutex. And in any > case we should narrow its scope in do_execve() path. Why do we take it so early? > Why do we need to do, say, copy_strings() with this lock held? The original > motivation for this has gone, acct_arg_size() can work just fine even if > multiple threads call sys_execve(). The little piece of this puzzle that I understand is that we don't want to ptrace_attach while a processes is in the middle of exec. The name cred_guard_mutex is odd for that, but that is what I see that lock doing. But ptrace really needs to consider either the original creds and mm or the final creds and mm. Halfway states are problem. Solution to avoid that may simply be some code motion that allows the mutex to have a smaller hold time. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Oleg Nesterov <oleg@redhat.com> writes: > On 11/04, Oleg Nesterov wrote: >> >> On 11/04, Oleg Nesterov wrote: >> > >> > On 11/04, Eric W. Biederman wrote: >> > > >> > > The following mostly correct patch modifies zap_other_threads in >> > > the case of a de_thread to not wait for zombies to be reaped. The only >> > > case that cares is ptrace (as threads are self reaping). So I don't >> > > think this will cause any problems except removing the strace -f race. >> > >> > From my previous email: >> > >> > So the only plan I currently have is change de_thread() to wait until >> > other threads pass exit_notify() or even exit_signals(), but I don't >> > like this. >> > >> > And yes, I don't like this, but perhaps this is what we should do. >> > >> > The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT >> > checks doesn't look right, but off course technically this change should >> > be simple enough. >> > >> > But not that simple. Just for example, the exiting sub-threads should >> > not run with ->group_leader pointing to nowhere, in case it was reaped >> > by de_thread. >> >> Not to mention other potential problems outside of ptrace/exec. For example >> userns_install() can fail after mt-exec even without ptrace, simply because >> thread_group_empty() can be false. Sure, easy to fix, and probably _install() >> should use signal->live anyway, but still. >> >> And I didn't mention the fun with sighand unsharing. We simply can't do this >> until all sub-threads go away. IOW, your patch breaks the usage of ->siglock. >> The execing thread and the zombie threads will use different locks to, say, >> remove the task from thread-group. Again, this is fixable, but not that >> simple. >> >> > And we have another problem with PTRACE_EVENT_EXIT which can lead to the >> > same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never >> > defined. But this change will add the user-visible change. >> > >> > And if we add the user-visible changes, then perhaps we could simply untrace >> > the traced sub-threads on exec. This change is simple, we do not even need >> > to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace >> > if exec is in progress. But I'm afraid we can't do this. > > So I was thinking about something like below. Untested, probably buggy/incomplete > too, but hopefully can work. > > flush_old_exec() calls the new kill_sub_threads() helper which waits until > all the sub-threads pass exit_notify(). > > de_thread() is called after install_exec_creds(), it is simplified and waits > for thread_group_empty() without cred_guard_mutex. > > But again, I do not really like this, and we need to do something with > PTRACE_EVENT_EXIT anyway, this needs another/separate change. User-visible. > > And I disagree that this has nothing to do with cred_guard_mutex. And in any > case we should narrow its scope in do_execve() path. Why do we take it so early? > Why do we need to do, say, copy_strings() with this lock held? The original > motivation for this has gone, acct_arg_size() can work just fine even if > multiple threads call sys_execve(). > > I'll try to discuss the possible changes in LSM hooks with Jann, I still think > that this is what we actually need to do. At least try to do, possibly this is > too complicated. The code below looks interesting. Am I wrong or do we get the PTRACE_EVENT_EXIT case wrong for the multi-threaded exec's when we don't exec from the primary thread? AKA I think the primary thread will pass through ptrace_event(PTRACE_EVENT_EXIT) before we steal it's thread and likewise the thread that calls exec won't pass through ptrace_event(PTRACE_EVENT_EXIT). Which I suspect gives us quite a bit of lattitude to skip that notification entirely without notifying userspace. We need to test to be certain that both gdb and strace can cope. But I do suspect we could just throw ptrace_event(PTRACE_EVENT_EXIT) out in the case of a multi-threaded exec and no one would care. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm) setup_new_exec(bprm); install_exec_creds(bprm); + retval = de_thread(current); + if (retval) + goto out_free_dentry; + /* Do this so that we can load the interpreter, if need be. We will change some of these later */ retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP), executable_stack); if (retval < 0) goto out_free_dentry; - + current->mm->start_stack = bprm->p; /* Now we do a little grungy work by mmapping the ELF image into diff --git a/fs/exec.c b/fs/exec.c index 4e497b9..7246b9f 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1036,13 +1036,59 @@ static int exec_mmap(struct mm_struct *mm) return 0; } +static int wait_for_notify_count(struct task_struct *tsk, struct signal_struct *sig) +{ + for (;;) { + if (unlikely(__fatal_signal_pending(tsk))) + goto killed; + set_current_state(TASK_KILLABLE); + if (!sig->notify_count) + break; + schedule(); + } + __set_current_state(TASK_RUNNING); + return 0; + +killed: + /* protects against exit_notify() and __exit_signal() */ + read_lock(&tasklist_lock); + sig->group_exit_task = NULL; + sig->notify_count = 0; + read_unlock(&tasklist_lock); + return -EINTR; +} + +static int kill_sub_threads(struct task_struct *tsk) +{ + struct signal_struct *sig = tsk->signal; + int err = -EINTR; + + if (thread_group_empty(tsk)) + return 0; + + read_lock(&tasklist_lock); + spin_lock_irq(&tsk->sighand->siglock); + if (!signal_group_exit(sig)) { + sig->group_exit_task = tsk; + sig->notify_count = -zap_other_threads(tsk); + err = 0; + } + spin_unlock_irq(&tsk->sighand->siglock); + read_unlock(&tasklist_lock); + + if (!err) + err = wait_for_notify_count(tsk, sig); + return err; + +} + /* * This function makes sure the current process has its own signal table, * so that flush_signal_handlers can later reset the handlers without * disturbing other processes. (Other processes might share the signal * table via the CLONE_SIGHAND option to clone().) */ -static int de_thread(struct task_struct *tsk) +int de_thread(struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; @@ -1051,34 +1097,15 @@ static int de_thread(struct task_struct *tsk) if (thread_group_empty(tsk)) goto no_thread_group; - /* - * Kill all other threads in the thread group. - */ spin_lock_irq(lock); - if (signal_group_exit(sig)) { - /* - * Another group action in progress, just - * return so that the signal is processed. - */ - spin_unlock_irq(lock); - return -EAGAIN; - } - - sig->group_exit_task = tsk; - sig->notify_count = zap_other_threads(tsk); + sig->notify_count = sig->nr_threads; if (!thread_group_leader(tsk)) sig->notify_count--; - - while (sig->notify_count) { - __set_current_state(TASK_KILLABLE); - spin_unlock_irq(lock); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - spin_lock_irq(lock); - } spin_unlock_irq(lock); + if (wait_for_notify_count(tsk, sig)) + return -EINTR; + /* * At this point all other threads have exited, all we have to * do is to wait for the thread group leader to become inactive, @@ -1087,24 +1114,8 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) { struct task_struct *leader = tsk->group_leader; - for (;;) { - threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); - /* - * Do this under tasklist_lock to ensure that - * exit_notify() can't miss ->group_exit_task - */ - sig->notify_count = -1; - if (likely(leader->exit_state)) - break; - __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); - threadgroup_change_end(tsk); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - } - + threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); /* * The only record we have of the real-time age of a * process, regardless of execs it's done, is start_time. @@ -1162,10 +1173,9 @@ static int de_thread(struct task_struct *tsk) release_task(leader); } +no_thread_group: sig->group_exit_task = NULL; sig->notify_count = 0; - -no_thread_group: /* we have changed execution domain */ tsk->exit_signal = SIGCHLD; @@ -1197,14 +1207,6 @@ static int de_thread(struct task_struct *tsk) BUG_ON(!thread_group_leader(tsk)); return 0; - -killed: - /* protects against exit_notify() and __exit_signal() */ - read_lock(&tasklist_lock); - sig->group_exit_task = NULL; - sig->notify_count = 0; - read_unlock(&tasklist_lock); - return -EAGAIN; } char *get_task_comm(char *buf, struct task_struct *tsk) @@ -1239,7 +1241,7 @@ int flush_old_exec(struct linux_binprm * bprm) * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */ - retval = de_thread(current); + retval = kill_sub_threads(current); if (retval) goto out; --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -101,6 +101,7 @@ extern int __must_check remove_arg_zero(struct linux_binprm *); extern int search_binary_handler(struct linux_binprm *); extern int flush_old_exec(struct linux_binprm * bprm); extern void setup_new_exec(struct linux_binprm * bprm); +extern int de_thread(struct task_struct *tsk); extern void would_dump(struct linux_binprm *, struct file *); extern int suid_dumpable; diff --git a/kernel/exit.c b/kernel/exit.c index 9d68c45..f3dd46d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -690,8 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead) if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) + /* mt-exec, kill_sub_threads() is waiting for group exit */ + if (unlikely(tsk->signal->notify_count < 0) && + !++tsk->signal->notify_count) wake_up_process(tsk->signal->group_exit_task); write_unlock_irq(&tasklist_lock); --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1194,13 +1194,12 @@ int zap_other_threads(struct task_struct *p) while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; - /* Don't bother with already dead threads */ if (t->exit_state) continue; sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); + count++; } return count;