From patchwork Sat Nov 5 14:56:23 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oleg Nesterov X-Patchwork-Id: 9413715 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 2D1006048F for ; Sat, 5 Nov 2016 14:56:39 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1DCBA291A5 for ; Sat, 5 Nov 2016 14:56:39 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 120CE291E7; Sat, 5 Nov 2016 14:56:39 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 59087291A5 for ; Sat, 5 Nov 2016 14:56:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755070AbcKEO4g (ORCPT ); Sat, 5 Nov 2016 10:56:36 -0400 Received: from mx1.redhat.com ([209.132.183.28]:39744 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755006AbcKEO4e (ORCPT ); Sat, 5 Nov 2016 10:56:34 -0400 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id D876C85545; Sat, 5 Nov 2016 14:56:32 +0000 (UTC) Received: from dhcp-27-174.brq.redhat.com ([10.34.27.30]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with SMTP id uA5EuOL5028768; Sat, 5 Nov 2016 10:56:24 -0400 Received: by dhcp-27-174.brq.redhat.com (nbSMTP-1.00) for uid 1000 oleg@redhat.com; Sat, 5 Nov 2016 15:56:32 +0100 (CET) Date: Sat, 5 Nov 2016 15:56:23 +0100 From: Oleg Nesterov To: "Eric W. Biederman" Cc: Jann Horn , Alexander Viro , Roland McGrath , John Johansen , James Morris , "Serge E. Hallyn" , Paul Moore , Stephen Smalley , Eric Paris , Casey Schaufler , Kees Cook , Andrew Morton , Janis Danisevskis , Seth Forshee , Thomas Gleixner , Benjamin LaHaise , Ben Hutchings , Andy Lutomirski , Linus Torvalds , Krister Johansen , linux-fsdevel@vger.kernel.org, linux-security-module@vger.kernel.org, security@kernel.org Subject: Re: [PATCH v3 1/8] exec: introduce cred_guard_light Message-ID: <20161105145623.GA21207@redhat.com> References: <1477863998-3298-1-git-send-email-jann@thejh.net> <1477863998-3298-2-git-send-email-jann@thejh.net> <20161102181806.GB1112@redhat.com> <20161102205011.GF8196@pc.thejh.net> <20161103181225.GA11212@redhat.com> <87k2cj2x6j.fsf@xmission.com> <87k2cjuw6h.fsf@xmission.com> <20161104180416.GA19221@redhat.com> <20161104184505.GA21320@redhat.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20161104184505.GA21320@redhat.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Sat, 05 Nov 2016 14:56:33 +0000 (UTC) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 11/04, Oleg Nesterov wrote: > > On 11/04, Oleg Nesterov wrote: > > > > On 11/04, Eric W. Biederman wrote: > > > > > > The following mostly correct patch modifies zap_other_threads in > > > the case of a de_thread to not wait for zombies to be reaped. The only > > > case that cares is ptrace (as threads are self reaping). So I don't > > > think this will cause any problems except removing the strace -f race. > > > > From my previous email: > > > > So the only plan I currently have is change de_thread() to wait until > > other threads pass exit_notify() or even exit_signals(), but I don't > > like this. > > > > And yes, I don't like this, but perhaps this is what we should do. > > > > The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT > > checks doesn't look right, but off course technically this change should > > be simple enough. > > > > But not that simple. Just for example, the exiting sub-threads should > > not run with ->group_leader pointing to nowhere, in case it was reaped > > by de_thread. > > Not to mention other potential problems outside of ptrace/exec. For example > userns_install() can fail after mt-exec even without ptrace, simply because > thread_group_empty() can be false. Sure, easy to fix, and probably _install() > should use signal->live anyway, but still. > > And I didn't mention the fun with sighand unsharing. We simply can't do this > until all sub-threads go away. IOW, your patch breaks the usage of ->siglock. > The execing thread and the zombie threads will use different locks to, say, > remove the task from thread-group. Again, this is fixable, but not that > simple. > > > And we have another problem with PTRACE_EVENT_EXIT which can lead to the > > same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never > > defined. But this change will add the user-visible change. > > > > And if we add the user-visible changes, then perhaps we could simply untrace > > the traced sub-threads on exec. This change is simple, we do not even need > > to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace > > if exec is in progress. But I'm afraid we can't do this. So I was thinking about something like below. Untested, probably buggy/incomplete too, but hopefully can work. flush_old_exec() calls the new kill_sub_threads() helper which waits until all the sub-threads pass exit_notify(). de_thread() is called after install_exec_creds(), it is simplified and waits for thread_group_empty() without cred_guard_mutex. But again, I do not really like this, and we need to do something with PTRACE_EVENT_EXIT anyway, this needs another/separate change. User-visible. And I disagree that this has nothing to do with cred_guard_mutex. And in any case we should narrow its scope in do_execve() path. Why do we take it so early? Why do we need to do, say, copy_strings() with this lock held? The original motivation for this has gone, acct_arg_size() can work just fine even if multiple threads call sys_execve(). I'll try to discuss the possible changes in LSM hooks with Jann, I still think that this is what we actually need to do. At least try to do, possibly this is too complicated. Oleg. fs/binfmt_elf.c | 6 ++- fs/exec.c | 108 ++++++++++++++++++++++++------------------------ include/linux/binfmts.h | 1 + kernel/exit.c | 5 ++- kernel/signal.c | 3 +- 5 files changed, 65 insertions(+), 58 deletions(-) --- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm) setup_new_exec(bprm); install_exec_creds(bprm); + retval = de_thread(current); + if (retval) + goto out_free_dentry; + /* Do this so that we can load the interpreter, if need be. We will change some of these later */ retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP), executable_stack); if (retval < 0) goto out_free_dentry; - + current->mm->start_stack = bprm->p; /* Now we do a little grungy work by mmapping the ELF image into diff --git a/fs/exec.c b/fs/exec.c index 4e497b9..7246b9f 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1036,13 +1036,59 @@ static int exec_mmap(struct mm_struct *mm) return 0; } +static int wait_for_notify_count(struct task_struct *tsk, struct signal_struct *sig) +{ + for (;;) { + if (unlikely(__fatal_signal_pending(tsk))) + goto killed; + set_current_state(TASK_KILLABLE); + if (!sig->notify_count) + break; + schedule(); + } + __set_current_state(TASK_RUNNING); + return 0; + +killed: + /* protects against exit_notify() and __exit_signal() */ + read_lock(&tasklist_lock); + sig->group_exit_task = NULL; + sig->notify_count = 0; + read_unlock(&tasklist_lock); + return -EINTR; +} + +static int kill_sub_threads(struct task_struct *tsk) +{ + struct signal_struct *sig = tsk->signal; + int err = -EINTR; + + if (thread_group_empty(tsk)) + return 0; + + read_lock(&tasklist_lock); + spin_lock_irq(&tsk->sighand->siglock); + if (!signal_group_exit(sig)) { + sig->group_exit_task = tsk; + sig->notify_count = -zap_other_threads(tsk); + err = 0; + } + spin_unlock_irq(&tsk->sighand->siglock); + read_unlock(&tasklist_lock); + + if (!err) + err = wait_for_notify_count(tsk, sig); + return err; + +} + /* * This function makes sure the current process has its own signal table, * so that flush_signal_handlers can later reset the handlers without * disturbing other processes. (Other processes might share the signal * table via the CLONE_SIGHAND option to clone().) */ -static int de_thread(struct task_struct *tsk) +int de_thread(struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; @@ -1051,34 +1097,15 @@ static int de_thread(struct task_struct *tsk) if (thread_group_empty(tsk)) goto no_thread_group; - /* - * Kill all other threads in the thread group. - */ spin_lock_irq(lock); - if (signal_group_exit(sig)) { - /* - * Another group action in progress, just - * return so that the signal is processed. - */ - spin_unlock_irq(lock); - return -EAGAIN; - } - - sig->group_exit_task = tsk; - sig->notify_count = zap_other_threads(tsk); + sig->notify_count = sig->nr_threads; if (!thread_group_leader(tsk)) sig->notify_count--; - - while (sig->notify_count) { - __set_current_state(TASK_KILLABLE); - spin_unlock_irq(lock); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - spin_lock_irq(lock); - } spin_unlock_irq(lock); + if (wait_for_notify_count(tsk, sig)) + return -EINTR; + /* * At this point all other threads have exited, all we have to * do is to wait for the thread group leader to become inactive, @@ -1087,24 +1114,8 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) { struct task_struct *leader = tsk->group_leader; - for (;;) { - threadgroup_change_begin(tsk); - write_lock_irq(&tasklist_lock); - /* - * Do this under tasklist_lock to ensure that - * exit_notify() can't miss ->group_exit_task - */ - sig->notify_count = -1; - if (likely(leader->exit_state)) - break; - __set_current_state(TASK_KILLABLE); - write_unlock_irq(&tasklist_lock); - threadgroup_change_end(tsk); - schedule(); - if (unlikely(__fatal_signal_pending(tsk))) - goto killed; - } - + threadgroup_change_begin(tsk); + write_lock_irq(&tasklist_lock); /* * The only record we have of the real-time age of a * process, regardless of execs it's done, is start_time. @@ -1162,10 +1173,9 @@ static int de_thread(struct task_struct *tsk) release_task(leader); } +no_thread_group: sig->group_exit_task = NULL; sig->notify_count = 0; - -no_thread_group: /* we have changed execution domain */ tsk->exit_signal = SIGCHLD; @@ -1197,14 +1207,6 @@ static int de_thread(struct task_struct *tsk) BUG_ON(!thread_group_leader(tsk)); return 0; - -killed: - /* protects against exit_notify() and __exit_signal() */ - read_lock(&tasklist_lock); - sig->group_exit_task = NULL; - sig->notify_count = 0; - read_unlock(&tasklist_lock); - return -EAGAIN; } char *get_task_comm(char *buf, struct task_struct *tsk) @@ -1239,7 +1241,7 @@ int flush_old_exec(struct linux_binprm * bprm) * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */ - retval = de_thread(current); + retval = kill_sub_threads(current); if (retval) goto out; --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -101,6 +101,7 @@ extern int __must_check remove_arg_zero(struct linux_binprm *); extern int search_binary_handler(struct linux_binprm *); extern int flush_old_exec(struct linux_binprm * bprm); extern void setup_new_exec(struct linux_binprm * bprm); +extern int de_thread(struct task_struct *tsk); extern void would_dump(struct linux_binprm *, struct file *); extern int suid_dumpable; diff --git a/kernel/exit.c b/kernel/exit.c index 9d68c45..f3dd46d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -690,8 +690,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead) if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); - /* mt-exec, de_thread() is waiting for group leader */ - if (unlikely(tsk->signal->notify_count < 0)) + /* mt-exec, kill_sub_threads() is waiting for group exit */ + if (unlikely(tsk->signal->notify_count < 0) && + !++tsk->signal->notify_count) wake_up_process(tsk->signal->group_exit_task); write_unlock_irq(&tasklist_lock); --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1194,13 +1194,12 @@ int zap_other_threads(struct task_struct *p) while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; - /* Don't bother with already dead threads */ if (t->exit_state) continue; sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); + count++; } return count;