From patchwork Fri Nov 4 15:00:38 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Eric W. Biederman" X-Patchwork-Id: 9412719 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 8759F60573 for ; Fri, 4 Nov 2016 15:08:27 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 78EA32B150 for ; Fri, 4 Nov 2016 15:08:27 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6D9272B1AC; Fri, 4 Nov 2016 15:08:27 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8D83F2B150 for ; Fri, 4 Nov 2016 15:08:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935401AbcKDPIU (ORCPT ); Fri, 4 Nov 2016 11:08:20 -0400 Received: from out01.mta.xmission.com ([166.70.13.231]:45044 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932271AbcKDPIS (ORCPT ); Fri, 4 Nov 2016 11:08:18 -0400 Received: from in01.mta.xmission.com ([166.70.13.51]) by out01.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1c2g5A-0004Dw-G4; Fri, 04 Nov 2016 09:07:04 -0600 Received: from [205.159.154.82] (helo=x220.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1c2g4L-0006Th-0l; Fri, 04 Nov 2016 09:06:24 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Oleg Nesterov Cc: Jann Horn , Alexander Viro , Roland McGrath , John Johansen , James Morris , "Serge E. Hallyn" , Paul Moore , Stephen Smalley , Eric Paris , Casey Schaufler , Kees Cook , Andrew Morton , Janis Danisevskis , Seth Forshee , Thomas Gleixner , Benjamin LaHaise , Ben Hutchings , Andy Lutomirski , Linus Torvalds , Krister Johansen , linux-fsdevel@vger.kernel.org, linux-security-module@vger.kernel.org, security@kernel.org References: <1477863998-3298-1-git-send-email-jann@thejh.net> <1477863998-3298-2-git-send-email-jann@thejh.net> <20161102181806.GB1112@redhat.com> <20161102205011.GF8196@pc.thejh.net> <20161103181225.GA11212@redhat.com> <87k2cj2x6j.fsf@xmission.com> Date: Fri, 04 Nov 2016 10:00:38 -0500 In-Reply-To: <87k2cj2x6j.fsf@xmission.com> (Eric W. Biederman's message of "Fri, 04 Nov 2016 08:26:28 -0500") Message-ID: <87k2cjuw6h.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 X-XM-SPF: eid=1c2g4L-0006Th-0l; ; ; mid=<87k2cjuw6h.fsf@xmission.com>; ; ; hst=in01.mta.xmission.com; ; ; ip=205.159.154.82; ; ; frm=ebiederm@xmission.com; ; ; spf=neutral X-XM-AID: U2FsdGVkX1+rbezMcCbJhX2JeTaBGij58iRXdmYOc+A= X-SA-Exim-Connect-IP: 205.159.154.82 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: [PATCH v3 1/8] exec: introduce cred_guard_light X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP ebiederm@xmission.com (Eric W. Biederman) writes: > Oleg Nesterov writes: > >> On 11/02, Jann Horn wrote: >>> >>> On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote: >>> > On 10/30, Jann Horn wrote: >>> > > >>> > > This is a new per-threadgroup lock that can often be taken instead of >>> > > cred_guard_mutex and has less deadlock potential. I'm doing this because >>> > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a >>> > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped >>> > > thread, and the debugger attempts to inspect procfs files of the debugged >>> > > task. >>> > >>> > Yes, but let me repeat that we need to fix this anyway. So I don't really >>> > understand why should we add yet another mutex. >>> >>> execve() only takes the new mutex immediately after de_thread(), so this >>> problem shouldn't occur there. >> >> Yes, I see. >> >>> Basically, I think that I'm not making the >>> problem worse with my patches this way. >> >> In a sense that it doesn't add the new deadlocks, I agree. But it adds >> yet another per-process mutex while we already have the similar one, >> >>> I believe that it should be possible to convert most existing users of the >>> cred_guard_mutex to the new cred_guard_light - exceptions to that that I >>> see are: >>> >>> - PTRACE_ATTACH >> >> This is the main problem afaics. So "strace -f" can hang if it races >> with mt-exec. And we need to fix this. I constantly forget about this >> problem, but I tried many times to find a reasonable solution, still >> can't. >> >> IMO, it would be nice to rework the lsm hooks, so that we could take >> cred_guard_mutex after de_thread() (like your cred_guard_light) or >> at least drop it earlier, but unlikely this is possible... >> >> So the only plan I currently have is change de_thread() to wait until >> other threads pass exit_notify() or even exit_signals(), but I don't >> like this. >> >>> - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task) >> >> I forgot about this one... Need to re-check but at first glance this >> is not a real problem. >> >>> Beyond that, conceptually, the new cred_guard_light could also be turned >>> into a read-write mutex >> >> Not sure I understand how this can help... doesn't matter. >> >> My point is, imo you should not add the new mutex. Just use the old >> one in (say) 4/8 (which I do not personally like as you know ;), this >> won't add the new problem. >> >> >>> It seems to me like SECCOMP_FILTER_FLAG_TSYNC doesn't really have >>> deadlocking issues. >> >> Yes, agreed. >> >>> PTRACE_ATTACH isn't that clear to me; if a debugger >>> tries to attach to a newly spawned thread while another ptraced thread is >>> dying because of de_thread() in a third thread, that might still cause >>> the debugger to deadlock, right? >> >> This is the trivial test-case I wrote when the problem was initially >> reported. And damn, I always knew that cred_guard_mutex needs fixes, >> but somehow I completely forgot that it is used by PTRACE_ATTACH when >> I was going to try to remove from fs/proc a long ago. >> >> void *thread(void *arg) >> { >> ptrace(PTRACE_TRACEME, 0,0,0); >> return NULL; >> } >> >> int main(void) >> { >> int pid = fork(); >> >> if (!pid) { >> pthread_t pt; >> pthread_create(&pt, NULL, thread, NULL); >> pthread_join(pt, NULL); >> execlp("echo", "echo", "passed", NULL); >> } >> >> sleep(1); >> // or anything else which needs ->cred_guard_mutex, >> // say open(/proc/$pid/mem) >> ptrace(PTRACE_ATTACH, pid, 0,0); >> kill(pid, SIGCONT); >> >> return 0; >> } >> >> The problem is trivial. The execing thread waits until its sub-thread >> goes away, it should be reaped by the tracer, the tracer waits for >> cred_guard_mutex. > > There is a bug here but I don't believe it has anything to do with > the cred_guard_mutex. > > If we reach zap_other_threads fundamentally the tracer should not > be able to block the traced thread from exiting. Those are the > semantics described in the comments in the code. > > I have poked things a little and have a half fix for that but > the fix appears to be the wrong, but enlightening. > > AKA the following prevents the hang of your test case. > diff --git a/kernel/signal.c b/kernel/signal.c > index 75761acc77cf..a6f83450500e 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -1200,7 +1200,7 @@ int zap_other_threads(struct task_struct *p) > if (t->exit_state) > continue; > sigaddset(&t->pending.signal, SIGKILL); > - signal_wake_up(t, 1); > + signal_wake_up_state(t, TASK_WAKEKILL | __TASK_TRACED); > } > > return count; > > It looks like somewhere on the exit path the traced thread is blocking > without setting TASK_WAKEKILL. Apologies there was a testing mistake and that patch does not actually help anything. The following mostly correct patch modifies zap_other_threads in the case of a de_thread to not wait for zombies to be reaped. The only case that cares is ptrace (as threads are self reaping). So I don't think this will cause any problems except removing the strace -f race. Not waiting for zombies to be reaped in de_thread keeps the kernel from holding the cred_guard_mutex while waiting for userspace. Which should mean we don't have to move it. Not waiting for zombies to be reaped should also speed of mt-exec. So I think this is a benefit all around. Eric --- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/kernel/exit.c b/kernel/exit.c index 9d68c45ebbe3..8c8556cab655 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -109,7 +109,8 @@ static void __exit_signal(struct task_struct *tsk) * If there is any task waiting for the group exit * then notify it: */ - if (sig->notify_count > 0 && !--sig->notify_count) + if ((sig->flags & SIGNAL_GROUP_EXIT) && + sig->notify_count > 0 && !--sig->notify_count) wake_up_process(sig->group_exit_task); if (tsk == sig->curr_target) @@ -690,6 +691,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead) if (tsk->exit_state == EXIT_DEAD) list_add(&tsk->ptrace_entry, &dead); + if (!(tsk->signal->flags & SIGNAL_GROUP_EXIT) && + tsk->signal->notify_count > 0 && !--tsk->signal->notify_count) + wake_up_process(tsk->signal->group_exit_task); + /* mt-exec, de_thread() is waiting for group leader */ if (unlikely(tsk->signal->notify_count < 0)) wake_up_process(tsk->signal->group_exit_task); diff --git a/kernel/signal.c b/kernel/signal.c index 75761acc77cf..a3a5cd8dad0f 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1194,7 +1194,9 @@ int zap_other_threads(struct task_struct *p) while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; + if ((t->signal->flags & SIGNAL_GROUP_EXIT) || + !t->exit_state) + count++; /* Don't bother with already dead threads */ if (t->exit_state)