diff mbox

[v3,1/8] exec: introduce cred_guard_light

Message ID 87k2cjuw6h.fsf@xmission.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eric W. Biederman Nov. 4, 2016, 3 p.m. UTC
ebiederm@xmission.com (Eric W. Biederman) writes:

> Oleg Nesterov <oleg@redhat.com> writes:
>
>> On 11/02, Jann Horn wrote:
>>>
>>> On Wed, Nov 02, 2016 at 07:18:06PM +0100, Oleg Nesterov wrote:
>>> > On 10/30, Jann Horn wrote:
>>> > >
>>> > > This is a new per-threadgroup lock that can often be taken instead of
>>> > > cred_guard_mutex and has less deadlock potential. I'm doing this because
>>> > > Oleg Nesterov mentioned the potential for deadlocks, in particular if a
>>> > > debugged task is stuck in execve, trying to get rid of a ptrace-stopped
>>> > > thread, and the debugger attempts to inspect procfs files of the debugged
>>> > > task.
>>> >
>>> > Yes, but let me repeat that we need to fix this anyway. So I don't really
>>> > understand why should we add yet another mutex.
>>>
>>> execve() only takes the new mutex immediately after de_thread(), so this
>>> problem shouldn't occur there.
>>
>> Yes, I see.
>>
>>> Basically, I think that I'm not making the
>>> problem worse with my patches this way.
>>
>> In a sense that it doesn't add the new deadlocks, I agree. But it adds
>> yet another per-process mutex while we already have the similar one,
>>
>>> I believe that it should be possible to convert most existing users of the
>>> cred_guard_mutex to the new cred_guard_light - exceptions to that that I
>>> see are:
>>>
>>>  - PTRACE_ATTACH
>>
>> This is the main problem afaics. So "strace -f" can hang if it races
>> with mt-exec. And we need to fix this. I constantly forget about this
>> problem, but I tried many times to find a reasonable solution, still
>> can't.
>>
>> IMO, it would be nice to rework the lsm hooks, so that we could take
>> cred_guard_mutex after de_thread() (like your cred_guard_light) or
>> at least drop it earlier, but unlikely this is possible...
>>
>> So the only plan I currently have is change de_thread() to wait until
>> other threads pass exit_notify() or even exit_signals(), but I don't
>> like this.
>>
>>>  - SECCOMP_FILTER_FLAG_TSYNC (sets NO_NEW_PRIVS on remote task)
>>
>> I forgot about this one... Need to re-check but at first glance this
>> is not a real problem.
>>
>>> Beyond that, conceptually, the new cred_guard_light could also be turned
>>> into a read-write mutex
>>
>> Not sure I understand how this can help... doesn't matter.
>>
>> My point is, imo you should not add the new mutex. Just use the old
>> one in (say) 4/8 (which I do not personally like as you know ;), this
>> won't add the new problem.
>>
>>
>>> It seems to me like SECCOMP_FILTER_FLAG_TSYNC doesn't really have
>>> deadlocking issues.
>>
>> Yes, agreed.
>>
>>> PTRACE_ATTACH isn't that clear to me; if a debugger
>>> tries to attach to a newly spawned thread while another ptraced thread is
>>> dying because of de_thread() in a third thread, that might still cause
>>> the debugger to deadlock, right?
>>
>> This is the trivial test-case I wrote when the problem was initially
>> reported. And damn, I always knew that cred_guard_mutex needs fixes,
>> but somehow I completely forgot that it is used by PTRACE_ATTACH when
>> I was going to try to remove from fs/proc a long ago.
>>
>> 	void *thread(void *arg)
>> 	{
>> 		ptrace(PTRACE_TRACEME, 0,0,0);
>> 		return NULL;
>> 	}
>>
>> 	int main(void)
>> 	{
>> 		int pid = fork();
>>
>> 		if (!pid) {
>> 			pthread_t pt;
>> 			pthread_create(&pt, NULL, thread, NULL);
>> 			pthread_join(pt, NULL);
>> 			execlp("echo", "echo", "passed", NULL);
>> 		}
>>
>> 		sleep(1);
>> 		// or anything else which needs ->cred_guard_mutex,
>> 		// say open(/proc/$pid/mem)
>> 		ptrace(PTRACE_ATTACH, pid, 0,0);
>> 		kill(pid, SIGCONT);
>>
>> 		return 0;
>> 	}
>>
>> The problem is trivial. The execing thread waits until its sub-thread
>> goes away, it should be reaped by the tracer, the tracer waits for
>> cred_guard_mutex.
>
> There is a bug here but I don't believe it has anything to do with
> the cred_guard_mutex.
>
> If we reach zap_other_threads fundamentally the tracer should not
> be able to block the traced thread from exiting.  Those are the
> semantics described in the comments in the code.
>
> I have poked things a little and have a half fix for that but
> the fix appears to be the wrong, but enlightening.
>
> AKA the following prevents the hang of your test case.
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 75761acc77cf..a6f83450500e 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1200,7 +1200,7 @@ int zap_other_threads(struct task_struct *p)
>  		if (t->exit_state)
>  			continue;
>  		sigaddset(&t->pending.signal, SIGKILL);
> -		signal_wake_up(t, 1);
> +		signal_wake_up_state(t, TASK_WAKEKILL | __TASK_TRACED);
>  	}
>  
>  	return count;
>
> It looks like somewhere on the exit path the traced thread is blocking
> without setting TASK_WAKEKILL.

Apologies there was a testing mistake and that patch does not actually
help anything.

The following mostly correct patch modifies zap_other_threads in
the case of a de_thread to not wait for zombies to be reaped.  The only
case that cares is ptrace (as threads are self reaping).  So I don't
think this will cause any problems except removing the strace -f race.

Not waiting for zombies to be reaped in de_thread keeps the kernel from
holding the cred_guard_mutex while waiting for userspace.  Which should
mean we don't have to move it.

Not waiting for zombies to be reaped should also speed of mt-exec.  So I
think this is a benefit all around.




Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Oleg Nesterov Nov. 4, 2016, 6:04 p.m. UTC | #1
On 11/04, Eric W. Biederman wrote:
>
> The following mostly correct patch modifies zap_other_threads in
> the case of a de_thread to not wait for zombies to be reaped.  The only
> case that cares is ptrace (as threads are self reaping).  So I don't
> think this will cause any problems except removing the strace -f race.

From my previous email:

	So the only plan I currently have is change de_thread() to wait until
	other threads pass exit_notify() or even exit_signals(), but I don't
	like this.

And yes, I don't like this, but perhaps this is what we should do.

The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT
checks doesn't look right, but off course technically this change should
be simple enough.

But not that simple. Just for example, the exiting sub-threads should
not run with ->group_leader pointing to nowhere, in case it was reaped
by de_thread.

And we have another problem with PTRACE_EVENT_EXIT which can lead to the
same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never
defined. But this change will add the user-visible change.

And if we add the user-visible changes, then perhaps we could simply untrace
the traced sub-threads on exec. This change is simple, we do not even need
to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace
if exec is in progress. But I'm afraid we can't do this.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Nov. 4, 2016, 6:45 p.m. UTC | #2
On 11/04, Oleg Nesterov wrote:
>
> On 11/04, Eric W. Biederman wrote:
> >
> > The following mostly correct patch modifies zap_other_threads in
> > the case of a de_thread to not wait for zombies to be reaped.  The only
> > case that cares is ptrace (as threads are self reaping).  So I don't
> > think this will cause any problems except removing the strace -f race.
>
> From my previous email:
>
> 	So the only plan I currently have is change de_thread() to wait until
> 	other threads pass exit_notify() or even exit_signals(), but I don't
> 	like this.
>
> And yes, I don't like this, but perhaps this is what we should do.
>
> The patch is incomplete and racy (afaics), and the SIGNAL_GROUP_EXIT
> checks doesn't look right, but off course technically this change should
> be simple enough.
>
> But not that simple. Just for example, the exiting sub-threads should
> not run with ->group_leader pointing to nowhere, in case it was reaped
> by de_thread.

Not to mention other potential problems outside of ptrace/exec. For example
userns_install() can fail after mt-exec even without ptrace, simply because
thread_group_empty() can be false. Sure, easy to fix, and probably _install()
should use signal->live anyway, but still.

And I didn't mention the fun with sighand unsharing. We simply can't do this
until all sub-threads go away. IOW, your patch breaks the usage of ->siglock.
The execing thread and the zombie threads will use different locks to, say,
remove the task from thread-group. Again, this is fixable, but not that
simple.

> And we have another problem with PTRACE_EVENT_EXIT which can lead to the
> same deadlock. Unfortunately, the semantics of PTRACE_EVENT_EXIT was never
> defined. But this change will add the user-visible change.
>
> And if we add the user-visible changes, then perhaps we could simply untrace
> the traced sub-threads on exec. This change is simple, we do not even need
> to touch exec/de_thread, we could just change exit_notify() to ignore ->ptrace
> if exec is in progress. But I'm afraid we can't do this.

Eric, I hope you see my emails, I got the "Undelivered Mail Returned to Sender"
	...
	This is the mail system at host mail.kernel.org.
	...
	<ebiederm@xmission.com> (expanded from <security@kernel.org>): host
	    mx.xmission.com[166.70.12.20] said: 550-XM-RJCT16: SPF Failure
	    (ip=198.145.29.136, frm=oleg@redhat.com, 550 result=fail) (in reply to RCPT
	    TO command)

right now I have no idea what does this mean.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kees Cook Nov. 8, 2016, 10:02 p.m. UTC | #3
On Fri, Nov 4, 2016 at 11:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> Eric, I hope you see my emails, I got the "Undelivered Mail Returned to Sender"
>         ...
>         This is the mail system at host mail.kernel.org.
>         ...
>         <ebiederm@xmission.com> (expanded from <security@kernel.org>): host
>             mx.xmission.com[166.70.12.20] said: 550-XM-RJCT16: SPF Failure
>             (ip=198.145.29.136, frm=oleg@redhat.com, 550 result=fail) (in reply to RCPT
>             TO command)
>
> right now I have no idea what does this mean.

This is a problem for Google folks too sometimes. This is saying that
xmission.com is checking redhat.com's SPF records and refusing to let
kernel.org deliver email as if it were redhat.com (due to
security@kernel.org being an alias not a mailing list). There aren't
good solutions for this, but best I've found is to have my
security@kernel.org alias be a @kernel.org address instead of an
@google.com address...

-Kees
Eric W. Biederman Nov. 8, 2016, 10:46 p.m. UTC | #4
Kees Cook <keescook@chromium.org> writes:

> On Fri, Nov 4, 2016 at 11:45 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> Eric, I hope you see my emails, I got the "Undelivered Mail Returned to Sender"
>>         ...

Oleg I can receive your messages directly and through vger.kernel.org
lists, but I can't receive them through the email reflector at
security@kernel.org.

>>         This is the mail system at host mail.kernel.org.
>>         ...
>>         <ebiederm@xmission.com> (expanded from <security@kernel.org>): host
>>             mx.xmission.com[166.70.12.20] said: 550-XM-RJCT16: SPF Failure
>>             (ip=198.145.29.136, frm=oleg@redhat.com, 550 result=fail) (in reply to RCPT
>>             TO command)
>>
>> right now I have no idea what does this mean.
>
> This is a problem for Google folks too sometimes. This is saying that
> xmission.com is checking redhat.com's SPF records and refusing to let
> kernel.org deliver email as if it were redhat.com (due to
> security@kernel.org being an alias not a mailing list). There aren't
> good solutions for this, but best I've found is to have my
> security@kernel.org alias be a @kernel.org address instead of an
> @google.com address...

Ugh.  Is even redhat configuring the redhat email to do that?
I will have to look.

Last I looked xmission.com was just enforcing the policy that the other
mail domains were asking to be enforced on themselves.  But those are
policies that are incompatible with mailing lists in general.  Although
I do get confused about which part SPF and DKIM play in this mess.

I just remember that the last several ``enhancements'' to email were
busily breaking mailing lists and I thought they were completely insane.
I can even find evidence that it is (or at least was) so bad that email
standards comittee member's can't comminicate with each other via email
lists.

vger.kernel.org appears to rewrite the envelope sender to avoid
problems.

If xmission is doing any more than just performing what the domain of
the senders of email asked them to do I will be happy to see if I can
to sort it out.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin LaHaise Nov. 8, 2016, 10:56 p.m. UTC | #5
On Tue, Nov 08, 2016 at 04:46:44PM -0600, Eric W. Biederman wrote:
> Kees Cook <keescook@chromium.org> writes:
...
> > This is a problem for Google folks too sometimes. This is saying that
> > xmission.com is checking redhat.com's SPF records and refusing to let
> > kernel.org deliver email as if it were redhat.com (due to
> > security@kernel.org being an alias not a mailing list). There aren't
> > good solutions for this, but best I've found is to have my
> > security@kernel.org alias be a @kernel.org address instead of an
> > @google.com address...
> 
> Ugh.  Is even redhat configuring the redhat email to do that?
> I will have to look.
> 
> Last I looked xmission.com was just enforcing the policy that the other
> mail domains were asking to be enforced on themselves.  But those are
> policies that are incompatible with mailing lists in general.  Although
> I do get confused about which part SPF and DKIM play in this mess.
> 
> I just remember that the last several ``enhancements'' to email were
> busily breaking mailing lists and I thought they were completely insane.
> I can even find evidence that it is (or at least was) so bad that email
> standards comittee member's can't comminicate with each other via email
> lists.
> 
> vger.kernel.org appears to rewrite the envelope sender to avoid
> problems.

Envelope sender rewriting is insufficient, the From: lines need to be 
rewritten to be compliant.  This is a pain in the ass for the @kvack.org 
mailing lists as well -- people with @google.com addresses don't see the 
mailing list postings of users from @google.com and other domains using 
"enhanced" email header "validation" techniques.

		-ben

> If xmission is doing any more than just performing what the domain of
> the senders of email asked them to do I will be happy to see if I can
> to sort it out.
> 
> Eric
Eric W. Biederman Nov. 8, 2016, 11:33 p.m. UTC | #6
Benjamin LaHaise <bcrl@kvack.org> writes:

> On Tue, Nov 08, 2016 at 04:46:44PM -0600, Eric W. Biederman wrote:
>> Kees Cook <keescook@chromium.org> writes:
> ...
>> > This is a problem for Google folks too sometimes. This is saying that
>> > xmission.com is checking redhat.com's SPF records and refusing to let
>> > kernel.org deliver email as if it were redhat.com (due to
>> > security@kernel.org being an alias not a mailing list). There aren't
>> > good solutions for this, but best I've found is to have my
>> > security@kernel.org alias be a @kernel.org address instead of an
>> > @google.com address...
>> 
>> Ugh.  Is even redhat configuring the redhat email to do that?
>> I will have to look.
>> 
>> Last I looked xmission.com was just enforcing the policy that the other
>> mail domains were asking to be enforced on themselves.  But those are
>> policies that are incompatible with mailing lists in general.  Although
>> I do get confused about which part SPF and DKIM play in this mess.
>> 
>> I just remember that the last several ``enhancements'' to email were
>> busily breaking mailing lists and I thought they were completely insane.
>> I can even find evidence that it is (or at least was) so bad that email
>> standards comittee member's can't comminicate with each other via email
>> lists.
>> 
>> vger.kernel.org appears to rewrite the envelope sender to avoid
>> problems.
>
> Envelope sender rewriting is insufficient, the From: lines need to be 
> rewritten to be compliant.  This is a pain in the ass for the @kvack.org 
> mailing lists as well -- people with @google.com addresses don't see the 
> mailing list postings of users from @google.com and other domains using 
> "enhanced" email header "validation" techniques.

That definitely happens in the worst case.  At least for Oleg something
less serious is happening because the from header does not get changed
and the email gets to me through the vger.kernel.org lists.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/exit.c b/kernel/exit.c
index 9d68c45ebbe3..8c8556cab655 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -109,7 +109,8 @@  static void __exit_signal(struct task_struct *tsk)
 		 * If there is any task waiting for the group exit
 		 * then notify it:
 		 */
-		if (sig->notify_count > 0 && !--sig->notify_count)
+		if ((sig->flags & SIGNAL_GROUP_EXIT) &&
+		    sig->notify_count > 0 && !--sig->notify_count)
 			wake_up_process(sig->group_exit_task);
 
 		if (tsk == sig->curr_target)
@@ -690,6 +691,10 @@  static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (tsk->exit_state == EXIT_DEAD)
 		list_add(&tsk->ptrace_entry, &dead);
 
+	if (!(tsk->signal->flags & SIGNAL_GROUP_EXIT) &&
+	    tsk->signal->notify_count > 0 && !--tsk->signal->notify_count)
+		wake_up_process(tsk->signal->group_exit_task);
+
 	/* mt-exec, de_thread() is waiting for group leader */
 	if (unlikely(tsk->signal->notify_count < 0))
 		wake_up_process(tsk->signal->group_exit_task);
diff --git a/kernel/signal.c b/kernel/signal.c
index 75761acc77cf..a3a5cd8dad0f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1194,7 +1194,9 @@  int zap_other_threads(struct task_struct *p)
 
 	while_each_thread(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
+		if ((t->signal->flags & SIGNAL_GROUP_EXIT) ||
+		    !t->exit_state)
+			count++;
 
 		/* Don't bother with already dead threads */
 		if (t->exit_state)