diff mbox series

[v2] exec: don't force_sigsegv processes with a pending fatal signal

Message ID 20190205025308.GA24455@visor (mailing list archive)
State New, archived
Headers show
Series [v2] exec: don't force_sigsegv processes with a pending fatal signal | expand

Commit Message

Ivan Delalande Feb. 5, 2019, 2:53 a.m. UTC
We were seeing unexplained segfaults in coreutils processes and other
basic utilities on systems with print-fatal-signals enabled:

	[  311.001986] potentially unexpected fatal signal 11.
	[  311.001993] CPU: 3 PID: 4565 Comm: tail Tainted: P           O    4.9.100.Ar-8497547.eostrunkkernel49 #1
	[  311.001995] task: ffff88021431b400 task.stack: ffffc90004cec000
	[  311.001997] RIP: 0023:[<00000000f7722c09>]  [<00000000f7722c09>] 0xf7722c09
	[  311.002003] RSP: 002b:00000000ffcc8aa4  EFLAGS: 00000296
	[  311.002004] RAX: fffffffffffffff2 RBX: 0000000057efc530 RCX: 0000000057efdb68
	[  311.002006] RDX: 0000000057effb60 RSI: 0000000057efdb68 RDI: 00000000f768f000
	[  311.002007] RBP: 0000000057efc530 R08: 0000000000000000 R09: 0000000000000000
	[  311.002008] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
	[  311.002009] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
	[  311.002011] FS:  0000000000000000(0000) GS:ffff88021e980000(0000) knlGS:0000000000000000
	[  311.002013] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
	[  311.002014] CR2: 00000000f77bf097 CR3: 0000000150f6f000 CR4: 00000000000406f0

We tracked these crashes down to binfmt_elf failing to load segments
for ld.so inside the kernel. Digging further, the actual problem
seems to occur when a process gets sigkilled while it is still being
loaded by the kernel. In our case when _do_page_fault goes for a retry
it will return early as it first checks for fatal_signal_pending(), so
load_elf_interp also returns with error and as a result
search_binary_handler will force_sigsegv() which is pretty confusing as
nothing actually failed here.


v2: add a message when load_binary fails, add a check for fatal signals
in signal_delivered (avoiding a single check in force_sigsegv as other
architectures use it directly and may have different expectations).

Thanks to Dmitry Safonov and Oleg Nesterov for their comments and
suggestions.

Fixes: 19d860a140be ("handle suicide on late failure exits in execve() in search_binary_handler()")
Reference: https://lkml.org/lkml/2013/2/14/5
Signed-off-by: Ivan Delalande <colona@arista.com>
---
 fs/exec.c       | 7 ++++++-
 kernel/signal.c | 6 +++---
 2 files changed, 9 insertions(+), 4 deletions(-)

Comments

Andrew Morton Feb. 5, 2019, 9:11 p.m. UTC | #1
On Mon, 4 Feb 2019 18:53:08 -0800 Ivan Delalande <colona@arista.com> wrote:

> We were seeing unexplained segfaults in coreutils processes and other
> basic utilities on systems with print-fatal-signals enabled:
> 
> 	[  311.001986] potentially unexpected fatal signal 11.
> 	[  311.001993] CPU: 3 PID: 4565 Comm: tail Tainted: P           O    4.9.100.Ar-8497547.eostrunkkernel49 #1
> 	[  311.001995] task: ffff88021431b400 task.stack: ffffc90004cec000
> 	[  311.001997] RIP: 0023:[<00000000f7722c09>]  [<00000000f7722c09>] 0xf7722c09
> 	[  311.002003] RSP: 002b:00000000ffcc8aa4  EFLAGS: 00000296
> 	[  311.002004] RAX: fffffffffffffff2 RBX: 0000000057efc530 RCX: 0000000057efdb68
> 	[  311.002006] RDX: 0000000057effb60 RSI: 0000000057efdb68 RDI: 00000000f768f000
> 	[  311.002007] RBP: 0000000057efc530 R08: 0000000000000000 R09: 0000000000000000
> 	[  311.002008] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> 	[  311.002009] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> 	[  311.002011] FS:  0000000000000000(0000) GS:ffff88021e980000(0000) knlGS:0000000000000000
> 	[  311.002013] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> 	[  311.002014] CR2: 00000000f77bf097 CR3: 0000000150f6f000 CR4: 00000000000406f0
> 
> We tracked these crashes down to binfmt_elf failing to load segments
> for ld.so inside the kernel. Digging further, the actual problem
> seems to occur when a process gets sigkilled while it is still being
> loaded by the kernel. In our case when _do_page_fault goes for a retry
> it will return early as it first checks for fatal_signal_pending(), so
> load_elf_interp also returns with error and as a result
> search_binary_handler will force_sigsegv() which is pretty confusing as
> nothing actually failed here.
> 
> 
> v2: add a message when load_binary fails, add a check for fatal signals
> in signal_delivered (avoiding a single check in force_sigsegv as other
> architectures use it directly and may have different expectations).
> 
> Thanks to Dmitry Safonov and Oleg Nesterov for their comments and
> suggestions.
> 
> ...
>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1660,7 +1660,12 @@ int search_binary_handler(struct linux_binprm *bprm)
>  		if (retval < 0 && !bprm->mm) {
>  			/* we got to flush_old_exec() and failed after it */
>  			read_unlock(&binfmt_lock);
> -			force_sigsegv(SIGSEGV, current);
> +			if (!fatal_signal_pending(current)) {
> +				if (print_fatal_signals)
> +					pr_info("load_binary() failed: %d\n",
> +						retval);

Should we be using print_fatal_signal() here?

> +				force_sigsegv(SIGSEGV, current);
> +			}
>  			return retval;
>  		}
>  		if (retval != -ENOEXEC || !bprm->file) {
> diff --git a/kernel/signal.c b/kernel/signal.c
> index e1d7ad8e6ab1..674076e63624 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2552,10 +2552,10 @@ static void signal_delivered(struct ksignal *ksig, int stepping)
>  
>  void signal_setup_done(int failed, struct ksignal *ksig, int stepping)
>  {
> -	if (failed)
> -		force_sigsegv(ksig->sig, current);
> -	else
> +	if (!failed)
>  		signal_delivered(ksig, stepping);
> +	else if (!fatal_signal_pending(current))
> +		force_sigsegv(ksig->sig, current);
>  }
Ivan Delalande Feb. 6, 2019, 3:10 a.m. UTC | #2
On Tue, Feb 05, 2019 at 01:11:19PM -0800, Andrew Morton wrote:
> On Mon, 4 Feb 2019 18:53:08 -0800 Ivan Delalande <colona@arista.com> wrote:
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1660,7 +1660,12 @@ int search_binary_handler(struct linux_binprm *bprm)
> >  		if (retval < 0 && !bprm->mm) {
> >  			/* we got to flush_old_exec() and failed after it */
> >  			read_unlock(&binfmt_lock);
> > -			force_sigsegv(SIGSEGV, current);
> > +			if (!fatal_signal_pending(current)) {
> > +				if (print_fatal_signals)
> > +					pr_info("load_binary() failed: %d\n",
> > +						retval);
> 
> Should we be using print_fatal_signal() here?

I don't think so, the force_sigsegv() already ensures it will be called
from get_signal() when the signal is handled, and so the process
information will be printed then.

> > +				force_sigsegv(SIGSEGV, current);
> > +			}
> >  			return retval;
> >  		}
> >  		if (retval != -ENOEXEC || !bprm->file) {


Thanks,
Eric W. Biederman Feb. 8, 2019, 5:13 a.m. UTC | #3
Ivan Delalande <colona@arista.com> writes:

> On Tue, Feb 05, 2019 at 01:11:19PM -0800, Andrew Morton wrote:
>> On Mon, 4 Feb 2019 18:53:08 -0800 Ivan Delalande <colona@arista.com> wrote:
>> > --- a/fs/exec.c
>> > +++ b/fs/exec.c
>> > @@ -1660,7 +1660,12 @@ int search_binary_handler(struct linux_binprm *bprm)
>> >  		if (retval < 0 && !bprm->mm) {
>> >  			/* we got to flush_old_exec() and failed after it */
>> >  			read_unlock(&binfmt_lock);
>> > -			force_sigsegv(SIGSEGV, current);
>> > +			if (!fatal_signal_pending(current)) {
>> > +				if (print_fatal_signals)
>> > +					pr_info("load_binary() failed: %d\n",
>> > +						retval);
>> 
>> Should we be using print_fatal_signal() here?
>
> I don't think so, the force_sigsegv() already ensures it will be called
> from get_signal() when the signal is handled, and so the process
> information will be printed then.
>
>> > +				force_sigsegv(SIGSEGV, current);
>> > +			}
>> >  			return retval;
>> >  		}
>> >  		if (retval != -ENOEXEC || !bprm->file) {
>


I just noticed this.  From  my patch queue that I intend to send to
Linus tomorrow. I think this change fixes your issue of getting
the SIGSEGV instead of the already pending fatal signal.

So I think this fixes your issue without any other code changes.
Ivan can you verify that the patch below is enough?


diff --git a/kernel/signal.c b/kernel/signal.c
index 9ca8e5278c8e..5424cb0006bc 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2393,6 +2393,11 @@ bool get_signal(struct ksignal *ksig)
                goto relock;
        }
 
+       /* Has this task already been marked for death? */
+       ksig->info.si_signo = signr = SIGKILL;
+       if (signal_group_exit(signal))
+               goto fatal;
+
        for (;;) {
                struct k_sigaction *ka;
 
@@ -2488,6 +2493,7 @@ bool get_signal(struct ksignal *ksig)
                        continue;
                }
 
+       fatal:
                spin_unlock_irq(&sighand->siglock);
Ivan Delalande Feb. 9, 2019, 12:16 a.m. UTC | #4
Hi Eric,

On Thu, Feb 07, 2019 at 11:13:59PM -0600, Eric W. Biederman wrote:
> I just noticed this.  From  my patch queue that I intend to send to
> Linus tomorrow. I think this change fixes your issue of getting
> the SIGSEGV instead of the already pending fatal signal.
> 
> So I think this fixes your issue without any other code changes.
> Ivan can you verify that the patch below is enough?

I was having issues with just this patch applied on top of v5.0-rc5 or
the latest master: defunct processes accumulating, exiting processes
that would hang forever, and some kernel functions eating all the CPU
(setup_sigcontext, common_interrupt, __clear_user, do_signal…).

But using your user-namespace.git/for-linus worked great and I've been
running my reproducer for a few hours now without issue. I'll probably
keep it running over the week-end as it has been unreliable at times,
but it looks promising so far.


A difference I've noticed with your tree (unrelated to my issue here but
that you may want to look at) is when I run my reproducer under
strace -f, I'm now getting quite a lot of "Exit of unknown pid 12345
ignored" warnings from strace, which I've never seen with mainline.
My reproducer simply fork-exec tail processes in a loop, and tries to
sigkill them in the parent with a variable delay.

Thank you,

> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9ca8e5278c8e..5424cb0006bc 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2393,6 +2393,11 @@ bool get_signal(struct ksignal *ksig)
>                 goto relock;
>         }
>  
> +       /* Has this task already been marked for death? */
> +       ksig->info.si_signo = signr = SIGKILL;
> +       if (signal_group_exit(signal))
> +               goto fatal;
> +
>         for (;;) {
>                 struct k_sigaction *ka;
>  
> @@ -2488,6 +2493,7 @@ bool get_signal(struct ksignal *ksig)
>                         continue;
>                 }
>  
> +       fatal:
>                 spin_unlock_irq(&sighand->siglock);
>  
>
Eric W. Biederman Feb. 10, 2019, 5:05 p.m. UTC | #5
Ivan Delalande <colona@arista.com> writes:

> Hi Eric,
>
> On Thu, Feb 07, 2019 at 11:13:59PM -0600, Eric W. Biederman wrote:
>> I just noticed this.  From  my patch queue that I intend to send to
>> Linus tomorrow. I think this change fixes your issue of getting
>> the SIGSEGV instead of the already pending fatal signal.
>> 
>> So I think this fixes your issue without any other code changes.
>> Ivan can you verify that the patch below is enough?
>
> I was having issues with just this patch applied on top of v5.0-rc5 or
> the latest master: defunct processes accumulating, exiting processes
> that would hang forever, and some kernel functions eating all the CPU
> (setup_sigcontext, common_interrupt, __clear_user, do_signal…).
>
> But using your user-namespace.git/for-linus worked great and I've been
> running my reproducer for a few hours now without issue. I'll probably
> keep it running over the week-end as it has been unreliable at times,
> but it looks promising so far.

Sounds.  Thank you for finding my tree, and thank you for testing.

> A difference I've noticed with your tree (unrelated to my issue here but
> that you may want to look at) is when I run my reproducer under
> strace -f, I'm now getting quite a lot of "Exit of unknown pid 12345
> ignored" warnings from strace, which I've never seen with mainline.
> My reproducer simply fork-exec tail processes in a loop, and tries to
> sigkill them in the parent with a variable delay.

What was your base tree?

My best guess is that your SIGKILL is getting there before strace
realizes the process has been forked.  If we can understand the race
it is probably worth fixing.

Any chance you can post your reproducer.

It is possible it is my most recent fixes, or it is possible something
changed from the tree you were testing and the tree you are working
on.

Eric
Oleg Nesterov Feb. 11, 2019, 4:02 p.m. UTC | #6
On 02/08, Ivan Delalande wrote:
>
> A difference I've noticed with your tree (unrelated to my issue here but
> that you may want to look at) is when I run my reproducer under
> strace -f, I'm now getting quite a lot of "Exit of unknown pid 12345
> ignored" warnings from strace, which I've never seen with mainline.
> My reproducer simply fork-exec tail processes in a loop, and tries to
> sigkill them in the parent with a variable delay.

Hmm... may be because of PTRACE_EVENT_EXIT problem I mentioned in reply
to this change...

> 
> Thank you,
> 
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 9ca8e5278c8e..5424cb0006bc 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -2393,6 +2393,11 @@ bool get_signal(struct ksignal *ksig)
> >                 goto relock;
> >         }
> >  
> > +       /* Has this task already been marked for death? */
> > +       ksig->info.si_signo = signr = SIGKILL;
> > +       if (signal_group_exit(signal))
> > +               goto fatal;
> > +
> >         for (;;) {
> >                 struct k_sigaction *ka;
> >  
> > @@ -2488,6 +2493,7 @@ bool get_signal(struct ksignal *ksig)
> >                         continue;
> >                 }
> >  
> > +       fatal:
> >                 spin_unlock_irq(&sighand->siglock);
> >  
> >    
> 
> -- 
> Ivan Delalande
> Arista Networks
Oleg Nesterov Feb. 11, 2019, 5:12 p.m. UTC | #7
sorry, I couldn't look at this patch before.

On 02/04, Ivan Delalande wrote:
>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1660,7 +1660,12 @@ int search_binary_handler(struct linux_binprm *bprm)
>  		if (retval < 0 && !bprm->mm) {
>  			/* we got to flush_old_exec() and failed after it */
>  			read_unlock(&binfmt_lock);
> -			force_sigsegv(SIGSEGV, current);
> +			if (!fatal_signal_pending(current)) {
> +				if (print_fatal_signals)
> +					pr_info("load_binary() failed: %d\n",
> +						retval);

I won't argue, but do we really want this spam?

> +				force_sigsegv(SIGSEGV, current);
> +			}
>  			return retval;
>  		}
>  		if (retval != -ENOEXEC || !bprm->file) {
> diff --git a/kernel/signal.c b/kernel/signal.c
> index e1d7ad8e6ab1..674076e63624 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2552,10 +2552,10 @@ static void signal_delivered(struct ksignal *ksig, int stepping)
>
>  void signal_setup_done(int failed, struct ksignal *ksig, int stepping)
>  {
> -	if (failed)
> -		force_sigsegv(ksig->sig, current);
> -	else
> +	if (!failed)
>  		signal_delivered(ksig, stepping);
> +	else if (!fatal_signal_pending(current))
> +		force_sigsegv(ksig->sig, current);

The changelog doesn't explain this change.

OK, I guess it comes from the previous discussion, setup_rt_frame() can equally fail
if fatal_signal_pending(). But this should be documented at least in the changelog,
and I still think we could simply change force_sigsegv() instead.

In any case, Eric has already mentioned that we going to give SIGKILL more priority,
so I think we can drop this patch?

Oleg.
Ivan Delalande Feb. 11, 2019, 11:20 p.m. UTC | #8
On Mon, Feb 11, 2019 at 06:12:53PM +0100, Oleg Nesterov wrote:
> sorry, I couldn't look at this patch before.
> 
> On 02/04, Ivan Delalande wrote:
> >
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1660,7 +1660,12 @@ int search_binary_handler(struct linux_binprm *bprm)
> >  		if (retval < 0 && !bprm->mm) {
> >  			/* we got to flush_old_exec() and failed after it */
> >  			read_unlock(&binfmt_lock);
> > -			force_sigsegv(SIGSEGV, current);
> > +			if (!fatal_signal_pending(current)) {
> > +				if (print_fatal_signals)
> > +					pr_info("load_binary() failed: %d\n",
> > +						retval);
> 
> I won't argue, but do we really want this spam?
> 
> > +				force_sigsegv(SIGSEGV, current);
> > +			}
> >  			return retval;
> >  		}
> >  		if (retval != -ENOEXEC || !bprm->file) {
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index e1d7ad8e6ab1..674076e63624 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -2552,10 +2552,10 @@ static void signal_delivered(struct ksignal *ksig, int stepping)
> >
> >  void signal_setup_done(int failed, struct ksignal *ksig, int stepping)
> >  {
> > -	if (failed)
> > -		force_sigsegv(ksig->sig, current);
> > -	else
> > +	if (!failed)
> >  		signal_delivered(ksig, stepping);
> > +	else if (!fatal_signal_pending(current))
> > +		force_sigsegv(ksig->sig, current);
> 
> The changelog doesn't explain this change.
> 
> OK, I guess it comes from the previous discussion, setup_rt_frame() can equally fail
> if fatal_signal_pending(). But this should be documented at least in the changelog,
> and I still think we could simply change force_sigsegv() instead.
> 
> In any case, Eric has already mentioned that we going to give SIGKILL more priority,
> so I think we can drop this patch?

Yes, I've been running our tests on top of Eric's tree over the week-end
and haven't seen any new hit. I also see that Andrew has dropped the
patch from -mm, so no futher action should be required here.

Thank you for taking a look at the patch anyway.
Ivan Delalande Feb. 11, 2019, 11:25 p.m. UTC | #9
On Sun, Feb 10, 2019 at 11:05:52AM -0600, Eric W. Biederman wrote:
> Ivan Delalande <colona@arista.com> writes:
> > A difference I've noticed with your tree (unrelated to my issue here but
> > that you may want to look at) is when I run my reproducer under
> > strace -f, I'm now getting quite a lot of "Exit of unknown pid 12345
> > ignored" warnings from strace, which I've never seen with mainline.
> > My reproducer simply fork-exec tail processes in a loop, and tries to
> > sigkill them in the parent with a variable delay.
> 
> What was your base tree?

It was just off v5.0-rc5, and I didn't see these warnings on the last
few RCs either. Now I'm seeing them on vanilla v5.0-rc6 as well.

> My best guess is that your SIGKILL is getting there before strace
> realizes the process has been forked.  If we can understand the race
> it is probably worth fixing.
> 
> Any chance you can post your reproducer.

Sure, see the attachment. I think this is the simplest version where
these warnings show up. This one just forks/exec `tail -a` to make it
fail and exit 1 as soon as possible, and progressively increase the
delay between the fork and sigkill to try to hit our original issue,
stopping and restarting only after 10 completions of the child as the
timing varies a fair bit.

Running this program under `strace -f -o /dev/null` prints the warnings
almost instantly on my system.

> It is possible it is my most recent fixes, or it is possible something
> changed from the tree you were testing and the tree you are working
> on.

Thanks,
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index fb72d36f7823..caf584064f89 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1660,7 +1660,12 @@  int search_binary_handler(struct linux_binprm *bprm)
 		if (retval < 0 && !bprm->mm) {
 			/* we got to flush_old_exec() and failed after it */
 			read_unlock(&binfmt_lock);
-			force_sigsegv(SIGSEGV, current);
+			if (!fatal_signal_pending(current)) {
+				if (print_fatal_signals)
+					pr_info("load_binary() failed: %d\n",
+						retval);
+				force_sigsegv(SIGSEGV, current);
+			}
 			return retval;
 		}
 		if (retval != -ENOEXEC || !bprm->file) {
diff --git a/kernel/signal.c b/kernel/signal.c
index e1d7ad8e6ab1..674076e63624 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2552,10 +2552,10 @@  static void signal_delivered(struct ksignal *ksig, int stepping)
 
 void signal_setup_done(int failed, struct ksignal *ksig, int stepping)
 {
-	if (failed)
-		force_sigsegv(ksig->sig, current);
-	else
+	if (!failed)
 		signal_delivered(ksig, stepping);
+	else if (!fatal_signal_pending(current))
+		force_sigsegv(ksig->sig, current);
 }
 
 /*