diff mbox series

[RFC,v9,02/13] x86: always set IF before oopsing from page fault

Message ID e6c57f675e5b53d4de266412aa526b7660c47918.1554248002.git.khalid.aziz@oracle.com (mailing list archive)
State New, archived
Headers show
Series Add support for eXclusive Page Frame Ownership | expand

Commit Message

Khalid Aziz April 3, 2019, 5:34 p.m. UTC
From: Tycho Andersen <tycho@tycho.ws>

Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
that might sleep:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, pid: 1970, name: lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: lkdtm_xpfo_test Tainted: G      D         4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

To be safe, let's just always enable irqs.

The particular case I'm hitting is:

Aug 23 19:30:27 xpfo kernel: [   38.278615]  __bad_area_nosemaphore+0x1a9/0x1d0
Aug 23 19:30:27 xpfo kernel: [   38.278617]  bad_area_nosemaphore+0xf/0x20
Aug 23 19:30:27 xpfo kernel: [   38.278618]  __do_page_fault+0xd1/0x540
Aug 23 19:30:27 xpfo kernel: [   38.278620]  ? irq_work_queue+0x9b/0xb0
Aug 23 19:30:27 xpfo kernel: [   38.278623]  ? wake_up_klogd+0x36/0x40
Aug 23 19:30:27 xpfo kernel: [   38.278624]  trace_do_page_fault+0x3c/0xf0
Aug 23 19:30:27 xpfo kernel: [   38.278625]  do_async_page_fault+0x14/0x60
Aug 23 19:30:27 xpfo kernel: [   38.278627]  async_page_fault+0x28/0x30

When a fault is in kernel space which has been triggered by XPFO.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: x86@kernel.org
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
 arch/x86/mm/fault.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Andy Lutomirski April 4, 2019, 12:12 a.m. UTC | #1
On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> From: Tycho Andersen <tycho@tycho.ws>
>
> Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> that might sleep:
>


> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9d5c75f02295..7891add0913f 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
>         /* Executive summary in case the body of the oops scrolled away */
>         printk(KERN_DEFAULT "CR2: %016lx\n", address);
>
> +       /*
> +        * We're about to oops, which might kill the task. Make sure we're
> +        * allowed to sleep.
> +        */
> +       flags |= X86_EFLAGS_IF;
> +
>         oops_end(flags, regs, sig);
>  }
>


NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
rewind_stack_do_exit().
Tycho Andersen April 4, 2019, 1:42 a.m. UTC | #2
On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote:
> On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >
> > From: Tycho Andersen <tycho@tycho.ws>
> >
> > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> > that might sleep:
> >
> 
> 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 9d5c75f02295..7891add0913f 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> >         /* Executive summary in case the body of the oops scrolled away */
> >         printk(KERN_DEFAULT "CR2: %016lx\n", address);
> >
> > +       /*
> > +        * We're about to oops, which might kill the task. Make sure we're
> > +        * allowed to sleep.
> > +        */
> > +       flags |= X86_EFLAGS_IF;
> > +
> >         oops_end(flags, regs, sig);
> >  }
> >
> 
> 
> NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
> rewind_stack_do_exit().

[I trimmed the CC list since google rejected it with E2BIG :)]

I guess the problem is really that do_exit() (or really
exit_signals()) might sleep. Maybe we should put an irq_enable() at
the beginning of do_exit() instead and fix this problem for all
arches?

Tycho
Andy Lutomirski April 4, 2019, 4:12 a.m. UTC | #3
On Wed, Apr 3, 2019 at 6:42 PM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote:
> > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> > >
> > > From: Tycho Andersen <tycho@tycho.ws>
> > >
> > > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> > > that might sleep:
> > >
> >
> >
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index 9d5c75f02295..7891add0913f 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> > >         /* Executive summary in case the body of the oops scrolled away */
> > >         printk(KERN_DEFAULT "CR2: %016lx\n", address);
> > >
> > > +       /*
> > > +        * We're about to oops, which might kill the task. Make sure we're
> > > +        * allowed to sleep.
> > > +        */
> > > +       flags |= X86_EFLAGS_IF;
> > > +
> > >         oops_end(flags, regs, sig);
> > >  }
> > >
> >
> >
> > NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
> > rewind_stack_do_exit().
>
> [I trimmed the CC list since google rejected it with E2BIG :)]
>
> I guess the problem is really that do_exit() (or really
> exit_signals()) might sleep. Maybe we should put an irq_enable() at
> the beginning of do_exit() instead and fix this problem for all
> arches?
>

Hmm.  do_exit() isn't really meant to be "try your best to leave the
system somewhat usable without returning" -- it's a function that,
other than in OOPSes, is called from a well-defined state.  So I think
rewind_stack_do_exit() is probably a better spot.  But we need to
rewind the stack and *then* turn on IRQs, since we otherwise risk
exploding quite badly.
Tycho Andersen April 4, 2019, 3:47 p.m. UTC | #4
On Wed, Apr 03, 2019 at 09:12:16PM -0700, Andy Lutomirski wrote:
> On Wed, Apr 3, 2019 at 6:42 PM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote:
> > > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> > > >
> > > > From: Tycho Andersen <tycho@tycho.ws>
> > > >
> > > > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> > > > that might sleep:
> > > >
> > >
> > >
> > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > index 9d5c75f02295..7891add0913f 100644
> > > > --- a/arch/x86/mm/fault.c
> > > > +++ b/arch/x86/mm/fault.c
> > > > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> > > >         /* Executive summary in case the body of the oops scrolled away */
> > > >         printk(KERN_DEFAULT "CR2: %016lx\n", address);
> > > >
> > > > +       /*
> > > > +        * We're about to oops, which might kill the task. Make sure we're
> > > > +        * allowed to sleep.
> > > > +        */
> > > > +       flags |= X86_EFLAGS_IF;
> > > > +
> > > >         oops_end(flags, regs, sig);
> > > >  }
> > > >
> > >
> > >
> > > NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
> > > rewind_stack_do_exit().
> >
> > [I trimmed the CC list since google rejected it with E2BIG :)]
> >
> > I guess the problem is really that do_exit() (or really
> > exit_signals()) might sleep. Maybe we should put an irq_enable() at
> > the beginning of do_exit() instead and fix this problem for all
> > arches?
> >
> 
> Hmm.  do_exit() isn't really meant to be "try your best to leave the
> system somewhat usable without returning" -- it's a function that,
> other than in OOPSes, is called from a well-defined state.  So I think
> rewind_stack_do_exit() is probably a better spot.  But we need to
> rewind the stack and *then* turn on IRQs, since we otherwise risk
> exploding quite badly.

Ok, sounds good. I guess we can include something like this patch in
the next series.

Thanks,

Tycho


From 34dce229a4f43f90db823671eb0b8da7c4906045 Mon Sep 17 00:00:00 2001
From: Tycho Andersen <tycho@tycho.ws>
Date: Thu, 4 Apr 2019 09:41:32 -0600
Subject: [PATCH] x86/entry: re-enable interrupts before exiting

If the kernel oopses in an interrupt, nothing re-enables interrupts:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from invalid context at
./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, pid: 1970, name:
lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: lkdtm_xpfo_test Tainted: G      D
4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

do_exit() expects to be called in a well-defined environment, so let's
re-enable interrupts after unwinding the stack, in case they were disabled.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
---
 arch/x86/entry/entry_32.S | 6 ++++++
 arch/x86/entry/entry_64.S | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..8ddb7b41669d 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1507,6 +1507,12 @@ ENTRY(rewind_stack_do_exit)
 	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
 	leal	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
 
+	/*
+	 * If we oopsed in an interrupt handler, interrupts may be off. Let's turn
+	 * them back on before going back to "normal" code.
+	 */
+	sti
+
 	call	do_exit
 1:	jmp 1b
 END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..c0759f3e3ad2 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1672,5 +1672,11 @@ ENTRY(rewind_stack_do_exit)
 	leaq	-PTREGS_SIZE(%rax), %rsp
 	UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
 
+	/*
+	 * If we oopsed in an interrupt handler, interrupts may be off. Let's turn
+	 * them back on before going back to "normal" code.
+	 */
+	sti
+
 	call	do_exit
 END(rewind_stack_do_exit)
Sebastian Andrzej Siewior April 4, 2019, 4:23 p.m. UTC | #5
- stepping on del button while browsing though CCs.
On 2019-04-04 09:47:27 [-0600], Tycho Andersen wrote:
> > Hmm.  do_exit() isn't really meant to be "try your best to leave the
> > system somewhat usable without returning" -- it's a function that,
> > other than in OOPSes, is called from a well-defined state.  So I think
> > rewind_stack_do_exit() is probably a better spot.  But we need to
> > rewind the stack and *then* turn on IRQs, since we otherwise risk
> > exploding quite badly.
> 
> Ok, sounds good. I guess we can include something like this patch in
> the next series.

The tracing infrastructure probably doesn't know that the interrupts are
back on. Also if you were holding a spin lock then your preempt count
isn't 0 which means that might_sleep() will trigger a splat (in your
backtrace it was zero).

> Thanks,
> 
> Tycho
Sebastian
Thomas Gleixner April 4, 2019, 4:28 p.m. UTC | #6
On Thu, 4 Apr 2019, Tycho Andersen wrote:
>  	leaq	-PTREGS_SIZE(%rax), %rsp
>  	UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
>  
> +	/*
> +	 * If we oopsed in an interrupt handler, interrupts may be off. Let's turn
> +	 * them back on before going back to "normal" code.
> +	 */
> +	sti

That breaks the paravirt muck and tracing/lockdep.

ENABLE_INTERRUPTS() is what you want plus TRACE_IRQ_ON to keep the tracer
and lockdep happy.

Thanks,

	tglx
Andy Lutomirski April 4, 2019, 5:11 p.m. UTC | #7
> On Apr 4, 2019, at 10:28 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
>> On Thu, 4 Apr 2019, Tycho Andersen wrote:
>>    leaq    -PTREGS_SIZE(%rax), %rsp
>>    UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
>> 
>> +    /*
>> +     * If we oopsed in an interrupt handler, interrupts may be off. Let's turn
>> +     * them back on before going back to "normal" code.
>> +     */
>> +    sti
> 
> That breaks the paravirt muck and tracing/lockdep.
> 
> ENABLE_INTERRUPTS() is what you want plus TRACE_IRQ_ON to keep the tracer
> and lockdep happy.
> 
> 

I’m sure we’ll find some other thing we forgot to reset eventually, so let’s do this in C.  Change the call do_exit to call __finish_rewind_stack_do_exit and add the latter as a C function that does local_irq_enable() and do_exit().
diff mbox series

Patch

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9d5c75f02295..7891add0913f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -858,6 +858,12 @@  no_context(struct pt_regs *regs, unsigned long error_code,
 	/* Executive summary in case the body of the oops scrolled away */
 	printk(KERN_DEFAULT "CR2: %016lx\n", address);
 
+	/*
+	 * We're about to oops, which might kill the task. Make sure we're
+	 * allowed to sleep.
+	 */
+	flags |= X86_EFLAGS_IF;
+
 	oops_end(flags, regs, sig);
 }