[07/39] x86/entry/32: Enter the kernel via trampoline stack

Message ID	1531308586-29340-8-git-send-email-joro@8bytes.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of joro@8bytes.org designates 2a01:238:4383:600:38bc:a715:4b6d:a889 as permitted sender) client-ip=2a01:238:4383:600:38bc:a715:4b6d:a889; From: Joerg Roedel <joro@8bytes.org> To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@kernel.org>, "H . Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Linus Torvalds <torvalds@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Dave Hansen <dave.hansen@intel.com>, Josh Poimboeuf <jpoimboe@redhat.com>, Juergen Gross <jgross@suse.com>, Peter Zijlstra <peterz@infradead.org>, Borislav Petkov <bp@alien8.de>, Jiri Kosina <jkosina@suse.cz>, Boris Ostrovsky <boris.ostrovsky@oracle.com>, Brian Gerst <brgerst@gmail.com>, David Laight <David.Laight@aculab.com>, Denys Vlasenko <dvlasenk@redhat.com>, Eduardo Valentin <eduval@amazon.com>, Greg KH <gregkh@linuxfoundation.org>, Will Deacon <will.deacon@arm.com>, aliguori@amazon.com, daniel.gruss@iaik.tugraz.at, hughd@google.com, keescook@google.com, Andrea Arcangeli <aarcange@redhat.com>, Waiman Long <llong@redhat.com>, Pavel Machek <pavel@ucw.cz>, "David H . Gutteridge" <dhgutteridge@sympatico.ca>, jroedel@suse.de, joro@8bytes.org Subject: [PATCH 07/39] x86/entry/32: Enter the kernel via trampoline stack Date: Wed, 11 Jul 2018 13:29:14 +0200 Message-Id: <1531308586-29340-8-git-send-email-joro@8bytes.org> In-Reply-To: <1531308586-29340-1-git-send-email-joro@8bytes.org> References: <1531308586-29340-1-git-send-email-joro@8bytes.org> Sender: owner-linux-mm@kvack.org Precedence: bulk

Message ID

1531308586-29340-8-git-send-email-joro@8bytes.org (mailing list archive)

State

New, archived

Headers

Received-SPF: pass (google.com: domain of joro@8bytes.org designates
	2a01:238:4383:600:38bc:a715:4b6d:a889 as permitted sender)
	client-ip=2a01:238:4383:600:38bc:a715:4b6d:a889; 
From: Joerg Roedel <joro@8bytes.org>
To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@kernel.org>,
	"H . Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andy Lutomirski <luto@kernel.org>, Dave Hansen <dave.hansen@intel.com>,
	Josh Poimboeuf <jpoimboe@redhat.com>, Juergen Gross <jgross@suse.com>,
	Peter Zijlstra <peterz@infradead.org>, Borislav Petkov <bp@alien8.de>,
	Jiri Kosina <jkosina@suse.cz>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	Brian Gerst <brgerst@gmail.com>, David Laight <David.Laight@aculab.com>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	Eduardo Valentin <eduval@amazon.com>,
	Greg KH <gregkh@linuxfoundation.org>, Will Deacon <will.deacon@arm.com>,
	aliguori@amazon.com, daniel.gruss@iaik.tugraz.at, hughd@google.com,
	keescook@google.com, Andrea Arcangeli <aarcange@redhat.com>,
	Waiman Long <llong@redhat.com>, Pavel Machek <pavel@ucw.cz>,
	"David H . Gutteridge" <dhgutteridge@sympatico.ca>, jroedel@suse.de,
	joro@8bytes.org
Subject: [PATCH 07/39] x86/entry/32: Enter the kernel via trampoline stack
Date: Wed, 11 Jul 2018 13:29:14 +0200
Message-Id: <1531308586-29340-8-git-send-email-joro@8bytes.org>
In-Reply-To: <1531308586-29340-1-git-send-email-joro@8bytes.org>
References: <1531308586-29340-1-git-send-email-joro@8bytes.org>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Commit Message

Joerg Roedel July 11, 2018, 11:29 a.m. UTC

From: Joerg Roedel <jroedel@suse.de>

Use the entry-stack as a trampoline to enter the kernel. The
entry-stack is already in the cpu_entry_area and will be
mapped to userspace when PTI is enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S        | 136 +++++++++++++++++++++++++++++++--------
 arch/x86/include/asm/switch_to.h |   6 +-
 arch/x86/kernel/asm-offsets.c    |   1 +
 arch/x86/kernel/cpu/common.c     |   5 +-
 arch/x86/kernel/process.c        |   2 -
 arch/x86/kernel/process_32.c     |  10 +--
 6 files changed, 121 insertions(+), 39 deletions(-)

Comments

Andy Lutomirski July 12, 2018, 9:09 p.m. UTC | #1

> On Jul 11, 2018, at 4:29 AM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> From: Joerg Roedel <jroedel@suse.de>
> 
> Use the entry-stack as a trampoline to enter the kernel. The
> entry-stack is already in the cpu_entry_area and will be
> mapped to userspace when PTI is enabled.
> 
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
> arch/x86/entry/entry_32.S        | 136 +++++++++++++++++++++++++++++++--------
> arch/x86/include/asm/switch_to.h |   6 +-
> arch/x86/kernel/asm-offsets.c    |   1 +
> arch/x86/kernel/cpu/common.c     |   5 +-
> arch/x86/kernel/process.c        |   2 -
> arch/x86/kernel/process_32.c     |  10 +--
> 6 files changed, 121 insertions(+), 39 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 61303fa..528db7d 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -154,25 +154,36 @@
> 
> #endif /* CONFIG_X86_32_LAZY_GS */
> 
> -.macro SAVE_ALL pt_regs_ax=%eax
> +.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
>    cld
> +    /* Push segment registers and %eax */
>    PUSH_GS
>    pushl    %fs
>    pushl    %es
>    pushl    %ds
>    pushl    \pt_regs_ax
> +
> +    /* Load kernel segments */
> +    movl    $(__USER_DS), %eax

If \pt_regs_ax != %eax, then this will behave oddly. Maybe it’s okay. But I don’t see why this change was needed at all.

> +    movl    %eax, %ds
> +    movl    %eax, %es
> +    movl    $(__KERNEL_PERCPU), %eax
> +    movl    %eax, %fs
> +    SET_KERNEL_GS %eax
> +
> +    /* Push integer registers and complete PT_REGS */
>    pushl    %ebp
>    pushl    %edi
>    pushl    %esi
>    pushl    %edx
>    pushl    %ecx
>    pushl    %ebx
> -    movl    $(__USER_DS), %edx
> -    movl    %edx, %ds
> -    movl    %edx, %es
> -    movl    $(__KERNEL_PERCPU), %edx
> -    movl    %edx, %fs
> -    SET_KERNEL_GS %edx
> +
> +    /* Switch to kernel stack if necessary */
> +.if \switch_stacks > 0
> +    SWITCH_TO_KERNEL_STACK
> +.endif
> +
> .endm
> 
> /*
> @@ -269,6 +280,72 @@
> .Lend_\@:
> #endif /* CONFIG_X86_ESPFIX32 */
> .endm
> +
> +
> +/*
> + * Called with pt_regs fully populated and kernel segments loaded,
> + * so we can access PER_CPU and use the integer registers.
> + *
> + * We need to be very careful here with the %esp switch, because an NMI
> + * can happen everywhere. If the NMI handler finds itself on the
> + * entry-stack, it will overwrite the task-stack and everything we
> + * copied there. So allocate the stack-frame on the task-stack and
> + * switch to it before we do any copying.

Ick, right. Same with machine check, though. You could alternatively fix it by running NMIs on an irq stack if the irq count is zero.  How confident are you that you got #MC right?

> + */
> +.macro SWITCH_TO_KERNEL_STACK
> +
> +    ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
> +
> +    /* Are we on the entry stack? Bail out if not! */
> +    movl    PER_CPU_VAR(cpu_entry_area), %edi
> +    addl    $CPU_ENTRY_AREA_entry_stack, %edi
> +    cmpl    %esp, %edi
> +    jae    .Lend_\@

That’s an alarming assumption about the address space layout. How about an xor and an and instead of cmpl?  As it stands, if the address layout ever changes, the failure may be rather subtle.

Anyway, wouldn’t it be easier to solve this by just not switching stacks on entries from kernel mode and making the entry stack bigger?  Stick an assertion in the scheduling code that we’re not on an entry stack, perhaps.

Joerg Roedel July 13, 2018, 10:56 a.m. UTC | #2

Hi Andy,

thanks for you valuable feedback.

On Thu, Jul 12, 2018 at 02:09:45PM -0700, Andy Lutomirski wrote:
> > On Jul 11, 2018, at 4:29 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > -.macro SAVE_ALL pt_regs_ax=%eax
> > +.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
> >    cld
> > +    /* Push segment registers and %eax */
> >    PUSH_GS
> >    pushl    %fs
> >    pushl    %es
> >    pushl    %ds
> >    pushl    \pt_regs_ax
> > +
> > +    /* Load kernel segments */
> > +    movl    $(__USER_DS), %eax
> 
> If \pt_regs_ax != %eax, then this will behave oddly. Maybe it’s okay.
> But I don’t see why this change was needed at all.

This is a left-over from a previous approach I tried and then abandoned
later. You are right, it is not needed.

> > +/*
> > + * Called with pt_regs fully populated and kernel segments loaded,
> > + * so we can access PER_CPU and use the integer registers.
> > + *
> > + * We need to be very careful here with the %esp switch, because an NMI
> > + * can happen everywhere. If the NMI handler finds itself on the
> > + * entry-stack, it will overwrite the task-stack and everything we
> > + * copied there. So allocate the stack-frame on the task-stack and
> > + * switch to it before we do any copying.
> 
> Ick, right. Same with machine check, though. You could alternatively
> fix it by running NMIs on an irq stack if the irq count is zero.  How
> confident are you that you got #MC right?

Pretty confident, #MC uses the exception entry path which also handles
entry-stack and user-cr3 correctly. It might go through through the slow
paranoid exit path, but that's okay for #MC I guess.

And when the #MC happens while we switch to the task stack and do the
copying the same precautions as for NMI apply.

> > + */
> > +.macro SWITCH_TO_KERNEL_STACK
> > +
> > +    ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
> > +
> > +    /* Are we on the entry stack? Bail out if not! */
> > +    movl    PER_CPU_VAR(cpu_entry_area), %edi
> > +    addl    $CPU_ENTRY_AREA_entry_stack, %edi
> > +    cmpl    %esp, %edi
> > +    jae    .Lend_\@
> 
> That’s an alarming assumption about the address space layout. How
> about an xor and an and instead of cmpl?  As it stands, if the address
> layout ever changes, the failure may be rather subtle.

Right, I implement a more restrictive check.

> Anyway, wouldn’t it be easier to solve this by just not switching
> stacks on entries from kernel mode and making the entry stack bigger?
> Stick an assertion in the scheduling code that we’re not on an entry
> stack, perhaps.

That'll save us the check whether we are on the entry stack and replace
it with a check whether we are coming from user/vm86 mode. I don't think
that this will simplify things much and I am a bit afraid that it'll
break unwritten assumptions elsewhere. It is probably something we can
look into later separatly from the basic pti-x32 enablement.


Thanks,

	Joerg

Andy Lutomirski July 13, 2018, 5:21 p.m. UTC | #3

On Fri, Jul 13, 2018 at 3:56 AM, Joerg Roedel <joro@8bytes.org> wrote:
> Hi Andy,
>
> thanks for you valuable feedback.
>
> On Thu, Jul 12, 2018 at 02:09:45PM -0700, Andy Lutomirski wrote:
>> > On Jul 11, 2018, at 4:29 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > -.macro SAVE_ALL pt_regs_ax=%eax
>> > +.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
>> >    cld
>> > +    /* Push segment registers and %eax */
>> >    PUSH_GS
>> >    pushl    %fs
>> >    pushl    %es
>> >    pushl    %ds
>> >    pushl    \pt_regs_ax
>> > +
>> > +    /* Load kernel segments */
>> > +    movl    $(__USER_DS), %eax
>>
>> If \pt_regs_ax != %eax, then this will behave oddly. Maybe it’s okay.
>> But I don’t see why this change was needed at all.
>
> This is a left-over from a previous approach I tried and then abandoned
> later. You are right, it is not needed.
>
>> > +/*
>> > + * Called with pt_regs fully populated and kernel segments loaded,
>> > + * so we can access PER_CPU and use the integer registers.
>> > + *
>> > + * We need to be very careful here with the %esp switch, because an NMI
>> > + * can happen everywhere. If the NMI handler finds itself on the
>> > + * entry-stack, it will overwrite the task-stack and everything we
>> > + * copied there. So allocate the stack-frame on the task-stack and
>> > + * switch to it before we do any copying.
>>
>> Ick, right. Same with machine check, though. You could alternatively
>> fix it by running NMIs on an irq stack if the irq count is zero.  How
>> confident are you that you got #MC right?
>
> Pretty confident, #MC uses the exception entry path which also handles
> entry-stack and user-cr3 correctly. It might go through through the slow
> paranoid exit path, but that's okay for #MC I guess.
>
> And when the #MC happens while we switch to the task stack and do the
> copying the same precautions as for NMI apply.
>
>> > + */
>> > +.macro SWITCH_TO_KERNEL_STACK
>> > +
>> > +    ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
>> > +
>> > +    /* Are we on the entry stack? Bail out if not! */
>> > +    movl    PER_CPU_VAR(cpu_entry_area), %edi
>> > +    addl    $CPU_ENTRY_AREA_entry_stack, %edi
>> > +    cmpl    %esp, %edi
>> > +    jae    .Lend_\@
>>
>> That’s an alarming assumption about the address space layout. How
>> about an xor and an and instead of cmpl?  As it stands, if the address
>> layout ever changes, the failure may be rather subtle.
>
> Right, I implement a more restrictive check.

But the check needs to be correct or we'll mess up, right?  I think
the code will be much more robust and easier to review if you check
"on the entry stack" instead of ">= the entry stack".  (Or <= -- I can
never remember how this works in AT&T syntax.)

>
>> Anyway, wouldn’t it be easier to solve this by just not switching
>> stacks on entries from kernel mode and making the entry stack bigger?
>> Stick an assertion in the scheduling code that we’re not on an entry
>> stack, perhaps.
>
> That'll save us the check whether we are on the entry stack and replace
> it with a check whether we are coming from user/vm86 mode. I don't think
> that this will simplify things much and I am a bit afraid that it'll
> break unwritten assumptions elsewhere. It is probably something we can
> look into later separatly from the basic pti-x32 enablement.
>

Fair enough.  There's also the issue that NMI still has to switch CR3
if it hits with the wrong CR3.

I personally much prefer checking whether you came from user mode
rather than the stack address, but I'm okay with either approach here.

Joerg Roedel July 17, 2018, 7:07 a.m. UTC | #4

On Fri, Jul 13, 2018 at 10:21:39AM -0700, Andy Lutomirski wrote:
> On Fri, Jul 13, 2018 at 3:56 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > Right, I implement a more restrictive check.
> 
> But the check needs to be correct or we'll mess up, right?  I think
> the code will be much more robust and easier to review if you check
> "on the entry stack" instead of ">= the entry stack".  (Or <= -- I can
> never remember how this works in AT&T syntax.)

Yeah, I re-used the check implemented on the NMI entry path, it checks
exactly for the entry-stack range.


Regards,

	Joerg

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 61303fa..528db7d 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -154,25 +154,36 @@ 
 
 #endif /* CONFIG_X86_32_LAZY_GS */
 
-.macro SAVE_ALL pt_regs_ax=%eax
+.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
 	cld
+	/* Push segment registers and %eax */
 	PUSH_GS
 	pushl	%fs
 	pushl	%es
 	pushl	%ds
 	pushl	\pt_regs_ax
+
+	/* Load kernel segments */
+	movl	$(__USER_DS), %eax
+	movl	%eax, %ds
+	movl	%eax, %es
+	movl	$(__KERNEL_PERCPU), %eax
+	movl	%eax, %fs
+	SET_KERNEL_GS %eax
+
+	/* Push integer registers and complete PT_REGS */
 	pushl	%ebp
 	pushl	%edi
 	pushl	%esi
 	pushl	%edx
 	pushl	%ecx
 	pushl	%ebx
-	movl	$(__USER_DS), %edx
-	movl	%edx, %ds
-	movl	%edx, %es
-	movl	$(__KERNEL_PERCPU), %edx
-	movl	%edx, %fs
-	SET_KERNEL_GS %edx
+
+	/* Switch to kernel stack if necessary */
+.if \switch_stacks > 0
+	SWITCH_TO_KERNEL_STACK
+.endif
+
 .endm
 
 /*
@@ -269,6 +280,72 @@ 
 .Lend_\@:
 #endif /* CONFIG_X86_ESPFIX32 */
 .endm
+
+
+/*
+ * Called with pt_regs fully populated and kernel segments loaded,
+ * so we can access PER_CPU and use the integer registers.
+ *
+ * We need to be very careful here with the %esp switch, because an NMI
+ * can happen everywhere. If the NMI handler finds itself on the
+ * entry-stack, it will overwrite the task-stack and everything we
+ * copied there. So allocate the stack-frame on the task-stack and
+ * switch to it before we do any copying.
+ */
+.macro SWITCH_TO_KERNEL_STACK
+
+	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
+
+	/* Are we on the entry stack? Bail out if not! */
+	movl	PER_CPU_VAR(cpu_entry_area), %edi
+	addl	$CPU_ENTRY_AREA_entry_stack, %edi
+	cmpl	%esp, %edi
+	jae	.Lend_\@
+
+	/* Load stack pointer into %esi and %edi */
+	movl	%esp, %esi
+	movl	%esi, %edi
+
+	/* Move %edi to the top of the entry stack */
+	andl	$(MASK_entry_stack), %edi
+	addl	$(SIZEOF_entry_stack), %edi
+
+	/* Load top of task-stack into %edi */
+	movl	TSS_entry_stack(%edi), %edi
+
+	/* Bytes to copy */
+	movl	$PTREGS_SIZE, %ecx
+
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, PT_EFLAGS(%esi)
+	jz	.Lcopy_pt_regs_\@
+
+	/*
+	 * Stack-frame contains 4 additional segment registers when
+	 * coming from VM86 mode
+	 */
+	addl	$(4 * 4), %ecx
+
+.Lcopy_pt_regs_\@:
+#endif
+
+	/* Allocate frame on task-stack */
+	subl	%ecx, %edi
+
+	/* Switch to task-stack */
+	movl	%edi, %esp
+
+	/*
+	 * We are now on the task-stack and can safely copy over the
+	 * stack-frame
+	 */
+	shrl	$2, %ecx
+	cld
+	rep movsl
+
+.Lend_\@:
+.endm
+
 /*
  * %eax: prev task
  * %edx: next task
@@ -461,6 +538,7 @@  ENTRY(xen_sysenter_target)
  */
 ENTRY(entry_SYSENTER_32)
 	movl	TSS_entry_stack(%esp), %esp
+
 .Lsysenter_past_esp:
 	pushl	$__USER_DS		/* pt_regs->ss */
 	pushl	%ebp			/* pt_regs->sp (stashed in bp) */
@@ -469,7 +547,7 @@  ENTRY(entry_SYSENTER_32)
 	pushl	$__USER_CS		/* pt_regs->cs */
 	pushl	$0			/* pt_regs->ip = 0 (placeholder) */
 	pushl	%eax			/* pt_regs->orig_ax */
-	SAVE_ALL pt_regs_ax=$-ENOSYS	/* save rest */
+	SAVE_ALL pt_regs_ax=$-ENOSYS	/* save rest, stack already switched */
 
 	/*
 	 * SYSENTER doesn't filter flags, so we need to clear NT, AC
@@ -580,7 +658,8 @@  ENDPROC(entry_SYSENTER_32)
 ENTRY(entry_INT80_32)
 	ASM_CLAC
 	pushl	%eax			/* pt_regs->orig_ax */
-	SAVE_ALL pt_regs_ax=$-ENOSYS	/* save rest */
+
+	SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1	/* save rest */
 
 	/*
 	 * User mode is traced as though IRQs are on, and the interrupt gate
@@ -677,7 +756,8 @@  END(irq_entries_start)
 common_interrupt:
 	ASM_CLAC
 	addl	$-0x80, (%esp)			/* Adjust vector into the [-256, -1] range */
-	SAVE_ALL
+
+	SAVE_ALL switch_stacks=1
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
 	movl	%esp, %eax
@@ -685,16 +765,16 @@  common_interrupt:
 	jmp	ret_from_intr
 ENDPROC(common_interrupt)
 
-#define BUILD_INTERRUPT3(name, nr, fn)	\
-ENTRY(name)				\
-	ASM_CLAC;			\
-	pushl	$~(nr);			\
-	SAVE_ALL;			\
-	ENCODE_FRAME_POINTER;		\
-	TRACE_IRQS_OFF			\
-	movl	%esp, %eax;		\
-	call	fn;			\
-	jmp	ret_from_intr;		\
+#define BUILD_INTERRUPT3(name, nr, fn)			\
+ENTRY(name)						\
+	ASM_CLAC;					\
+	pushl	$~(nr);					\
+	SAVE_ALL switch_stacks=1;			\
+	ENCODE_FRAME_POINTER;				\
+	TRACE_IRQS_OFF					\
+	movl	%esp, %eax;				\
+	call	fn;					\
+	jmp	ret_from_intr;				\
 ENDPROC(name)
 
 #define BUILD_INTERRUPT(name, nr)		\
@@ -926,16 +1006,20 @@  common_exception:
 	pushl	%es
 	pushl	%ds
 	pushl	%eax
+	movl	$(__USER_DS), %eax
+	movl	%eax, %ds
+	movl	%eax, %es
+	movl	$(__KERNEL_PERCPU), %eax
+	movl	%eax, %fs
 	pushl	%ebp
 	pushl	%edi
 	pushl	%esi
 	pushl	%edx
 	pushl	%ecx
 	pushl	%ebx
+	SWITCH_TO_KERNEL_STACK
 	ENCODE_FRAME_POINTER
 	cld
-	movl	$(__KERNEL_PERCPU), %ecx
-	movl	%ecx, %fs
 	UNWIND_ESPFIX_STACK
 	GS_TO_REG %ecx
 	movl	PT_GS(%esp), %edi		# get the function address
@@ -943,9 +1027,6 @@  common_exception:
 	movl	$-1, PT_ORIG_EAX(%esp)		# no syscall to restart
 	REG_TO_PTGS %ecx
 	SET_KERNEL_GS %ecx
-	movl	$(__USER_DS), %ecx
-	movl	%ecx, %ds
-	movl	%ecx, %es
 	TRACE_IRQS_OFF
 	movl	%esp, %eax			# pt_regs pointer
 	CALL_NOSPEC %edi
@@ -964,6 +1045,7 @@  ENTRY(debug)
 	 */
 	ASM_CLAC
 	pushl	$-1				# mark this as an int
+
 	SAVE_ALL
 	ENCODE_FRAME_POINTER
 	xorl	%edx, %edx			# error code 0
@@ -999,6 +1081,7 @@  END(debug)
  */
 ENTRY(nmi)
 	ASM_CLAC
+
 #ifdef CONFIG_X86_ESPFIX32
 	pushl	%eax
 	movl	%ss, %eax
@@ -1066,7 +1149,8 @@  END(nmi)
 ENTRY(int3)
 	ASM_CLAC
 	pushl	$-1				# mark this as an int
-	SAVE_ALL
+
+	SAVE_ALL switch_stacks=1
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
 	xorl	%edx, %edx			# zero error code
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index eb5f799..20e5f7ab 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -89,13 +89,9 @@  static inline void refresh_sysenter_cs(struct thread_struct *thread)
 /* This is used when switching tasks or entering/exiting vm86 mode. */
 static inline void update_sp0(struct task_struct *task)
 {
-	/* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
-#ifdef CONFIG_X86_32
-	load_sp0(task->thread.sp0);
-#else
+	/* sp0 always points to the entry trampoline stack, which is constant: */
 	if (static_cpu_has(X86_FEATURE_XENPV))
 		load_sp0(task_top_of_stack(task));
-#endif
 }
 
 #endif /* _ASM_X86_SWITCH_TO_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index a1e1628..01de31d 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -103,6 +103,7 @@  void common(void) {
 	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
 	DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
+	DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
 
 	/* Offset for sp0 and sp1 into the tss_struct */
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index eb4cb3e..43a927e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1804,11 +1804,12 @@  void cpu_init(void)
 	enter_lazy_tlb(&init_mm, curr);
 
 	/*
-	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
-	 * task never enters user mode.
+	 * Initialize the TSS.  sp0 points to the entry trampoline stack
+	 * regardless of what task is running.
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
+	load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 30ca2d1..c93fcfd 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -57,14 +57,12 @@  __visible DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw) = {
 		 */
 		.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,
 
-#ifdef CONFIG_X86_64
 		/*
 		 * .sp1 is cpu_current_top_of_stack.  The init task never
 		 * runs user code, but cpu_current_top_of_stack should still
 		 * be well defined before the first context switch.
 		 */
 		.sp1 = TOP_OF_INIT_STACK,
-#endif
 
 #ifdef CONFIG_X86_32
 		.ss0 = __KERNEL_DS,
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index ec62cc7..04bbf93 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -287,10 +287,12 @@  __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 */
 	update_sp0(next_p);
 	refresh_sysenter_cs(next);
-	this_cpu_write(cpu_current_top_of_stack,
-		       (unsigned long)task_stack_page(next_p) +
-		       THREAD_SIZE);
-	/* SYSENTER reads the task-stack from tss.sp1 */
+	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
+	/*
+	 * TODO: Find a way to let cpu_current_top_of_stack point to
+	 * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
+	 * iret exceptions.
+	 */
 	this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
 
 	/*

[07/39] x86/entry/32: Enter the kernel via trampoline stack

Commit Message

Comments

Patch