Message ID | 8d36dd9b2430b61db64333af7b911d0bca7d5d2f.1468270393.git.luto@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Andy Lutomirski <luto@kernel.org> wrote: > This allows x86_64 kernels to enable vmapped stacks. There are a > couple of interesting bits. > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -92,6 +92,7 @@ config X86 > select HAVE_ARCH_TRACEHOOK > select HAVE_ARCH_TRANSPARENT_HUGEPAGE > select HAVE_EBPF_JIT if X86_64 > + select HAVE_ARCH_VMAP_STACK if X86_64 So what is the performance impact? Because I think we should consider enabling this feature by default on x86 - but the way it's selected here it will be default-off. On the plus side: the debuggability and reliability improvements are real and making it harder for exploits to use kernel stack overflows is a nice bonus as well. There's two performance effects: - vmalloc now potentially moves into the thread pool create/destroy hot path. - we use 4K TLBs for kernel stacks instead of 2MB TLBs. The TLB effect should be relatively modest on modern CPUs, given that the kernel stack size is limited and 4K TLBs are plenty. The vmalloc() part should be measured I suspect. Thanks, Ingo
On Wed, Jul 13, 2016 at 12:53 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Andy Lutomirski <luto@kernel.org> wrote: > >> This allows x86_64 kernels to enable vmapped stacks. There are a >> couple of interesting bits. > >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -92,6 +92,7 @@ config X86 >> select HAVE_ARCH_TRACEHOOK >> select HAVE_ARCH_TRANSPARENT_HUGEPAGE >> select HAVE_EBPF_JIT if X86_64 >> + select HAVE_ARCH_VMAP_STACK if X86_64 > > So what is the performance impact? Seems to be a very slight speedup (0.5 µs or so) on my silly benchmark (pthread_create, pthread_join in a loop). It should be a small slowdown on workloads that create many threads all at once, thus defeating the stack cache. It should be a *large* speedup on any workload that would trigger compaction on clone() to satisfy the high-order allocation. > > Because I think we should consider enabling this feature by default on x86 - but > the way it's selected here it will be default-off. > > On the plus side: the debuggability and reliability improvements are real and > making it harder for exploits to use kernel stack overflows is a nice bonus as > well. There's two performance effects: Agreed. At the very least, I want to wait until after net-next gets pulled to flip the default to y. I'm also a bit concerned about more random driver issues that I haven't found yet. I suppose we could flip the default to y for a few -rc releases and see what, if anything, shakes loose. --Andy
* Andy Lutomirski <luto@amacapital.net> wrote: > On Wed, Jul 13, 2016 at 12:53 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > * Andy Lutomirski <luto@kernel.org> wrote: > > > >> This allows x86_64 kernels to enable vmapped stacks. There are a > >> couple of interesting bits. > > > >> --- a/arch/x86/Kconfig > >> +++ b/arch/x86/Kconfig > >> @@ -92,6 +92,7 @@ config X86 > >> select HAVE_ARCH_TRACEHOOK > >> select HAVE_ARCH_TRANSPARENT_HUGEPAGE > >> select HAVE_EBPF_JIT if X86_64 > >> + select HAVE_ARCH_VMAP_STACK if X86_64 > > > > So what is the performance impact? > > Seems to be a very slight speedup (0.5 µs or so) on my silly benchmark > (pthread_create, pthread_join in a loop). [...] Music to my ears - although TBH there's probably two opposing forces: advantages from the cache versus (possibly very minor, if measurable at all) disadvantages from the 4K granularity. > [...] It should be a small slowdown on workloads that create many threads all > at once, thus defeating the stack cache. It should be a *large* speedup on any > workload that would trigger compaction on clone() to satisfy the high-order > allocation. > > > > > Because I think we should consider enabling this feature by default on x86 - but > > the way it's selected here it will be default-off. > > > > On the plus side: the debuggability and reliability improvements are real and > > making it harder for exploits to use kernel stack overflows is a nice bonus as > > well. There's two performance effects: > > Agreed. At the very least, I want to wait until after net-next gets > pulled to flip the default to y. I'm also a bit concerned about more > random driver issues that I haven't found yet. I suppose we could > flip the default to y for a few -rc releases and see what, if > anything, shakes loose. So I'd prefer the following approach: to apply it to a v4.8-rc1 base in ~2 weeks and keep it default-y for much of the next development cycle. If no serious problems are found in those ~2 months then send it to Linus in that fashion. We can still turn it off by default (or re-spin the whole approach) if it turns out to be too risky. Exposing it as default-n for even a small amount of time will massively reduce the testing we'll get, as most people will just use the N setting (often without noticing). Plus this also gives net-next and other preparatory patches applied directly to maintainer trees time to trickle upstream. Thanks, Ingo
On Thu, Jul 14, 2016 at 1:34 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Andy Lutomirski <luto@amacapital.net> wrote: > >> On Wed, Jul 13, 2016 at 12:53 AM, Ingo Molnar <mingo@kernel.org> wrote: >> > >> > * Andy Lutomirski <luto@kernel.org> wrote: >> > >> >> This allows x86_64 kernels to enable vmapped stacks. There are a >> >> couple of interesting bits. >> > >> >> --- a/arch/x86/Kconfig >> >> +++ b/arch/x86/Kconfig >> >> @@ -92,6 +92,7 @@ config X86 >> >> select HAVE_ARCH_TRACEHOOK >> >> select HAVE_ARCH_TRANSPARENT_HUGEPAGE >> >> select HAVE_EBPF_JIT if X86_64 >> >> + select HAVE_ARCH_VMAP_STACK if X86_64 >> > >> > So what is the performance impact? >> >> Seems to be a very slight speedup (0.5 盜 or so) on my silly benchmark >> (pthread_create, pthread_join in a loop). [...] > > Music to my ears - although TBH there's probably two opposing forces: advantages > from the cache versus (possibly very minor, if measurable at all) disadvantages > from the 4K granularity. True. It's also plausible that there will be different lock contention issues on very large system with vmalloc vs using the page allocator. And there's one other issue. The patchset will considerably increase the frequency with which vmap gets called, which will make the giant hole in Chris Metcalf's isolation series more apparent. But that's not really a regression -- the problem is pre-existing, and I've pointed it out a few times before. Arguably it's even a benefit -- someone will have to *fix* it now :) > So I'd prefer the following approach: to apply it to a v4.8-rc1 base in ~2 weeks > and keep it default-y for much of the next development cycle. If no serious > problems are found in those ~2 months then send it to Linus in that fashion. We > can still turn it off by default (or re-spin the whole approach) if it turns out > to be too risky. > > Exposing it as default-n for even a small amount of time will massively reduce the > testing we'll get, as most people will just use the N setting (often without > noticing). > > Plus this also gives net-next and other preparatory patches applied directly to > maintainer trees time to trickle upstream. Works for me. If you think it makes sense, I can split out a bunch of the x86 and mm preparatory patches and, if it reorders cleanly and still works, the THREAD_INFO_IN_TASK thing. Those could plausibly go in 4.7. Would you like me to see how that goes? (I'll verify that it builds, that the 0day bot likes it, and that the final result of the patchset is exactly identical to what I already sent.) --Andy
* Andy Lutomirski <luto@amacapital.net> wrote: > If you think it makes sense, I can split out a bunch of the x86 and mm > preparatory patches and, if it reorders cleanly and still works, the > THREAD_INFO_IN_TASK thing. Those could plausibly go in 4.7. Would > you like me to see how that goes? (I'll verify that it builds, that > the 0day bot likes it, and that the final result of the patchset is > exactly identical to what I already sent.) Yeah, doing that would be useful - the smaller the riskier bits end up being the better. Thanks, Ingo
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index d9a94da0c29f..afdcf96ef109 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -92,6 +92,7 @@ config X86 select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_EBPF_JIT if X86_64 + select HAVE_ARCH_VMAP_STACK if X86_64 select HAVE_CC_STACKPROTECTOR select HAVE_CMPXCHG_DOUBLE select HAVE_CMPXCHG_LOCAL diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h index 8f321a1b03a1..14e4b20f0aaf 100644 --- a/arch/x86/include/asm/switch_to.h +++ b/arch/x86/include/asm/switch_to.h @@ -8,6 +8,28 @@ struct tss_struct; void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p, struct tss_struct *tss); +/* This runs runs on the previous thread's stack. */ +static inline void prepare_switch_to(struct task_struct *prev, + struct task_struct *next) +{ +#ifdef CONFIG_VMAP_STACK + /* + * If we switch to a stack that has a top-level paging entry + * that is not present in the current mm, the resulting #PF will + * will be promoted to a double-fault and we'll panic. Probe + * the new stack now so that vmalloc_fault can fix up the page + * tables if needed. This can only happen if we use a stack + * in vmap space. + * + * We assume that the stack is aligned so that it never spans + * more than one top-level paging entry. + * + * To minimize cache pollution, just follow the stack pointer. + */ + READ_ONCE(*(unsigned char *)next->thread.sp); +#endif +} + #ifdef CONFIG_X86_32 #ifdef CONFIG_CC_STACKPROTECTOR @@ -39,6 +61,8 @@ do { \ */ \ unsigned long ebx, ecx, edx, esi, edi; \ \ + prepare_switch_to(prev, next); \ + \ asm volatile("pushl %%ebp\n\t" /* save EBP */ \ "movl %%esp,%[prev_sp]\n\t" /* save ESP */ \ "movl %[next_sp],%%esp\n\t" /* restore ESP */ \ @@ -103,7 +127,9 @@ do { \ * clean in kernel mode, with the possible exception of IOPL. Kernel IOPL * has no effect. */ -#define switch_to(prev, next, last) \ +#define switch_to(prev, next, last) \ + prepare_switch_to(prev, next); \ + \ asm volatile(SAVE_CONTEXT \ "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ \ "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */ \ diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 00f03d82e69a..f0345284c30b 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP, SIGBUS, "segment not present", segment_not_present) DO_ERROR(X86_TRAP_SS, SIGBUS, "stack segment", stack_segment) DO_ERROR(X86_TRAP_AC, SIGBUS, "alignment check", alignment_check) +#ifdef CONFIG_VMAP_STACK +static void __noreturn handle_stack_overflow(const char *message, + struct pt_regs *regs, + unsigned long fault_address) +{ + printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n", + (void *)fault_address, current->stack, + (char *)current->stack + THREAD_SIZE - 1); + die(message, regs, 0); + + /* Be absolutely certain we don't return. */ + panic(message); +} +#endif + #ifdef CONFIG_X86_64 /* Runs on IST stack */ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code) { static const char str[] = "double fault"; struct task_struct *tsk = current; +#ifdef CONFIG_VMAP_STACK + unsigned long cr2; +#endif #ifdef CONFIG_X86_ESPFIX64 extern unsigned char native_irq_return_iret[]; @@ -332,6 +350,50 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code) tsk->thread.error_code = error_code; tsk->thread.trap_nr = X86_TRAP_DF; +#ifdef CONFIG_VMAP_STACK + /* + * If we overflow the stack into a guard page, the CPU will fail + * to deliver #PF and will send #DF instead. Similarly, if we + * take any non-IST exception while too close to the bottom of + * the stack, the processor will get a page fault while + * delivering the exception and will generate a double fault. + * + * According to the SDM (footnote in 6.15 under "Interrupt 14 - + * Page-Fault Exception (#PF): + * + * Processors update CR2 whenever a page fault is detected. If a + * second page fault occurs while an earlier page fault is being + * deliv- ered, the faulting linear address of the second fault will + * overwrite the contents of CR2 (replacing the previous + * address). These updates to CR2 occur even if the page fault + * results in a double fault or occurs during the delivery of a + * double fault. + * + * The logic below has a small possibility of incorrectly diagnosing + * some errors as stack overflows. For example, if the IDT or GDT + * gets corrupted such that #GP delivery fails due to a bad descriptor + * causing #GP and we hit this condition while CR2 coincidentally + * points to the stack guard page, we'll think we overflowed the + * stack. Given that we're going to panic one way or another + * if this happens, this isn't necessarily worth fixing. + * + * If necessary, we could improve the test by only diagnosing + * a stack overflow if the saved RSP points within 47 bytes of + * the bottom of the stack: if RSP == tsk_stack + 48 and we + * take an exception, the stack is already aligned and there + * will be enough room SS, RSP, RFLAGS, CS, RIP, and a + * possible error code, so a stack overflow would *not* double + * fault. With any less space left, exception delivery could + * fail, and, as a practical matter, we've overflowed the + * stack even if the actual trigger for the double fault was + * something else. + */ + cr2 = read_cr2(); + if ((unsigned long)task_stack_page(tsk) - 1 - cr2 < PAGE_SIZE) + handle_stack_overflow( + "kernel stack overflow (double-fault)", regs, cr2); +#endif + #ifdef CONFIG_DOUBLEFAULT df_debug(regs, error_code); #endif diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 5643fd0b1a7d..fbf036ae72ac 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, unsigned cpu = smp_processor_id(); if (likely(prev != next)) { + if (IS_ENABLED(CONFIG_VMAP_STACK)) { + /* + * If our current stack is in vmalloc space and isn't + * mapped in the new pgd, we'll double-fault. Forcibly + * map it. + */ + unsigned int stack_pgd_index = + pgd_index(current_stack_pointer()); + pgd_t *pgd = next->pgd + stack_pgd_index; + + if (unlikely(pgd_none(*pgd))) + set_pgd(pgd, init_mm.pgd[stack_pgd_index]); + } + #ifdef CONFIG_SMP this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); this_cpu_write(cpu_tlbstate.active_mm, next); #endif + cpumask_set_cpu(cpu, mm_cpumask(next)); /*
This allows x86_64 kernels to enable vmapped stacks. There are a couple of interesting bits. First, x86 lazily faults in top-level paging entries for the vmalloc area. This won't work if we get a page fault while trying to access the stack: the CPU will promote it to a double-fault and we'll die. To avoid this problem, probe the new stack when switching stacks and forcibly populate the pgd entry for the stack when switching mms. Second, once we have guard pages around the stack, we'll want to detect and handle stack overflow. I didn't enable it on x86_32. We'd need to rework the double-fault code a bit and I'm concerned about running out of vmalloc virtual addresses under some workloads. This patch, by itself, will behave somewhat erratically when the stack overflows while RSP is still more than a few tens of bytes above the bottom of the stack. Specifically, we'll get #PF and make it to no_context and them oops without reliably triggering a double-fault, and no_context doesn't know about stack overflows. The next patch will improve that case. Thank you to Nadav and Brian for helping me pay enough attention to the SDM to hopefully get this right. Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Brian Gerst <brgerst@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> --- arch/x86/Kconfig | 1 + arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++- arch/x86/kernel/traps.c | 62 ++++++++++++++++++++++++++++++++++++++++ arch/x86/mm/tlb.c | 15 ++++++++++ 4 files changed, 105 insertions(+), 1 deletion(-)