Message ID | 20180123075358.nztpyxympwfkyi2a@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Ingo Molnar <mingo@kernel.org> wrote: > Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? > Is there a description of the exact speculative execution vulnerability that has > to be addressed to begin with? Ok, so for now I'm assuming that this is the 16 entries return-stack-buffer underflow condition where SkyLake falls back to the branch predictor (while other CPUs wrap the buffer). > If this approach is workable I'd much prefer it to any MSR writes in the syscall > entry path not just because it's fast enough in practice to not be turned off by > everyone, but also because everyone would agree that per function call overhead > needs to go away on new CPUs. Both deployment and backporting is also _much_ more > flexible, simpler, faster and more complete than microcode/firmware or compiler > based solutions. > > Assuming the vulnerability can be addressed via this route that is, which is a big > assumption! So I talked this over with PeterZ, and I think it's all doable: - the CALL __fentry__ callbacks maintain the depth tracking (on the kernel stack, fast to access), and issue an "RSB-stuffing sequence" when depth reaches 16 entries. - "the RSB-stuffing sequence" is a return trampoline that pushes a CALL on the stack which is executed on the RET. - All asynchronous contexts (IRQs, NMIs, etc.) stuff the RSB before IRET. (The tracking could probably made IRQ and maybe even NMI safe, but the worst-case nesting scenarios make my head ache.) I.e. IBRS can be mostly replaced with a kernel based solution that is better than IBRS and which does not negatively impact any other non-SkyLake CPUs or general code quality. I.e. a full upstream Spectre solution. Thanks, Ingo
On Tue, 2018-01-23 at 08:53 +0100, Ingo Molnar wrote: > > The patch below demonstrates the principle, it forcibly enables dynamic ftrace > patching (CONFIG_DYNAMIC_FTRACE=y et al) and turns mcount/__fentry__ into a RET: > > ffffffff81a01a40 <__fentry__>: > ffffffff81a01a40: c3 retq > > This would have to be extended with (very simple) call stack depth tracking (just > 3 more instructions would do in the fast path I believe) and a suitable SkyLake > workaround (and also has to play nice with the ftrace callbacks). > > On non-SkyLake the overhead would be 0 cycles. The overhead of forcing CONFIG_DYNAMIC_FTRACE=y is precisely zero cycles? That seems a little optimistic. ;) I'll grant you if it goes straight to a 'ret' it isn't *that* high though. > On SkyLake this would add an overhead of maybe 2-3 cycles per function call and > obviously all this code and data would be very cache hot. Given that the average > number of function calls per system call is around a dozen, this would be _much_ > faster than any microcode/MSR based approach. That's kind of neat, except you don't want it at the top of the function; you want it at the bottom. If you could hijack the *return* site, then you could check for underflow and stuff the RSB right there. But in __fentry__ there's not a lot you can do other than complain that something bad is going to happen in the future. You know that a string of 16+ rets is going to happen, but you've got no gadget in *there* to deal with it when it does. HJ did have patches to turn 'ret' into a form of retpoline, which I don't think ever even got performance-tested. They'd have forced a mispredict on *every* ret. A cheaper option might be to turn ret into a 'jmp skylake_ret_hack'. Which on pre-SKL will be a bare ret, and SKL+ can do the counting (in conjunction with a 'per_cpu(call_depth)++' in __fentry__) and stuff the RSB before actually returning, when appropriate. By the time you've made it work properly, I suspect we're approaching the barf-factor of IBRS, for a less complete solution. > Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? Andi's been experimenting at https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=spec/deep-chain-3 > Is there a description of the exact speculative execution vulnerability that has > to be addressed to begin with? "It takes predictions from the generic branch target buffer when the RSB underflows". IBRS filters what can come from the BTB, and resolves the problem that way. Retpoline avoids the indirect branches that on *earlier* CPUs were the only things that would use the offending predictions. But on SKL, now 'ret' is one of the problematic instructions too. Fun! :) > If this approach is workable I'd much prefer it to any MSR writes in the syscall > entry path not just because it's fast enough in practice to not be turned off by > everyone, but also because everyone would agree that per function call overhead > needs to go away on new CPUs. Both deployment and backporting is also _much_ more > flexible, simpler, faster and more complete than microcode/firmware or compiler > based solutions. > > Assuming the vulnerability can be addressed via this route that is, which is a big > assumption! I think it's close. There are some other cases which empty the RSB, like sleeping and loading microcode, which can happily be special- cased. Andi's rounded up many of the remaining details already at https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=spec/skl-rsb-3 And there's SMI, which is a pain but I think Linus is right we can possibly just stick our fingers in our ears and pretend we didn't hear about that one as it's likely to be hard to trigger (famous last words). On the whole though, I think you can see why we're keeping IBRS around for now, sent out purely as an RFC and rebased on top of the stuff we're *actually* sending to Linus for inclusion. When we have a clear idea of what we're doing for Skylake, it'll be useful to have a proper comparison of the security, the performance and the "ick" factor of whatever we come up with, vs. IBRS. Right now the plan is just "screw Skylake"; we'll just forget it's a special snowflake and treat it like everything else, except for a bit of extra RSB-stuffing on context switch (since we had to add that for !SMEP anyway). And that's not *entirely* unreasonable but as I said I'd *really* like to have a decent analysis of the implications of that, not just some hand-wavy "nah, it'll be fine".
On Tue, 2018-01-23 at 10:27 +0100, Ingo Molnar wrote: > * Ingo Molnar <mingo@kernel.org> wrote: > > > > > Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? > > Is there a description of the exact speculative execution vulnerability that has > > to be addressed to begin with? > > Ok, so for now I'm assuming that this is the 16 entries return-stack-buffer > underflow condition where SkyLake falls back to the branch predictor (while other > CPUs wrap the buffer). Yep. > > > > If this approach is workable I'd much prefer it to any MSR writes in the syscall > > entry path not just because it's fast enough in practice to not be turned off by > > everyone, but also because everyone would agree that per function call overhead > > needs to go away on new CPUs. Both deployment and backporting is also _much_ more > > flexible, simpler, faster and more complete than microcode/firmware or compiler > > based solutions. > > > > Assuming the vulnerability can be addressed via this route that is, which is a big > > assumption! > > So I talked this over with PeterZ, and I think it's all doable: > > - the CALL __fentry__ callbacks maintain the depth tracking (on the kernel > stack, fast to access), and issue an "RSB-stuffing sequence" when depth reaches > 16 entries. > > - "the RSB-stuffing sequence" is a return trampoline that pushes a CALL on the > stack which is executed on the RET. That's neat. We'll want to make sure the unwinder can cope but hey, Peter *loves* hacking objtool, right? :) > - All asynchronous contexts (IRQs, NMIs, etc.) stuff the RSB before IRET. (The > tracking could probably made IRQ and maybe even NMI safe, but the worst-case > nesting scenarios make my head ache.) > > I.e. IBRS can be mostly replaced with a kernel based solution that is better than > IBRS and which does not negatively impact any other non-SkyLake CPUs or general > code quality. > > I.e. a full upstream Spectre solution. Sounds good. I look forward to seeing it. In the meantime I'll resend the basic bits of the feature detection and especially turning off KPTI when RDCL_NO is set. We do also want to do IBPB even with retpoline, so I'll send those patches for KVM and context switch. There is some bikeshedding to be done there about the precise conditions under which we do it. Finally, KVM should be *exposing* IBRS to guests even if we don't use it ourselves. We'll do that too.
* David Woodhouse <dwmw2@infradead.org> wrote: > On Tue, 2018-01-23 at 08:53 +0100, Ingo Molnar wrote: > > > > The patch below demonstrates the principle, it forcibly enables dynamic ftrace > > patching (CONFIG_DYNAMIC_FTRACE=y et al) and turns mcount/__fentry__ into a RET: > > > > ffffffff81a01a40 <__fentry__>: > > ffffffff81a01a40: c3 retq > > > > This would have to be extended with (very simple) call stack depth tracking (just > > 3 more instructions would do in the fast path I believe) and a suitable SkyLake > > workaround (and also has to play nice with the ftrace callbacks). > > > > On non-SkyLake the overhead would be 0 cycles. > > The overhead of forcing CONFIG_DYNAMIC_FTRACE=y is precisely zero > cycles? That seems a little optimistic. ;) The overhead of the quick hack patch I sent to show what exact code I mean is obviously not zero. The overhead of using my proposed solution, to utilize the function call callback that CONFIG_DYNAMIC_FTRACE=y provides, is exactly zero on non-SkyLake systems where the callback is patched out, on typical Linux distros. The callback is widely enabled on distro kernels: Fedora: CONFIG_DYNAMIC_FTRACE=y Ubuntu: CONFIG_DYNAMIC_FTRACE=y OpenSuse (default flavor): CONFIG_DYNAMIC_FTRACE=y BTW., the reason this is enabled on all distro kernels is because the overhead is a single patched-in NOP instruction in the function epilogue, when tracing is disabled. So it's not even a CALL+RET - it's a patched in NOP. Thanks, Ingo
* David Woodhouse <dwmw2@infradead.org> wrote: > > On SkyLake this would add an overhead of maybe 2-3 cycles per function call and > > obviously all this code and data would be very cache hot. Given that the average > > number of function calls per system call is around a dozen, this would be _much_ > > faster than any microcode/MSR based approach. > > That's kind of neat, except you don't want it at the top of the > function; you want it at the bottom. > > If you could hijack the *return* site, then you could check for > underflow and stuff the RSB right there. But in __fentry__ there's not > a lot you can do other than complain that something bad is going to > happen in the future. You know that a string of 16+ rets is going to > happen, but you've got no gadget in *there* to deal with it when it > does. No, it can be done with the existing CALL instrumentation callback that CONFIG_DYNAMIC_FTRACE=y provides, by pushing a RET trampoline on the stack from the CALL trampoline - see my previous email. > HJ did have patches to turn 'ret' into a form of retpoline, which I > don't think ever even got performance-tested. Return instrumentation is possible as well, but there are two major drawbacks: - GCC support for it is not as widely available and return instrumentation is less tested in Linux kernel contexts - a major point of my suggestion is that CONFIG_DYNAMIC_FTRACE=y is already enabled in distros here and today, so the runtime overhead to non-SkyLake CPUs would be literally zero, while still allowing to fix the RSB vulnerability on SkyLake. Thanks, Ingo
On Tue, 2018-01-23 at 11:15 +0100, Ingo Molnar wrote: > > BTW., the reason this is enabled on all distro kernels is because the overhead is > a single patched-in NOP instruction in the function epilogue, when tracing is > disabled. So it's not even a CALL+RET - it's a patched in NOP. Hm? We still have GCC emitting 'call __fentry__' don't we? Would be nice to get to the point where we can patch *that* out into a NOP... or are you saying we already can? But this is a digression. I was being pedantic about the "0 cycles" but sure, this would be perfectly tolerable.
* David Woodhouse <dwmw2@infradead.org> wrote: > On Tue, 2018-01-23 at 11:15 +0100, Ingo Molnar wrote: > > > > BTW., the reason this is enabled on all distro kernels is because the overhead > > is a single patched-in NOP instruction in the function epilogue, when tracing > > is disabled. So it's not even a CALL+RET - it's a patched in NOP. > > Hm? We still have GCC emitting 'call __fentry__' don't we? Would be nice to get > to the point where we can patch *that* out into a NOP... or are you saying we > already can? Yes, we already can and do patch the 'call __fentry__/ mcount' call site into a NOP today - all 50,000+ call sites on a typical distro kernel. We did so for a long time - this is all a well established, working mechanism. > But this is a digression. I was being pedantic about the "0 cycles" but sure, > this would be perfectly tolerable. It's not a digression in two ways: - I wanted to make it clear that for distro kernels it _is_ a zero cycles overhead mechanism for non-SkyLake CPUs, literally. - I noticed that Meltdown and the CR3 writes for PTI appears to have established a kind of ... insensitivity and numbness to kernel micro-costs, which peaked with the per-syscall MSR write nonsense patch of the SkyLake workaround. That attitude is totally unacceptable to me as x86 maintainer and yes, still every cycle counts. Thanks, Ingo
On 01/23/2018 01:27 AM, Ingo Molnar wrote: > > - All asynchronous contexts (IRQs, NMIs, etc.) stuff the RSB before IRET. (The > tracking could probably made IRQ and maybe even NMI safe, but the worst-case > nesting scenarios make my head ache.) This all sounds totally workable to me. We talked about using ftrace itself to track call depth, but it would be unusable in production, of course. This seems workable, though. You're also totally right about the zero overhead on most kernels with it turned off when we don't need RSB underflow protection (basically pre-Skylake). I also agree that the safe thing to do is to just stuff before iret. I bet we can get a ftrace-driven RSB tracker working precisely enough even with NMIs, but it's way simpler to just stuff and be done with it for now.
On Tue, 23 Jan 2018, Ingo Molnar wrote: > * David Woodhouse <dwmw2@infradead.org> wrote: > > > > On SkyLake this would add an overhead of maybe 2-3 cycles per function call and > > > obviously all this code and data would be very cache hot. Given that the average > > > number of function calls per system call is around a dozen, this would be _much_ > > > faster than any microcode/MSR based approach. > > > > That's kind of neat, except you don't want it at the top of the > > function; you want it at the bottom. > > > > If you could hijack the *return* site, then you could check for > > underflow and stuff the RSB right there. But in __fentry__ there's not > > a lot you can do other than complain that something bad is going to > > happen in the future. You know that a string of 16+ rets is going to > > happen, but you've got no gadget in *there* to deal with it when it > > does. > > No, it can be done with the existing CALL instrumentation callback that > CONFIG_DYNAMIC_FTRACE=y provides, by pushing a RET trampoline on the stack from > the CALL trampoline - see my previous email. > > > HJ did have patches to turn 'ret' into a form of retpoline, which I > > don't think ever even got performance-tested. > > Return instrumentation is possible as well, but there are two major drawbacks: > > - GCC support for it is not as widely available and return instrumentation is > less tested in Linux kernel contexts > > - a major point of my suggestion is that CONFIG_DYNAMIC_FTRACE=y is already > enabled in distros here and today, so the runtime overhead to non-SkyLake CPUs > would be literally zero, while still allowing to fix the RSB vulnerability on > SkyLake. I played around with that a bit during the week and it turns out to be less simple than you thought. 1) Injecting a trampoline return only works for functions which have all arguments in registers. For functions with arguments on stack like all varg functions this breaks because the function wont find its arguments anymore. I have not yet found a way to figure out reliably which functions have arguments on stack. That might be an option to simply ignore them. The workaround is to replace the original return on stack with the trampoline and store the original return in a per thread stack, which I implemented. But this sucks performance wise badly. 2) Doing the whole dance on function entry has a real down side because you refill RSB on every 15th return no matter whether its required or not. That really gives a very prominent performance hit. An alternative idea is to do the following (not yet implemented): __fentry__: incl PER_CPU_VAR(call_depth) retq and use -mfunction-return=thunk-extern which is available on retpoline enabled compilers. That's a reasonable requirement because w/o retpoline the whole SKL magic is pointless anyway. -mfunction-return=thunk-extern issues jump __x86_return_thunk instead of ret. In the thunk we can do the whole shebang of mitigation. That jump can be identified at build time and it can be patched into a ret for unaffected CPUs. Ideally we do the patching at build time and only patch the jump in when SKL is detected or paranoia requests it. We could actually look into that for tracing as well. The only reason why we don't do that is to select the ideal nop for the CPU the kernel runs on, which obviously cannot be known at build time. __x86_return_thunk would look like this: __x86_return_thunk: testl $0xf, PER_CPU_VAR(call_depth) jnz 1f stuff_rsb 1: decl PER_CPU_VAR(call_depth) ret The call_depth variable would be reset on context switch. Though that has another problem: tail calls. Tail calls will invoke the __fentry__ call of the tail called function, which makes the call_depth counter unbalanced. Tail calls can be prevented by using -fno-optimize-sibling-calls, but that probably sucks as well. Yet another possibility is to avoid the function entry and accouting magic and use the generic gcc return thunk: __x86_return_thunk: call L2 L1: pause lfence jmp L1 L2: lea 8(%rsp), %rsp|lea 4(%esp), %esp ret which basically refills the RSB on every return. That can be inline or extern, but in both cases we should be able to patch it out. I have no idea how that affects performance, but it might be worthwhile to experiment with that. If nobody beats me to it, I'll play around with that some more after vacation. Thanks, tglx
On Sun, 2018-02-04 at 19:43 +0100, Thomas Gleixner wrote: > > __x86_return_thunk would look like this: > > __x86_return_thunk: > testl $0xf, PER_CPU_VAR(call_depth) > jnz 1f > stuff_rsb > 1: > decl PER_CPU_VAR(call_depth) > ret > > The call_depth variable would be reset on context switch. Note that the 'jnz' can be predicted taken there, allowing the CPU to speculate all the way to the 'ret'... and beyond.
On Sun, 2018-02-04 at 19:43 +0100, Thomas Gleixner wrote: > Yet another possibility is to avoid the function entry and accouting magic > and use the generic gcc return thunk: > > __x86_return_thunk: > call L2 > L1: > pause > lfence > jmp L1 > L2: > lea 8(%rsp), %rsp|lea 4(%esp), %esp > ret > > which basically refills the RSB on every return. That can be inline or > extern, but in both cases we should be able to patch it out. > > I have no idea how that affects performance, but it might be worthwhile to > experiment with that. That was what I had in mind when I asked HJ to add -mfunction-return. I suspect the performance hit would be significant because it would cause a prediction miss on *every* return. But as I said, let's implement what we can without IBRS for Skylake, then we can compare the two options for performance, security coverage and general fugliness.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 423e4b64e683..df471538a79c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -133,6 +133,8 @@ config X86 select HAVE_DMA_CONTIGUOUS select HAVE_DYNAMIC_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS + select DYNAMIC_FTRACE + select DYNAMIC_FTRACE_WITH_REGS select HAVE_EBPF_JIT if X86_64 select HAVE_EFFICIENT_UNALIGNED_ACCESS select HAVE_EXIT_THREAD @@ -140,6 +142,7 @@ config X86 select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER + select FUNCTION_TRACER select HAVE_GCC_PLUGINS select HAVE_HW_BREAKPOINT select HAVE_IDE diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S index 7cb8ba08beb9..1e219e0f2887 100644 --- a/arch/x86/kernel/ftrace_64.S +++ b/arch/x86/kernel/ftrace_64.S @@ -19,6 +19,7 @@ EXPORT_SYMBOL(__fentry__) # define function_hook mcount EXPORT_SYMBOL(mcount) #endif + ret /* All cases save the original rbp (8 bytes) */ #ifdef CONFIG_FRAME_POINTER