Message ID | 20190427100639.15074-4-nstange@suse.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/ftrace: make ftrace_int3_handler() not to skip fops invocation | expand |
On Sat, Apr 27, 2019 at 12:06:38PM +0200, Nicolai Stange wrote: > ftrace_int3_handler()'s context is different from the interrupted call > instruction's one, obviously. In order to be able to emulate the call > within the original context, make ftrace_int3_handler() set its iret > frame's ->ip to some helper stub. Upon return from the trap, this stub will > then mimic the call by pushing the the return address onto the stack and > issuing a jmp to the target address. As describe above, the jmp target > will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide > one such stub implementation for each of the two cases. Yuck; I'd much rather we get that static_call() stuff sorted such that text_poke() and poke_int3_handler() can do CALL emulation. Given all the back and forth, I think the solution where we shift pt_regs a bit to allow the emulated PUSH is a viable solution; eg. I think we collectively hated it least.
[ Added Linus ] On Sat, 27 Apr 2019 12:26:57 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, Apr 27, 2019 at 12:06:38PM +0200, Nicolai Stange wrote: > > ftrace_int3_handler()'s context is different from the interrupted call > > instruction's one, obviously. In order to be able to emulate the call > > within the original context, make ftrace_int3_handler() set its iret > > frame's ->ip to some helper stub. Upon return from the trap, this stub will > > then mimic the call by pushing the the return address onto the stack and > > issuing a jmp to the target address. As describe above, the jmp target > > will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide > > one such stub implementation for each of the two cases. > > Yuck; I'd much rather we get that static_call() stuff sorted such that > text_poke() and poke_int3_handler() can do CALL emulation. > > Given all the back and forth, I think the solution where we shift > pt_regs a bit to allow the emulated PUSH is a viable solution; eg. I > think we collectively hated it least. Note, this use case is slightly different than the static calls but has the same issue. That issue is that we need a way to simulate a "call" function from int3, and there's no good way to handle that case right now. The static call needs to modify the call sight but make the call still work from int3 context. This use case is similar, but the issue is with a "bug" in the function tracing implementation, that has become an issue with live kernel patching which is built on top of it. The issue is this: For optimization reasons, if there's only a single user of a function it gets its own trampoline that sets up the call to its callback and calls that callback directly: <function being traced> call custom trampoline <custom trampoline> save regs to call C code call custom_callback restore regs ret If more than one callback is attached to that function, then we need to call the iterator that loops through all registered callbacks of the function tracer and checks their hashes to see if they want to trace this function or not. <function being traced> call iterator_trampoline <iterator_trampoline> save regs to call C code call iterator restore regs ret What happens in that transition? We add an "int3" at the function being traced, and change: call custom_trampoline to call iterator_trampoline But if the function being traced is executed during this transition, the int3 just makes it act like a nop and returns pass the call. For tracing this is a bug because we just missed a function that should be traced. For live kernel patching, this could be more devastating because the original "buggy" patched function is called, and this could cause subtle problems. If we can add a slot to the int3 handler, that we can use to emulate a call, (perhaps add another slot to pt_regs structure, call it link register) It would make it easier to solve both this and the static call problems. -- Steve
On Sun, Apr 28, 2019 at 10:38 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > For optimization reasons, if there's only a single user of a function > it gets its own trampoline that sets up the call to its callback and > calls that callback directly: So this is the same issue as the static calls, and it has exactly the same solution. Which I already outlined once, and nobody wrote the code for. So here's a COMPLETELY UNTESTED patch that only works (_if_ it works) for (a) 64-bit (b) SMP but that's just because I've hardcoded the percpu segment handling. It does *not* emulate the "call" in the BP handler itself, instead if replace the %ip (the same way all the other BP handlers replace the %ip) with a code sequence that just does push %gs:bp_call_return jmp *%gs:bp_call_target after having filled in those per-cpu things. The reason they are percpu is that after the %ip has been changed, the target CPU goes its merry way, and doesn't wait for the text--poke semaphore serialization. But since we have interrupts disabled on that CPU, we know that *another* text poke won't be coming around and changing the values. THIS IS ENTIRELY UNTESTED! I've built it, and it at least seems to build, although with warnings arch/x86/kernel/alternative.o: warning: objtool: emulate_call_irqoff()+0x9: indirect jump found in RETPOLINE build arch/x86/kernel/alternative.o: warning: objtool: emulate_call_irqon()+0x8: indirect jump found in RETPOLINE build arch/x86/kernel/alternative.o: warning: objtool: emulate_call_irqoff()+0x9: sibling call from callable instruction with modified stack frame arch/x86/kernel/alternative.o: warning: objtool: emulate_call_irqon()+0x8: sibling call from callable instruction with modified stack frame that will need the appropriate "ignore this case" annotations that I didn't do. Do I expect it to work? No. I'm sure there's some silly mistake here, but the point of the patch is to show it as an example, so that it can actually be tested. With this, it should be possible (under the text rewriting lock) to do replace_call(callsite, newcallopcode, callsize, calltargettarget); to do the static rewriting of the call at "callsite" to have the new call target. And again. Untested. But doesn't need any special code in the entry path, and the concept is simple even if there are probably stupid bugs just because it's entirely untested. Oh, and did I mention that I didn't test this? Linus arch/x86/kernel/alternative.c | 54 ++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 51 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 9a79c7808f9c..92b59958cff3 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -739,7 +739,11 @@ static void do_sync_core(void *info) } static bool bp_patching_in_progress; -static void *bp_int3_handler, *bp_int3_addr; +static void *bp_int3_handler_irqoff, *bp_int3_handler_irqon, *bp_int3_addr; +static void *bp_int3_call_target, *bp_int3_call_return; + +static DEFINE_PER_CPU(void *, bp_call_return); +static DEFINE_PER_CPU(void *, bp_call_target); int poke_int3_handler(struct pt_regs *regs) { @@ -762,7 +766,22 @@ int poke_int3_handler(struct pt_regs *regs) return 0; /* set up the specified breakpoint handler */ - regs->ip = (unsigned long) bp_int3_handler; + regs->ip = (unsigned long) bp_int3_handler_irqon; + + /* + * If we want an irqoff irq3 handler, and interrupts were + * on, we turn them off and use the special irqoff handler + * instead. + */ + if (bp_int3_handler_irqoff) { + this_cpu_write(bp_call_target, bp_int3_call_target); + this_cpu_write(bp_call_return, bp_int3_call_return); + + if (regs->flags & X86_EFLAGS_IF) { + regs->flags &= ~X86_EFLAGS_IF; + regs->ip = (unsigned long) bp_int3_handler_irqoff; + } + } return 1; } @@ -792,7 +811,7 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler) { unsigned char int3 = 0xcc; - bp_int3_handler = handler; + bp_int3_handler_irqon = handler; bp_int3_addr = (u8 *)addr + sizeof(int3); bp_patching_in_progress = true; @@ -830,7 +849,36 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler) * the writing of the new instruction. */ bp_patching_in_progress = false; + bp_int3_handler_irqoff = NULL; return addr; } +extern asmlinkage void emulate_call_irqon(void); +extern asmlinkage void emulate_call_irqoff(void); + +asm( + ".text\n" + ".global emulate_call_irqoff\n" + ".type emulate_call_irqoff, @function\n" + "emulate_call_irqoff:\n\t" + "push %gs:bp_call_return\n\t" + "sti\n\t" + "jmp *%gs:bp_call_target\n" + ".size emulate_call_irqoff, .-emulate_call_irqoff\n" + + ".global emulate_call_irqon\n" + ".type emulate_call_irqon, @function\n" + "emulate_call_irqon:\n\t" + "push %gs:bp_call_return\n\t" + "jmp *%gs:bp_call_target\n" + ".size emulate_call_irqon, .-emulate_call_irqon\n" + ".previous\n"); + +void replace_call(void *addr, const void *opcode, size_t len, void *target) +{ + bp_int3_call_target = target; + bp_int3_call_return = addr + len; + bp_int3_handler_irqoff = emulate_call_irqoff; + text_poke_bp(addr, opcode, len, emulate_call_irqon); +}
On Mon, Apr 29, 2019 at 11:06 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > It does *not* emulate the "call" in the BP handler itself, instead if > replace the %ip (the same way all the other BP handlers replace the > %ip) with a code sequence that just does > > push %gs:bp_call_return > jmp *%gs:bp_call_target > > after having filled in those per-cpu things. Note that if you read the patch, you'll see that my explanation glossed over the "what if an interrupt happens" part. Which is handled by having two handlers, one for "interrupts were already disabled" and one for "interrupts were enabled, so I disabled them before entering the handler". The second handler does the same push/jmp sequence, but has a "sti" before the jmp. Because of the one-instruction sti shadow, interrupts won't actually be enabled until after the jmp instruction has completed, and thus the "push/jmp" is atomic wrt regular interrupts. It's not safe wrt NMI, of course, but since NMI won't be rescheduling, and since any SMP IPI won't be punching through that sequence anyway, it's still atomic wrt _another_ text_poke() attempt coming in and re-using the bp_call_return/tyarget slots. So yeah, it's not "one-liner" trivial, but it's not like it's complicated either, and it actually matches the existing "call this code to emulate the replaced instruction". So I'd much rather have a couple of tens of lines of code here that still acts pretty much exactly like all the other rewriting does, rather than play subtle games with the entry stack frame. Finally: there might be other situations where you want to have this kind of "pseudo-atomic" replacement sequence, so I think while it's a hack specific to emulating a "call" instruction, I don't think it is conceptually limited to just that case. Linus
On Mon, Apr 29, 2019 at 11:29 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, Apr 29, 2019 at 11:06 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > > > It does *not* emulate the "call" in the BP handler itself, instead if > > replace the %ip (the same way all the other BP handlers replace the > > %ip) with a code sequence that just does > > > > push %gs:bp_call_return > > jmp *%gs:bp_call_target > > > > after having filled in those per-cpu things. > > Note that if you read the patch, you'll see that my explanation > glossed over the "what if an interrupt happens" part. Which is handled > by having two handlers, one for "interrupts were already disabled" and > one for "interrupts were enabled, so I disabled them before entering > the handler". This is quite a cute solution. > > The second handler does the same push/jmp sequence, but has a "sti" > before the jmp. Because of the one-instruction sti shadow, interrupts > won't actually be enabled until after the jmp instruction has > completed, and thus the "push/jmp" is atomic wrt regular interrupts. > > It's not safe wrt NMI, of course, but since NMI won't be rescheduling, > and since any SMP IPI won't be punching through that sequence anyway, > it's still atomic wrt _another_ text_poke() attempt coming in and > re-using the bp_call_return/tyarget slots. I'm less than 100% convinced about this argument. Sure, an NMI right there won't cause a problem. But an NMI followed by an interrupt will kill us if preemption is on. I can think of three solutions: 1. Assume that all CPUs (and all relevant hypervisors!) either mask NMIs in the STI shadow or return from NMIs with interrupts masked for one instruction. Ditto for MCE. This seems too optimistic. 2. Put a fixup in the NMI handler and MCE handler: if the return address is one of these magic jumps, clear IF and back up IP by one byte. This should *work*, but it's a bit ugly. 3. Use a different magic sequence: push %gs:bp_call_return int3 and have the int3 handler adjust regs->ip and regs->flags as appropriate. I think I like #3 the best, even though it's twice as slow. FWIW, kernel shadow stack patches will show up eventually, and none of these approaches are compatible as is. Perhaps the actual sequence should be this, instead: bp_int3_fixup_irqsoff: call 1f 1: int3 bp_int3_fixup_irqson: call 1f 1: int3 and the int3 handler will update the conventional return address *and* the shadow return address. Linus, what do you think about this variant? Finally, with my maintainer hat on: if anyone actually wants to merge this thing, I want to see a test case, exercised regularly (every boot in configured, perhaps) that actually *runs* this code. Hitting it in practice will be rare, and I want bugs to be caught.
On Mon, 29 Apr 2019 11:06:58 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > +void replace_call(void *addr, const void *opcode, size_t len, void *target) > +{ > + bp_int3_call_target = target; > + bp_int3_call_return = addr + len; > + bp_int3_handler_irqoff = emulate_call_irqoff; > + text_poke_bp(addr, opcode, len, emulate_call_irqon); > +} Note, the function tracer does not use text poke. It does it in batch mode. It can update over 40,000 calls in one go: add int3 breakpoint to all 40,000 call sites. sync() change the last four bytes of each of those call sites sync() remove int3 from the 40,000 call site with new call. It's a bit more intrusive than the static call updates we were discussing before. -- Steve
On Mon, Apr 29, 2019 at 11:53 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > On Mon, Apr 29, 2019, 11:42 Andy Lutomirski <luto@kernel.org> wrote: >> >> >> I'm less than 100% convinced about this argument. Sure, an NMI right >> there won't cause a problem. But an NMI followed by an interrupt will >> kill us if preemption is on. I can think of three solutions: > > > No, because either the sti shadow disables nmi too (that's the case on some CPUs at least) or the iret from nmi does. > > Otherwise you could never trust the whole sti shadow thing - and it very much is part of the architecture. > Is this documented somewhere? And do you actually believe that this is true under KVM, Hyper-V, etc? As I recall, Andrew Cooper dug in to the way that VMX dealt with this stuff and concluded that the SDM was blatantly wrong in many cases, which leads me to believe that Xen HVM/PVH is the *only* hypervisor that gets it right. Steven's point about batched updates is quite valid, though. My personal favorite solution to this whole mess is to rework the whole thing so that the int3 handler simply returns and retries and to replace the sync_core() broadcast with an SMI broadcast. I don't know whether this will actually work on real CPUs and on VMs and whether it's going to crash various BIOSes out there.
On Mon, 29 Apr 2019 11:59:04 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > I really don't care. Just do what I suggested, and if you have numbers to > show problems, then maybe I'll care. > Are you suggesting that I rewrite the code to do it one function at a time? This has always been batch mode. This is not something new. The function tracer has been around longer than the text poke code. > Right now you're just making excuses for this. I described the solution > months ago, now I've written a patch, if that's not good enough then we can > just skip this all entirely. > > Honestly, if you need to rewrite tens of thousands of calls, maybe you're > doing something wrong? > # cd /sys/kernel/debug/tracing # cat available_filter_functions | wc -l 45856 # cat enabled_functions | wc -l 0 # echo function > current_tracer # cat enabled_functions | wc -l 45856 There, I just enabled 45,856 function call sites in one shot! How else do you want to update them? Every function in the kernel has a nop, that turns into a call to the ftrace_handler, if I add another user of that code, it will change each one as well. -- Steve
On Mon, Apr 29, 2019 at 12:13 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > On Mon, Apr 29, 2019, 12:02 Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> >> >> If nmi were to break it, it would be a cpu bug. > > > Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret. > Where? STI; IRET would be nuts. Before: commit 4214a16b02971c60960afd675d03544e109e0d75 Author: Andy Lutomirski <luto@kernel.org> Date: Thu Apr 2 17:12:12 2015 -0700 x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER we did sti; sysxit, but, when we discussed this, I don't recall anyone speaking up in favor of the safely of the old code. Not to mention that the crash we'll get if we get an NMI and a rescheduling interrupt in this path will be very, very hard to debug.
On Mon, Apr 29, 2019 at 12:07 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > Are you suggesting that I rewrite the code to do it one function at a > time? This has always been batch mode. This is not something new. The > function tracer has been around longer than the text poke code. Only do the 'call' instructions one at a time. Why would you change _existing_ code? Linus
On Mon, Apr 29, 2019 at 12:24 PM Andy Lutomirski <luto@kernel.org> wrote: > > Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret. > > Where? STI; IRET would be nuts. Sorry, not 'sti;iret' but 'sti;sysexit' > before commit 4214a16b02971c60960afd675d03544e109e0d75 > x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER > > we did sti; sysxit, but, when we discussed this, I don't recall anyone > speaking up in favor of the safely of the old code. We still have that sti sysexit in the 32-bit code. Linus
On Mon, Apr 29, 2019 at 12:02 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > If nmi were to break it, it would be a cpu bug. I'm pretty sure I've > seen the "shadow stops even nmi" documented for some uarch, but as > mentioned it's not necessarily the only way to guarantee the shadow. In fact, the documentation is simply the official Intel instruction docs for "STI": The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and NMI interrupts. NMI interrupts (and SMIs) may be blocked for one macroinstruction following an STI. note the "may be blocked". As mentioned, that's just one option for not having NMI break the STI shadow guarantee, but it's clearly one that Intel has done at times, and clearly even documents as having done so. There is absolutely no question that the sti shadow is real, and that people have depended on it for _decades_. It would be a horrible errata if the shadow can just be made to go away by randomly getting an NMI or SMI. Linus
On Mon, Apr 29, 2019 at 1:06 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Only do the 'call' instructions one at a time. Why would you change > _existing_ code? Side note: if you want to, you can easily batch up rewriting 'call' instructions to the same target using the exact same code. You just need to change the int3 handler case to calculate the bp_int3_call_return from the fixed one-time address to use sopmething like this_cpu_write(bp_call_return, int3_address-1+bp_int3_call_size); instead (and you'd need to also teach the function that there's not just a single int3 live at a time) Linus
On Mon, 29 Apr 2019 13:06:17 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Apr 29, 2019 at 12:07 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > Are you suggesting that I rewrite the code to do it one function at a > > time? This has always been batch mode. This is not something new. The > > function tracer has been around longer than the text poke code. > > Only do the 'call' instructions one at a time. Why would you change > _existing_ code? The function tracing is a call instruction. On boot: <function_X>: nop blah blah After a callback to function tracing is called: <function_X> call custom_trampoline blah blah If we have two functions to that function added: <function_X> call iterator_trampoline blah blah The update from "call custom_trampoline" to "call iterator_trampoline" is where we have an issue. We could make this a special case where we do this one at a time, but currently the code is all the same looking at tables to determine to what to do. Which is one of three: 1) change nop to call function 2) change call function to nop 3) update call function to another call function #3 is where we have an issue. But if we want this to be different, we would need to change the code significantly, and know that we are only updating calls to calls. Which would take a bit of accounting to see if that's the change that is being made. This thread started about that #3 operation causing a call to be missed because we turn it into a nop while we make the transition, where in reality it needs to be a call to one of the two functions in the transition. -- Steve
On Mon, Apr 29, 2019 at 1:30 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > The update from "call custom_trampoline" to "call iterator_trampoline" > is where we have an issue. So it has never worked. Just tell people that they have two chocies: - you do the careful rewriting, which takes more time - you do it by rewriting as nop and then back, which is what historically has been done, and that is fast and simple, because there's no need to be careful. Really. I find your complaints completely incomprehensible. You've never rewritten call instructions atomically before, and now you complain about it being slightly more expensive to do it when I give you the code? Yes it damn well will be slightly more expensive. Deal with it. Btw, once again - I several months ago also gave a suggestion on how it could be done batch-mode by having lots of those small stubs and just generating them dynamically. You never wrote that code *EITHER*. It's been *months*. So now I've written the non-batch-mode code for you, and you just *complain* about it? I'm done with this discussion. I'm totally fed up. Linus
On Mon, Apr 29, 2019 at 11:57 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > Otherwise you could never trust the whole sti shadow thing - and it very much is part of the architecture. > > Is this documented somewhere? Btw, if you really don't trust the sti shadow despite it going all the way back to the 8086, then you could instead make the irqoff code do push %gs:bp_call_return push %gs:bp_call_target sti ret which just keeps interrupts explicitly disabled over the whole use of the percpu data. The actual "ret" instruction doesn't matter, it's not going to change in this model (where the code isn't dynamically generated or changed). So I claim that it will still be protected by the sti shadow, but when written that way it doesn't actually matter, and you could reschedule immediately after the sti (getting an interrupt there might make the stack frame look odd, but it doesn't really affect anything else) Linus
On Mon, 29 Apr 2019 14:38:35 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Apr 29, 2019 at 1:30 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > The update from "call custom_trampoline" to "call iterator_trampoline" > > is where we have an issue. > > So it has never worked. Just tell people that they have two chocies: The custom call to iterator translation never worked. One option is to always call the iterator, and just take the hit. There's another solution to to make permanent updates that would handle the live patching case, but not for the tracing case. It is more critical for live patching than tracing (missed traced function is not as critical as running buggy code unexpectedly). I could look to work on that instead. > > - you do the careful rewriting, which takes more time Which would be the method I said making the call to call a special case. > > - you do it by rewriting as nop and then back, which is what > historically has been done, and that is fast and simple, because > there's no need to be careful. Except in the live kernel patching case where you just put back the buggy function. > > Really. I find your complaints completely incomprehensible. You've > never rewritten call instructions atomically before, and now you > complain about it being slightly more expensive to do it when I give > you the code? Yes it damn well will be slightly more expensive. Deal > with it. I wasn't complaining about the expense of doing it that way. I was stating that it would take more time to get it right as it as it would require a different implementation for the call to call case. > > Btw, once again - I several months ago also gave a suggestion on how > it could be done batch-mode by having lots of those small stubs and > just generating them dynamically. > > You never wrote that code *EITHER*. It's been *months*. Note, I wasn't the one writing the code back then either. I posted a proof of concept for a similar topic (trace events) for the purpose of bringing back the performance lost due to the retpolines (something like 8% hit). Josh took the initiative to do that work, but I don't remember ever getting to a consensus on a solution to that. Yes you had given us ideas, but I remember people (like Andy) having concerns with it. But because it was just an optimization change, and Josh had other things to work on, that work just stalled. But I just found out that the function tracing code has the same issue, but this can cause real problems with live kernel patching. This is why this patch set started. > > So now I've written the non-batch-mode code for you, and you just > *complain* about it? Because this is a different story. The trace event code is updated one at a time (there's patches out there to do it in batch). But this is not trace events. This is function tracing which only does batching. > > I'm done with this discussion. I'm totally fed up. > Note, I wasn't writing this code anyway as I didn't have the time to. I was helping others who took the initiative to do this work. But I guess they didn't have time to move it forward either. For the live kernel patching case, I'll start working on the permanent changes, where once a function entry is changed, it can never be put back. Then we wont have an issue with the live kernel patching case, but only for tracing. But the worse thing there is you miss tracing functions while an update is being made. If Nicolai has time, perhaps he can try out the method you suggest and see if we can move this forward. -- Steve
On Mon, Apr 29, 2019 at 01:16:10PM -0700, Linus Torvalds wrote: > On Mon, Apr 29, 2019 at 12:02 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > If nmi were to break it, it would be a cpu bug. I'm pretty sure I've > > seen the "shadow stops even nmi" documented for some uarch, but as > > mentioned it's not necessarily the only way to guarantee the shadow. > > In fact, the documentation is simply the official Intel instruction > docs for "STI": > > The IF flag and the STI and CLI instructions do not prohibit the > generation of exceptions and NMI interrupts. NMI interrupts (and > SMIs) may be blocked for one macroinstruction following an STI. > > note the "may be blocked". As mentioned, that's just one option for > not having NMI break the STI shadow guarantee, but it's clearly one > that Intel has done at times, and clearly even documents as having > done so. > > There is absolutely no question that the sti shadow is real, and that > people have depended on it for _decades_. It would be a horrible > errata if the shadow can just be made to go away by randomly getting > an NMI or SMI. FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm not sure that counters the "horrible errata" statement ;-). SMI+RSM saves and restores STI blocking in that case, but AFAICT NMI has no such protection and will effectively break the shadow on its IRET. All other (modern) Intel uArchs block NMI in the shadow and also enforce STI_BLOCKING==0 when injecting an NMI via VM-Enter, i.e. prevent a VMM from breaking the shadow so long as the VMM preserves the shadow info. KVM is generally ok with respect to STI blocking, but ancient versions didn't migrate STI blocking and there's currently a hole where single-stepping a guest (from host userspace) could drop STI_BLOCKING if a different VM-Exit occurs between the single-step #DB VM-Exit and the instruction in the shadow. Though "don't do that" may be a reasonable answer in that case.
On Mon, Apr 29, 2019 at 3:08 PM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm > not sure that counters the "horrible errata" statement ;-). SMI+RSM saves > and restores STI blocking in that case, but AFAICT NMI has no such > protection and will effectively break the shadow on its IRET. Ugh. I can't say I care deeply about Quark (ie never seemed to go anywhere), but it's odd. I thought it was based on a Pentium core (or i486+?). Are you saying those didn't do it either? I have this dim memory about talking about this with some (AMD?) engineer, and having an alternative approach for the sti shadow wrt NMI - basically not checking interrupts in the instruction you return to with 'iret'. I don't think it was even conditional on the "iret from NMI", I think it was basically any iret also did the sti shadow thing. But I can find no actual paper to back that up, so this may be me just making sh*t up. > KVM is generally ok with respect to STI blocking, but ancient versions > didn't migrate STI blocking and there's currently a hole where > single-stepping a guest (from host userspace) could drop STI_BLOCKING > if a different VM-Exit occurs between the single-step #DB VM-Exit and the > instruction in the shadow. Though "don't do that" may be a reasonable > answer in that case. I thought the sti shadow blocked the single-step exception too? I know "mov->ss" does block debug interrupts too. Or are you saying that it's some "single step by emulation" that just miss setting the STI_BLOCKING flag? Linus
On Mon, Apr 29, 2019 at 03:22:09PM -0700, Linus Torvalds wrote: > On Mon, Apr 29, 2019 at 3:08 PM Sean Christopherson > <sean.j.christopherson@intel.com> wrote: > > > > FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm > > not sure that counters the "horrible errata" statement ;-). SMI+RSM saves > > and restores STI blocking in that case, but AFAICT NMI has no such > > protection and will effectively break the shadow on its IRET. > > Ugh. I can't say I care deeply about Quark (ie never seemed to go > anywhere), but it's odd. I thought it was based on a Pentium core (or > i486+?). Are you saying those didn't do it either? It's 486 based, but either way I suspect the answer is "yes". IIRC, Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that was based on P54C, though I'm struggling to recall exactly what the Larrabee weirdness was. > I have this dim memory about talking about this with some (AMD?) > engineer, and having an alternative approach for the sti shadow wrt > NMI - basically not checking interrupts in the instruction you return > to with 'iret'. I don't think it was even conditional on the "iret > from NMI", I think it was basically any iret also did the sti shadow > thing. > > But I can find no actual paper to back that up, so this may be me just > making sh*t up. If Intel CPUs ever did anything like that on IRET it's long gone. > > KVM is generally ok with respect to STI blocking, but ancient versions > > didn't migrate STI blocking and there's currently a hole where > > single-stepping a guest (from host userspace) could drop STI_BLOCKING > > if a different VM-Exit occurs between the single-step #DB VM-Exit and the > > instruction in the shadow. Though "don't do that" may be a reasonable > > answer in that case. > > I thought the sti shadow blocked the single-step exception too? I know > "mov->ss" does block debug interrupts too. {MOV,POP}SS blocks #DBs, STI does not. > Or are you saying that it's some "single step by emulation" that just > miss setting the STI_BLOCKING flag? This is the case I was talking about for KVM. KVM supports single-stepping the guest from userpace and uses EFLAGS.TF to do so (since it works on both Intel and AMD). VMX has a consistency check that fails VM-Entry if STI_BLOCKING=1, EFLAGS.TF==1, IA32_DEBUGCTL.BTF=0 and there isn't a pending single-step #DB, and so KVM clears STI_BLOCKING immediately before entering the guest when single-stepping the guest. If a VM-Exit occurs immediately after VM-Entry, e.g. due to hardware interrupt, then KVM will see STI_BLOCKING=0 when processing guest events in its run loop and will inject any pending interrupts. I *think* the KVM behavior can be fixed, e.g. I'm not entirely sure why KVM takes this approach instead of setting PENDING_DBG.BS, but that's probably a moot point.
On Mon, Apr 29, 2019 at 05:08:46PM -0700, Sean Christopherson wrote: > On Mon, Apr 29, 2019 at 03:22:09PM -0700, Linus Torvalds wrote: > > On Mon, Apr 29, 2019 at 3:08 PM Sean Christopherson > > <sean.j.christopherson@intel.com> wrote: > > > > > > FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm > > > not sure that counters the "horrible errata" statement ;-). SMI+RSM saves > > > and restores STI blocking in that case, but AFAICT NMI has no such > > > protection and will effectively break the shadow on its IRET. > > > > Ugh. I can't say I care deeply about Quark (ie never seemed to go > > anywhere), but it's odd. I thought it was based on a Pentium core (or > > i486+?). Are you saying those didn't do it either? > > It's 486 based, but either way I suspect the answer is "yes". IIRC, > Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that > was based on P54C, though I'm struggling to recall exactly what the > Larrabee weirdness was. Aha! Found an ancient comment that explicitly states P5 does not block NMI/SMI in the STI shadow, while P6 does block NMI/SMI.
On Mon, Apr 29, 2019 at 5:45 PM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Mon, Apr 29, 2019 at 05:08:46PM -0700, Sean Christopherson wrote: > > > > It's 486 based, but either way I suspect the answer is "yes". IIRC, > > Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that > > was based on P54C, though I'm struggling to recall exactly what the > > Larrabee weirdness was. > > Aha! Found an ancient comment that explicitly states P5 does not block > NMI/SMI in the STI shadow, while P6 does block NMI/SMI. Ok, so the STI shadow really wouldn't be reliable on those machines. Scary. Of course, the good news is that hopefully nobody has them any more, and if they do, they presumably don't use fancy NMI profiling etc, so any actual NMI's are probably relegated purely to largely rare and effectively fatal errors anyway (ie memory parity errors). Linus
Steven Rostedt <rostedt@goodmis.org> writes: > On Mon, 29 Apr 2019 14:38:35 -0700 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Mon, Apr 29, 2019 at 1:30 PM Steven Rostedt <rostedt@goodmis.org> wrote: >> > >> > The update from "call custom_trampoline" to "call iterator_trampoline" >> > is where we have an issue. >> >> So it has never worked. Just tell people that they have two chocies: > > The custom call to iterator translation never worked. One option is to > always call the iterator, and just take the hit. There's another > solution to to make permanent updates that would handle the live > patching case, but not for the tracing case. It is more critical for > live patching than tracing (missed traced function is not as critical > as running buggy code unexpectedly). I could look to work on that > instead. Making the live patching case permanent would disable tracing of live patched functions though? For unconditionally calling the iterator: I don't have numbers, but would expect that always searching through something like 50-70 live patching ftrace_ops from some hot path will be noticeable. > If Nicolai has time, perhaps he can try out the method you suggest and > see if we can move this forward. You mean making ftrace handle call->call transitions one by one in non-batch mode, right? Sure, I can do that. Another question though: is there anything that prevents us from calling ftrace_ops_list_func() directly from ftrace_int3_handler()? We don't have parent_ip, but if that is used for reporting only, maybe setting it to some ftrace_is_in_int3_and_doesnt_now_the_parent() will work? Graph tracing entries would still be lost, but well... Thanks, Nicolai
On Mon, Apr 29, 2019 at 07:26:02PM -0700, Linus Torvalds wrote: > On Mon, Apr 29, 2019 at 5:45 PM Sean Christopherson > <sean.j.christopherson@intel.com> wrote: > > > > On Mon, Apr 29, 2019 at 05:08:46PM -0700, Sean Christopherson wrote: > > > > > > It's 486 based, but either way I suspect the answer is "yes". IIRC, > > > Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that > > > was based on P54C, though I'm struggling to recall exactly what the > > > Larrabee weirdness was. > > > > Aha! Found an ancient comment that explicitly states P5 does not block > > NMI/SMI in the STI shadow, while P6 does block NMI/SMI. > > Ok, so the STI shadow really wouldn't be reliable on those machines. Scary. > > Of course, the good news is that hopefully nobody has them any more, > and if they do, they presumably don't use fancy NMI profiling etc, so > any actual NMI's are probably relegated purely to largely rare and > effectively fatal errors anyway (ie memory parity errors). We do have KNC perf support, if that chip has 'issues'... Outside of that, we only do perf from P6 onwards. With P4 support being in dubious shape, because it's massively weird and 'nobody' still has those machines.
On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote: > On Mon, 29 Apr 2019 11:06:58 -0700 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > +void replace_call(void *addr, const void *opcode, size_t len, void *target) > > +{ > > + bp_int3_call_target = target; > > + bp_int3_call_return = addr + len; > > + bp_int3_handler_irqoff = emulate_call_irqoff; > > + text_poke_bp(addr, opcode, len, emulate_call_irqon); > > +} > > Note, the function tracer does not use text poke. It does it in batch > mode. It can update over 40,000 calls in one go: Note that Daniel is adding batch stuff to text_poke(), because jump_label/static_key stuffs also end up wanting to change more than one site at a time and IPI spraying the machine for every single instance is hurting. So ideally ftrace would start to use the 'normal' code when that happens.
On Mon, 29 Apr 2019, Linus Torvalds wrote: > > > It's 486 based, but either way I suspect the answer is "yes". IIRC, > > > Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that > > > was based on P54C, though I'm struggling to recall exactly what the > > > Larrabee weirdness was. > > > > Aha! Found an ancient comment that explicitly states P5 does not block > > NMI/SMI in the STI shadow, while P6 does block NMI/SMI. > > Ok, so the STI shadow really wouldn't be reliable on those machines. Scary. > > Of course, the good news is that hopefully nobody has them any more, and > if they do, they presumably don't use fancy NMI profiling etc, so any > actual NMI's are probably relegated purely to largely rare and > effectively fatal errors anyway (ie memory parity errors). FWIW, if that thing has local apic (I have no idea, I've never seen Quark myself), then NMIs are used to trigger all-cpu backtrace as well. Which still can be done in situations where the kernel is then expected to continue running undisrupted (softlockup, sysrq, hung task detector, ...). Nothing to really worry about in the particular case of this HW perhaps, sure.
On Mon, Apr 29, 2019 at 03:06:30PM -0700, Linus Torvalds wrote: > On Mon, Apr 29, 2019 at 11:57 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > Otherwise you could never trust the whole sti shadow thing - and it very much is part of the architecture. > > > > Is this documented somewhere? > > Btw, if you really don't trust the sti shadow despite it going all the > way back to the 8086, then you could instead make the irqoff code do > > push %gs:bp_call_return > push %gs:bp_call_target > sti > ret This variant cures the RETPOLINE complaint; due to there not actually being an indirect jump anymore. And it cures the sibling call complaint, but trades it for "return with modified stack frame". Something like so is clean: +extern asmlinkage void emulate_call_irqon(void); +extern asmlinkage void emulate_call_irqoff(void); + +asm( + ".text\n" + ".global emulate_call_irqoff\n" + ".type emulate_call_irqoff, @function\n" + "emulate_call_irqoff:\n\t" + "push %gs:bp_call_return\n\t" + "push %gs:bp_call_target\n\t" + "sti\n\t" + "ret\n" + ".size emulate_call_irqoff, .-emulate_call_irqoff\n" + + ".global emulate_call_irqon\n" + ".type emulate_call_irqon, @function\n" + "emulate_call_irqon:\n\t" + "push %gs:bp_call_return\n\t" + "push %gs:bp_call_target\n\t" + "ret\n" + ".size emulate_call_irqon, .-emulate_call_irqon\n" + ".previous\n"); + +STACK_FRAME_NON_STANDARD(emulate_call_irqoff); +STACK_FRAME_NON_STANDARD(emulate_call_irqon);
On Tue, 30 Apr 2019 12:46:48 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote: > > On Mon, 29 Apr 2019 11:06:58 -0700 > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > +void replace_call(void *addr, const void *opcode, size_t len, void *target) > > > +{ > > > + bp_int3_call_target = target; > > > + bp_int3_call_return = addr + len; > > > + bp_int3_handler_irqoff = emulate_call_irqoff; > > > + text_poke_bp(addr, opcode, len, emulate_call_irqon); > > > +} > > > > Note, the function tracer does not use text poke. It does it in batch > > mode. It can update over 40,000 calls in one go: > > Note that Daniel is adding batch stuff to text_poke(), because > jump_label/static_key stuffs also end up wanting to change more than one > site at a time and IPI spraying the machine for every single instance is > hurting. > > So ideally ftrace would start to use the 'normal' code when that > happens. Sure, but that's a completely different issue. We would need to solve the 'emulate' call for batch mode first. -- Steve
On Mon, Apr 29, 2019 at 01:07:33PM -0700, Linus Torvalds wrote: > On Mon, Apr 29, 2019 at 12:24 PM Andy Lutomirski <luto@kernel.org> wrote: > > > Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret. > > > > Where? STI; IRET would be nuts. > > Sorry, not 'sti;iret' but 'sti;sysexit' > > > before commit 4214a16b02971c60960afd675d03544e109e0d75 > > x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER > > > > we did sti; sysxit, but, when we discussed this, I don't recall anyone > > speaking up in favor of the safely of the old code. > > We still have that sti sysexit in the 32-bit code. We also have both: "STI; HLT" and "STI; MWAIT" where we rely on the STI shadow. Getting an NMI in between shouldn't hurt too much, but if that in turn can lead to an actual interrupt happening, we're up some creek without no paddle. Most moden systems don't use either anymore though. As mwait_idle_with_hints() relies on MWAIT ECX[0]=1 to allow MWAIT to wake from pending interrupts.
On Tue, Apr 30, 2019 at 09:44:45AM -0400, Steven Rostedt wrote: > On Tue, 30 Apr 2019 12:46:48 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote: > > > On Mon, 29 Apr 2019 11:06:58 -0700 > > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > > +void replace_call(void *addr, const void *opcode, size_t len, void *target) > > > > +{ > > > > + bp_int3_call_target = target; > > > > + bp_int3_call_return = addr + len; > > > > + bp_int3_handler_irqoff = emulate_call_irqoff; > > > > + text_poke_bp(addr, opcode, len, emulate_call_irqon); > > > > +} > > > > > > Note, the function tracer does not use text poke. It does it in batch > > > mode. It can update over 40,000 calls in one go: > > > > Note that Daniel is adding batch stuff to text_poke(), because > > jump_label/static_key stuffs also end up wanting to change more than one > > site at a time and IPI spraying the machine for every single instance is > > hurting. > > > > So ideally ftrace would start to use the 'normal' code when that > > happens. > > Sure, but that's a completely different issue. We would need to solve > the 'emulate' call for batch mode first. I don't see a problem there; when INT3 happens; you bsearch() the batch array to find the one you hit, you frob that into the percpu variables and do the thing. Or am I loosing the plot again?
On Tue, 30 Apr 2019 16:20:47 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > > > Sure, but that's a completely different issue. We would need to solve > > the 'emulate' call for batch mode first. > > I don't see a problem there; when INT3 happens; you bsearch() the batch > array to find the one you hit, you frob that into the percpu variables > and do the thing. Or am I loosing the plot again? I may need to tweak Linus's patch slightly, but I may be able to get it to work with the batch mode. -- Steve
On Tue, Apr 30, 2019 at 6:56 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Apr 29, 2019 at 01:07:33PM -0700, Linus Torvalds wrote: > > > > We still have that sti sysexit in the 32-bit code. > > We also have both: "STI; HLT" and "STI; MWAIT" where we rely on the STI > shadow. I guess the good news is that in all cases we really only ever protect against a very unlikely race, and if the race happens it's not actually fatal. Yes, if we get an NMI and then an interrupt in between the "st;hlt" we might wait for the next interrupt and get a (potentially fairly horrible) latency issue. I guess that with maximal luck it might be a one-shot timer and not get re-armed, but it sounds very very very unlikely. Googling around, I actually find a patch from Avi Kivity from back in 2010 for this exact issue, apparently because kvm got this case wrong and somebody hit it. The patch never made it upstream exactly because kvm could be fixed and people decided that most real hardware didn't have the issue in the first place. In the discussion I found, Peter Anvin tried to get confirmation from AMD engineers about this too, but I don't see any resolution. Realistically, I don't think you can hit the problem in practice. The only way to hit that incredibly small race of "one instruction, *both* NMI and interrupts" is to have a lot of interrupts going all at the same time, but that will also then solve the latency problem, so the very act of triggering it will also fix it. I don't see any case where it's really bad. The "sti sysexit" race is similar, just about latency of user space signal reporting (and perhaps any pending TIF_WORK_xyz flags). So maybe we don't care deeply about the sti shadow. It's a potential latecy problem when broken, but not a huge issue. And for the instruction rewriting hack, moving to "push+sti+ret" also makes a lost sti shadow just a "possibly odd stack frame visibility" issue rather than anything deeply fatal. We can probably just write it off as "some old CPU's (and a smattering or very rare and not relevant new ones) have potential but unlikely latency issues because of a historical CPU mis-design - don't do perf on them". Linus
On Tue, Apr 30, 2019 at 9:06 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, Apr 30, 2019 at 6:56 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Realistically, I don't think you can hit the problem in practice. The > only way to hit that incredibly small race of "one instruction, *both* > NMI and interrupts" is to have a lot of interrupts going all at the > same time, but that will also then solve the latency problem, so the > very act of triggering it will also fix it. > > I don't see any case where it's really bad. The "sti sysexit" race is > similar, just about latency of user space signal reporting (and > perhaps any pending TIF_WORK_xyz flags). In the worst case, it actually kills the machine. Last time I tracked a bug like this down, I think the issue was that we got preempted after the last TIF_ check, entered a VM, exited, context switched back, and switched to user mode without noticing that there was a ending KVM user return notifier. This left us with bogus CPU state and the machine exploded. Linus, can I ask you to reconsider your opposition to Josh's other approach of just shifting the stack on int3 entry? I agree that it's ugly, but the ugliness is easily manageable and fairly self-contained. We add a little bit of complication to the entry asm (but it's not like it's unprecedented -- the entry asm does all kinds of stack rearrangement due to IST and PTI crap already), and we add an int3_emulate_call(struct pt_regs *regs, unsigned long target) helper that has appropriate assertions that the stack is okay and emulates the call. And that's it. In contrast, your approach involves multiple asm trampolines, hash tables, batching complications, and sti shadows. As an additional argument, with the stack-shifting approach, it runs on *every int3 from kernel mode*. This means that we can do something like this: static bool int3_emulate_call_okay(struct pt_regs *regs) { unsigned long available_stack = regs->sp - (unsigned long); return available_stack >= sizeof(long); } void do_int3(...) { { WARN_ON_ONCE(!user_mode(regs) && !int3_emulate_call_okay(regs)); ...; } static void int3_emulate_call(struct pt_regs *regs, unsigned long target) { BUG_ON(user_mode(regs) || !int3_emulate_call_okey(regs)); regs->sp -= sizeof(unsigned long); *(unsigned long *)regs->sp = target; /* CET SHSTK fixup goes here */ } Obviously the CET SHSTK fixup might be rather nasty, but I suspect it's a solvable problem. A major benefit of this is that the entry asm nastiness will get exercised all the time, and, if we screw it up, the warning will fire. This is the basic principle behind why the entry stuff *works* these days. I've put a lot of effort into making sure that running kernels with CONFIG_DEBUG_ENTRY and running the selftests actually exercises the nasty cases. --Andy
On Tue, 30 Apr 2019 09:33:51 -0700 Andy Lutomirski <luto@kernel.org> wrote: > Linus, can I ask you to reconsider your opposition to Josh's other > approach of just shifting the stack on int3 entry? I agree that it's > ugly, but the ugliness is easily manageable and fairly self-contained. > We add a little bit of complication to the entry asm (but it's not > like it's unprecedented -- the entry asm does all kinds of stack > rearrangement due to IST and PTI crap already), and we add an > int3_emulate_call(struct pt_regs *regs, unsigned long target) helper > that has appropriate assertions that the stack is okay and emulates > the call. And that's it. I also prefer Josh's stack shift solution, as I personally believe that's a cleaner solution. But I went ahead and implemented Linus's version to get it working for ftrace. Here's the code, and it survived some preliminary tests. There's three places that use the update code. One is the start of every function call (yes, I counted that as one, and that case is determined by: ftrace_location(ip)). The other is the trampoline itself has an update. That could also be converted to a text poke, but for now its here as it was written before text poke existed. The third place is actually a jump (to the function graph code). But that can be safely skipped if we are converting it, as it only goes from jump to nop, or nop to jump. The trampolines reflect this. Also, as NMI code is traced by ftrace, I had to duplicate the trampolines for the nmi case (but only for the interrupts disabled case as NMIs don't have interrupts enabled). -- Steve diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..bf320bf791dd 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -17,6 +17,7 @@ #include <linux/uaccess.h> #include <linux/ftrace.h> #include <linux/percpu.h> +#include <linux/frame.h> #include <linux/sched.h> #include <linux/slab.h> #include <linux/init.h> @@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, static unsigned long ftrace_update_func; +/* Used within inline asm below */ +unsigned long ftrace_update_func_call; + static int update_ftrace_func(unsigned long ip, void *new) { unsigned char old[MCOUNT_INSN_SIZE]; @@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func) unsigned char *new; int ret; + ftrace_update_func_call = (unsigned long)func; + new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new); @@ -280,6 +286,70 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip) return 0; } +extern asmlinkage void ftrace_emulate_call_irqon(void); +extern asmlinkage void ftrace_emulate_call_irqoff(void); +extern asmlinkage void ftrace_emulate_call_nmi(void); +extern asmlinkage void ftrace_emulate_call_update_irqoff(void); +extern asmlinkage void ftrace_emulate_call_update_irqon(void); +extern asmlinkage void ftrace_emulate_call_update_nmi(void); + +static DEFINE_PER_CPU(void *, ftrace_bp_call_return); +static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return); + +asm( + ".text\n" + ".global ftrace_emulate_call_irqoff\n" + ".type ftrace_emulate_call_irqoff, @function\n" + "ftrace_emulate_call_irqoff:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "sti\n\t" + "jmp ftrace_caller\n" + ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n" + + ".global ftrace_emulate_call_irqon\n" + ".type ftrace_emulate_call_irqon, @function\n" + "ftrace_emulate_call_irqon:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "jmp ftrace_caller\n" + ".size ftrace_emulate_call_irqon, .-ftrace_emulate_call_irqon\n" + + ".global ftrace_emulate_call_nmi\n" + ".type ftrace_emulate_call_nmi, @function\n" + "ftrace_emulate_call_nmi:\n\t" + "push %gs:ftrace_bp_call_nmi_return\n\t" + "jmp ftrace_caller\n" + ".size ftrace_emulate_call_nmi, .-ftrace_emulate_call_nmi\n" + + ".global ftrace_emulate_call_update_irqoff\n" + ".type ftrace_emulate_call_update_irqoff, @function\n" + "ftrace_emulate_call_update_irqoff:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "sti\n\t" + "jmp *ftrace_update_func_call\n" + ".size ftrace_emulate_call_update_irqoff, .-ftrace_emulate_call_update_irqoff\n" + + ".global ftrace_emulate_call_update_irqon\n" + ".type ftrace_emulate_call_update_irqon, @function\n" + "ftrace_emulate_call_update_irqon:\n\t" + "push %gs:ftrace_bp_call_return\n\t" + "jmp *ftrace_update_func_call\n" + ".size ftrace_emulate_call_update_irqon, .-ftrace_emulate_call_update_irqon\n" + + ".global ftrace_emulate_call_update_nmi\n" + ".type ftrace_emulate_call_update_nmi, @function\n" + "ftrace_emulate_call_update_nmi:\n\t" + "push %gs:ftrace_bp_call_nmi_return\n\t" + "jmp *ftrace_update_func_call\n" + ".size ftrace_emulate_call_update_nmi, .-ftrace_emulate_call_update_nmi\n" + ".previous\n"); + +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqoff); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqon); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_nmi); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqoff); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqon); +STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_nmi); + /* * A breakpoint was added to the code address we are about to * modify, and this is the handle that will just skip over it. @@ -295,10 +365,40 @@ int ftrace_int3_handler(struct pt_regs *regs) return 0; ip = regs->ip - 1; - if (!ftrace_location(ip) && !is_ftrace_caller(ip)) + if (ftrace_location(ip)) { + if (in_nmi()) { + this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE); + regs->ip = (unsigned long) ftrace_emulate_call_nmi; + return 1; + } + this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE); + if (regs->flags & X86_EFLAGS_IF) { + regs->flags &= ~X86_EFLAGS_IF; + regs->ip = (unsigned long) ftrace_emulate_call_irqoff; + } else { + regs->ip = (unsigned long) ftrace_emulate_call_irqon; + } + } else if (is_ftrace_caller(ip)) { + /* If it's a jump, just need to skip it */ + if (!ftrace_update_func_call) { + regs->ip += MCOUNT_INSN_SIZE -1; + return 1; + } + if (in_nmi()) { + this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE); + regs->ip = (unsigned long) ftrace_emulate_call_update_nmi; + return 1; + } + this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE); + if (regs->flags & X86_EFLAGS_IF) { + regs->flags &= ~X86_EFLAGS_IF; + regs->ip = (unsigned long) ftrace_emulate_call_update_irqoff; + } else { + regs->ip = (unsigned long) ftrace_emulate_call_update_irqon; + } + } else { return 0; - - regs->ip += MCOUNT_INSN_SIZE - 1; + } return 1; } @@ -859,6 +959,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops) func = ftrace_ops_get_func(ops); + ftrace_update_func_call = (unsigned long)func; + /* Do a safe modify in case the trampoline is executing */ new = ftrace_call_replace(ip, (unsigned long)func); ret = update_ftrace_func(ip, new); @@ -960,6 +1062,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func) { unsigned char *new; + ftrace_update_func_call = 0; new = ftrace_jmp_replace(ip, (unsigned long)func); return update_ftrace_func(ip, new);
On Tue, 30 Apr 2019 13:03:59 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > I also prefer Josh's stack shift solution, as I personally believe > that's a cleaner solution. But I went ahead and implemented Linus's > version to get it working for ftrace. Here's the code, and it survived > some preliminary tests. Well it past the second level of tests. If this is the way we are going, I could add comments to the code and apply it to my queue and run it through the rest of my test suite and make it ready for the merge window. I may add a stable tag to it to go back to where live kernel patching was added, as it fixes a potential bug there. -- Steve
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 00b7e27bc2b7..0b63ae02b1f3 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -93,6 +93,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o obj-$(CONFIG_FUNCTION_TRACER) += ftrace_$(BITS).o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o +obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace_int3_stubs.o obj-$(CONFIG_X86_TSC) += trace_clock.o obj-$(CONFIG_KEXEC_CORE) += machine_kexec_$(BITS).o obj-$(CONFIG_KEXEC_CORE) += relocate_kernel_$(BITS).o crash.o diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..917494f35cba 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -280,25 +280,84 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip) return 0; } -/* - * A breakpoint was added to the code address we are about to - * modify, and this is the handle that will just skip over it. - * We are either changing a nop into a trace call, or a trace - * call to a nop. While the change is taking place, we treat - * it just like it was a nop. - */ +extern void ftrace_graph_call(void); + +asmlinkage void ftrace_int3_stub_regs_caller(void); +asmlinkage void ftrace_int3_stub_list_func(void); + +/* A breakpoint was added to the code address we are about to modify. */ int ftrace_int3_handler(struct pt_regs *regs) { unsigned long ip; + bool is_ftrace_location; + struct ftrace_int3_stack *stack; + int slot; if (WARN_ON_ONCE(!regs)) return 0; ip = regs->ip - 1; - if (!ftrace_location(ip) && !is_ftrace_caller(ip)) + is_ftrace_location = ftrace_location(ip); + if (!is_ftrace_location && !is_ftrace_caller(ip)) return 0; - regs->ip += MCOUNT_INSN_SIZE - 1; + ip += MCOUNT_INSN_SIZE; + + if (!is_ftrace_location && + ftrace_update_func == (unsigned long)ftrace_graph_call) { + /* + * The insn at ftrace_graph_call is being changed from a + * nop to a jmp or vice versa. Treat it as a nop and + * skip over it. + */ + regs->ip = ip; + return 1; + } + + /* + * The insn having the breakpoint on it is either some mcount + * call site or one of ftrace_call, ftrace_regs_call and their + * equivalents within some trampoline. The currently pending + * transition is known to turn the insn from a nop to a call, + * from a call to a nop or to change the target address of an + * existing call. We're going to emulate a call to the most + * generic implementation capable of handling any possible + * configuration. For the mcount sites that would be + * ftrace_regs_caller() and for the remaining calls, which all + * have got some ftrace_func_t target, ftrace_ops_list_func() + * will do the right thing. + * + * However, the call insn can't get emulated from this trap + * handler here. Rewrite the iret frame's ->ip value to one of + * the ftrace_int3_stub instances which will then setup + * everything in the original context. The address following + * the current insn will be passed to the stub via the + * ftrace_int3_stack. + */ + stack = ¤t_thread_info()->ftrace_int3_stack; + if (WARN_ON_ONCE(stack->depth >= 4)) { + /* + * This should not happen as at most one stack slot is + * required per the contexts "normal", "irq", + * "softirq" and "nmi" each. However, be conservative + * and treat it like a nop. + */ + regs->ip = ip; + return 1; + } + + /* + * Make sure interrupts will see the incremented ->depth value + * before writing the stack entry. + */ + slot = stack->depth; + WRITE_ONCE(stack->depth, slot + 1); + WRITE_ONCE(stack->slots[slot], ip); + + if (is_ftrace_location) + regs->ip = (unsigned long)ftrace_int3_stub_regs_caller; + else + regs->ip = (unsigned long)ftrace_int3_stub_list_func; return 1; } @@ -949,8 +1008,6 @@ void arch_ftrace_trampoline_free(struct ftrace_ops *ops) #ifdef CONFIG_FUNCTION_GRAPH_TRACER #ifdef CONFIG_DYNAMIC_FTRACE -extern void ftrace_graph_call(void); - static unsigned char *ftrace_jmp_replace(unsigned long ip, unsigned long addr) { return ftrace_text_replace(0xe9, ip, addr); diff --git a/arch/x86/kernel/ftrace_int3_stubs.S b/arch/x86/kernel/ftrace_int3_stubs.S new file mode 100644 index 000000000000..ef5f580450bb --- /dev/null +++ b/arch/x86/kernel/ftrace_int3_stubs.S @@ -0,0 +1,61 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2019 SUSE Linux GmbH */ + +#include <asm/asm.h> +#include <asm/percpu.h> +#include <asm/unwind_hints.h> +#include <asm/asm-offsets.h> +#include <linux/linkage.h> + +#ifdef CONFIG_X86_64 +#define WORD_SIZE 8 +#else +#define WORD_SIZE 4 +#endif + +.macro ftrace_int3_stub call_target + /* + * We got here from ftrace_int3_handler() because a breakpoint + * on a call insn currently being modified has been + * hit. ftrace_int3_handler() can't emulate the function call + * directly, because it's running at a different position on + * the stack, obviously. Hence it sets the regs->ip to this + * stub so that we end up here upon the iret from the int3 + * handler. The stack is now in its original state and we can + * emulate the function call insn by pushing the return + * address onto the stack and jumping to the call target. The + * desired return address has been put onto the ftrace_int3_stack + * kept within struct thread_info. + */ + UNWIND_HINT_EMPTY + /* Reserve room for the emulated call's return address. */ + sub $WORD_SIZE, %_ASM_SP + /* + * Pop the return address from ftrace_int3_ip_stack and write + * it to the location just reserved. Be careful to retrieve + * the address before decrementing ->depth in order to protect + * against nested contexts clobbering it. + */ + push %_ASM_AX + push %_ASM_CX + push %_ASM_DX + mov PER_CPU_VAR(current_task), %_ASM_AX + mov TASK_TI_ftrace_int3_depth(%_ASM_AX), %_ASM_CX + dec %_ASM_CX + mov TASK_TI_ftrace_int3_slots(%_ASM_AX, %_ASM_CX, WORD_SIZE), %_ASM_DX + mov %_ASM_CX, TASK_TI_ftrace_int3_depth(%_ASM_AX) + mov %_ASM_DX, 3*WORD_SIZE(%_ASM_SP) + pop %_ASM_DX + pop %_ASM_CX + pop %_ASM_AX + /* Finally, transfer control to the target function. */ + jmp \call_target +.endm + +ENTRY(ftrace_int3_stub_regs_caller) + ftrace_int3_stub ftrace_regs_caller +END(ftrace_int3_stub_regs_caller) + +ENTRY(ftrace_int3_stub_list_func) + ftrace_int3_stub ftrace_ops_list_func +END(ftrace_int3_stub_list_func)
With dynamic ftrace, ftrace patches call sites on x86 in a three steps: 1. Put a breakpoint at the to be patched location, 2. update call site and 3. finally remove the breakpoint again. Note that the breakpoint ftrace_int3_handler() simply makes execution to skip over the to be patched instruction. This patching happens in the following circumstances: a.) the global ftrace_trace_function changes and the call sites at ftrace_call and ftrace_regs_call get patched, b.) an ftrace_ops' ->func changes and the call site in its trampoline gets patched, c.) graph tracing gets enabled/disabled and the jump site at ftrace_graph_call gets patched, d.) a mcount site gets converted from nop -> call, call -> nop, or call -> call. The latter case, i.e. a mcount call getting redirected, is possible in e.g. a transition from trampolined to not trampolined upon a user enabling function tracing on a live patched function. The ftrace_int3_handler() simply skipping over the updated insn is quite problematic in the context of live patching, because it means that a live patched function gets temporarily reverted to its unpatched original and this breaks the live patching consistency model. But even without live patching, it is desirable to avoid missing traces when making changes to the tracing setup. Make ftrace_int3_handler not to skip over the fops invocation, but modify the interrupted control flow to issue a call as needed. Case c.) from the list above can be ignored, because a jmp instruction gets changed to a nop or vice versa. The remaining cases a.), b.) and d.) all involve call instructions. For a.) and b.), the call always goes to some ftrace_func_t and the generic ftrace_ops_list_func() impementation will be able to demultiplex and do the right thing. For case d.), the call target is either of ftrace_caller(), ftrace_regs_caller() or some ftrace_ops' trampoline. Because providing the register state won't cause any harm for !FTRACE_OPS_FL_SAVE_REGS ftrace_ops, ftrace_regs_caller() would be a suitable target capable of handling any case. ftrace_int3_handler()'s context is different from the interrupted call instruction's one, obviously. In order to be able to emulate the call within the original context, make ftrace_int3_handler() set its iret frame's ->ip to some helper stub. Upon return from the trap, this stub will then mimic the call by pushing the the return address onto the stack and issuing a jmp to the target address. As describe above, the jmp target will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide one such stub implementation for each of the two cases. Finally, the desired return address, which is derived from the trapping ->ip, must get passed from ftrace_int3_handler() to these stubs. Make ftrace_int3_handler() push it onto the ftrace_int3_stack introduced by an earlier patch and let the stubs consume it. Be careful to use proper compiler barriers such that nested int3 handling from e.g. irqs won't clobber entries owned by outer instances. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Nicolai Stange <nstange@suse.de> --- arch/x86/kernel/Makefile | 1 + arch/x86/kernel/ftrace.c | 79 +++++++++++++++++++++++++++++++------ arch/x86/kernel/ftrace_int3_stubs.S | 61 ++++++++++++++++++++++++++++ 3 files changed, 130 insertions(+), 11 deletions(-) create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S