Message ID | 1354817466.30905.13.camel@linaro1.home (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 2012-12-06 at 18:11 +0000, Jon Medhurst (Tixy) wrote: > When the generic ftrace implementation modifies code for trace-points it > uses stop_machine() to call ftrace_modify_all_code() on one CPU. This > ultimately calls the ARM specific function ftrace_modify_code() which > updates the instruction and then does flush_icache_range(). As this > cache flushing only operates on the local CPU then other cores may end > up executing the old instruction if it's still in their icaches. > > This may or may not cause problems for the use of ftrace on kernels > compiled for ARM instructions. However, Thumb2 instructions can straddle > two cache lines so its possible for half the old instruction to be in > the cache and half the new one, leading to the CPU executing garbage. Hmm, your use of "may or may not" seems as you may not know this answer. I wonder if you can use the break point method as x86 does now, and remove the stop machine completely. Basically this is how it works: add sw breakpoints to all locations to modify (the bp handler just does a nop over the instruction). send an IPI to all CPUs to flush their icache. Modify the non breakpoint part of the instruction with the new instruction. send an IPI to all CPUs to flush their icache Replace the breakpoint with the finished instruction. Then you don't suffer the stomp_machine() latency hit. The system will slow a bit due to the breakpoints but there wont be a huge "halt" in the middle of processing. -- Steve > > This patch fixes this situation by providing an arch-specific > implementation of arch_ftrace_update_code() which ensures that after one > core has modified all the code, the other cores invalidate their icaches > before continuing. > > Signed-off-by: Jon Medhurst <tixy@linaro.org> > ---
On Thu, 2012-12-06 at 14:19 -0500, Steven Rostedt wrote: > Hmm, your use of "may or may not" seems as you may not know this answer. > I wonder if you can use the break point method as x86 does now, and > remove the stop machine completely. Basically this is how it works: > > add sw breakpoints to all locations to modify (the bp handler just does > a nop over the instruction). > > send an IPI to all CPUs to flush their icache. > > Modify the non breakpoint part of the instruction with the new > instruction. > > send an IPI to all CPUs to flush their icache > > Replace the breakpoint with the finished instruction. If I understand correctly then this method can't work on ARM because a 'software breakpoint' is 'replace an instruction with a known undefined instruction _of the same size_'. It haa to be the same size because code like this: it eq /* If condition code 'eq' true */ insA /* then execute this instruction */ insB /* Always execute this */ if we replace insA with a breakpoint which is shorter, then we have it eq /* If condition code 'eq' true */ bkpt /* then execute the breakpoint */ insA-part2 /* Always execute this garbage */ insB /* Always execute this */ and to complicate matters more, the 'it' instruction can make up to the next four instructions conditional, so you can't reverse decode the instruction stream reliably to even detect such code. And further, it's implementation defined (up to who every creates the silicon) whether an undefined instructions actually causes an abort when it occurs in such an 'it' block, it may just execute as a nop. Welcome to the work of ARM :-)
On Fri, 2012-12-07 at 09:22 +0000, Jon Medhurst (Tixy) wrote: > On Thu, 2012-12-06 at 14:19 -0500, Steven Rostedt wrote: > > Hmm, your use of "may or may not" seems as you may not know this answer. > > I wonder if you can use the break point method as x86 does now, and > > remove the stop machine completely. Basically this is how it works: > > > > add sw breakpoints to all locations to modify (the bp handler just does > > a nop over the instruction). > > > > send an IPI to all CPUs to flush their icache. > > > > Modify the non breakpoint part of the instruction with the new > > instruction. > > > > send an IPI to all CPUs to flush their icache > > > > Replace the breakpoint with the finished instruction. > > If I understand correctly then this method can't work on ARM because a > 'software breakpoint' is 'replace an instruction with a known undefined > instruction _of the same size_'. It haa to be the same size because code > like this: > > it eq /* If condition code 'eq' true */ > insA /* then execute this instruction */ > insB /* Always execute this */ > > if we replace insA with a breakpoint which is shorter, then we have > > it eq /* If condition code 'eq' true */ > bkpt /* then execute the breakpoint */ > insA-part2 /* Always execute this garbage */ Why always execute the garbage? Do what we do in x86, where the breakpoint is only 1 byte and the instruction being replaced is 5 bytes. The breakpoint handler returns to the instruction after the "garbage" (insB). > insB /* Always execute this */ > > and to complicate matters more, the 'it' instruction can make up to the > next four instructions conditional, so you can't reverse decode the > instruction stream reliably to even detect such code. > > And further, it's implementation defined (up to who every creates the > silicon) whether an undefined instructions actually causes an abort when > it occurs in such an 'it' block, it may just execute as a nop. > > Welcome to the work of ARM :-) > But also realize that function tracing is special :-) We have no cases like this. The instruction being replaced is a call to mcount. In fact, we replace it at boot with a nop. And this method only replaces that nop into a call to function tracer, or replaces the call to function tracer back to a nop. Always at the start of the function, and never involved with conditionals. This limitation that function tracing imposes on what we replace makes things a bit more sane in how we replace it. -- Steve
On Fri, 2012-12-07 at 09:03 -0500, Steven Rostedt wrote: > On Fri, 2012-12-07 at 09:22 +0000, Jon Medhurst (Tixy) wrote: > > On Thu, 2012-12-06 at 14:19 -0500, Steven Rostedt wrote: > > > Hmm, your use of "may or may not" seems as you may not know this answer. > > > I wonder if you can use the break point method as x86 does now, and > > > remove the stop machine completely. Basically this is how it works: > > > > > > add sw breakpoints to all locations to modify (the bp handler just does > > > a nop over the instruction). > > > > > > send an IPI to all CPUs to flush their icache. > > > > > > Modify the non breakpoint part of the instruction with the new > > > instruction. > > > > > > send an IPI to all CPUs to flush their icache > > > > > > Replace the breakpoint with the finished instruction. > > > > If I understand correctly then this method can't work on ARM because a > > 'software breakpoint' is 'replace an instruction with a known undefined > > instruction _of the same size_'. It haa to be the same size because code > > like this: > > > > it eq /* If condition code 'eq' true */ > > insA /* then execute this instruction */ > > insB /* Always execute this */ > > > > if we replace insA with a breakpoint which is shorter, then we have > > > > it eq /* If condition code 'eq' true */ > > bkpt /* then execute the breakpoint */ > > insA-part2 /* Always execute this garbage */ > > Why always execute the garbage? Do what we do in x86, where the > breakpoint is only 1 byte and the instruction being replaced is 5 bytes. We don't get any say in the matter, if the condition is false, then the CPU will skip over bkpt and go on to execute insA-part2, that's how the instruction set works. If the condition is true, then it's implementation defined whether the CPU will skip bkt or not. The 'it' instruction is a separate instruction to insA, it's not any kind of prefix used to make a more complex single instruction. You can do something like: iteet eq /* if-then-else-else-then */ insA /* executed if 'eq' */ insB /* executed if not 'eq' */ insC /* executed if not 'eq' */ insD /* executed if 'eq' */ which is five separate CPU instructions, and you can get interrupted after any of them (the current state of conditional instruction execution is stored in the status register). Replacing insB with a shorter 'breakpoint' would give... iteet eq /* if-then-else-else-then */ insA /* executed if 'eq' */ bkpt /* executed if not 'eq' (implementation defined) */ insB-part2 /* executed if not 'eq' (garbage) */ insC /* executed if 'eq' */ insD /* always executed */ which is not good ;-) > The breakpoint handler returns to the instruction after the > "garbage" (insB). > > > insB /* Always execute this */ > > > > and to complicate matters more, the 'it' instruction can make up to the > > next four instructions conditional, so you can't reverse decode the > > instruction stream reliably to even detect such code. > > > > And further, it's implementation defined (up to who every creates the > > silicon) whether an undefined instructions actually causes an abort when > > it occurs in such an 'it' block, it may just execute as a nop. > > > > Welcome to the work of ARM :-) > > > > But also realize that function tracing is special :-) We have no cases > like this. The instruction being replaced is a call to mcount. In fact, > we replace it at boot with a nop. And this method only replaces that nop > into a call to function tracer, or replaces the call to function tracer > back to a nop. Always at the start of the function, and never involved > with conditionals. This limitation that function tracing imposes on what > we replace makes things a bit more sane in how we replace it. Then perhaps the method you suggest will work on ARM :-). However, that is not something I personally propose to implement at this time. (I was doing my good Samaritan act by trying to fix the crashes which another team was getting when trying to use ftrace.)
On Fri, 2012-12-07 at 14:55 +0000, Jon Medhurst (Tixy) wrote: > > > But also realize that function tracing is special :-) We have no cases > > like this. The instruction being replaced is a call to mcount. In fact, > > we replace it at boot with a nop. And this method only replaces that nop > > into a call to function tracer, or replaces the call to function tracer > > back to a nop. Always at the start of the function, and never involved > > with conditionals. This limitation that function tracing imposes on what > > we replace makes things a bit more sane in how we replace it. > > Then perhaps the method you suggest will work on ARM :-). However, that > is not something I personally propose to implement at this time. (I was > doing my good Samaritan act by trying to fix the crashes which another > team was getting when trying to use ftrace.) > I'm not NACKing your previous patch, I was just suggesting to bring ARM up to the future :-) I have no problems with the patch, but I just want to put it out there that there's better ways. It's part of the remove stomp_machine() crusade ;-) -- Steve
On Fri, 2012-12-07 at 10:28 -0500, Steven Rostedt wrote: > I'm not NACKing your previous patch, I was just suggesting to bring ARM > up to the future :-) ARM is the future ;-) > > I have no problems with the patch, but I just want to put it out there > that there's better ways. It's part of the remove stomp_machine() > crusade ;-) Indeed, when I first cam across stop_machine(), I though 'yuck!', especially when I realised Linux scheduling seems to mean the system goes idle until the the next tick triggers scheduling of these 'highest priority' threads.
On Fri, 2012-12-07 at 15:40 +0000, Jon Medhurst (Tixy) wrote: > On Fri, 2012-12-07 at 10:28 -0500, Steven Rostedt wrote: > > I'm not NACKing your previous patch, I was just suggesting to bring ARM > > up to the future :-) > > ARM is the future ;-) > I had a strong feeling you would reply with that. -- Steve
On Fri, Dec 07, 2012 at 10:28:54AM -0500, Steven Rostedt wrote: > On Fri, 2012-12-07 at 14:55 +0000, Jon Medhurst (Tixy) wrote: > > > > > But also realize that function tracing is special :-) We have no cases > > > like this. The instruction being replaced is a call to mcount. In fact, > > > we replace it at boot with a nop. And this method only replaces that nop > > > into a call to function tracer, or replaces the call to function tracer > > > back to a nop. Always at the start of the function, and never involved > > > with conditionals. This limitation that function tracing imposes on what > > > we replace makes things a bit more sane in how we replace it. > > > > Then perhaps the method you suggest will work on ARM :-). However, that > > is not something I personally propose to implement at this time. (I was > > doing my good Samaritan act by trying to fix the crashes which another > > team was getting when trying to use ftrace.) > > > > I'm not NACKing your previous patch, I was just suggesting to bring ARM > up to the future :-) > > I have no problems with the patch, but I just want to put it out there > that there's better ways. It's part of the remove stomp_machine() > crusade ;-) That's fine if there are better ways. If your view is that this would bring things "up to the future" consider this: what you suggest is possible with the standard ARM 32-bit instruction set. With the more modern Thumb instruction set, because we now effectively have prefixes, where those prefixes control the execution of the following instructions, what you suggest becomes no longer possible. So, it's not a question of bringing stuff up to the future at all... you can call it a design regression of you will, but you're really making demands about how CPUs work which are outside of your remit. Think of this a bit like you changing the opcodes immediately following a 'LOCK' prefix on x86. I suspect divorsing the following opcodes from its prefix would be very bad for the instructions atomicity.
On Fri, 2012-12-07 at 16:23 +0000, Russell King - ARM Linux wrote: > That's fine if there are better ways. If your view is that this would > bring things "up to the future" consider this: what you suggest is possible > with the standard ARM 32-bit instruction set. With the more modern Thumb > instruction set, because we now effectively have prefixes, where those > prefixes control the execution of the following instructions, what you > suggest becomes no longer possible. > > So, it's not a question of bringing stuff up to the future at all... you > can call it a design regression of you will, but you're really making > demands about how CPUs work which are outside of your remit. > > Think of this a bit like you changing the opcodes immediately following a > 'LOCK' prefix on x86. I suspect divorsing the following opcodes from its > prefix would be very bad for the instructions atomicity. But what about the limitations that the function tracer imposes on the code that gets modified by stop_machine()? 1) the original code is simply a call to mcount 2) on boot up, that call gets converted into a nop 3) the code that gets changed will only be converting a nop to a call into the function tracer, and back again. IOW, it's a very limited subset of the ARM assembly that gets touched. I'm not sure what the op codes are for the above, but I can imagine they don't impose the prefixes as you described. If that's the case, is it still possible to change to the breakpoint method? -- Steve
On Fri, Dec 07, 2012 at 11:36:40AM -0500, Steven Rostedt wrote: > On Fri, 2012-12-07 at 16:23 +0000, Russell King - ARM Linux wrote: > > > That's fine if there are better ways. If your view is that this would > > bring things "up to the future" consider this: what you suggest is possible > > with the standard ARM 32-bit instruction set. With the more modern Thumb > > instruction set, because we now effectively have prefixes, where those > > prefixes control the execution of the following instructions, what you > > suggest becomes no longer possible. > > > > So, it's not a question of bringing stuff up to the future at all... you > > can call it a design regression of you will, but you're really making > > demands about how CPUs work which are outside of your remit. > > > > Think of this a bit like you changing the opcodes immediately following a > > 'LOCK' prefix on x86. I suspect divorsing the following opcodes from its > > prefix would be very bad for the instructions atomicity. > > But what about the limitations that the function tracer imposes on the > code that gets modified by stop_machine()? > > 1) the original code is simply a call to mcount > > 2) on boot up, that call gets converted into a nop > > 3) the code that gets changed will only be converting a nop to a call > into the function tracer, and back again. > > IOW, it's a very limited subset of the ARM assembly that gets touched. > I'm not sure what the op codes are for the above, but I can imagine they > don't impose the prefixes as you described. > > If that's the case, is it still possible to change to the breakpoint > method? I have no idea; I've no idea how ftrace works on ARM. That's something other people use and deal with. Last (and only) time I used the built-in kernel tracing facilities I ended up giving up with it and going back to using my sched-clock+record+printk based approaches instead on account of the kernels built-in tracing being far too heavy.
On Fri, 2012-12-07 at 16:45 +0000, Russell King - ARM Linux wrote: > On Fri, Dec 07, 2012 at 11:36:40AM -0500, Steven Rostedt wrote: > > But what about the limitations that the function tracer imposes on the > > code that gets modified by stop_machine()? > > > > 1) the original code is simply a call to mcount > > > > 2) on boot up, that call gets converted into a nop > > > > 3) the code that gets changed will only be converting a nop to a call > > into the function tracer, and back again. > > > > IOW, it's a very limited subset of the ARM assembly that gets touched. > > I'm not sure what the op codes are for the above, but I can imagine they > > don't impose the prefixes as you described. > > > > If that's the case, is it still possible to change to the breakpoint > > method? > > I have no idea; I've no idea how ftrace works on ARM. I know how ftrace works on ARM, I'm just asking about the way the architecture works in general. So to answer my question, you don't need to know anything about ftrace. I'll make my question more general: If I have a nop, that is a size of a call (branch and link), which is near the beginning of a function and not part of any conditional, and I want to convert it into a call (branch and link), would adding a breakpoint to it, modifying it to the call, and then removing the breakpoint be possible? Of course it would require syncing in between steps, but my question is, if the above is possible on a thumb2 ARM processor? > That's something > other people use and deal with. Last (and only) time I used the built-in > kernel tracing facilities I ended up giving up with it and going back to > using my sched-clock+record+printk based approaches instead on account > of the kernels built-in tracing being far too heavy. Too bad. Which tracing facilities did you use? Function tracing? And IIRC, ARM originally had only the static tracing, which was extremely heavy weight. Have you tried tracepoints? Also, have you tried my favorite way of debugging: trace_printk(). It acts just like printk but instead of recording to the console, it records into the ftrace buffer which can be read via the /debug/tracing/trace or dumped to the console with a sysrq-z. -- Steve
On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote: > I'll make my question more general: > > If I have a nop, that is a size of a call (branch and link), which is > near the beginning of a function and not part of any conditional, and I > want to convert it into a call (branch and link), would adding a > breakpoint to it, modifying it to the call, and then removing the > breakpoint be possible? Of course it would require syncing in between > steps, but my question is, if the above is possible on a thumb2 ARM > processor? I believe so. The details are (repeating your earlier explanation) ... 1. Replace first half of nop with 16bit 'breakpoint' instruction. 2. Sync.(cache flush to PoU + IPIs to make other cores invalidate the icache for changed part of the nop instruction). 3. Replace second half of nop with second half of the call instruction. 4. Sync. 5. Replace the breakpoint with the first half of the call instruction. 6. Sync And if any core execute the breakpoint instruction, then the handler ensures execution continues at the instruction after the nop were trying to replace. However, wouldn't we need any of this breakpoint malarkey, why not just just use a 16-bit branch instruction which branches over the second half of the nop? :-)
On Fri, 2012-12-07 at 17:45 +0000, Jon Medhurst (Tixy) wrote: > On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote: > > I'll make my question more general: > > > > If I have a nop, that is a size of a call (branch and link), which is > > near the beginning of a function and not part of any conditional, and I > > want to convert it into a call (branch and link), would adding a > > breakpoint to it, modifying it to the call, and then removing the > > breakpoint be possible? Of course it would require syncing in between > > steps, but my question is, if the above is possible on a thumb2 ARM > > processor? > > I believe so. The details are (repeating your earlier explanation) ... > > 1. Replace first half of nop with 16bit 'breakpoint' instruction. > > 2. Sync.(cache flush to PoU + IPIs to make other cores invalidate the > icache for changed part of the nop instruction). > > 3. Replace second half of nop with second half of the call instruction. > > 4. Sync. > > 5. Replace the breakpoint with the first half of the call instruction. > > 6. Sync > > And if any core execute the breakpoint instruction, then the handler > ensures execution continues at the instruction after the nop were trying > to replace. Exactly! > > However, wouldn't we need any of this breakpoint malarkey, why not just > just use a 16-bit branch instruction which branches over the second half > of the nop? :-) If you can get away with that, sure. Or better yet. If the arch supports it, you can do what I did with powerpc. That was just replace the nop with the 32bit branch, and the 32bit branch with a 32bit nop. No nops. No multiple steps in between. I just did the swap of all function tracepoints in one fell swoop, and then did the icache sync. Now that's if the arch doesn't have issues with swapping code like this. Can a 32bit branch-and-link be spread across cache lines? On x86 the call is 5 bytes and can be. Thus, we were forced to do the breakpoint because we don't know how the instructions are laid out on the cache lines. If 32bit can't be swapped but 16bit never crosses cache lines, then your approach may also work. -- Steve
On Fri, Dec 07, 2012 at 12:13:56PM -0500, Steven Rostedt wrote: > On Fri, 2012-12-07 at 16:45 +0000, Russell King - ARM Linux wrote: > > On Fri, Dec 07, 2012 at 11:36:40AM -0500, Steven Rostedt wrote: > > > > But what about the limitations that the function tracer imposes on the > > > code that gets modified by stop_machine()? > > > > > > 1) the original code is simply a call to mcount > > > > > > 2) on boot up, that call gets converted into a nop > > > > > > 3) the code that gets changed will only be converting a nop to a call > > > into the function tracer, and back again. > > > > > > IOW, it's a very limited subset of the ARM assembly that gets touched. > > > I'm not sure what the op codes are for the above, but I can imagine they > > > don't impose the prefixes as you described. > > > > > > If that's the case, is it still possible to change to the breakpoint > > > method? > > > > I have no idea; I've no idea how ftrace works on ARM. > > I know how ftrace works on ARM, I'm just asking about the way the > architecture works in general. So to answer my question, you don't need > to know anything about ftrace. I'll make my question more general: > > If I have a nop, that is a size of a call (branch and link), which is > near the beginning of a function and not part of any conditional, and I > want to convert it into a call (branch and link), would adding a > breakpoint to it, modifying it to the call, and then removing the > breakpoint be possible? Of course it would require syncing in between > steps, but my question is, if the above is possible on a thumb2 ARM > processor? So, you're asking me to wave hands in the air, make guesses and hope that I hit the situation you're knowledgable of without actually telling me anything. Great - you really know how to frustrate people... If you're saying that the nop was created at _compile_ time, to be a 32-bit instruction then maybe - but you have a problem. That 32-bit instruction may stradle a 32-bit boundary (worse if it stradles a page), and _any_ changes to that instruction will not be atomic - other CPUs will see the store as two separate operations which, given the right timing may create an illegal instruction. Even changing it to a breakpoint is potentially problematical. So we'd need to ensure that no other CPU was executing the code while we modify it. Now, if you're going to say that ftrace inserts a 32-bit nop with appropriate alignment constraints at _compile_ time, then maybe that would work, but then your update to the instruction might as well just be NOP->BL because that's a word-write to an aligned address which will be atomic (in so far as either the entire instruction has been updated _or_ none of the instruction has been updated.) In a previous email you intimated that these NOPs are inserted by ftrace at boot time. Given that these NOPs would have to be 32-bit instructions, I'd hope that they're also replacing 32-bit instructions and not two 16-bit instructions which might be prefixed by a "if-then" instruction. Maybe now you'll provide some information on how ftrace works as you should now realise that your "simple question" doesn't have a simple answer. > > That's something > > other people use and deal with. Last (and only) time I used the built-in > > kernel tracing facilities I ended up giving up with it and going back to > > using my sched-clock+record+printk based approaches instead on account > > of the kernels built-in tracing being far too heavy. > > Too bad. Which tracing facilities did you use? Function tracing? And > IIRC, ARM originally had only the static tracing, which was extremely > heavy weight. Have you tried tracepoints? Also, have you tried my > favorite way of debugging: trace_printk(). It acts just like printk but > instead of recording to the console, it records into the ftrace buffer > which can be read via the /debug/tracing/trace or dumped to the console > with a sysrq-z. TBH I don't remember, it was a few years ago that I last had to measure stuff.
On Fri, 2012-12-07 at 13:06 -0500, Steven Rostedt wrote: > If you can get away with that, sure. Or better yet. If the arch supports > it, you can do what I did with powerpc. That was just replace the nop > with the 32bit branch, and the 32bit branch with a 32bit nop. No nops. s/No nops/No breakpoints/ > No multiple steps in between. I just did the swap of all function > tracepoints in one fell swoop, and then did the icache sync. > -- Steve
On Fri, 2012-12-07 at 13:06 -0500, Steven Rostedt wrote: > > > > However, wouldn't we need any of this breakpoint malarkey, why not just > > just use a 16-bit branch instruction which branches over the second half > > of the nop? :-) > > If you can get away with that, sure. Or better yet. If the arch supports > it, you can do what I did with powerpc. That was just replace the nop > with the 32bit branch, and the 32bit branch with a 32bit nop. No nops. > No multiple steps in between. I just did the swap of all function > tracepoints in one fell swoop, and then did the icache sync. But that gets back to the original problem that the 32bit instructions can straddle two cache lines. > > Now that's if the arch doesn't have issues with swapping code like this. > Can a 32bit branch-and-link be spread across cache lines? Yes, Thumb2 instructions are either 16 or 32bit and are aligned to 16bit boundaries. So 16bit instructions never cross cache lines and all 32bit instructions can. > On x86 the > call is 5 bytes and can be. Thus, we were forced to do the breakpoint > because we don't know how the instructions are laid out on the cache > lines. > > If 32bit can't be swapped but 16bit never crosses cache lines, then your > approach may also work. It should.
[ Added hpa, as he knows a bit about x86, and breakpoints ] On Fri, 2012-12-07 at 18:13 +0000, Russell King - ARM Linux wrote: > So, you're asking me to wave hands in the air, make guesses and hope that > I hit the situation you're knowledgable of without actually telling me > anything. Great - you really know how to frustrate people... Sorry, I thought I was telling you something. I guess we have a bit of a disconnect. Jon did hit what I was trying to ask. > > If you're saying that the nop was created at _compile_ time, to be a 32-bit Actually the call is created at compile time. When compiled with -pg, gcc will insert branch and link calls to a special function (actually implemented in assembly) to "mcount". Looking at the arm implementation, it seems that the branch and link is 32bits (just confirming). On boot up, all calls to mcount are converted to 32bit nops. > instruction then maybe - but you have a problem. That 32-bit instruction > may stradle a 32-bit boundary (worse if it stradles a page), and _any_ > changes to that instruction will not be atomic - other CPUs will see the > store as two separate operations which, given the right timing may create > an illegal instruction. This is the same as x86. > > Even changing it to a breakpoint is potentially problematical. So we'd > need to ensure that no other CPU was executing the code while we modify > it. This is not the same as x86, I guess because x86 has a one byte breakpoint. Thus, it is stated in the x86 architecture (I believe, Peter, you can correct me if I'm wrong), that the only "safe" thing that can modify code, is a software breakpoint. Are you saying that thumb does not guarantee even software breakpoints from being added atomically? Doesn't that kill the purpose of a breakpoint? > > Now, if you're going to say that ftrace inserts a 32-bit nop with > appropriate alignment constraints at _compile_ time, then maybe that would > work, but then your update to the instruction might as well just be NOP->BL > because that's a word-write to an aligned address which will be atomic (in > so far as either the entire instruction has been updated _or_ none of the > instruction has been updated.) That's how it's done on powerpc. > > In a previous email you intimated that these NOPs are inserted by ftrace at > boot time. Given that these NOPs would have to be 32-bit instructions, I'd > hope that they're also replacing 32-bit instructions and not two 16-bit > instructions which might be prefixed by a "if-then" instruction. At compile time, it's a call to mcount, inserted by gcc, and at the beginning of functions, thus they should never be prefixed by a "if-then". > > Maybe now you'll provide some information on how ftrace works as you should > now realise that your "simple question" doesn't have a simple answer. Maybe, I've described it above, but I'll repeat it in more detail here. Ftrace uses the gcc -pg option that adds a call to 'mcount' to the beginning of almost every function. It does not add mcount to inlined functions, so these functions are truly functions and not inlined into other functions. Also some functions are manually ignored in the kernel when we annotate them with 'notrace'. Just to make things consistent, as gcc doesn't always honor the 'inline' code, I've defined inline to also contain 'notrace' as well. The 'mcount' function has to be written in assembly. On arm, it's implemented in arch/arm/kernel/entry-common.S. It looks like this: ENTRY(mcount) stmdb sp!, {lr} ldr lr, [fp, #-4] ldmia sp!, {pc} ENDPROC(mcount) Note, I removed the #ifdef/#else of CONFIG_DYNAMIC_FTRACE, because this modification only happens with that config enabled. So I only showed that version. During compile time, the recordmcount.c code is run against all .o files, and records the location of the mcount callers. It then creates a table of those locations and links them back into the .o file. On boot up (before SMP starts), this table is referenced, and all the calls to mcount are converted into 32 bit nops. Before this conversion, any code that hits the call will simply return back (as you can see by the above mcount definition). When we enable tracing, currently we use stop_machine(), all the nops in functions to be traced are then converted to another call, but not to mcount (as that just returns), but instead to 'ftrace_caller'. This function is also written in assembly and it handles the tracing when hit. When tracing is disabled, the same thing happens but we convert the call sites into nops. Does this make more sense? > > > > That's something > > > other people use and deal with. Last (and only) time I used the built-in > > > kernel tracing facilities I ended up giving up with it and going back to > > > using my sched-clock+record+printk based approaches instead on account > > > of the kernels built-in tracing being far too heavy. > > > > Too bad. Which tracing facilities did you use? Function tracing? And > > IIRC, ARM originally had only the static tracing, which was extremely > > heavy weight. Have you tried tracepoints? Also, have you tried my > > favorite way of debugging: trace_printk(). It acts just like printk but > > instead of recording to the console, it records into the ftrace buffer > > which can be read via the /debug/tracing/trace or dumped to the console > > with a sysrq-z. > > TBH I don't remember, it was a few years ago that I last had to measure > stuff. Yeah, things have improved since then :-) -- Steve
Hi Steve, On Fri, Dec 07, 2012 at 06:43:25PM +0000, Steven Rostedt wrote: > On Fri, 2012-12-07 at 18:13 +0000, Russell King - ARM Linux wrote: > > Even changing it to a breakpoint is potentially problematical. So we'd > > need to ensure that no other CPU was executing the code while we modify > > it. > > This is not the same as x86, I guess because x86 has a one byte > breakpoint. Thus, it is stated in the x86 architecture (I believe, > Peter, you can correct me if I'm wrong), that the only "safe" thing that > can modify code, is a software breakpoint. > > Are you saying that thumb does not guarantee even software breakpoints > from being added atomically? Doesn't that kill the purpose of a > breakpoint? For ARMv7, there are small subsets of instructions for ARM and Thumb which are guaranteed to be atomic wrt concurrent modification and execution of the instruction stream between different processors: Thumb: The 16-bit encodings of the B, NOP, BKPT, and SVC instructions. ARM: The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions. but before your eyes light up at the presence of the BKPT instruction in that list; we don't actually use that in Linux and instead leave it for external (i.e. JTAG) debuggers so that they can operate without getting tangled up with spurious traps from the OS. Linux actually picks its own undefined instructions, which are obviously not included in the lists above. Also note that the 16-bit limitation for Thumb instructions above can actually be used to modify *half* of a BL instruction but, to keep things exciting, the PC-relative immediate is split across the two halves. However, you could in theory mess around with bottom 10 bits or so, depending on the exact encoding... Obviously this doesn't preclude the need for cache maintenance on both D and I side, but the atomicity guarantees are as I've described above. Will
On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote: > For ARMv7, there are small subsets of instructions for ARM and Thumb which > are guaranteed to be atomic wrt concurrent modification and execution of > the instruction stream between different processors: > > Thumb: The 16-bit encodings of the B, NOP, BKPT, and SVC instructions. > ARM: The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions. > > but before your eyes light up at the presence of the BKPT instruction in > that list; we don't actually use that in Linux and instead leave it for > external (i.e. JTAG) debuggers so that they can operate without getting > tangled up with spurious traps from the OS. Linux actually picks its own > undefined instructions, which are obviously not included in the lists above. My eyes actually lit up with the B instruction :-) As Jon showed, we could use a 16bit jump instead. Add the B to jump over the other half of the call (to all places that you want to modify). Send a sync to all CPUs to flush their caches. Modify the other half of the call, send another sync, and then modify the first half. > > Also note that the 16-bit limitation for Thumb instructions above can > actually be used to modify *half* of a BL instruction but, to keep things > exciting, the PC-relative immediate is split across the two halves. However, > you could in theory mess around with bottom 10 bits or so, depending on the > exact encoding... > > Obviously this doesn't preclude the need for cache maintenance on both D and > I side, but the atomicity guarantees are as I've described above. Right. Thanks for the update. -- Steve
Hi Jon, Back-pedalling a bit here, but I'm confused by one of your points below: On Fri, Dec 07, 2012 at 05:45:47PM +0000, Jon Medhurst (Tixy) wrote: > On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote: > > I'll make my question more general: > > > > If I have a nop, that is a size of a call (branch and link), which is > > near the beginning of a function and not part of any conditional, and I > > want to convert it into a call (branch and link), would adding a > > breakpoint to it, modifying it to the call, and then removing the > > breakpoint be possible? Of course it would require syncing in between > > steps, but my question is, if the above is possible on a thumb2 ARM > > processor? > > I believe so. The details are (repeating your earlier explanation) ... > > 1. Replace first half of nop with 16bit 'breakpoint' instruction. Sort of -- you'd actually need 2x16-bit nops to make this work. > 2. Sync.(cache flush to PoU + IPIs to make other cores invalidate the > icache for changed part of the nop instruction). Why do you need to use IPIs for I-cache invalidation on other cores? For ARMv7 SMP (i.e. the multi-processing extensions) doing I-cache invalidation by MVA to PoU will be broadcast to the applicable domain for the shareability attributes of the address. So if you do icimvau with an inner-shareable virtual address, it will be broadcast by the hardware. > However, wouldn't we need any of this breakpoint malarkey, why not just > just use a 16-bit branch instruction which branches over the second half > of the nop? :-) Yes, and I think if you do use two 16-bit nops, you can even get rid of all the intermediate `sync' operations (I guess you might want one at the end if you want the call to become visible at a particular point). Will
On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote: > For ARMv7, there are small subsets of instructions for ARM and Thumb which > are guaranteed to be atomic wrt concurrent modification and execution of > the instruction stream between different processors: > > Thumb: The 16-bit encodings of the B, NOP, BKPT, and SVC instructions. > ARM: The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions. > So this means for things like kprobes which can modify arbitrary kernel code we are going to need to continue to always use some form of stop_the_whole_system() function? Also, kprobes currently uses patch_text() which only uses stop_machine for Thumb2 instructions which straddle a word boundary, so this needs changing?
On Mon, Dec 10, 2012 at 11:04:05AM +0000, Jon Medhurst (Tixy) wrote: > On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote: > > For ARMv7, there are small subsets of instructions for ARM and Thumb which > > are guaranteed to be atomic wrt concurrent modification and execution of > > the instruction stream between different processors: > > > > Thumb: The 16-bit encodings of the B, NOP, BKPT, and SVC instructions. > > ARM: The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions. > > > > So this means for things like kprobes which can modify arbitrary kernel > code we are going to need to continue to always use some form of > stop_the_whole_system() function? > > Also, kprobes currently uses patch_text() which only uses stop_machine > for Thumb2 instructions which straddle a word boundary, so this needs > changing? Yes; if you're modifying instructions other than those mentioned above, then you'll need to synchronise the CPUs, update the instructions, perform cache-maintenance on the writing CPU and then execute an isb on the executing core (this last bit isn't needed if you're going to go through an exception return to get back to the new code -- depends on how your stop/resume code works). For ftrace we can (hopefully) avoid a lot of this when we have known points of modification. Will
On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote: > Hi Jon, > > Back-pedalling a bit here, but I'm confused by one of your points below: > > On Fri, Dec 07, 2012 at 05:45:47PM +0000, Jon Medhurst (Tixy) wrote: > > On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote: > > > I'll make my question more general: > > > > > > If I have a nop, that is a size of a call (branch and link), which is > > > near the beginning of a function and not part of any conditional, and I > > > want to convert it into a call (branch and link), would adding a > > > breakpoint to it, modifying it to the call, and then removing the > > > breakpoint be possible? Of course it would require syncing in between > > > steps, but my question is, if the above is possible on a thumb2 ARM > > > processor? > > > > I believe so. The details are (repeating your earlier explanation) ... > > > > 1. Replace first half of nop with 16bit 'breakpoint' instruction. > > Sort of -- you'd actually need 2x16-bit nops to make this work. Why? > > > 2. Sync.(cache flush to PoU + IPIs to make other cores invalidate the > > icache for changed part of the nop instruction). > > Why do you need to use IPIs for I-cache invalidation on other cores? For > ARMv7 SMP (i.e. the multi-processing extensions) doing I-cache invalidation > by MVA to PoU will be broadcast to the applicable domain for the > shareability attributes of the address. So if you do icimvau with an > inner-shareable virtual address, it will be broadcast by the hardware. > > > However, wouldn't we need any of this breakpoint malarkey, why not just > > just use a 16-bit branch instruction which branches over the second half > > of the nop? :-) > > Yes, and I think if you do use two 16-bit nops, you can even get rid of all > the intermediate `sync' operations (I guess you might want one at the end if > you want the call to become visible at a particular point). Wont work. We are replacing a 32bit call with a nop. That nop must also be 32bits, because we could eventually replace the nop(s) with a 32bit call. Basically, we can never allow the second 16bit part ever be the next instruction. If the first 16bit nop is executed, and then the task gets preempted. The nops get converted to a 32bit call. The task gets scheduled again and now is executing the second 16bits of the 32bit call and we get unexpected (probably crashing) results. By having either a 16bit breakpoint whose handler returns after the second 16bit part, or a 16bit jump that simply jumps over the second half, then all this should work. When the CPU processes a 32bit instruction, it either processes all or non of it, correct? -- Steve
On Mon, Dec 10, 2012 at 01:02:17PM +0000, Steven Rostedt wrote: > On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote: > > Hi Jon, > > > > Back-pedalling a bit here, but I'm confused by one of your points below: > > > > On Fri, Dec 07, 2012 at 05:45:47PM +0000, Jon Medhurst (Tixy) wrote: > > > On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote: > > > > I'll make my question more general: > > > > > > > > If I have a nop, that is a size of a call (branch and link), which is > > > > near the beginning of a function and not part of any conditional, and I > > > > want to convert it into a call (branch and link), would adding a > > > > breakpoint to it, modifying it to the call, and then removing the > > > > breakpoint be possible? Of course it would require syncing in between > > > > steps, but my question is, if the above is possible on a thumb2 ARM > > > > processor? > > > > > > I believe so. The details are (repeating your earlier explanation) ... > > > > > > 1. Replace first half of nop with 16bit 'breakpoint' instruction. > > > > Sort of -- you'd actually need 2x16-bit nops to make this work. > > Why? Because the architecture doesn't provide any guarantees about concurrent modification of 32-bit nop instructions. If you stop the world every time, fine, but that's what we're trying to avoid, right? > > > However, wouldn't we need any of this breakpoint malarkey, why not just > > > just use a 16-bit branch instruction which branches over the second half > > > of the nop? :-) > > > > Yes, and I think if you do use two 16-bit nops, you can even get rid of all > > the intermediate `sync' operations (I guess you might want one at the end if > > you want the call to become visible at a particular point). > > Wont work. We are replacing a 32bit call with a nop. That nop must also > be 32bits, because we could eventually replace the nop(s) with a 32bit > call. Basically, we can never allow the second 16bit part ever be the > next instruction. If the first 16bit nop is executed, and then the task > gets preempted. The nops get converted to a 32bit call. The task gets > scheduled again and now is executing the second 16bits of the 32bit call > and we get unexpected (probably crashing) results. Damn, I didn't realise you wanted to put the 32-bit call back on pre-emption. Still, the `sync' is not needed when patching in a b for a nop. > By having either a 16bit breakpoint whose handler returns after the > second 16bit part, or a 16bit jump that simply jumps over the second > half, then all this should work. When the CPU processes a 32bit > instruction, it either processes all or non of it, correct? If you have two 16-bit nops, patching the first to branch over the second will work. Will
Steven Rostedt wrote: > > Yes, and I think if you do use two 16-bit nops, you can even get rid of all > > the intermediate `sync' operations (I guess you might want one at the end if > > you want the call to become visible at a particular point). > > Wont work. We are replacing a 32bit call with a nop. That nop must also > be 32bits, because we could eventually replace the nop(s) with a 32bit > call. Basically, we can never allow the second 16bit part ever be the > next instruction. If the first 16bit nop is executed, and then the task > gets preempted. The nops get converted to a 32bit call. The task gets > scheduled again and now is executing the second 16bits of the 32bit call > and we get unexpected (probably crashing) results. > > By having either a 16bit breakpoint whose handler returns after the > second 16bit part, or a 16bit jump that simply jumps over the second > half, then all this should work. When the CPU processes a 32bit > instruction, it either processes all or non of it, correct? Sounds good, except what Will wrote a few days ago: On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote: > For ARMv7, there are small subsets of instructions for ARM and Thumb which > are guaranteed to be atomic wrt concurrent modification and execution of > the instruction stream between different processors: > > Thumb: The 16-bit encodings of the B, NOP, BKPT, and SVC instructions. > ARM: The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions. Thumb 32-bit ftrace call isn't in the above list. Questions: does the above concurrent modification guarantee require both the old instruction _and_ the new one to be among those listed, or is it enough to be just the new one (for example when setting a normal software breakpoint, that would be useful)? Can it be the old one and not the new (for example when removing a software breakpoint, that would be useful)? Does that subset mean replacing any of the listed instructions by any of the others is ok, or any of the listed with another of the same type? (I guess as a matter of architecture design, it makes sense to guarantee only a short list, because of occasions when the hardware, or a software emulation through traps, or a simulation, might read the instruction memory more than once.) This is what makes me wonder, if it's safe to replace the 32-bit mcount call with a 16-bit short jump: > On Mon, Dec 10, 2012 at 11:04:05AM +0000, Jon Medhurst (Tixy) wrote: > > So this means for things like kprobes which can modify arbitrary kernel > > code we are going to need to continue to always use some form of > > stop_the_whole_system() function? > > > > Also, kprobes currently uses patch_text() which only uses stop_machine > > for Thumb2 instructions which straddle a word boundary, so this needs > > changing? Will Deacon replied: > Yes; if you're modifying instructions other than those mentioned above, then > you'll need to synchronise the CPUs, update the instructions, perform > cache-maintenance on the writing CPU and then execute an isb on the > executing core (this last bit isn't needed if you're going to go through an > exception return to get back to the new code -- depends on how your > stop/resume code works). If I've understood that exchange, it implies that using patch_text() to replace an instruction not in the list of special ones, with a trap or jump, isn't ok? And so it's ok to replace the NOP with a short branch (since 16-bit "B" is in the list), but it's not ok to replace 16-bit "B" with the 32-bit ftrace call; and the same going the other way? Best, -- Jamie
On Mon, Dec 10, 2012 at 08:02:17AM -0500, Steven Rostedt wrote: > On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote: > > Yes, and I think if you do use two 16-bit nops, you can even get rid of all > > the intermediate `sync' operations (I guess you might want one at the end if > > you want the call to become visible at a particular point). > > Wont work. We are replacing a 32bit call with a nop. That nop must also > be 32bits, because we could eventually replace the nop(s) with a 32bit > call. ... which, if it's misaligned to a 32-bit boundary, which can happen with Thumb-2 code, will require the replacement to be done atomically; you will need to use stop_machine() to ensure that other CPUs don't try to execute the instruction mid-way through modification... as I have already explained in my previous mails.
On Mon, 2012-12-10 at 11:24 +0000, Will Deacon wrote: > On Mon, Dec 10, 2012 at 11:04:05AM +0000, Jon Medhurst (Tixy) wrote: > > On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote: > > > For ARMv7, there are small subsets of instructions for ARM and Thumb which > > > are guaranteed to be atomic wrt concurrent modification and execution of > > > the instruction stream between different processors: > > > > > > Thumb: The 16-bit encodings of the B, NOP, BKPT, and SVC instructions. > > > ARM: The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions. > > > > > > > So this means for things like kprobes which can modify arbitrary kernel > > code we are going to need to continue to always use some form of > > stop_the_whole_system() function? > > > > Also, kprobes currently uses patch_text() which only uses stop_machine > > for Thumb2 instructions which straddle a word boundary, so this needs > > changing? > > Yes; if you're modifying instructions other than those mentioned above, then > you'll need to synchronise the CPUs, update the instructions, perform > cache-maintenance on the writing CPU and then execute an isb on the > executing core (this last bit isn't needed if you're going to go through an > exception return to get back to the new code -- depends on how your > stop/resume code works). Yeah, kprobe optimizing will probably require stop_machine() always, as it's modifying random code, or adding breakpoints into random places. That's another adventure to deal with at another time. > > For ftrace we can (hopefully) avoid a lot of this when we have known points > of modification. I'm also thinking about tracepoints which behave almost the same as ftrace. They have nop place holders too. They happen to be 32bits too, but may only need to be 16 bit. The way tracepoints work is with the use of asm goto. For example we have: arch/arm/include/asm/jump_label.h #ifdef CONFIG_THUMB2_KERNEL #define JUMP_LABEL_NOP "nop.w" #else #define JUMP_LABEL_NOP "nop" #endif static __always_inline bool arch_static_branch(struct static_key *key) { asm goto("1:\n\t" JUMP_LABEL_NOP "\n\t" ".pushsection __jump_table, \"aw\"\n\t" ".word 1b, %l[l_yes], %c0\n\t" ".popsection\n\t" : : "i" (key) : : l_yes); return false; l_yes: return true; } Tracepoints use the jump-label "static branch" logic, which uses a gcc 4.6 feature called asm goto. The asm goto allows the internal asm to reference a label outside the asm stamement and the compiler is aware that the asm statement may jump to that label. Thus the compiler treats that asm statement as a possible branch to the given label and it wont optimize required statements after the asm, if they are needed for the jump to the label. Now in include/linux/tracepoint.h we have: static inline void trace_##name(proto) \ { \ if (static_key_false(&__tracepoint_##name.key)) \ __DO_TRACE(&__tracepoint_##name, \ TP_PROTO(data_proto), \ TP_ARGS(data_args), \ TP_CONDITION(cond),,); \ } \ Where the static_key_false() is an "unlikely" version of the static_branch() that tells gcc the result of the if statement goes into the unlikely location (end of function perhaps). But this doesn't guarantee that it becomes part of some if statement, so this doesn't have all the limitations that ftrace mcount call has. -- Steve
On Mon, 2012-12-10 at 13:57 +0000, Russell King - ARM Linux wrote: > On Mon, Dec 10, 2012 at 08:02:17AM -0500, Steven Rostedt wrote: > > On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote: > > > Yes, and I think if you do use two 16-bit nops, you can even get rid of all > > > the intermediate `sync' operations (I guess you might want one at the end if > > > you want the call to become visible at a particular point). > > > > Wont work. We are replacing a 32bit call with a nop. That nop must also > > be 32bits, because we could eventually replace the nop(s) with a 32bit > > call. > > ... which, if it's misaligned to a 32-bit boundary, which can happen with > Thumb-2 code, will require the replacement to be done atomically; you will > need to use stop_machine() to ensure that other CPUs don't try to execute > the instruction mid-way through modification... as I have already > explained in my previous mails. If there's no way to modify a 32bit operation without stop_machine(), ever with a breakpoint, than we can stop the discussion here. ARM will forever require stop_machine() for use with tracepoints and ftrace. Too bad, as ARM was the x86 competitor. Here's something that x86 has a one up on ARM. -- Steve
On Mon, Dec 10, 2012 at 09:06:05AM -0500, Steven Rostedt wrote: > On Mon, 2012-12-10 at 13:57 +0000, Russell King - ARM Linux wrote: > > On Mon, Dec 10, 2012 at 08:02:17AM -0500, Steven Rostedt wrote: > > > On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote: > > > > Yes, and I think if you do use two 16-bit nops, you can even get rid of all > > > > the intermediate `sync' operations (I guess you might want one at the end if > > > > you want the call to become visible at a particular point). > > > > > > Wont work. We are replacing a 32bit call with a nop. That nop must also > > > be 32bits, because we could eventually replace the nop(s) with a 32bit > > > call. > > > > ... which, if it's misaligned to a 32-bit boundary, which can happen with > > Thumb-2 code, will require the replacement to be done atomically; you will > > need to use stop_machine() to ensure that other CPUs don't try to execute > > the instruction mid-way through modification... as I have already > > explained in my previous mails. > > If there's no way to modify a 32bit operation without stop_machine(), > ever with a breakpoint, than we can stop the discussion here. ARM will > forever require stop_machine() for use with tracepoints and ftrace. Too > bad, as ARM was the x86 competitor. Here's something that x86 has a one > up on ARM. You think that kind of blackmail makes a difference? Look closely at what I've written - I didn't say that there's no way to modify any 32-bit operation without stop_machine().
On Mon, 2012-12-10 at 14:07 +0000, Russell King - ARM Linux wrote: > > > If there's no way to modify a 32bit operation without stop_machine(), > > ever with a breakpoint, than we can stop the discussion here. ARM will > > forever require stop_machine() for use with tracepoints and ftrace. Too > > bad, as ARM was the x86 competitor. Here's something that x86 has a one > > up on ARM. > > You think that kind of blackmail makes a difference? I'm not trying to blackmail. How does one blackmail to change facts? I was just showing my disappointment, as I'm starting to grow fond of ARM. Although I wouldn't mind sending out some blackmail to the ARM HW vendors (got any pictures of them cheating on their spouses?) to make them come up with a standardize way to make all this easy :-) > Look closely at what > I've written - I didn't say that there's no way to modify any 32-bit > operation without stop_machine(). Again, you and I are having a disconnect. I'm not a HW expert. I'm trying to get a total understanding of what you, Will, Jon and others are trying to say. Lets look at what you last said: > ... which, if it's misaligned to a 32-bit boundary, which can happen with > Thumb-2 code, will require the replacement to be done atomically; you will > need to use stop_machine() to ensure that other CPUs don't try to execute > the instruction mid-way through modification... as I have already > explained in my previous mails. I'm confused to what is wrong to "misaligned to a 32-bit boundery". Isn't it best if it is on a 32-bit boundary? Or do you mean that it's misaligned across a 32-bit boundary? I guess I just read it wrong. Either way, I said there's probably no guarantee that the 32-bit calls to mcount that gcc has inserted (or the tracepoints) are going to be aligned to 32-bit boundaries. But I'm wondering if that's still a problem. Let's look at the ways another CPU could get the 32-bit instruction if it is misaligned, and across two different cache lines, or even two different pages: 1) the CPU gets the full 32bits as it was on the other CPU, or how it will be. 2) The CPU gets the first 16bits as it was on the other CPU an the second 16bits with the update. 3) The CPU gets the first 16bits with the update and the second 16bits as it use to be. The first case isn't interesting, so lets jump to the 2 and 3rd cases. On an update of a 32bit nop to a 16bit breakpoint or branch (jump over second half). As we only modify the first half of the 32bit operation, and the second half still contains the second half of the 32bit nop, the other CPU will see: <first 16bits of 32bit nop><second 16bits of 32bit nop> That doesn't look so bad. Or: <breakpoint/branch><second 16bits of 32bit nop> If it sees the breakpoint or branch, then it should just jump over the second part of the nop, and not execute it. If this is an issue, then we need stop_machine() otherwise, I'll continue. Now before changing the second half of the 32bits, we send a sync (IPI maybe) to all other CPUs, so that we now guarantee that all CPUs sees the added breakpoint/branch. As the breakpoint and branch will return pass the second half of the nop, modifying it shouldn't bother the other CPUs. If it does, then we have to stay with stop_machine(). If not, let me continue. We need to send another sync to guarantee that all the other CPUs see the second half of the jump to ftrace code. Now the second half matches the second half of the jump to the ftrace code, and the first half is still the breakpoint/branch. So lets change that to the first half of the jump to the ftrace code. Now the other CPUs will see: <breakpoint/branch><second 16bits of jump to ftrace> The above should still skip the second half. Or: <first 16bits of jump to ftrace><second 16bits of jump to ftrace> The above is what we want CPUs to see. If my assumptions about the changes above can't be done, then yes we are stuck with stop_machine(). If they can be done, then we have hope to remove it. -- Steve
Hi Jamie, Thanks for summarising the thread so far. On Mon, Dec 10, 2012 at 01:40:01PM +0000, Jamie Lokier wrote: > On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote: > > For ARMv7, there are small subsets of instructions for ARM and Thumb which > > are guaranteed to be atomic wrt concurrent modification and execution of > > the instruction stream between different processors: > > > > Thumb: The 16-bit encodings of the B, NOP, BKPT, and SVC instructions. > > ARM: The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions. > > Thumb 32-bit ftrace call isn't in the above list. > > Questions: does the above concurrent modification guarantee require > both the old instruction _and_ the new one to be among those listed, > or is it enough to be just the new one (for example when setting a > normal software breakpoint, that would be useful)? Can it be the old > one and not the new (for example when removing a software breakpoint, > that would be useful)? Does that subset mean replacing any of the > listed instructions by any of the others is ok, or any of the listed > with another of the same type? Yes, the target instruction also has to be listed. However, I only described the simple cases above... there are a number of exceptions when it comes to 32-bit Thumb-2 encodings of BL (I hinted at one of them before): - The two halfwords of a 32-bit BL instruction can each be replaced with the relevant halfword from another BL instruction. This basically means you can change the immediate fields, as I mentioned earlier. - The most-significant halfword of a 32-bit BL instruction can be replaced with B, BKPT or SVC (i.e. not NOP). - A 16-bit B, BKPT, or SVC instruction can be replaced with the most-significant halfword of a BL instruction. The latter points mean we can effectively nop out bl instructions, and put them back again. > This is what makes me wonder, if it's safe to replace the 32-bit > mcount call with a 16-bit short jump: > > > On Mon, Dec 10, 2012 at 11:04:05AM +0000, Jon Medhurst (Tixy) wrote: > > > So this means for things like kprobes which can modify arbitrary kernel > > > code we are going to need to continue to always use some form of > > > stop_the_whole_system() function? > > > > > > Also, kprobes currently uses patch_text() which only uses stop_machine > > > for Thumb2 instructions which straddle a word boundary, so this needs > > > changing? > > Will Deacon replied: > > Yes; if you're modifying instructions other than those mentioned above, then > > you'll need to synchronise the CPUs, update the instructions, perform > > cache-maintenance on the writing CPU and then execute an isb on the > > executing core (this last bit isn't needed if you're going to go through an > > exception return to get back to the new code -- depends on how your > > stop/resume code works). > > If I've understood that exchange, it implies that using patch_text() > to replace an instruction not in the list of special ones, with a trap > or jump, isn't ok? And so it's ok to replace the NOP with a short > branch (since 16-bit "B" is in the list), but it's not ok to replace > 16-bit "B" with the 32-bit ftrace call; and the same going the other way? Well, these restrictions only apply if you want to avoid synchronising the CPUs. Hopefully my explanation above answers your questions! Will
On Mon, Dec 10, 2012 at 09:46:41AM -0500, Steven Rostedt wrote: > Again, you and I are having a disconnect. I'm not a HW expert. I'm > trying to get a total understanding of what you, Will, Jon and others > are trying to say. Well, there's people who think that you're intentionally trying to wind me up (I'm not alone in this opinion; believe me, I checked with someone else taking part in this thread and they said as much...) > > ... which, if it's misaligned to a 32-bit boundary, which can happen with > > Thumb-2 code, will require the replacement to be done atomically; you will > > need to use stop_machine() to ensure that other CPUs don't try to execute > > the instruction mid-way through modification... as I have already > > explained in my previous mails. > > I'm confused to what is wrong to "misaligned to a 32-bit boundery". > Isn't it best if it is on a 32-bit boundary? Or do you mean that it's > misaligned across a 32-bit boundary? I guess I just read it wrong. What I mean is a store of 32-bit size to an address which is not numerically an integer multiple of four. To see why this is a problem, take a moment to think about how you'd update a misaligned 32-bit value on a 32-bit bus with byte enables. You need to do it as two transactions. If your bus is 64-bits wide, then the problem potentially becomes one where there's an issue if it crosses a 64-bit boundary. Continue for larger bus widths... Now add in the effect of caching with its cache line boundaries, and what the effects are if a write crosses the cache line boundary (which means it ends up with two separate validity bits etc.) Lastly, remember that ARM CPUs have a Harvard cache architecture; that means that the data paths are entirely separate from the instruction paths - and in some cases that goes all the way to the memory controller, but that's not relevant. The relevant point here is that the point in the pathways where the instruction and data paths unite can be quite some distance _outside_ of the CPU. What this all means is that a misaligned 32-bit store can ultimately appear as two separate 16-bit stores, which may be interleaved by other bus activity. Whether that is visible to other CPUs in a SMP system as two separate 16-bit stores or not isn't well defined. x86 in this regard is beautiful; it's fully coherent with everything. It enforces correctness for almost every situation. It manages this by using a hell of a lot of logic to do interlocking and ensure correct ordering. If you want that from an ARM CPU then you'd probably need a comparible amount of logic - and power - to be able to do that. > Either way, I said there's probably no guarantee that the 32-bit calls > to mcount that gcc has inserted (or the tracepoints) are going to be > aligned to 32-bit boundaries. Correct; there is no guarantee of that what so ever when building for Thumb-2. > But I'm wondering if that's still a > problem. Let's look at the ways another CPU could get the 32-bit > instruction if it is misaligned, and across two different cache lines, > or even two different pages: > > > 1) the CPU gets the full 32bits as it was on the other CPU, or how it > will be. > > 2) The CPU gets the first 16bits as it was on the other CPU an the > second 16bits with the update. > > 3) The CPU gets the first 16bits with the update and the second 16bits > as it use to be. > > > The first case isn't interesting, so lets jump to the 2 and 3rd cases. > > On an update of a 32bit nop to a 16bit breakpoint or branch (jump over > second half). Err. Let me remind you what you said in the message which I replied to earlier today: We are replacing a 32bit call with a nop. That nop must also ^^^^^ be 32bits, because we could eventually replace the nop(s) with a 32bit ^^^^^^ call. Maybe that's sloppy language, but I tend to read what's written and interpret it as written... so to now say about 16-bit breakpoint or branch instructions to me sounds like changing the point of discussion.
On Mon, 2012-12-10 at 15:25 +0000, Russell King - ARM Linux wrote: > On Mon, Dec 10, 2012 at 09:46:41AM -0500, Steven Rostedt wrote: > > Again, you and I are having a disconnect. I'm not a HW expert. I'm > > trying to get a total understanding of what you, Will, Jon and others > > are trying to say. > > Well, there's people who think that you're intentionally trying to wind > me up (I'm not alone in this opinion; believe me, I checked with someone > else taking part in this thread and they said as much...) I'm sorry that you and others feel that way. What benefit would I get to do such a thing? That would be counter productive on my part, and honestly, that's the last thing I want to do. I think it may just be that my general personality and my way of writing may be of issue. Ask others that deal with me. I'm not saying anything differently to you than I do to anyone else. But that's just me. I'm really not taking any of this personal nor am I trying to piss you or anyone else off. I'm just trying to talk technical here. > > > > ... which, if it's misaligned to a 32-bit boundary, which can happen with > > > Thumb-2 code, will require the replacement to be done atomically; you will > > > need to use stop_machine() to ensure that other CPUs don't try to execute > > > the instruction mid-way through modification... as I have already > > > explained in my previous mails. > > > > I'm confused to what is wrong to "misaligned to a 32-bit boundery". > > Isn't it best if it is on a 32-bit boundary? Or do you mean that it's > > misaligned across a 32-bit boundary? I guess I just read it wrong. > > What I mean is a store of 32-bit size to an address which is not > numerically an integer multiple of four. > > To see why this is a problem, take a moment to think about how you'd > update a misaligned 32-bit value on a 32-bit bus with byte enables. > You need to do it as two transactions. > > If your bus is 64-bits wide, then the problem potentially becomes one > where there's an issue if it crosses a 64-bit boundary. Continue for > larger bus widths... > > Now add in the effect of caching with its cache line boundaries, and > what the effects are if a write crosses the cache line boundary (which > means it ends up with two separate validity bits etc.) > > Lastly, remember that ARM CPUs have a Harvard cache architecture; that > means that the data paths are entirely separate from the instruction > paths - and in some cases that goes all the way to the memory controller, > but that's not relevant. The relevant point here is that the point in > the pathways where the instruction and data paths unite can be quite > some distance _outside_ of the CPU. > > What this all means is that a misaligned 32-bit store can ultimately > appear as two separate 16-bit stores, which may be interleaved by > other bus activity. Whether that is visible to other CPUs in a SMP > system as two separate 16-bit stores or not isn't well defined. > > x86 in this regard is beautiful; it's fully coherent with everything. > It enforces correctness for almost every situation. It manages this > by using a hell of a lot of logic to do interlocking and ensure > correct ordering. If you want that from an ARM CPU then you'd probably > need a comparible amount of logic - and power - to be able to do that. I'm going to be bluntly honest here, and say that I do not fully understand all the intrinsic operations that you've explained above. This may be part of the frustration that you have with me. Not that I'm trying to piss you off, but the fact that I'm not as well versed of all the details that goes on inside an ARM processor. Yes, I've been spoiled with x86, but I'm trying hard to understand the differences with ARM. I may be asking stupid questions that are very obvious to you, but to me, this is a new world. Please have some patience with me. > > > Either way, I said there's probably no guarantee that the 32-bit calls > > to mcount that gcc has inserted (or the tracepoints) are going to be > > aligned to 32-bit boundaries. > > Correct; there is no guarantee of that what so ever when building for > Thumb-2. > > > But I'm wondering if that's still a > > problem. Let's look at the ways another CPU could get the 32-bit > > instruction if it is misaligned, and across two different cache lines, > > or even two different pages: > > > > > > 1) the CPU gets the full 32bits as it was on the other CPU, or how it > > will be. > > > > 2) The CPU gets the first 16bits as it was on the other CPU an the > > second 16bits with the update. > > > > 3) The CPU gets the first 16bits with the update and the second 16bits > > as it use to be. > > > > > > The first case isn't interesting, so lets jump to the 2 and 3rd cases. > > > > On an update of a 32bit nop to a 16bit breakpoint or branch (jump over > > second half). > > Err. Let me remind you what you said in the message which I replied to > earlier today: > > We are replacing a 32bit call with a nop. That nop must also > ^^^^^ > be 32bits, because we could eventually replace the nop(s) with a 32bit > ^^^^^^ > call. > > Maybe that's sloppy language, but I tend to read what's written and > interpret it as written... so to now say about 16-bit breakpoint or > branch instructions to me sounds like changing the point of discussion. The grand view is to change a 32bit nop to a 32bit branch and link, and vice-versa. But because the 32bit operation may cross cache-lines or pages, there's no safe way to to do that directly (I'm assuming from everything that I've been told). Thus, the idea is to break the conversion up into steps where we only change half of the instruction. Changing the first half to be either a breakpoint or a branch that skips the second half. Considering that no matter how the processor sees the result, it wont be an issue. This is where there's a lot of assumptions that I'm trying to understand, and where you may be frustrated with me. If we can change the first half of the instruction with a 16bit operation that skips the second half, then it may be possible to change the entire 32bit op with another 32bit op in a series of steps, as I explained several times. -- Steve
On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote: > Hi Jon, > > Back-pedalling a bit here, but I'm confused by one of your points below: > > On Fri, Dec 07, 2012 at 05:45:47PM +0000, Jon Medhurst (Tixy) wrote: > > On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote: > > > I'll make my question more general: > > > > > > If I have a nop, that is a size of a call (branch and link), which is > > > near the beginning of a function and not part of any conditional, and I > > > want to convert it into a call (branch and link), would adding a > > > breakpoint to it, modifying it to the call, and then removing the > > > breakpoint be possible? Of course it would require syncing in between > > > steps, but my question is, if the above is possible on a thumb2 ARM > > > processor? > > > > I believe so. The details are (repeating your earlier explanation) ... > > > > 1. Replace first half of nop with 16bit 'breakpoint' instruction. > > Sort of -- you'd actually need 2x16-bit nops to make this work. > > > 2. Sync.(cache flush to PoU + IPIs to make other cores invalidate the > > icache for changed part of the nop instruction). > > Why do you need to use IPIs for I-cache invalidation on other cores? For > ARMv7 SMP (i.e. the multi-processing extensions) doing I-cache invalidation > by MVA to PoU will be broadcast to the applicable domain for the > shareability attributes of the address. So if you do icimvau with an > inner-shareable virtual address, it will be broadcast by the hardware. That was a clue I was missing, and it means that my patch which spawned this thread is flawed. The original problem I was trying to cure was random crashes whilst ftrace_modify_all_code() was going round modifying kernel functions, and I fixed this by getting all cores to __flush_icache_all() after the modifications had been made. But if cache flushes are broadcast across all cores then my reasoning for the fix is wrong. As this only seems to surface on TC2 perhaps CCI doesn't do the magic we want, or we have it misconfigured, or were been hit by cache differences between A7 and A15? (I've seen comments somewhere which says A7 has VIPT aliasing I-cache, and A15 is PIPT non-aliasing). Will need to some more detailed investigation when I get time.
diff --git a/arch/arm/kernel/ftrace.c b/arch/arm/kernel/ftrace.c index 34e5664..38b670c 100644 --- a/arch/arm/kernel/ftrace.c +++ b/arch/arm/kernel/ftrace.c @@ -14,6 +14,7 @@ #include <linux/ftrace.h> #include <linux/uaccess.h> +#include <linux/stop_machine.h> #include <asm/cacheflush.h> #include <asm/opcodes.h> @@ -156,6 +157,39 @@ int ftrace_make_nop(struct module *mod, return ret; } +struct afmc_data { + int command; + atomic_t cpu; + atomic_t done; +}; + +static int __arch_ftrace_modify_code(void *data) +{ + struct afmc_data *afmcd = data; + + if (atomic_inc_return(&afmcd->cpu) == num_online_cpus()) { + /* Last cpu to get into this function does the actual work */ + ftrace_modify_all_code(afmcd->command); + wmb(); + atomic_set(&afmcd->done, true); + } else { + /* Other cpus wait for the code modifications to be done */ + rmb(); + while (!atomic_read(&afmcd->done)) + cpu_relax(); + /* Ensure icache is consistent with the code changes */ + __flush_icache_all(); + } + + return 0; +} + +void arch_ftrace_update_code(int command) +{ + struct afmc_data afmcd = { command }; + stop_machine(__arch_ftrace_modify_code, &afmcd, cpu_online_mask); +} + int __init ftrace_dyn_arch_init(void *data) { *(unsigned long *)data = 0;
When the generic ftrace implementation modifies code for trace-points it uses stop_machine() to call ftrace_modify_all_code() on one CPU. This ultimately calls the ARM specific function ftrace_modify_code() which updates the instruction and then does flush_icache_range(). As this cache flushing only operates on the local CPU then other cores may end up executing the old instruction if it's still in their icaches. This may or may not cause problems for the use of ftrace on kernels compiled for ARM instructions. However, Thumb2 instructions can straddle two cache lines so its possible for half the old instruction to be in the cache and half the new one, leading to the CPU executing garbage. This patch fixes this situation by providing an arch-specific implementation of arch_ftrace_update_code() which ensures that after one core has modified all the code, the other cores invalidate their icaches before continuing. Signed-off-by: Jon Medhurst <tixy@linaro.org> --- arch/arm/kernel/ftrace.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+)