Message ID | 20240628-dev-andyc-dyn-ftrace-v4-v2-3-1e5f4cb1f049@sifive.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | riscv: ftrace: atmoic patching and preempt improvements | expand |
Andy Chiu <andy.chiu@sifive.com> writes: > We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since > instruction fetch can break down to 4 byte at a time, it is impossible > to update two instructions without a race. In order to mitigate it, we > initialize the patchable entry to AUIPC + NOP4. Then, the run-time code > patching can change NOP4 to JALR to eable/disable ftrcae from a enable ftrace > function. This limits the reach of each ftrace entry to +-2KB displacing > from ftrace_caller. > > Starting from the trampoline, we add a level of indirection for it to > reach ftrace caller target. Now, it loads the target address from a > memory location, then perform the jump. This enable the kernel to update > the target atomically. The +-2K limit is for direct calls, right? ...and this I would say breaks DIRECT_CALLS (which should be implemented using call_ops later)? Björn
Björn Töpel <bjorn@kernel.org> writes: > Andy Chiu <andy.chiu@sifive.com> writes: > >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since >> instruction fetch can break down to 4 byte at a time, it is impossible >> to update two instructions without a race. In order to mitigate it, we >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code >> patching can change NOP4 to JALR to eable/disable ftrcae from a > enable ftrace > >> function. This limits the reach of each ftrace entry to +-2KB displacing >> from ftrace_caller. >> >> Starting from the trampoline, we add a level of indirection for it to >> reach ftrace caller target. Now, it loads the target address from a >> memory location, then perform the jump. This enable the kernel to update >> the target atomically. > > The +-2K limit is for direct calls, right? > > ...and this I would say breaks DIRECT_CALLS (which should be implemented > using call_ops later)? Thinking a bit more, and re-reading the series. This series is good work, and it's a big improvement for DYNAMIC_FTRACE, but +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) +{ + unsigned long distance, orig_addr; + + orig_addr = (unsigned long)&ftrace_caller; + distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr; + if (distance > JALR_RANGE) + return -EINVAL; + + return __ftrace_modify_call(rec->ip, addr, false); +} + breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within the JALR_RANGE. Unless we're happy with a break (I'm not) -- I really think Puranjay's CALL_OPS patch needs to be baked in in the series! Björn
On Wed, Aug 14, 2024 at 02:57:52PM +0200, Björn Töpel wrote: > Björn Töpel <bjorn@kernel.org> writes: > > > Andy Chiu <andy.chiu@sifive.com> writes: > > > >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since > >> instruction fetch can break down to 4 byte at a time, it is impossible > >> to update two instructions without a race. In order to mitigate it, we > >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code > >> patching can change NOP4 to JALR to eable/disable ftrcae from a > > enable ftrace > > > >> function. This limits the reach of each ftrace entry to +-2KB displacing > >> from ftrace_caller. > >> > >> Starting from the trampoline, we add a level of indirection for it to > >> reach ftrace caller target. Now, it loads the target address from a > >> memory location, then perform the jump. This enable the kernel to update > >> the target atomically. > > > > The +-2K limit is for direct calls, right? > > > > ...and this I would say breaks DIRECT_CALLS (which should be implemented > > using call_ops later)? > > Thinking a bit more, and re-reading the series. > > This series is good work, and it's a big improvement for DYNAMIC_FTRACE, > but > > +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) > +{ > + unsigned long distance, orig_addr; > + > + orig_addr = (unsigned long)&ftrace_caller; > + distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr; > + if (distance > JALR_RANGE) > + return -EINVAL; > + > + return __ftrace_modify_call(rec->ip, addr, false); > +} > + > > breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within > the JALR_RANGE. Yes, it is hardly possible that a direct trampoline will be within the range. Recently I have been thinking some solutions to address the issue. One solution is replaying AUIPC at function entries. The idea has two sides. First, if we are returning back to the second instruction at trap return, then do sepc -= 4 so it executes the up-to-date AUIPC. The other side is to fire synchronous IPI that does remote fence.i at right timings to prevent concurrent executing on a mix of old and new instructions. Consider replacing instructions at a function's patchable entry with the following sequence: Initial state: -------------- 0: AUIPC 4: JALR Step1: write(0, "J +8") fence w,w send sync local+remote fence.i ------------------------ 0: J +8 4: JALR Step2: write(4, "JALR'") fence w,w send sync local+remote fence.i ------------------------ 0: J +8 4: JALR' Step3: write(0, "AUIPC'") fence w,w send sync local+remote fence.i (to activate the call) ----------------------- 0: AUIPC' 4: JALR' The following execution sequences are acceptable: - AUIPC, JALR - J +8, (skipping {JALR | JALR'}) - AUIPC', JALR' And here are sequences that we want to prevent: - AUIPC', JALR - AUIPC, JALR' The local core should never execute the forbidden sequence. By listing all possible combinations of executing sequence on a remote core, we can find that the dangerous seqence is impossible to happen: let f be the fence.i at step 1, 2, 3. And let numbers be the location of code being executed. Mathematically, here are all combinations at a site happening on a remote core: fff04 -- updated seq ff0f4 -- impossible, would be ff0f04, updated seq ff04f -- impossible, would be ff08f, safe seq f0ff4 -- impossible, would be f0ff04, updated seq f0f4f -- impossible, would be f0f08f (safe), or f0f0f04 (updated) f04ff -- impossible, would be f08ff, safe seq 0fff4 -- impossible, would be 0fff04, updated seq 0ff4f -- impossible, would be 0ff08f (safe), or 0ff0f04 (updated) 0f4ff -- impossible, would be 0f08ff (safe), 0f0f08f (safe), 0f0f0f04 (updated) 04fff -- old seq After the 1st 'fence.i', remote cores should observe (J +8, JALR) or (J +8, JALR') After the 2nd 'fence.i', remote cores should observe (J +8, JALR') or (AUIPC', JALR') After the 3rd 'fence.i', remote cores should observe (AUIPC', JALR') Remote cores should never execute (AUIPC',JALR) or (AUIPC,JALR') To correctly implement the solution, the trap return code must match JALR and adjust sepc only for patchable function entries. This is undocumently possible because we use t0 as source and destination registers for JALR at function entries. Compiler never generates JALR that uses the same register pattern. Another solution is inspired by zcmt, and perhaps we can optimize it if the hardware does support zcmt. First, we allocate a page and divide it into two halves. The first half of the page are 255 x 8B destination addresses. Then, starting from offset 2056, the second half of the page is composed by a series of 2 x 4 Byte instructions: 0: ftrace_tramp_1 8: ftrace_tramp_2 ... 2040: ftrace_tramp_255 2048: ftrace_tramp_256 (not used when configured with 255 tramps) 2056: ld t1, -2048(t1) jr t1 ld t1, -2048(t1) jr t1 ... 4088: ld t1, -2048(t1) jr t1 4096: It is possible to expand to 511 trampolines by adding a page below, and making a load+jr sequence from +2040 offset. When the kernel boots, we direct AUIPCs at patchable entries to the page, and disable the call by setting the second instruction to NOP4. Then, we can effectively enable/disable/modify a call by setting only the instruction at JALR. It is possible to utilize most of the current patch set to achieve atomic patching. A missing part is to allocate and manage trampolines for ftrace users. Thanks, Andy
Andy Chiu <andybnac@gmail.com> writes: > On Wed, Aug 14, 2024 at 02:57:52PM +0200, Björn Töpel wrote: >> Björn Töpel <bjorn@kernel.org> writes: >> >> > Andy Chiu <andy.chiu@sifive.com> writes: >> > >> >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since >> >> instruction fetch can break down to 4 byte at a time, it is impossible >> >> to update two instructions without a race. In order to mitigate it, we >> >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code >> >> patching can change NOP4 to JALR to eable/disable ftrcae from a >> > enable ftrace >> > >> >> function. This limits the reach of each ftrace entry to +-2KB displacing >> >> from ftrace_caller. >> >> >> >> Starting from the trampoline, we add a level of indirection for it to >> >> reach ftrace caller target. Now, it loads the target address from a >> >> memory location, then perform the jump. This enable the kernel to update >> >> the target atomically. >> > >> > The +-2K limit is for direct calls, right? >> > >> > ...and this I would say breaks DIRECT_CALLS (which should be implemented >> > using call_ops later)? >> >> Thinking a bit more, and re-reading the series. >> >> This series is good work, and it's a big improvement for DYNAMIC_FTRACE, >> but >> >> +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) >> +{ >> + unsigned long distance, orig_addr; >> + >> + orig_addr = (unsigned long)&ftrace_caller; >> + distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr; >> + if (distance > JALR_RANGE) >> + return -EINVAL; >> + >> + return __ftrace_modify_call(rec->ip, addr, false); >> +} >> + >> >> breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within >> the JALR_RANGE. > > > Yes, it is hardly possible that a direct trampoline will be within the > range. > > Recently I have been thinking some solutions to address the issue. One > solution is replaying AUIPC at function entries. The idea has two sides. > First, if we are returning back to the second instruction at trap return, > then do sepc -= 4 so it executes the up-to-date AUIPC. The other side is > to fire synchronous IPI that does remote fence.i at right timings to > prevent concurrent executing on a mix of old and new instructions. > > Consider replacing instructions at a function's patchable entry with the > following sequence: > > Initial state: > -------------- > 0: AUIPC > 4: JALR > > Step1: > write(0, "J +8") > fence w,w > send sync local+remote fence.i > ------------------------ > 0: J +8 > 4: JALR > > Step2: > write(4, "JALR'") > fence w,w > send sync local+remote fence.i > ------------------------ > 0: J +8 > 4: JALR' > > Step3: > write(0, "AUIPC'") > fence w,w > send sync local+remote fence.i (to activate the call) > ----------------------- > 0: AUIPC' > 4: JALR' > > The following execution sequences are acceptable: > - AUIPC, JALR > - J +8, (skipping {JALR | JALR'}) > - AUIPC', JALR' > > And here are sequences that we want to prevent: > - AUIPC', JALR > - AUIPC, JALR' > > The local core should never execute the forbidden sequence. > > By listing all possible combinations of executing sequence on a remote > core, we can find that the dangerous seqence is impossible to happen: > > let f be the fence.i at step 1, 2, 3. And let numbers be the location of > code being executed. Mathematically, here are all combinations at a site > happening on a remote core: > > fff04 -- updated seq > ff0f4 -- impossible, would be ff0f04, updated seq > ff04f -- impossible, would be ff08f, safe seq > f0ff4 -- impossible, would be f0ff04, updated seq > f0f4f -- impossible, would be f0f08f (safe), or f0f0f04 (updated) > f04ff -- impossible, would be f08ff, safe seq > 0fff4 -- impossible, would be 0fff04, updated seq > 0ff4f -- impossible, would be 0ff08f (safe), or 0ff0f04 (updated) > 0f4ff -- impossible, would be 0f08ff (safe), 0f0f08f (safe), 0f0f0f04 (updated) > 04fff -- old seq > > After the 1st 'fence.i', remote cores should observe (J +8, JALR) or (J +8, JALR') > After the 2nd 'fence.i', remote cores should observe (J +8, JALR') or (AUIPC', JALR') > After the 3rd 'fence.i', remote cores should observe (AUIPC', JALR') > > Remote cores should never execute (AUIPC',JALR) or (AUIPC,JALR') > > To correctly implement the solution, the trap return code must match JALR > and adjust sepc only for patchable function entries. This is undocumently > possible because we use t0 as source and destination registers for JALR > at function entries. Compiler never generates JALR that uses the same > register pattern. > > Another solution is inspired by zcmt, and perhaps we can optimize it if > the hardware does support zcmt. First, we allocate a page and divide it > into two halves. The first half of the page are 255 x 8B destination > addresses. Then, starting from offset 2056, the second half of the page > is composed by a series of 2 x 4 Byte instructions: > > 0: ftrace_tramp_1 > 8: ftrace_tramp_2 > ... > 2040: ftrace_tramp_255 > 2048: ftrace_tramp_256 (not used when configured with 255 tramps) > 2056: > ld t1, -2048(t1) > jr t1 > ld t1, -2048(t1) > jr t1 > ... > 4088: > ld t1, -2048(t1) > jr t1 > 4096: > > It is possible to expand to 511 trampolines by adding a page > below, and making a load+jr sequence from +2040 offset. > > When the kernel boots, we direct AUIPCs at patchable entries to the page, > and disable the call by setting the second instruction to NOP4. Then, we > can effectively enable/disable/modify a call by setting only the > instruction at JALR. It is possible to utilize most of the current patch > set to achieve atomic patching. A missing part is to allocate and manage > trampolines for ftrace users. (I will need to digest above in detail!) I don't think it's a good idea to try to handle direct calls w/o call_ops. What I was trying to say is "add the call_ops patch to your series, so that direct calls aren't broken". If direct calls depend on call_ops -- sure, no worries. But don't try to get direct calls W/O call_ops. That's a whole new bag of worms. Some more high-level thoughts: ...all this to workaround where we don't want the call_ops overhead? Is there really a use-case with a platform that doesn't handle the text overhead of call_ops? Maybe I'm missing context here... but I'd say, let's follow what arm64 did (but obviously w/o the BL direct call optimization, and always jump to a trampoline -- since that's not possible with RISC-V branch length), and just do the call_ops way. Then, as a second step, and if there are platforms that care, think about a variant w/o call_ops. Or what I wrote in the first section: 1. Keep this patch set 2. ...but add call_ops to it, and require call_ops for direct calls. Just my $.02. Björn
Björn Töpel <bjorn@kernel.org> 於 2024年9月11日 週三 下午10:37寫道: > > Andy Chiu <andybnac@gmail.com> writes: > > > On Wed, Aug 14, 2024 at 02:57:52PM +0200, Björn Töpel wrote: > >> Björn Töpel <bjorn@kernel.org> writes: > >> > >> > Andy Chiu <andy.chiu@sifive.com> writes: > >> > > >> >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since > >> >> instruction fetch can break down to 4 byte at a time, it is impossible > >> >> to update two instructions without a race. In order to mitigate it, we > >> >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code > >> >> patching can change NOP4 to JALR to eable/disable ftrcae from a > >> > enable ftrace > >> > > >> >> function. This limits the reach of each ftrace entry to +-2KB displacing > >> >> from ftrace_caller. > >> >> > >> >> Starting from the trampoline, we add a level of indirection for it to > >> >> reach ftrace caller target. Now, it loads the target address from a > >> >> memory location, then perform the jump. This enable the kernel to update > >> >> the target atomically. > >> > > >> > The +-2K limit is for direct calls, right? > >> > > >> > ...and this I would say breaks DIRECT_CALLS (which should be implemented > >> > using call_ops later)? > >> > >> Thinking a bit more, and re-reading the series. > >> > >> This series is good work, and it's a big improvement for DYNAMIC_FTRACE, > >> but > >> > >> +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) > >> +{ > >> + unsigned long distance, orig_addr; > >> + > >> + orig_addr = (unsigned long)&ftrace_caller; > >> + distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr; > >> + if (distance > JALR_RANGE) > >> + return -EINVAL; > >> + > >> + return __ftrace_modify_call(rec->ip, addr, false); > >> +} > >> + > >> > >> breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within > >> the JALR_RANGE. > > > > > > Yes, it is hardly possible that a direct trampoline will be within the > > range. > > > > Recently I have been thinking some solutions to address the issue. One > > solution is replaying AUIPC at function entries. The idea has two sides. > > First, if we are returning back to the second instruction at trap return, > > then do sepc -= 4 so it executes the up-to-date AUIPC. The other side is > > to fire synchronous IPI that does remote fence.i at right timings to > > prevent concurrent executing on a mix of old and new instructions. > > > > Consider replacing instructions at a function's patchable entry with the > > following sequence: > > > > Initial state: > > -------------- > > 0: AUIPC > > 4: JALR > > > > Step1: > > write(0, "J +8") > > fence w,w > > send sync local+remote fence.i > > ------------------------ > > 0: J +8 > > 4: JALR > > > > Step2: > > write(4, "JALR'") > > fence w,w > > send sync local+remote fence.i > > ------------------------ > > 0: J +8 > > 4: JALR' > > > > Step3: > > write(0, "AUIPC'") > > fence w,w > > send sync local+remote fence.i (to activate the call) > > ----------------------- > > 0: AUIPC' > > 4: JALR' > > > > The following execution sequences are acceptable: > > - AUIPC, JALR > > - J +8, (skipping {JALR | JALR'}) > > - AUIPC', JALR' > > > > And here are sequences that we want to prevent: > > - AUIPC', JALR > > - AUIPC, JALR' > > > > The local core should never execute the forbidden sequence. > > > > By listing all possible combinations of executing sequence on a remote > > core, we can find that the dangerous seqence is impossible to happen: > > > > let f be the fence.i at step 1, 2, 3. And let numbers be the location of > > code being executed. Mathematically, here are all combinations at a site > > happening on a remote core: > > > > fff04 -- updated seq > > ff0f4 -- impossible, would be ff0f04, updated seq > > ff04f -- impossible, would be ff08f, safe seq > > f0ff4 -- impossible, would be f0ff04, updated seq > > f0f4f -- impossible, would be f0f08f (safe), or f0f0f04 (updated) > > f04ff -- impossible, would be f08ff, safe seq > > 0fff4 -- impossible, would be 0fff04, updated seq > > 0ff4f -- impossible, would be 0ff08f (safe), or 0ff0f04 (updated) > > 0f4ff -- impossible, would be 0f08ff (safe), 0f0f08f (safe), 0f0f0f04 (updated) > > 04fff -- old seq > > > > After the 1st 'fence.i', remote cores should observe (J +8, JALR) or (J +8, JALR') > > After the 2nd 'fence.i', remote cores should observe (J +8, JALR') or (AUIPC', JALR') > > After the 3rd 'fence.i', remote cores should observe (AUIPC', JALR') > > > > Remote cores should never execute (AUIPC',JALR) or (AUIPC,JALR') > > > > To correctly implement the solution, the trap return code must match JALR > > and adjust sepc only for patchable function entries. This is undocumently > > possible because we use t0 as source and destination registers for JALR > > at function entries. Compiler never generates JALR that uses the same > > register pattern. > > > > Another solution is inspired by zcmt, and perhaps we can optimize it if > > the hardware does support zcmt. First, we allocate a page and divide it > > into two halves. The first half of the page are 255 x 8B destination > > addresses. Then, starting from offset 2056, the second half of the page > > is composed by a series of 2 x 4 Byte instructions: > > > > 0: ftrace_tramp_1 > > 8: ftrace_tramp_2 > > ... > > 2040: ftrace_tramp_255 > > 2048: ftrace_tramp_256 (not used when configured with 255 tramps) > > 2056: > > ld t1, -2048(t1) > > jr t1 > > ld t1, -2048(t1) > > jr t1 > > ... > > 4088: > > ld t1, -2048(t1) > > jr t1 > > 4096: > > > > It is possible to expand to 511 trampolines by adding a page > > below, and making a load+jr sequence from +2040 offset. > > > > When the kernel boots, we direct AUIPCs at patchable entries to the page, > > and disable the call by setting the second instruction to NOP4. Then, we > > can effectively enable/disable/modify a call by setting only the > > instruction at JALR. It is possible to utilize most of the current patch > > set to achieve atomic patching. A missing part is to allocate and manage > > trampolines for ftrace users. > > (I will need to digest above in detail!) > > I don't think it's a good idea to try to handle direct calls w/o > call_ops. What I was trying to say is "add the call_ops patch to your > series, so that direct calls aren't broken". If direct calls depend on > call_ops -- sure, no worries. But don't try to get direct calls W/O > call_ops. That's a whole new bag of worms. > > Some more high-level thoughts: ...all this to workaround where we don't > want the call_ops overhead? Is there really a use-case with a platform > that doesn't handle the text overhead of call_ops? Sorry for making any confusions. I have no strong personal preference on what we should do. Just want to have a technical discussion on what is possible if we want to optimize code size. > > Maybe I'm missing context here... but I'd say, let's follow what arm64 > did (but obviously w/o the BL direct call optimization, and always jump > to a trampoline -- since that's not possible with RISC-V branch length), > and just do the call_ops way. > > Then, as a second step, and if there are platforms that care, think > about a variant w/o call_ops. > > Or what I wrote in the first section: > > 1. Keep this patch set > 2. ...but add call_ops to it, and require call_ops for direct calls. > > Just my $.02. > > > Björn
Tao Chiu <andybnac@gmail.com> writes: > Björn Töpel <bjorn@kernel.org> 於 2024年9月11日 週三 下午10:37寫道: > >> >> Andy Chiu <andybnac@gmail.com> writes: >> >> > On Wed, Aug 14, 2024 at 02:57:52PM +0200, Björn Töpel wrote: >> >> Björn Töpel <bjorn@kernel.org> writes: >> >> >> >> > Andy Chiu <andy.chiu@sifive.com> writes: >> >> > >> >> >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since >> >> >> instruction fetch can break down to 4 byte at a time, it is impossible >> >> >> to update two instructions without a race. In order to mitigate it, we >> >> >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code >> >> >> patching can change NOP4 to JALR to eable/disable ftrcae from a >> >> > enable ftrace >> >> > >> >> >> function. This limits the reach of each ftrace entry to +-2KB displacing >> >> >> from ftrace_caller. >> >> >> >> >> >> Starting from the trampoline, we add a level of indirection for it to >> >> >> reach ftrace caller target. Now, it loads the target address from a >> >> >> memory location, then perform the jump. This enable the kernel to update >> >> >> the target atomically. >> >> > >> >> > The +-2K limit is for direct calls, right? >> >> > >> >> > ...and this I would say breaks DIRECT_CALLS (which should be implemented >> >> > using call_ops later)? >> >> >> >> Thinking a bit more, and re-reading the series. >> >> >> >> This series is good work, and it's a big improvement for DYNAMIC_FTRACE, >> >> but >> >> >> >> +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) >> >> +{ >> >> + unsigned long distance, orig_addr; >> >> + >> >> + orig_addr = (unsigned long)&ftrace_caller; >> >> + distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr; >> >> + if (distance > JALR_RANGE) >> >> + return -EINVAL; >> >> + >> >> + return __ftrace_modify_call(rec->ip, addr, false); >> >> +} >> >> + >> >> >> >> breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within >> >> the JALR_RANGE. >> > >> > >> > Yes, it is hardly possible that a direct trampoline will be within the >> > range. >> > >> > Recently I have been thinking some solutions to address the issue. One >> > solution is replaying AUIPC at function entries. The idea has two sides. >> > First, if we are returning back to the second instruction at trap return, >> > then do sepc -= 4 so it executes the up-to-date AUIPC. The other side is >> > to fire synchronous IPI that does remote fence.i at right timings to >> > prevent concurrent executing on a mix of old and new instructions. >> > >> > Consider replacing instructions at a function's patchable entry with the >> > following sequence: >> > >> > Initial state: >> > -------------- >> > 0: AUIPC >> > 4: JALR >> > >> > Step1: >> > write(0, "J +8") >> > fence w,w >> > send sync local+remote fence.i >> > ------------------------ >> > 0: J +8 >> > 4: JALR >> > >> > Step2: >> > write(4, "JALR'") >> > fence w,w >> > send sync local+remote fence.i >> > ------------------------ >> > 0: J +8 >> > 4: JALR' >> > >> > Step3: >> > write(0, "AUIPC'") >> > fence w,w >> > send sync local+remote fence.i (to activate the call) >> > ----------------------- >> > 0: AUIPC' >> > 4: JALR' >> > >> > The following execution sequences are acceptable: >> > - AUIPC, JALR >> > - J +8, (skipping {JALR | JALR'}) >> > - AUIPC', JALR' >> > >> > And here are sequences that we want to prevent: >> > - AUIPC', JALR >> > - AUIPC, JALR' >> > >> > The local core should never execute the forbidden sequence. >> > >> > By listing all possible combinations of executing sequence on a remote >> > core, we can find that the dangerous seqence is impossible to happen: >> > >> > let f be the fence.i at step 1, 2, 3. And let numbers be the location of >> > code being executed. Mathematically, here are all combinations at a site >> > happening on a remote core: >> > >> > fff04 -- updated seq >> > ff0f4 -- impossible, would be ff0f04, updated seq >> > ff04f -- impossible, would be ff08f, safe seq >> > f0ff4 -- impossible, would be f0ff04, updated seq >> > f0f4f -- impossible, would be f0f08f (safe), or f0f0f04 (updated) >> > f04ff -- impossible, would be f08ff, safe seq >> > 0fff4 -- impossible, would be 0fff04, updated seq >> > 0ff4f -- impossible, would be 0ff08f (safe), or 0ff0f04 (updated) >> > 0f4ff -- impossible, would be 0f08ff (safe), 0f0f08f (safe), 0f0f0f04 (updated) >> > 04fff -- old seq >> > >> > After the 1st 'fence.i', remote cores should observe (J +8, JALR) or (J +8, JALR') >> > After the 2nd 'fence.i', remote cores should observe (J +8, JALR') or (AUIPC', JALR') >> > After the 3rd 'fence.i', remote cores should observe (AUIPC', JALR') >> > >> > Remote cores should never execute (AUIPC',JALR) or (AUIPC,JALR') >> > >> > To correctly implement the solution, the trap return code must match JALR >> > and adjust sepc only for patchable function entries. This is undocumently >> > possible because we use t0 as source and destination registers for JALR >> > at function entries. Compiler never generates JALR that uses the same >> > register pattern. >> > >> > Another solution is inspired by zcmt, and perhaps we can optimize it if >> > the hardware does support zcmt. First, we allocate a page and divide it >> > into two halves. The first half of the page are 255 x 8B destination >> > addresses. Then, starting from offset 2056, the second half of the page >> > is composed by a series of 2 x 4 Byte instructions: >> > >> > 0: ftrace_tramp_1 >> > 8: ftrace_tramp_2 >> > ... >> > 2040: ftrace_tramp_255 >> > 2048: ftrace_tramp_256 (not used when configured with 255 tramps) >> > 2056: >> > ld t1, -2048(t1) >> > jr t1 >> > ld t1, -2048(t1) >> > jr t1 >> > ... >> > 4088: >> > ld t1, -2048(t1) >> > jr t1 >> > 4096: >> > >> > It is possible to expand to 511 trampolines by adding a page >> > below, and making a load+jr sequence from +2040 offset. >> > >> > When the kernel boots, we direct AUIPCs at patchable entries to the page, >> > and disable the call by setting the second instruction to NOP4. Then, we >> > can effectively enable/disable/modify a call by setting only the >> > instruction at JALR. It is possible to utilize most of the current patch >> > set to achieve atomic patching. A missing part is to allocate and manage >> > trampolines for ftrace users. >> >> (I will need to digest above in detail!) >> >> I don't think it's a good idea to try to handle direct calls w/o >> call_ops. What I was trying to say is "add the call_ops patch to your >> series, so that direct calls aren't broken". If direct calls depend on >> call_ops -- sure, no worries. But don't try to get direct calls W/O >> call_ops. That's a whole new bag of worms. >> >> Some more high-level thoughts: ...all this to workaround where we don't >> want the call_ops overhead? Is there really a use-case with a platform >> that doesn't handle the text overhead of call_ops? > > Sorry for making any confusions. I have no strong personal preference > on what we should do. Just want to have a technical discussion on what > is possible if we want to optimize code size. Understood! I'm -- as you know -- really eager to have a text patching mechanism that is not horrible. ;-) I'd rather wait with the text size optimizations. TL;DR -- your series is fine from my POV, but it's missing call_ops, so that it doesn't break direct calls. I will take your patch series, and Puranjay's call_ops series see how they play together. We can discuss it at LPC next week! Cheers, Björn
diff --git a/arch/riscv/include/asm/ftrace.h b/arch/riscv/include/asm/ftrace.h index 5f81c53dbfd9..7199383f8c02 100644 --- a/arch/riscv/include/asm/ftrace.h +++ b/arch/riscv/include/asm/ftrace.h @@ -81,6 +81,7 @@ struct dyn_arch_ftrace { #define JALR_T0 (0x000282e7) #define AUIPC_T0 (0x00000297) #define NOP4 (0x00000013) +#define JALR_RANGE (JALR_SIGN_MASK - 1) #define to_jalr_t0(offset) \ (((offset & JALR_OFFSET_MASK) << JALR_SHIFT) | JALR_T0) @@ -118,6 +119,9 @@ do { \ * Let auipc+jalr be the basic *mcount unit*, so we make it 8 bytes here. */ #define MCOUNT_INSN_SIZE 8 +#define MCOUNT_AUIPC_SIZE 4 +#define MCOUNT_JALR_SIZE 4 +#define MCOUNT_NOP4_SIZE 4 #ifndef __ASSEMBLY__ struct dyn_ftrace; diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c index 4b95c574fd04..5ebe412280ef 100644 --- a/arch/riscv/kernel/ftrace.c +++ b/arch/riscv/kernel/ftrace.c @@ -64,42 +64,64 @@ static int ftrace_check_current_call(unsigned long hook_pos, return 0; } -static int __ftrace_modify_call(unsigned long hook_pos, unsigned long target, - bool enable, bool ra) +static int __ftrace_modify_call(unsigned long hook_pos, unsigned long target, bool validate) { unsigned int call[2]; - unsigned int nops[2] = {NOP4, NOP4}; + unsigned int replaced[2]; + + make_call_t0(hook_pos, target, call); - if (ra) - make_call_ra(hook_pos, target, call); - else - make_call_t0(hook_pos, target, call); + if (validate) { + /* + * Read the text we want to modify; + * return must be -EFAULT on read error + */ + if (copy_from_kernel_nofault(replaced, (void *)hook_pos, + MCOUNT_INSN_SIZE)) + return -EFAULT; + + if (replaced[0] != call[0]) { + pr_err("%p: expected (%08x) but got (%08x)\n", + (void *)hook_pos, call[0], replaced[0]); + return -EINVAL; + } + } - /* Replace the auipc-jalr pair at once. Return -EPERM on write error. */ - if (patch_insn_write((void *)hook_pos, enable ? call : nops, MCOUNT_INSN_SIZE)) + /* Replace the jalr at once. Return -EPERM on write error. */ + if (patch_insn_write((void *)(hook_pos + MCOUNT_AUIPC_SIZE), call + 1, MCOUNT_JALR_SIZE)) return -EPERM; return 0; } -int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) +static int __ftrace_modify_call_site(ftrace_func_t *hook_pos, ftrace_func_t target, bool enable) { - unsigned int call[2]; + ftrace_func_t call = target; + ftrace_func_t nops = &ftrace_stub; - make_call_t0(rec->ip, addr, call); - - if (patch_insn_write((void *)rec->ip, call, MCOUNT_INSN_SIZE)) - return -EPERM; + WRITE_ONCE(*hook_pos, enable ? call : nops); return 0; } +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) +{ + unsigned long distance, orig_addr; + + orig_addr = (unsigned long)&ftrace_caller; + distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr; + if (distance > JALR_RANGE) + return -EINVAL; + + return __ftrace_modify_call(rec->ip, addr, false); +} + int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec, unsigned long addr) { - unsigned int nops[2] = {NOP4, NOP4}; + unsigned int nops[1] = {NOP4}; - if (patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE)) + if (patch_insn_write((void *)(rec->ip + MCOUNT_AUIPC_SIZE), nops, MCOUNT_NOP4_SIZE)) return -EPERM; return 0; @@ -114,21 +136,23 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec, */ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec) { + unsigned int nops[2]; int out; + make_call_t0(rec->ip, &ftrace_caller, nops); + nops[1] = NOP4; + mutex_lock(&text_mutex); - out = ftrace_make_nop(mod, rec, MCOUNT_ADDR); + out = patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE); mutex_unlock(&text_mutex); return out; } +ftrace_func_t ftrace_call_dest = ftrace_stub; int ftrace_update_ftrace_func(ftrace_func_t func) { - int ret = __ftrace_modify_call((unsigned long)&ftrace_call, - (unsigned long)func, true, true); - - return ret; + return __ftrace_modify_call_site(&ftrace_call_dest, func, true); } struct ftrace_modify_param { @@ -182,7 +206,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, if (ret) return ret; - return __ftrace_modify_call(caller, addr, true, false); + return __ftrace_modify_call(caller, addr, true); } #endif @@ -217,17 +241,17 @@ void ftrace_graph_func(unsigned long ip, unsigned long parent_ip, prepare_ftrace_return(&fregs->ra, ip, fregs->s0); } #else /* CONFIG_DYNAMIC_FTRACE_WITH_ARGS */ -extern void ftrace_graph_call(void); +ftrace_func_t ftrace_graph_call_dest = ftrace_stub; int ftrace_enable_ftrace_graph_caller(void) { - return __ftrace_modify_call((unsigned long)&ftrace_graph_call, - (unsigned long)&prepare_ftrace_return, true, true); + return __ftrace_modify_call_site(&ftrace_graph_call_dest, + &prepare_ftrace_return, true); } int ftrace_disable_ftrace_graph_caller(void) { - return __ftrace_modify_call((unsigned long)&ftrace_graph_call, - (unsigned long)&prepare_ftrace_return, false, true); + return __ftrace_modify_call_site(&ftrace_graph_call_dest, + &prepare_ftrace_return, false); } #endif /* CONFIG_DYNAMIC_FTRACE_WITH_ARGS */ #endif /* CONFIG_DYNAMIC_FTRACE */ diff --git a/arch/riscv/kernel/mcount-dyn.S b/arch/riscv/kernel/mcount-dyn.S index e988bd26b28b..bc06e8ab81cf 100644 --- a/arch/riscv/kernel/mcount-dyn.S +++ b/arch/riscv/kernel/mcount-dyn.S @@ -162,7 +162,8 @@ SYM_FUNC_START(ftrace_caller) mv a3, sp SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL) - call ftrace_stub + REG_L ra, ftrace_call_dest + jalr 0(ra) #ifdef CONFIG_FUNCTION_GRAPH_TRACER addi a0, sp, ABI_RA @@ -172,7 +173,8 @@ SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL) mv a2, s0 #endif SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) - call ftrace_stub + REG_L ra, ftrace_graph_call_dest + jalr 0(ra) #endif RESTORE_ABI jr t0 @@ -185,7 +187,8 @@ SYM_FUNC_START(ftrace_caller) PREPARE_ARGS SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL) - call ftrace_stub + REG_L ra, ftrace_call_dest + jalr 0(ra) RESTORE_ABI_REGS bnez t1, .Ldirect
We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since instruction fetch can break down to 4 byte at a time, it is impossible to update two instructions without a race. In order to mitigate it, we initialize the patchable entry to AUIPC + NOP4. Then, the run-time code patching can change NOP4 to JALR to eable/disable ftrcae from a function. This limits the reach of each ftrace entry to +-2KB displacing from ftrace_caller. Starting from the trampoline, we add a level of indirection for it to reach ftrace caller target. Now, it loads the target address from a memory location, then perform the jump. This enable the kernel to update the target atomically. The ordering of reading/updating the targert address should be guarded by generic ftrace code, where it sends smp_rmb ipi. Signed-off-by: Andy Chiu <andy.chiu@sifive.com> --- arch/riscv/include/asm/ftrace.h | 4 +++ arch/riscv/kernel/ftrace.c | 80 ++++++++++++++++++++++++++--------------- arch/riscv/kernel/mcount-dyn.S | 9 +++-- 3 files changed, 62 insertions(+), 31 deletions(-)