Message ID | 20250114175143.81438-26-vschneid@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | context_tracking,x86: Defer some IPIs until a user->kernel transition | expand |
On Tue, Jan 14, 2025, Valentin Schneider wrote: > text_poke_bp_batch() sends IPIs to all online CPUs to synchronize > them vs the newly patched instruction. CPUs that are executing in userspace > do not need this synchronization to happen immediately, and this is > actually harmful interference for NOHZ_FULL CPUs. ... > This leaves us with static keys and static calls. ... > @@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries > * First step: add a int3 trap to the address that will be patched. > */ > for (i = 0; i < nr_entries; i++) { > - tp[i].old = *(u8 *)text_poke_addr(&tp[i]); > - text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); > + void *addr = text_poke_addr(&tp[i]); > + > + /* > + * There's no safe way to defer IPIs for patching text in > + * .noinstr, record whether there is at least one such poke. > + */ > + if (is_kernel_noinstr_text((unsigned long)addr)) > + cond = NULL; Maybe pre-check "cond", especially if multiple ranges need to be checked? I.e. if (cond && is_kernel_noinstr_text(...)) > + > + tp[i].old = *((u8 *)addr); > + text_poke(addr, &int3, INT3_INSN_SIZE); > } > > - text_poke_sync(); > + __text_poke_sync(cond); > > /* > * Second step: update all but the first byte of the patched range. ... > +/** > + * is_kernel_noinstr_text - checks if the pointer address is located in the > + * .noinstr section > + * > + * @addr: address to check > + * > + * Returns: true if the address is located in .noinstr, false otherwise. > + */ > +static inline bool is_kernel_noinstr_text(unsigned long addr) > +{ > + return addr >= (unsigned long)__noinstr_text_start && > + addr < (unsigned long)__noinstr_text_end; > +} This doesn't do the right thing for modules, which matters because KVM can be built as a module on x86, and because context tracking understands transitions to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not being in the kernel, and thus will have IPIs deferred. If KVM uses a static key or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically use the wrong static path. I don't expect this to ever cause problems in practice, because patching code in KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are actively running guest code, would be all kinds of crazy. But I do think we should plug the hole. If this issue is unique to KVM, i.e. is not a generic problem for all modules (I assume module code generally isn't allowed in the entry path, even via NMI?), one idea would be to let KVM register its noinstr section for text poking.
On Tue, Jan 14, 2025, Valentin Schneider wrote: > static __always_inline void arch_context_tracking_work(enum ct_work work) > { > switch (work) { > - case CT_WORK_n: > - // Do work... > + case CT_WORK_SYNC: > + sync_core(); Not your bug, but serialize() needs to be __always_inline. Not sure what exactly caused it to be uninlined, but this is the obvious new usage. vmlinux.o: warning: objtool: __static_call_update_early+0x4e: call to serialize() leaves .noinstr.text section vmlinux.o: warning: objtool: ct_work_flush+0x69: call to serialize() leaves .noinstr.text section
On Tue, Jan 14, 2025, Sean Christopherson wrote: > On Tue, Jan 14, 2025, Valentin Schneider wrote: > > +/** > > + * is_kernel_noinstr_text - checks if the pointer address is located in the > > + * .noinstr section > > + * > > + * @addr: address to check > > + * > > + * Returns: true if the address is located in .noinstr, false otherwise. > > + */ > > +static inline bool is_kernel_noinstr_text(unsigned long addr) > > +{ > > + return addr >= (unsigned long)__noinstr_text_start && > > + addr < (unsigned long)__noinstr_text_end; > > +} > > This doesn't do the right thing for modules, which matters because KVM can be > built as a module on x86, and because context tracking understands transitions > to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not > being in the kernel, and thus will have IPIs deferred. If KVM uses a static key > or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the > patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically > use the wrong static path. > > I don't expect this to ever cause problems in practice, because patching code in > KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are > actively running guest code, would be all kinds of crazy. But I do think we > should plug the hole. > > If this issue is unique to KVM, i.e. is not a generic problem for all modules (I > assume module code generally isn't allowed in the entry path, even via NMI?), one > idea would be to let KVM register its noinstr section for text poking. Another idea would be to track which keys/branches are tagged noinstr, i.e. generate the information at compile time instead of doing lookups at runtime. The biggest downside I can think of is that it would require plumbing in the information to text_poke_bp_batch().
On 14/01/25 13:13, Sean Christopherson wrote: > On Tue, Jan 14, 2025, Valentin Schneider wrote: >> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize >> them vs the newly patched instruction. CPUs that are executing in userspace >> do not need this synchronization to happen immediately, and this is >> actually harmful interference for NOHZ_FULL CPUs. > > ... > >> This leaves us with static keys and static calls. > > ... > >> @@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries >> * First step: add a int3 trap to the address that will be patched. >> */ >> for (i = 0; i < nr_entries; i++) { >> - tp[i].old = *(u8 *)text_poke_addr(&tp[i]); >> - text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); >> + void *addr = text_poke_addr(&tp[i]); >> + >> + /* >> + * There's no safe way to defer IPIs for patching text in >> + * .noinstr, record whether there is at least one such poke. >> + */ >> + if (is_kernel_noinstr_text((unsigned long)addr)) >> + cond = NULL; > > Maybe pre-check "cond", especially if multiple ranges need to be checked? I.e. > > if (cond && is_kernel_noinstr_text(...)) >> + >> + tp[i].old = *((u8 *)addr); >> + text_poke(addr, &int3, INT3_INSN_SIZE); >> } >> >> - text_poke_sync(); >> + __text_poke_sync(cond); >> >> /* >> * Second step: update all but the first byte of the patched range. > > ... > >> +/** >> + * is_kernel_noinstr_text - checks if the pointer address is located in the >> + * .noinstr section >> + * >> + * @addr: address to check >> + * >> + * Returns: true if the address is located in .noinstr, false otherwise. >> + */ >> +static inline bool is_kernel_noinstr_text(unsigned long addr) >> +{ >> + return addr >= (unsigned long)__noinstr_text_start && >> + addr < (unsigned long)__noinstr_text_end; >> +} > > This doesn't do the right thing for modules, which matters because KVM can be > built as a module on x86, and because context tracking understands transitions > to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not > being in the kernel, and thus will have IPIs deferred. If KVM uses a static key > or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the > patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically > use the wrong static path. > AFAICT guest_state_{enter,exit}_irqoff() are only used in noinstr functions and thus such a static key usage should at the very least be caught and warned about by objtool - when this isn't built as a module. I never really thought about noinstr sections for modules; I can get objtool to warn about a non-noinstr allowed key being used in e.g. vmx_vcpu_enter_exit() just by feeding it the vmx.o: arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_enter_exit.isra.0+0x0: dummykey: non-RO static key usage in noinstr ...but that requires removing a lot of code first because objtool stops earlier in its noinstr checks as it hits functions it doesn't have full information on, e.g. arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_enter_exit+0x21c: call to __ct_user_enter() leaves .noinstr.text section __ct_user_enter() *is* noinstr, but you don't get that from just the header prototype. > I don't expect this to ever cause problems in practice, because patching code in > KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are > actively running guest code, would be all kinds of crazy. But I do think we > should plug the hole. > > If this issue is unique to KVM, i.e. is not a generic problem for all modules (I > assume module code generally isn't allowed in the entry path, even via NMI?), one > idea would be to let KVM register its noinstr section for text poking.
On 14/01/25 13:48, Sean Christopherson wrote: > On Tue, Jan 14, 2025, Sean Christopherson wrote: >> On Tue, Jan 14, 2025, Valentin Schneider wrote: >> > +/** >> > + * is_kernel_noinstr_text - checks if the pointer address is located in the >> > + * .noinstr section >> > + * >> > + * @addr: address to check >> > + * >> > + * Returns: true if the address is located in .noinstr, false otherwise. >> > + */ >> > +static inline bool is_kernel_noinstr_text(unsigned long addr) >> > +{ >> > + return addr >= (unsigned long)__noinstr_text_start && >> > + addr < (unsigned long)__noinstr_text_end; >> > +} >> >> This doesn't do the right thing for modules, which matters because KVM can be >> built as a module on x86, and because context tracking understands transitions >> to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not >> being in the kernel, and thus will have IPIs deferred. If KVM uses a static key >> or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the >> patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically >> use the wrong static path. >> >> I don't expect this to ever cause problems in practice, because patching code in >> KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are >> actively running guest code, would be all kinds of crazy. But I do think we >> should plug the hole. >> >> If this issue is unique to KVM, i.e. is not a generic problem for all modules (I >> assume module code generally isn't allowed in the entry path, even via NMI?), one >> idea would be to let KVM register its noinstr section for text poking. > > Another idea would be to track which keys/branches are tagged noinstr, i.e. generate > the information at compile time instead of doing lookups at runtime. The biggest > downside I can think of is that it would require plumbing in the information to > text_poke_bp_batch(). IIUC that's what I went for in v3: https://lore.kernel.org/lkml/20241119153502.41361-11-vschneid@redhat.com but, modules notwithstanding, simply checking if the patched instruction is in .noinstr was a lot neater.
On Fri, Jan 17, 2025, Valentin Schneider wrote: > On 14/01/25 13:13, Sean Christopherson wrote: > > On Tue, Jan 14, 2025, Valentin Schneider wrote: > >> +/** > >> + * is_kernel_noinstr_text - checks if the pointer address is located in the > >> + * .noinstr section > >> + * > >> + * @addr: address to check > >> + * > >> + * Returns: true if the address is located in .noinstr, false otherwise. > >> + */ > >> +static inline bool is_kernel_noinstr_text(unsigned long addr) > >> +{ > >> + return addr >= (unsigned long)__noinstr_text_start && > >> + addr < (unsigned long)__noinstr_text_end; > >> +} > > > > This doesn't do the right thing for modules, which matters because KVM can be > > built as a module on x86, and because context tracking understands transitions > > to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not > > being in the kernel, and thus will have IPIs deferred. If KVM uses a static key > > or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the > > patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically > > use the wrong static path. > > AFAICT guest_state_{enter,exit}_irqoff() are only used in noinstr functions > and thus such a static key usage should at the very least be caught and > warned about by objtool - when this isn't built as a module. That doesn't magically do the right thing though. If KVM is built as a module, is_kernel_noinstr_text() will get false negatives even for static keys/branches that are annotaed as NOINSTR.
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 5f3b2d0977235..485b32881fde5 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -2,11 +2,13 @@ #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H #define _ASM_X86_CONTEXT_TRACKING_WORK_H +#include <asm/sync_core.h> + static __always_inline void arch_context_tracking_work(enum ct_work work) { switch (work) { - case CT_WORK_n: - // Do work... + case CT_WORK_SYNC: + sync_core(); break; case CT_WORK_MAX: WARN_ON_ONCE(true); diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index ab9e143ec9fea..9dfa46f721c1d 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -33,6 +33,7 @@ extern void apply_relocation(u8 *buf, const u8 * const instr, size_t instrlen, u */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern void text_poke_sync(void); +extern void text_poke_sync_deferrable(void); extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); extern void *text_poke_copy(void *addr, const void *opcode, size_t len); #define text_poke_copy text_poke_copy diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 243843e44e89d..633deea8a89cb 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include <linux/mmu_context.h> #include <linux/bsearch.h> #include <linux/sync_core.h> +#include <linux/context_tracking.h> #include <asm/text-patching.h> #include <asm/alternative.h> #include <asm/sections.h> @@ -2109,9 +2110,24 @@ static void do_sync_core(void *info) sync_core(); } +static bool do_sync_core_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CT_WORK_SYNC); +} + +static void __text_poke_sync(smp_cond_func_t cond_func) +{ + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1); +} + void text_poke_sync(void) { - on_each_cpu(do_sync_core, NULL, 1); + __text_poke_sync(NULL); +} + +void text_poke_sync_deferrable(void) +{ + __text_poke_sync(do_sync_core_defer_cond); } /* @@ -2282,6 +2298,7 @@ static int tp_vec_nr; */ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries) { + smp_cond_func_t cond = do_sync_core_defer_cond; unsigned char int3 = INT3_INSN_OPCODE; unsigned int i; int do_sync; @@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * First step: add a int3 trap to the address that will be patched. */ for (i = 0; i < nr_entries; i++) { - tp[i].old = *(u8 *)text_poke_addr(&tp[i]); - text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); + void *addr = text_poke_addr(&tp[i]); + + /* + * There's no safe way to defer IPIs for patching text in + * .noinstr, record whether there is at least one such poke. + */ + if (is_kernel_noinstr_text((unsigned long)addr)) + cond = NULL; + + tp[i].old = *((u8 *)addr); + text_poke(addr, &int3, INT3_INSN_SIZE); } - text_poke_sync(); + __text_poke_sync(cond); /* * Second step: update all but the first byte of the patched range. @@ -2383,7 +2409,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * not necessary and we'd be safe even without it. But * better safe than sorry (plus there's not only Intel). */ - text_poke_sync(); + __text_poke_sync(cond); } /* @@ -2404,7 +2430,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries } if (do_sync) - text_poke_sync(); + __text_poke_sync(cond); /* * Remove and wait for refs to be zero. diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index 72e6a45e7ec24..c2fd2578fd5fc 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -817,7 +817,7 @@ void arch_arm_kprobe(struct kprobe *p) u8 int3 = INT3_INSN_OPCODE; text_poke(p->addr, &int3, 1); - text_poke_sync(); + text_poke_sync_deferrable(); perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1); } @@ -827,7 +827,7 @@ void arch_disarm_kprobe(struct kprobe *p) perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1); text_poke(p->addr, &p->opcode, 1); - text_poke_sync(); + text_poke_sync_deferrable(); } void arch_remove_kprobe(struct kprobe *p) diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 36d6809c6c9e1..b2ce4d9c3ba56 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -513,11 +513,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op) JMP32_INSN_SIZE - INT3_INSN_SIZE); text_poke(addr, new, INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); text_poke(addr + INT3_INSN_SIZE, new + INT3_INSN_SIZE, JMP32_INSN_SIZE - INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE); } diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index 8984abd91c001..acc810150b76c 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -194,7 +194,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs, write, apply); if (!early) { - text_poke_sync(); + text_poke_sync_deferrable(); mutex_unlock(&text_mutex); } diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index c768de6f19a9a..1b3ba3084fd2f 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -199,6 +199,21 @@ static inline bool is_kernel_inittext(unsigned long addr) addr < (unsigned long)_einittext; } + +/** + * is_kernel_noinstr_text - checks if the pointer address is located in the + * .noinstr section + * + * @addr: address to check + * + * Returns: true if the address is located in .noinstr, false otherwise. + */ +static inline bool is_kernel_noinstr_text(unsigned long addr) +{ + return addr >= (unsigned long)__noinstr_text_start && + addr < (unsigned long)__noinstr_text_end; +} + /** * __is_kernel_text - checks if the pointer address is located in the * .text section diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index c68245f8d77c5..931ade1dcbcc2 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,12 @@ #include <linux/bitops.h> enum { - CT_WORK_n_OFFSET, + CT_WORK_SYNC, CT_WORK_MAX_OFFSET }; enum ct_work { - CT_WORK_n = BIT(CT_WORK_n_OFFSET), + CT_WORK_SYNC = BIT(CT_WORK_SYNC_OFFSET), CT_WORK_MAX = BIT(CT_WORK_MAX_OFFSET) };