Message ID | 20241119153502.41361-14-vschneid@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | context_tracking,x86: Defer some IPIs until a user->kernel transition | expand |
On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote: > +void noinstr __flush_tlb_all_noinstr(void) > +{ > + /* > + * This is for invocation in early entry code that cannot be > + * instrumented. A RMW to CR4 works for most cases, but relies on > + * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID > + * would require also resetting CR3.PCID, so just try with CR4.PGE, else > + * do the CR3 write. > + * > + * XXX: this gives paravirt the finger. > + */ > + if (cpu_feature_enabled(X86_FEATURE_PGE)) > + __native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4)); > + else > + native_flush_tlb_local_noinstr(); > +} Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah. Why not always just do the CR3 write and call it a day? That should also work for paravirt, no? Just make the whole write_cr3 thing noinstr and voila.
On Wed, Nov 20, 2024 at 04:22:16PM +0100, Peter Zijlstra wrote: > On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote: > > > +void noinstr __flush_tlb_all_noinstr(void) > > +{ > > + /* > > + * This is for invocation in early entry code that cannot be > > + * instrumented. A RMW to CR4 works for most cases, but relies on > > + * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID > > + * would require also resetting CR3.PCID, so just try with CR4.PGE, else > > + * do the CR3 write. > > + * > > + * XXX: this gives paravirt the finger. > > + */ > > + if (cpu_feature_enabled(X86_FEATURE_PGE)) > > + __native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4)); > > + else > > + native_flush_tlb_local_noinstr(); > > +} > > Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah. > > Why not always just do the CR3 write and call it a day? That should also > work for paravirt, no? Just make the whole write_cr3 thing noinstr and > voila. Oh gawd, just having looked at xen_write_cr3() this might not be entirely trivial to mark noinstr :/
On 20/11/24 16:32, Peter Zijlstra wrote: > On Wed, Nov 20, 2024 at 04:22:16PM +0100, Peter Zijlstra wrote: >> On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote: >> >> > +void noinstr __flush_tlb_all_noinstr(void) >> > +{ >> > + /* >> > + * This is for invocation in early entry code that cannot be >> > + * instrumented. A RMW to CR4 works for most cases, but relies on >> > + * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID >> > + * would require also resetting CR3.PCID, so just try with CR4.PGE, else >> > + * do the CR3 write. >> > + * >> > + * XXX: this gives paravirt the finger. >> > + */ >> > + if (cpu_feature_enabled(X86_FEATURE_PGE)) >> > + __native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4)); >> > + else >> > + native_flush_tlb_local_noinstr(); >> > +} >> >> Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah. >> >> Why not always just do the CR3 write and call it a day? That should also >> work for paravirt, no? Just make the whole write_cr3 thing noinstr and >> voila. > > Oh gawd, just having looked at xen_write_cr3() this might not be > entirely trivial to mark noinstr :/ ... I hadn't even seen that. AIUI the CR3 RMW is not "enough" if we have PGE enabled, because then global pages aren't flushed. The question becomes: what is held in global pages and do we care about that when it comes to vmalloc()? I'm starting to think no, but this is x86, I don't know what surprises are waiting for me. I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, and it correctly uses the non-deferrable flush_tlb_kernel_range().
On Wed, Nov 20, 2024 at 06:24:56PM +0100, Valentin Schneider wrote: > > Oh gawd, just having looked at xen_write_cr3() this might not be > > entirely trivial to mark noinstr :/ > > ... I hadn't even seen that. > > AIUI the CR3 RMW is not "enough" if we have PGE enabled, because then > global pages aren't flushed. > > The question becomes: what is held in global pages and do we care about > that when it comes to vmalloc()? I'm starting to think no, but this is x86, > I don't know what surprises are waiting for me. > > I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, > and it correctly uses the non-deferrable flush_tlb_kernel_range(). I always forget what we use global pages for, dhansen might know, but let me try and have a look. I *think* we only have GLOBAL on kernel text, and that only sometimes.
On 11/21/24 03:12, Peter Zijlstra wrote: >> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, >> and it correctly uses the non-deferrable flush_tlb_kernel_range(). > > I always forget what we use global pages for, dhansen might know, but > let me try and have a look. > > I *think* we only have GLOBAL on kernel text, and that only sometimes. I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play. Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. Before PCIDs, global mappings let the kernel TLB entries live across CR3 writes. When PCIDs are in play, global mappings let two different ASIDs share TLB entries. When KPTI is around, the kernel writes CR3 at user/kernel switches to make sure secrets are unmapped and can't be leaked by Meltdown. But unmapping those secrets doesn't do squat if they were mapped globally since they'll still be in the TLB and quite usable. There, we're more judicious and only mark performance-sensitive things that are not secret to be global, like kernel text.
On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote: > On 11/21/24 03:12, Peter Zijlstra wrote: > >> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, > >> and it correctly uses the non-deferrable flush_tlb_kernel_range(). > > > > I always forget what we use global pages for, dhansen might know, but > > let me try and have a look. > > > > I *think* we only have GLOBAL on kernel text, and that only sometimes. > > I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play. Yah, I suppose I am. That was the last time I had a good look at this stuff :-) > Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. > Before PCIDs, global mappings let the kernel TLB entries live across CR3 > writes. When PCIDs are in play, global mappings let two different ASIDs > share TLB entries. Hurmph.. bah. That means we do need that horrible CR4 dance :/
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 2c66687ce00e2..9d4f021b5a45b 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -3,6 +3,7 @@ #define _ASM_X86_CONTEXT_TRACKING_WORK_H #include <asm/sync_core.h> +#include <asm/tlbflush.h> static __always_inline void arch_context_tracking_work(int work) { @@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(int work) case CONTEXT_WORK_SYNC: sync_core(); break; + case CONTEXT_WORK_TLBI: + __flush_tlb_all_noinstr(); + break; } } diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index aec6e2d3aa1d5..b97157a69d48e 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -74,6 +74,7 @@ static inline unsigned long native_read_cr4(void) return val; } +void native_write_cr4_noinstr(unsigned long val); void native_write_cr4(unsigned long val); #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 69e79fff41b80..a653b5f47f0e6 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -18,6 +18,7 @@ DECLARE_PER_CPU(u64, tlbstate_untag_mask); void __flush_tlb_all(void); +void noinstr __flush_tlb_all_noinstr(void); #define TLB_FLUSH_ALL -1UL #define TLB_GENERATION_INVALID 0 @@ -418,9 +419,20 @@ static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask) #endif #endif /* !MODULE */ +#define __NATIVE_TLB_FLUSH_GLOBAL(suffix, cr4) \ + native_write_cr4##suffix(cr4 ^ X86_CR4_PGE); \ + native_write_cr4##suffix(cr4) +#define NATIVE_TLB_FLUSH_GLOBAL(cr4) __NATIVE_TLB_FLUSH_GLOBAL(, cr4) +#define NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4) __NATIVE_TLB_FLUSH_GLOBAL(_noinstr, cr4) + static inline void __native_tlb_flush_global(unsigned long cr4) { - native_write_cr4(cr4 ^ X86_CR4_PGE); - native_write_cr4(cr4); + NATIVE_TLB_FLUSH_GLOBAL(cr4); } + +static inline void __native_tlb_flush_global_noinstr(unsigned long cr4) +{ + NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4); +} + #endif /* _ASM_X86_TLBFLUSH_H */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index a5f221ea56888..a84bb8511650b 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -424,7 +424,7 @@ void native_write_cr0(unsigned long val) } EXPORT_SYMBOL(native_write_cr0); -void __no_profile native_write_cr4(unsigned long val) +noinstr void native_write_cr4_noinstr(unsigned long val) { unsigned long bits_changed = 0; @@ -442,6 +442,10 @@ void __no_profile native_write_cr4(unsigned long val) bits_changed); } } +void native_write_cr4(unsigned long val) +{ + native_write_cr4_noinstr(val); +} #if IS_MODULE(CONFIG_LKDTM) EXPORT_SYMBOL_GPL(native_write_cr4); #endif diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 86593d1b787d8..973a4ab3f53b3 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -256,7 +256,7 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen, * * See SWITCH_TO_USER_CR3. */ -static inline void invalidate_user_asid(u16 asid) +static __always_inline void invalidate_user_asid(u16 asid) { /* There is no user ASID if address space separation is off */ if (!IS_ENABLED(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)) @@ -1198,7 +1198,7 @@ STATIC_NOPV void native_flush_tlb_global(void) /* * Flush the entire current user mapping */ -STATIC_NOPV void native_flush_tlb_local(void) +static noinstr void native_flush_tlb_local_noinstr(void) { /* * Preemption or interrupts must be disabled to protect the access @@ -1213,6 +1213,11 @@ STATIC_NOPV void native_flush_tlb_local(void) native_write_cr3(__native_read_cr3()); } +STATIC_NOPV void native_flush_tlb_local(void) +{ + native_flush_tlb_local_noinstr(); +} + void flush_tlb_local(void) { __flush_tlb_local(); @@ -1240,6 +1245,23 @@ void __flush_tlb_all(void) } EXPORT_SYMBOL_GPL(__flush_tlb_all); +void noinstr __flush_tlb_all_noinstr(void) +{ + /* + * This is for invocation in early entry code that cannot be + * instrumented. A RMW to CR4 works for most cases, but relies on + * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID + * would require also resetting CR3.PCID, so just try with CR4.PGE, else + * do the CR3 write. + * + * XXX: this gives paravirt the finger. + */ + if (cpu_feature_enabled(X86_FEATURE_PGE)) + __native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4)); + else + native_flush_tlb_local_noinstr(); +} + void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) { struct flush_tlb_info *info; diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index 13fc97b395030..47d5ced39a43a 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -6,11 +6,13 @@ enum { CONTEXT_WORK_SYNC_OFFSET, + CONTEXT_WORK_TLBI_OFFSET, CONTEXT_WORK_MAX_OFFSET }; enum ct_work { CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET), + CONTEXT_WORK_TLBI = BIT(CONTEXT_WORK_TLBI_OFFSET), CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET) };
Kernel TLB invalidation IPIs are a common source of interference on NOHZ_FULL CPUs. Given NOHZ_FULL CPUs executing in userspace are not accessing any kernel addresses, these invalidations do not need to happen immediately, and can be deferred until the next user->kernel transition. Add a minimal, noinstr-compliant variant of __flush_tlb_all() that doesn't try to leverage INVPCID. To achieve this: o Add a noinstr variant of native_write_cr4(), keeping native_write_cr4() as "only" __no_profile. o Make invalidate_user_asid() __always_inline XXX: what about paravirt? Should these be instead new operations made available through pv_ops.mmu.*? x86_64_start_kernel() uses __native_tlb_flush_global() regardless of paravirt, so I'm thinking the paravirt variants are more optimizations than hard requirements? Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- arch/x86/include/asm/context_tracking_work.h | 4 +++ arch/x86/include/asm/special_insns.h | 1 + arch/x86/include/asm/tlbflush.h | 16 ++++++++++-- arch/x86/kernel/cpu/common.c | 6 ++++- arch/x86/mm/tlb.c | 26 ++++++++++++++++++-- include/linux/context_tracking_work.h | 2 ++ 6 files changed, 50 insertions(+), 5 deletions(-)