Message ID | 20200727043132.15082-5-ricardo.neri-calderon@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | None | expand |
On July 26, 2020 9:31:32 PM PDT, Ricardo Neri <ricardo.neri-calderon@linux.intel.com> wrote: >The SERIALIZE instruction gives software a way to force the processor >to >complete all modifications to flags, registers and memory from previous >instructions and drain all buffered writes to memory before the next >instruction is fetched and executed. Thus, it serves the purpose of >sync_core(). Use it when available. > >Use boot_cpu_has() and not static_cpu_has(); the most critical paths >(returning to user mode and from interrupt and NMI) will not reach >sync_core(). > >Cc: Andy Lutomirski <luto@kernel.org> >Cc: Cathy Zhang <cathy.zhang@intel.com> >Cc: Dave Hansen <dave.hansen@linux.intel.com> >Cc: Fenghua Yu <fenghua.yu@intel.com> >Cc: "H. Peter Anvin" <hpa@zytor.com> >Cc: Kyung Min Park <kyung.min.park@intel.com> >Cc: Peter Zijlstra <peterz@infradead.org> >Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> >Cc: Sean Christopherson <sean.j.christopherson@intel.com> >Cc: linux-edac@vger.kernel.org >Cc: linux-kernel@vger.kernel.org >Reviwed-by: Tony Luck <tony.luck@intel.com> >Suggested-by: Andy Lutomirski <luto@kernel.org> >Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> >--- >--- > arch/x86/include/asm/special_insns.h | 5 +++++ > arch/x86/include/asm/sync_core.h | 10 +++++++++- > 2 files changed, 14 insertions(+), 1 deletion(-) > >diff --git a/arch/x86/include/asm/special_insns.h >b/arch/x86/include/asm/special_insns.h >index 59a3e13204c3..0a2a60bba282 100644 >--- a/arch/x86/include/asm/special_insns.h >+++ b/arch/x86/include/asm/special_insns.h >@@ -234,6 +234,11 @@ static inline void clwb(volatile void *__p) > > #define nop() asm volatile ("nop") > >+static inline void serialize(void) >+{ >+ asm volatile(".byte 0xf, 0x1, 0xe8"); >+} >+ > #endif /* __KERNEL__ */ > > #endif /* _ASM_X86_SPECIAL_INSNS_H */ >diff --git a/arch/x86/include/asm/sync_core.h >b/arch/x86/include/asm/sync_core.h >index fdb5b356e59b..bf132c09d61b 100644 >--- a/arch/x86/include/asm/sync_core.h >+++ b/arch/x86/include/asm/sync_core.h >@@ -5,6 +5,7 @@ > #include <linux/preempt.h> > #include <asm/processor.h> > #include <asm/cpufeature.h> >+#include <asm/special_insns.h> > > #ifdef CONFIG_X86_32 > static inline void iret_to_self(void) >@@ -54,7 +55,8 @@ static inline void iret_to_self(void) > static inline void sync_core(void) > { > /* >- * There are quite a few ways to do this. IRET-to-self is nice >+ * Hardware can do this for us if SERIALIZE is available. Otherwise, >+ * there are quite a few ways to do this. IRET-to-self is nice > * because it works on every CPU, at any CPL (so it's compatible > * with paravirtualization), and it never exits to a hypervisor. > * The only down sides are that it's a bit slow (it seems to be >@@ -75,6 +77,12 @@ static inline void sync_core(void) > * Like all of Linux's memory ordering operations, this is a > * compiler barrier as well. > */ >+ >+ if (boot_cpu_has(X86_FEATURE_SERIALIZE)) { >+ serialize(); >+ return; >+ } >+ > iret_to_self(); > } > Any reason to not make sync_core() an inline with alternatives? For a really overenginered solution, but which might perform unnecessary poorly on existing hardware: asm volatile("1: .byte 0xf, 0x1, 0xe8; 2:" _ASM_EXTABLE(1b,2b));
On July 26, 2020 10:55:15 PM PDT, hpa@zytor.com wrote: >On July 26, 2020 9:31:32 PM PDT, Ricardo Neri ><ricardo.neri-calderon@linux.intel.com> wrote: >>The SERIALIZE instruction gives software a way to force the processor >>to >>complete all modifications to flags, registers and memory from >previous >>instructions and drain all buffered writes to memory before the next >>instruction is fetched and executed. Thus, it serves the purpose of >>sync_core(). Use it when available. >> >>Use boot_cpu_has() and not static_cpu_has(); the most critical paths >>(returning to user mode and from interrupt and NMI) will not reach >>sync_core(). >> >>Cc: Andy Lutomirski <luto@kernel.org> >>Cc: Cathy Zhang <cathy.zhang@intel.com> >>Cc: Dave Hansen <dave.hansen@linux.intel.com> >>Cc: Fenghua Yu <fenghua.yu@intel.com> >>Cc: "H. Peter Anvin" <hpa@zytor.com> >>Cc: Kyung Min Park <kyung.min.park@intel.com> >>Cc: Peter Zijlstra <peterz@infradead.org> >>Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> >>Cc: Sean Christopherson <sean.j.christopherson@intel.com> >>Cc: linux-edac@vger.kernel.org >>Cc: linux-kernel@vger.kernel.org >>Reviwed-by: Tony Luck <tony.luck@intel.com> >>Suggested-by: Andy Lutomirski <luto@kernel.org> >>Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> >>--- >>--- >> arch/x86/include/asm/special_insns.h | 5 +++++ >> arch/x86/include/asm/sync_core.h | 10 +++++++++- >> 2 files changed, 14 insertions(+), 1 deletion(-) >> >>diff --git a/arch/x86/include/asm/special_insns.h >>b/arch/x86/include/asm/special_insns.h >>index 59a3e13204c3..0a2a60bba282 100644 >>--- a/arch/x86/include/asm/special_insns.h >>+++ b/arch/x86/include/asm/special_insns.h >>@@ -234,6 +234,11 @@ static inline void clwb(volatile void *__p) >> >> #define nop() asm volatile ("nop") >> >>+static inline void serialize(void) >>+{ >>+ asm volatile(".byte 0xf, 0x1, 0xe8"); >>+} >>+ >> #endif /* __KERNEL__ */ >> >> #endif /* _ASM_X86_SPECIAL_INSNS_H */ >>diff --git a/arch/x86/include/asm/sync_core.h >>b/arch/x86/include/asm/sync_core.h >>index fdb5b356e59b..bf132c09d61b 100644 >>--- a/arch/x86/include/asm/sync_core.h >>+++ b/arch/x86/include/asm/sync_core.h >>@@ -5,6 +5,7 @@ >> #include <linux/preempt.h> >> #include <asm/processor.h> >> #include <asm/cpufeature.h> >>+#include <asm/special_insns.h> >> >> #ifdef CONFIG_X86_32 >> static inline void iret_to_self(void) >>@@ -54,7 +55,8 @@ static inline void iret_to_self(void) >> static inline void sync_core(void) >> { >> /* >>- * There are quite a few ways to do this. IRET-to-self is nice >>+ * Hardware can do this for us if SERIALIZE is available. Otherwise, >>+ * there are quite a few ways to do this. IRET-to-self is nice >> * because it works on every CPU, at any CPL (so it's compatible >> * with paravirtualization), and it never exits to a hypervisor. >> * The only down sides are that it's a bit slow (it seems to be >>@@ -75,6 +77,12 @@ static inline void sync_core(void) >> * Like all of Linux's memory ordering operations, this is a >> * compiler barrier as well. >> */ >>+ >>+ if (boot_cpu_has(X86_FEATURE_SERIALIZE)) { >>+ serialize(); >>+ return; >>+ } >>+ >> iret_to_self(); >> } >> > >Any reason to not make sync_core() an inline with alternatives? > >For a really overenginered solution, but which might perform >unnecessary poorly on existing hardware: > >asm volatile("1: .byte 0xf, 0x1, 0xe8; 2:" > _ASM_EXTABLE(1b,2b)); (and : : : "memory" of course.)
On Sun, Jul 26, 2020 at 09:31:32PM -0700, Ricardo Neri wrote: > +static inline void serialize(void) > +{ > + asm volatile(".byte 0xf, 0x1, 0xe8"); > +} Can we pretty please have a comment with the binutils version that has the mnomic? Such that when we increase the required binutils version we can grep for this.
On Sun, Jul 26, 2020 at 09:31:32PM -0700, Ricardo Neri wrote: > @@ -75,6 +77,12 @@ static inline void sync_core(void) > * Like all of Linux's memory ordering operations, this is a > * compiler barrier as well. > */ > + > + if (boot_cpu_has(X86_FEATURE_SERIALIZE)) { > + serialize(); > + return; > + } > + > iret_to_self(); I was sorta expecting something like: alternative(IRET_TO_SELF, SERIALIZE, X86_FEATURE_SERIALIZE); But instead you used boot_cpu_has() which is a dynamic test.
On Sun, Jul 26, 2020 at 10:55:15PM -0700, hpa@zytor.com wrote: > For a really overenginered solution, but which might perform > unnecessary poorly on existing hardware: > > asm volatile("1: .byte 0xf, 0x1, 0xe8; 2:" > _ASM_EXTABLE(1b,2b)); Ha! cute, you take an #UD ? We could optimize the #UD exception handler for this I suppose, but that makes it an even worse hack. The simple alternative() seems like a much simpler approach.
On July 27, 2020 1:20:03 AM PDT, peterz@infradead.org wrote: >On Sun, Jul 26, 2020 at 09:31:32PM -0700, Ricardo Neri wrote: >> +static inline void serialize(void) >> +{ >> + asm volatile(".byte 0xf, 0x1, 0xe8"); >> +} > >Can we pretty please have a comment with the binutils version that has >the mnomic? Such that when we increase the required binutils version we >can grep for this. It also needs : : : "memory" to be a barrier.
On July 27, 2020 1:36:19 AM PDT, peterz@infradead.org wrote: >On Sun, Jul 26, 2020 at 10:55:15PM -0700, hpa@zytor.com wrote: >> For a really overenginered solution, but which might perform >> unnecessary poorly on existing hardware: >> >> asm volatile("1: .byte 0xf, 0x1, 0xe8; 2:" >> _ASM_EXTABLE(1b,2b)); > >Ha! cute, you take an #UD ? > >We could optimize the #UD exception handler for this I suppose, but >that >makes it an even worse hack. The simple alternative() seems like a much >simpler approach. If this is in any way performance critical, then no :) Taking the #UD has the cute property that we end up IRET on the way back, so we don't even need a fix-up path.
On Mon, Jul 27, 2020 at 05:49:28AM -0700, hpa@zytor.com wrote: > On July 27, 2020 1:36:19 AM PDT, peterz@infradead.org wrote: > >On Sun, Jul 26, 2020 at 10:55:15PM -0700, hpa@zytor.com wrote: > >> For a really overenginered solution, but which might perform > >> unnecessary poorly on existing hardware: > >> > >> asm volatile("1: .byte 0xf, 0x1, 0xe8; 2:" > >> _ASM_EXTABLE(1b,2b)); > > > >Ha! cute, you take an #UD ? > > > >We could optimize the #UD exception handler for this I suppose, but > >that makes it an even worse hack. The simple alternative() seems like > >a much simpler approach. > > If this is in any way performance critical, then no :) Yeah, I'm not sure.. the 'funny' thing is that typically call sync_core() from an IPI anyway. And the synchronous broadcast IPI is by far the most expensive part of that. Something like this... diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 20e07feb4064..528e049ee1d9 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -989,12 +989,13 @@ void *text_poke_kgdb(void *addr, const void *opcode, size_t len) static void do_sync_core(void *info) { - sync_core(); + /* IRET implies sync_core() */ } void text_poke_sync(void) { on_each_cpu(do_sync_core, NULL, 1); + sync_core(); } struct text_poke_loc { > Taking the #UD > has the cute property that we end up IRET on the way back, so we don't > even need a fix-up path. I got that, what I had in mind was making sure #UD avoids the overhead of doing exception entry/exit by adding an early exit. Something like so: diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 8493f55e1167..a3f41d645944 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -96,6 +96,16 @@ __always_inline int is_valid_bugaddr(unsigned long addr) return *(unsigned short *)addr == INSN_UD2; } +__always_inline int handle_serialize(struct pt_regs *regs) +{ + const char serialize[3] = { 0x0f, 0xe8, 0x02 }; + + if (regs->ip < TASK_SIZE_MAX) + return 0; + + return !memcmp((const void *)regs->ip, serialize, 3); +} + static nokprobe_inline int do_trap_no_signal(struct task_struct *tsk, int trapnr, const char *str, struct pt_regs *regs, long error_code) @@ -252,8 +262,13 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op) * handle it before exception entry to avoid recursive WARN * in case exception entry is the one triggering WARNs. */ - if (!user_mode(regs) && handle_bug(regs)) - return; + if (!user_mode(regs)) { + if (handle_bug(regs)) + return; + + if (handle_serialize(regs)) + return; + } state = idtentry_enter(regs); instrumentation_begin();
On Mon, Jul 27, 2020 at 03:05:36PM +0200, peterz@infradead.org wrote: > Yeah, I'm not sure.. the 'funny' thing is that typically call > sync_core() from an IPI anyway. And the synchronous broadcast IPI is by > far the most expensive part of that. > > Something like this... > > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > index 20e07feb4064..528e049ee1d9 100644 > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -989,12 +989,13 @@ void *text_poke_kgdb(void *addr, const void *opcode, size_t len) > > static void do_sync_core(void *info) > { > - sync_core(); > + /* IRET implies sync_core() */ > } > > void text_poke_sync(void) > { > on_each_cpu(do_sync_core, NULL, 1); > + sync_core(); > } > > struct text_poke_loc { So 'people' have wanted to optimize this for NOHZ_FULL and I suppose virt as well. IFF VMENTER is serializing, I suppose we can simply do something like: bool text_poke_cond(int cpu, void *info) { /* * If we observe the vCPU is preempted, it will do VMENTER * no point in sending an IPI to SERIALIZE. */ return !vcpu_is_preempted(cpu); } void text_poke_sync(void) { smp_call_function_many_cond(cpu_possible_mask, do_sync_core, NULL, 1, text_poke_cond); sync_core(); } The 'same' for NOHZ_FULL, except we need to cmpxchg a value such that if the cmpxchg() succeeds we know the CPU is in userspace and will SERIALIZE on the next entry. Much like kvm_flush_tlb_others(). Anyway, that's all hand-wavey.. I'll let someone that cares about those things write actual patches :-)
> For a really overenginered solution, but which might perform unnecessary poorly on existing hardware: > > asm volatile("1: .byte 0xf, 0x1, 0xe8; 2:" > _ASM_EXTABLE(1b,2b)); You win the prize for the smallest code. Might need (the already large) comment to double in size to explain the subtleties! -Tony
On Mon, Jul 27, 2020 at 10:20:03AM +0200, peterz@infradead.org wrote: > On Sun, Jul 26, 2020 at 09:31:32PM -0700, Ricardo Neri wrote: > > +static inline void serialize(void) > > +{ > > + asm volatile(".byte 0xf, 0x1, 0xe8"); > > +} > > Can we pretty please have a comment with the binutils version that has > the mnomic? Such that when we increase the required binutils version we > can grep for this. This is supported since binutils 2.35[1]. I'll add a comment with the clarification. Thanks and BR, Ricardo [1]. https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob_plain;f=gas/NEWS;;hb=refs/tags/binutils-2_35
On Mon, Jul 27, 2020 at 05:47:32AM -0700, hpa@zytor.com wrote: > On July 27, 2020 1:20:03 AM PDT, peterz@infradead.org wrote: > >On Sun, Jul 26, 2020 at 09:31:32PM -0700, Ricardo Neri wrote: > >> +static inline void serialize(void) > >> +{ > >> + asm volatile(".byte 0xf, 0x1, 0xe8"); > >> +} > > > >Can we pretty please have a comment with the binutils version that has > >the mnomic? Such that when we increase the required binutils version we > >can grep for this. > > It also needs : : : "memory" to be a barrier. Sure Peter, I will make this change. Thanks and BR, Ricardo
On Mon, Jul 27, 2020 at 03:30:20PM +0200, peterz@infradead.org wrote: > On Mon, Jul 27, 2020 at 03:05:36PM +0200, peterz@infradead.org wrote: > > Yeah, I'm not sure.. the 'funny' thing is that typically call > > sync_core() from an IPI anyway. And the synchronous broadcast IPI is by > > far the most expensive part of that. > > > > Something like this... > > > > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > > index 20e07feb4064..528e049ee1d9 100644 > > --- a/arch/x86/kernel/alternative.c > > +++ b/arch/x86/kernel/alternative.c > > @@ -989,12 +989,13 @@ void *text_poke_kgdb(void *addr, const void *opcode, size_t len) > > > > static void do_sync_core(void *info) > > { > > - sync_core(); > > + /* IRET implies sync_core() */ > > } > > > > void text_poke_sync(void) > > { > > on_each_cpu(do_sync_core, NULL, 1); > > + sync_core(); > > } > > > > struct text_poke_loc { > > So 'people' have wanted to optimize this for NOHZ_FULL and I suppose > virt as well. > > IFF VMENTER is serializing, I suppose we can simply do something like: > > bool text_poke_cond(int cpu, void *info) > { > /* > * If we observe the vCPU is preempted, it will do VMENTER > * no point in sending an IPI to SERIALIZE. > */ > return !vcpu_is_preempted(cpu); > } > > void text_poke_sync(void) > { > smp_call_function_many_cond(cpu_possible_mask, > do_sync_core, NULL, 1, text_poke_cond); > sync_core(); > } > > The 'same' for NOHZ_FULL, except we need to cmpxchg a value such that > if the cmpxchg() succeeds we know the CPU is in userspace and will > SERIALIZE on the next entry. Much like kvm_flush_tlb_others(). > > > Anyway, that's all hand-wavey.. I'll let someone that cares about those > things write actual patches :-) I think I got a little lost here. If I understand correctly, there are two alternatives to implement support for serialize better: a) alternative(IRET_TO_SELF, SERIALIZE, X86_FEATURE_SERIALIZE); or b) asm volatile("1:.byte 0xf, 0x1, 0xe8;2:" _ASM_EXTABLE(1b:2b) a) would be the traditional and simpler solution. b) would rely on causing an #UD and getting an IRET on existing hardware b) would need some more optimization work when handling the exception and a few reworks on the poke patching code. Which option should I focus on? Which option would be more desirable/better? Thanks and BR, Ricardo
On Mon, Jul 27, 2020 at 09:41:14PM -0700, Ricardo Neri wrote: > I think I got a little lost here. Hehe, sorry. I got carried away, it's just that recently people expressed interest in 'fixing' some of the text_poke_sync() issues again. > If I understand correctly, there are > two alternatives to implement support for serialize better: > > a) alternative(IRET_TO_SELF, SERIALIZE, X86_FEATURE_SERIALIZE); or > b) asm volatile("1:.byte 0xf, 0x1, 0xe8;2:" _ASM_EXTABLE(1b:2b) > > a) would be the traditional and simpler solution. b) would rely on > causing an #UD and getting an IRET on existing hardware b) would need some > more optimization work when handling the exception and a few reworks on > the poke patching code. > > Which option should I focus on? Which option would be more desirable/better? I'd say go with (a) for now. We can always go overboard later ;-)
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 59a3e13204c3..0a2a60bba282 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -234,6 +234,11 @@ static inline void clwb(volatile void *__p) #define nop() asm volatile ("nop") +static inline void serialize(void) +{ + asm volatile(".byte 0xf, 0x1, 0xe8"); +} + #endif /* __KERNEL__ */ #endif /* _ASM_X86_SPECIAL_INSNS_H */ diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index fdb5b356e59b..bf132c09d61b 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -5,6 +5,7 @@ #include <linux/preempt.h> #include <asm/processor.h> #include <asm/cpufeature.h> +#include <asm/special_insns.h> #ifdef CONFIG_X86_32 static inline void iret_to_self(void) @@ -54,7 +55,8 @@ static inline void iret_to_self(void) static inline void sync_core(void) { /* - * There are quite a few ways to do this. IRET-to-self is nice + * Hardware can do this for us if SERIALIZE is available. Otherwise, + * there are quite a few ways to do this. IRET-to-self is nice * because it works on every CPU, at any CPL (so it's compatible * with paravirtualization), and it never exits to a hypervisor. * The only down sides are that it's a bit slow (it seems to be @@ -75,6 +77,12 @@ static inline void sync_core(void) * Like all of Linux's memory ordering operations, this is a * compiler barrier as well. */ + + if (boot_cpu_has(X86_FEATURE_SERIALIZE)) { + serialize(); + return; + } + iret_to_self(); }
The SERIALIZE instruction gives software a way to force the processor to complete all modifications to flags, registers and memory from previous instructions and drain all buffered writes to memory before the next instruction is fetched and executed. Thus, it serves the purpose of sync_core(). Use it when available. Use boot_cpu_has() and not static_cpu_has(); the most critical paths (returning to user mode and from interrupt and NMI) will not reach sync_core(). Cc: Andy Lutomirski <luto@kernel.org> Cc: Cathy Zhang <cathy.zhang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kyung Min Park <kyung.min.park@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: linux-edac@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviwed-by: Tony Luck <tony.luck@intel.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> --- --- arch/x86/include/asm/special_insns.h | 5 +++++ arch/x86/include/asm/sync_core.h | 10 +++++++++- 2 files changed, 14 insertions(+), 1 deletion(-)