Message ID | 58656398-2d64-48b8-9ddc-c6836847a586@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: serializing of non-serializing MSR writes | expand |
On 28/02/2024 2:48 pm, Jan Beulich wrote: > Linux commit 25a068b8e9a4e ("x86/apic: Add extra serialization for non- > serializing MSRs") explains why an MFENCE+LFENCE pair is generally > needed ahead of ICR writes in x2APIC mode, and also why at least in > theory such is also needed ahead of TSC_DEADLINE writes. A comment of > our own in send_IPI_mask_x2apic_phys() further explains a condition > under which the LFENCE can be avoided. > > Further Linux commit 04c3024560d3 ("x86/barrier: Do not serialize MSR > accesses on AMD") explains that this barrier isn't needed on AMD or > Hygon, and is in fact hampering performance in a measurable way. > > Introduce a similarly named helper function, but with a parameter > allowing callers to specify whether a memory access will follow, thus > permitting the LFENCE to be omitted. > > Putting an instance in apic_wait_icr_idle() is to be on the safe side. > The one case where it was clearly missing is in send_IPI_shortcut(), > which is also used in x2APIC mode when called from send_IPI_mask(). > > Function comment shamelessly borrowed (but adapted) from Linux. > > Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when possible") > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > I question the need for a fence ahead of writing TSC_DEADLINE: The Linux > commit message talks about LVT being accessed via MMIO in xAPIC mode, > but that should not be relevant here: It's all the local CPU, so there > ought to not be visibility concerns (much like for a self-IPI no fence > is needed ahead of the ICR write). If that wasn't needed, we could > further use alternatives patching to remove the fence also from > apic_wait_icr_idle() when in xAPIC mode. (And only then would I agree to > have APIC in the feature identifier, like Linux has it.) > > A number of apic_write() may better be turned into apic_mem_write(), in > particular e.g. the ones in send_IPI_mask_{flat,phys}(). That way it > would be quite a bit easier to spot paths taken only in xAPIC mode. > > The INIT-INIT-SIPI sequence for AP startup doesn't use any barrier, also > not in Linux afaics. I can't explain the lack thereof, though. I have some opinions about what Linux did here... I don't think a single vendor/arch-neutral helper can possibly be right. It is vendor and uarch dependent which WRMSR's transparently degrade to WRMSRNS, and it is vendor dependent which serialising sequence to use (if any). e.g. AMD have recently (Zen2 uarch I believe) retroactively defined FS/GS_BASE to be non-serialising. (And this is another CPUID bit we need to OR backwards in time.) Furthermore, IIRC AMD still owe us an update to the APM; the APM currently says that a serialising sequence is needed for ICR. I'm told this isn't actually true, but I'm also very wary making an adjustment which is directly contradicted by the docs. The Linux change explains why in principle the IPI can be emitted before the stores are visible. This does actually explain TSC_DEADLINE too. Setting a deadline in the past gets you an interrupt immediately, and if you combine that with a WRMSR being ordered ahead of an MFENCE, then causality is violated. You'll take the interrupt on whichever instruction boundary has most recently retired, which will can be the wrong side of the WRMSR triggering the interrupts, at which point you'll livelock taking timer interrupts and re-arming the timer in the past. Now, for fencing, things are more complicated. AMD define MFENCE as architecturally serialising. Intel do not, hence why apparently a WRMSR can possibly move across it. The LFENCE is added for it's new speculative property of dispatch serialising. We don't actually care about architecturally serialising. If someone is interacting with these MSRs with relevant data in WC memory, then they get to keep all resulting pieces. However, we do care about plain stores, and for that we do need an MFENCE;LFENCE on Intel (The jury is out on whether a single SERIALIZE[sic] would be better, but it should have the correct semantics architecturally speaking.) In particular, ... > > --- a/xen/arch/x86/apic.c > +++ b/xen/arch/x86/apic.c > @@ -1309,6 +1309,12 @@ int reprogram_timer(s_time_t timeout) > > if ( tdt_enabled ) > { > + /* > + * WRMSR to TSC_DEADLINE is not serializing. We could pass @timeout > + * here, but the passed value is preferably compile-time-constant. > + */ > + weak_wrmsr_fence(false); > + > wrmsrl(MSR_IA32_TSC_DEADLINE, timeout ? stime2tsc(timeout) : 0); > return 1; > } > --- a/xen/arch/x86/cpu/amd.c > +++ b/xen/arch/x86/cpu/amd.c > @@ -675,8 +675,12 @@ void amd_log_freq(const struct cpuinfo_x > > void cf_check early_init_amd(struct cpuinfo_x86 *c) > { > - if (c == &boot_cpu_data) > + if (c == &boot_cpu_data) { > + /* No fencing needed ahead of certain MSR writes. */ > + setup_force_cpu_cap(X86_FEATURE_NO_WRMSR_FENCE); > + > amd_init_levelling(); > + } > > ctxt_switch_levelling(NULL); > } > --- a/xen/arch/x86/genapic/x2apic.c > +++ b/xen/arch/x86/genapic/x2apic.c > @@ -97,15 +97,15 @@ static void cf_check send_IPI_mask_x2api > > /* > * Ensure that any synchronisation data written in program order by this > - * CPU is seen by notified remote CPUs. The WRMSR contained within > - * apic_icr_write() can otherwise be executed early. > + * CPU is seen by notified remote CPUs. The WRMSR contained in the loop > + * below can otherwise be executed early. > * > - * The reason smp_mb() is sufficient here is subtle: the register arguments > + * The reason MFENCE is sufficient here is subtle: the register arguments > * to WRMSR must depend on a memory read executed after the barrier. This > * is guaranteed by cpu_physical_id(), which reads from a global array (and > * so cannot be hoisted above the barrier even by a clever compiler). > */ > - smp_mb(); > + weak_wrmsr_fence(true); > > local_irq_save(flags); > > @@ -130,7 +130,7 @@ static void cf_check send_IPI_mask_x2api > const cpumask_t *cluster_cpus; > unsigned long flags; > > - smp_mb(); /* See above for an explanation. */ > + weak_wrmsr_fence(true); /* See above for an explanation. */ > > local_irq_save(flags); > > --- a/xen/arch/x86/include/asm/cpufeatures.h > +++ b/xen/arch/x86/include/asm/cpufeatures.h > @@ -24,7 +24,7 @@ XEN_CPUFEATURE(APERFMPERF, X86_SY > XEN_CPUFEATURE(MFENCE_RDTSC, X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */ > XEN_CPUFEATURE(XEN_SMEP, X86_SYNTH(10)) /* SMEP gets used by Xen itself */ > XEN_CPUFEATURE(XEN_SMAP, X86_SYNTH(11)) /* SMAP gets used by Xen itself */ > -/* Bit 12 unused. */ > +XEN_CPUFEATURE(NO_WRMSR_FENCE, X86_SYNTH(12)) /* No MFENCE{,+LFENCE} ahead of certain WRMSR. */ > XEN_CPUFEATURE(IND_THUNK_LFENCE, X86_SYNTH(13)) /* Use IND_THUNK_LFENCE */ > XEN_CPUFEATURE(IND_THUNK_JMP, X86_SYNTH(14)) /* Use IND_THUNK_JMP */ > XEN_CPUFEATURE(SC_NO_BRANCH_HARDEN, X86_SYNTH(15)) /* (Disable) Conditional branch hardening */ > --- a/xen/arch/x86/include/asm/msr.h > +++ b/xen/arch/x86/include/asm/msr.h > @@ -97,6 +97,25 @@ static inline void msr_split(struct cpu_ > regs->rax = (uint32_t)val; > } > > +/* > + * Make previous memory operations globally visible before a WRMSR. Most > + * WRMSRs are full serializing instructions themselves and do not require this > + * barrier. This may only be required for the TSC_DEADLINE and x2APIC MSRs. > + * > + * MFENCE makes writes visible, but only affects load/store instructions. > + * WRMSR is unfortunately not a load/store instruction and is unaffected by > + * MFENCE. [ On Intel. AMD didn't end up with this (mis)behaviour. ] > The LFENCE ensures that the WRMSR is not reordered, but callers > + * can indicate to avoid it when they have a suitable memory access between > + * the invocation of this function and the WRMSR in question. ... this makes no sense. We need the LFENCE for dispatch serialising properties, not it's load ordering properties. What use will other memory have, when the entire problem is that WRMSR doesn't interact with them? Worse, it's a Spectre-v1 gadget and we're now acutely familiar with how the CPU will find its way around these. So even expressing "I critically need the LFENCE" still gets you pot luck on whether it has any effect against a causality-violating WRMSR. ~Andrew > + */ > +static inline void weak_wrmsr_fence(bool have_mem_access) > +{ > + alternative("mfence", "", X86_FEATURE_NO_WRMSR_FENCE); > + > + if ( !have_mem_access ) > + alternative("lfence", "", X86_FEATURE_NO_WRMSR_FENCE); > +} > + > static inline uint64_t rdtsc(void) > { > uint32_t low, high; > --- a/xen/arch/x86/smp.c > +++ b/xen/arch/x86/smp.c > @@ -39,7 +39,10 @@ static unsigned int prepare_ICR2(unsigne > void apic_wait_icr_idle(void) > { > if ( x2apic_enabled ) > + { > + weak_wrmsr_fence(false); > return; > + } > > while ( apic_read(APIC_ICR) & APIC_ICR_BUSY ) > cpu_relax();
On 29.02.2024 02:10, Andrew Cooper wrote: > On 28/02/2024 2:48 pm, Jan Beulich wrote: >> Linux commit 25a068b8e9a4e ("x86/apic: Add extra serialization for non- >> serializing MSRs") explains why an MFENCE+LFENCE pair is generally >> needed ahead of ICR writes in x2APIC mode, and also why at least in >> theory such is also needed ahead of TSC_DEADLINE writes. A comment of >> our own in send_IPI_mask_x2apic_phys() further explains a condition >> under which the LFENCE can be avoided. >> >> Further Linux commit 04c3024560d3 ("x86/barrier: Do not serialize MSR >> accesses on AMD") explains that this barrier isn't needed on AMD or >> Hygon, and is in fact hampering performance in a measurable way. >> >> Introduce a similarly named helper function, but with a parameter >> allowing callers to specify whether a memory access will follow, thus >> permitting the LFENCE to be omitted. >> >> Putting an instance in apic_wait_icr_idle() is to be on the safe side. >> The one case where it was clearly missing is in send_IPI_shortcut(), >> which is also used in x2APIC mode when called from send_IPI_mask(). >> >> Function comment shamelessly borrowed (but adapted) from Linux. >> >> Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when possible") >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> I question the need for a fence ahead of writing TSC_DEADLINE: The Linux >> commit message talks about LVT being accessed via MMIO in xAPIC mode, >> but that should not be relevant here: It's all the local CPU, so there >> ought to not be visibility concerns (much like for a self-IPI no fence >> is needed ahead of the ICR write). If that wasn't needed, we could >> further use alternatives patching to remove the fence also from >> apic_wait_icr_idle() when in xAPIC mode. (And only then would I agree to >> have APIC in the feature identifier, like Linux has it.) >> >> A number of apic_write() may better be turned into apic_mem_write(), in >> particular e.g. the ones in send_IPI_mask_{flat,phys}(). That way it >> would be quite a bit easier to spot paths taken only in xAPIC mode. >> >> The INIT-INIT-SIPI sequence for AP startup doesn't use any barrier, also >> not in Linux afaics. I can't explain the lack thereof, though. > > I have some opinions about what Linux did here... I don't think a > single vendor/arch-neutral helper can possibly be right. > > It is vendor and uarch dependent which WRMSR's transparently degrade to > WRMSRNS, and it is vendor dependent which serialising sequence to use > (if any). e.g. AMD have recently (Zen2 uarch I believe) retroactively > defined FS/GS_BASE to be non-serialising. (And this is another CPUID > bit we need to OR backwards in time.) > > Furthermore, IIRC AMD still owe us an update to the APM; the APM > currently says that a serialising sequence is needed for ICR. I'm told > this isn't actually true, but I'm also very wary making an adjustment > which is directly contradicted by the docs. I can see you wanting the doc to be corrected. What I'm having trouble with is you having indicated (long ago) that we can avoid this fence on AMD, just to now effectively object to me (finally) getting around to actually doing so? > The Linux change explains why in principle the IPI can be emitted before > the stores are visible. > > This does actually explain TSC_DEADLINE too. Setting a deadline in the > past gets you an interrupt immediately, and if you combine that with a > WRMSR being ordered ahead of an MFENCE, then causality is violated. > You'll take the interrupt on whichever instruction boundary has most > recently retired, which will can be the wrong side of the WRMSR > triggering the interrupts, at which point you'll livelock taking timer > interrupts and re-arming the timer in the past. Are you saying the interrupt is raised ahead of the insn retiring? That would be concerning, imo. Irrespective of that, as to live-locking: If that would really be a possible issue, moving lapic_timer_on() ahead of local_irq_enable() in acpi_processor_idle() and mwait_idle() would avoid that; the sole other use of reprogram_timer() already runs with IRQs off. > Now, for fencing, things are more complicated. AMD define MFENCE as > architecturally serialising. Intel do not, hence why apparently a WRMSR > can possibly move across it. The LFENCE is added for it's new > speculative property of dispatch serialising. > > We don't actually care about architecturally serialising. If someone is > interacting with these MSRs with relevant data in WC memory, then they > get to keep all resulting pieces. > > However, we do care about plain stores, and for that we do need an > MFENCE;LFENCE on Intel (The jury is out on whether a single > SERIALIZE[sic] would be better, but it should have the correct semantics > architecturally speaking.) > > In particular, ... > >> --- a/xen/arch/x86/genapic/x2apic.c >> +++ b/xen/arch/x86/genapic/x2apic.c >> @@ -97,15 +97,15 @@ static void cf_check send_IPI_mask_x2api >> >> /* >> * Ensure that any synchronisation data written in program order by this >> - * CPU is seen by notified remote CPUs. The WRMSR contained within >> - * apic_icr_write() can otherwise be executed early. >> + * CPU is seen by notified remote CPUs. The WRMSR contained in the loop >> + * below can otherwise be executed early. >> * >> - * The reason smp_mb() is sufficient here is subtle: the register arguments >> + * The reason MFENCE is sufficient here is subtle: the register arguments >> * to WRMSR must depend on a memory read executed after the barrier. This >> * is guaranteed by cpu_physical_id(), which reads from a global array (and >> * so cannot be hoisted above the barrier even by a clever compiler). >> */ >> - smp_mb(); >> + weak_wrmsr_fence(true); >> >> local_irq_save(flags); >> >> @@ -130,7 +130,7 @@ static void cf_check send_IPI_mask_x2api >> const cpumask_t *cluster_cpus; >> unsigned long flags; >> >> - smp_mb(); /* See above for an explanation. */ >> + weak_wrmsr_fence(true); /* See above for an explanation. */ >> >> local_irq_save(flags); >> >> --- a/xen/arch/x86/include/asm/cpufeatures.h >> +++ b/xen/arch/x86/include/asm/cpufeatures.h >> @@ -24,7 +24,7 @@ XEN_CPUFEATURE(APERFMPERF, X86_SY >> XEN_CPUFEATURE(MFENCE_RDTSC, X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */ >> XEN_CPUFEATURE(XEN_SMEP, X86_SYNTH(10)) /* SMEP gets used by Xen itself */ >> XEN_CPUFEATURE(XEN_SMAP, X86_SYNTH(11)) /* SMAP gets used by Xen itself */ >> -/* Bit 12 unused. */ >> +XEN_CPUFEATURE(NO_WRMSR_FENCE, X86_SYNTH(12)) /* No MFENCE{,+LFENCE} ahead of certain WRMSR. */ >> XEN_CPUFEATURE(IND_THUNK_LFENCE, X86_SYNTH(13)) /* Use IND_THUNK_LFENCE */ >> XEN_CPUFEATURE(IND_THUNK_JMP, X86_SYNTH(14)) /* Use IND_THUNK_JMP */ >> XEN_CPUFEATURE(SC_NO_BRANCH_HARDEN, X86_SYNTH(15)) /* (Disable) Conditional branch hardening */ >> --- a/xen/arch/x86/include/asm/msr.h >> +++ b/xen/arch/x86/include/asm/msr.h >> @@ -97,6 +97,25 @@ static inline void msr_split(struct cpu_ >> regs->rax = (uint32_t)val; >> } >> >> +/* >> + * Make previous memory operations globally visible before a WRMSR. Most >> + * WRMSRs are full serializing instructions themselves and do not require this >> + * barrier. This may only be required for the TSC_DEADLINE and x2APIC MSRs. >> + * >> + * MFENCE makes writes visible, but only affects load/store instructions. >> + * WRMSR is unfortunately not a load/store instruction and is unaffected by >> + * MFENCE. > > [ On Intel. AMD didn't end up with this (mis)behaviour. ] > >> The LFENCE ensures that the WRMSR is not reordered, but callers >> + * can indicate to avoid it when they have a suitable memory access between >> + * the invocation of this function and the WRMSR in question. > > ... this makes no sense. > > We need the LFENCE for dispatch serialising properties, not it's load > ordering properties. What use will other memory have, when the entire > problem is that WRMSR doesn't interact with them? Are you suggesting the comment (and code) in send_IPI_mask_x2apic_*() (left visible in context further up) are wrong then? I consider it correct (looking forward to see you prove it wrong), and with that having a way to avoid the LFENCE looks correct to me. Plus the comment here doesn't say "load ordering" anywhere. It's strictly execution ordering, guaranteed by a memory access the WRMSR input is dependent upon. For load ordering, MFENCE alone would be enough. > Worse, it's a Spectre-v1 gadget and we're now acutely familiar with how > the CPU will find its way around these. So even expressing "I > critically need the LFENCE" still gets you pot luck on whether it has > any effect against a causality-violating WRMSR. Hmm, besides me possibly taking this as "drop this patch" (which could do with making explicit, if that was meant), I'm afraid I can't view this remark as actionable in any way. Yet I firmly expect an IPI to not be raised speculatively (and, as said above, neither an LAPIC timer interrupt), so I'm even having trouble seeing how this would form a Spectre-v1 gadget. Jan
--- a/xen/arch/x86/apic.c +++ b/xen/arch/x86/apic.c @@ -1309,6 +1309,12 @@ int reprogram_timer(s_time_t timeout) if ( tdt_enabled ) { + /* + * WRMSR to TSC_DEADLINE is not serializing. We could pass @timeout + * here, but the passed value is preferably compile-time-constant. + */ + weak_wrmsr_fence(false); + wrmsrl(MSR_IA32_TSC_DEADLINE, timeout ? stime2tsc(timeout) : 0); return 1; } --- a/xen/arch/x86/cpu/amd.c +++ b/xen/arch/x86/cpu/amd.c @@ -675,8 +675,12 @@ void amd_log_freq(const struct cpuinfo_x void cf_check early_init_amd(struct cpuinfo_x86 *c) { - if (c == &boot_cpu_data) + if (c == &boot_cpu_data) { + /* No fencing needed ahead of certain MSR writes. */ + setup_force_cpu_cap(X86_FEATURE_NO_WRMSR_FENCE); + amd_init_levelling(); + } ctxt_switch_levelling(NULL); } --- a/xen/arch/x86/genapic/x2apic.c +++ b/xen/arch/x86/genapic/x2apic.c @@ -97,15 +97,15 @@ static void cf_check send_IPI_mask_x2api /* * Ensure that any synchronisation data written in program order by this - * CPU is seen by notified remote CPUs. The WRMSR contained within - * apic_icr_write() can otherwise be executed early. + * CPU is seen by notified remote CPUs. The WRMSR contained in the loop + * below can otherwise be executed early. * - * The reason smp_mb() is sufficient here is subtle: the register arguments + * The reason MFENCE is sufficient here is subtle: the register arguments * to WRMSR must depend on a memory read executed after the barrier. This * is guaranteed by cpu_physical_id(), which reads from a global array (and * so cannot be hoisted above the barrier even by a clever compiler). */ - smp_mb(); + weak_wrmsr_fence(true); local_irq_save(flags); @@ -130,7 +130,7 @@ static void cf_check send_IPI_mask_x2api const cpumask_t *cluster_cpus; unsigned long flags; - smp_mb(); /* See above for an explanation. */ + weak_wrmsr_fence(true); /* See above for an explanation. */ local_irq_save(flags); --- a/xen/arch/x86/include/asm/cpufeatures.h +++ b/xen/arch/x86/include/asm/cpufeatures.h @@ -24,7 +24,7 @@ XEN_CPUFEATURE(APERFMPERF, X86_SY XEN_CPUFEATURE(MFENCE_RDTSC, X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */ XEN_CPUFEATURE(XEN_SMEP, X86_SYNTH(10)) /* SMEP gets used by Xen itself */ XEN_CPUFEATURE(XEN_SMAP, X86_SYNTH(11)) /* SMAP gets used by Xen itself */ -/* Bit 12 unused. */ +XEN_CPUFEATURE(NO_WRMSR_FENCE, X86_SYNTH(12)) /* No MFENCE{,+LFENCE} ahead of certain WRMSR. */ XEN_CPUFEATURE(IND_THUNK_LFENCE, X86_SYNTH(13)) /* Use IND_THUNK_LFENCE */ XEN_CPUFEATURE(IND_THUNK_JMP, X86_SYNTH(14)) /* Use IND_THUNK_JMP */ XEN_CPUFEATURE(SC_NO_BRANCH_HARDEN, X86_SYNTH(15)) /* (Disable) Conditional branch hardening */ --- a/xen/arch/x86/include/asm/msr.h +++ b/xen/arch/x86/include/asm/msr.h @@ -97,6 +97,25 @@ static inline void msr_split(struct cpu_ regs->rax = (uint32_t)val; } +/* + * Make previous memory operations globally visible before a WRMSR. Most + * WRMSRs are full serializing instructions themselves and do not require this + * barrier. This may only be required for the TSC_DEADLINE and x2APIC MSRs. + * + * MFENCE makes writes visible, but only affects load/store instructions. + * WRMSR is unfortunately not a load/store instruction and is unaffected by + * MFENCE. The LFENCE ensures that the WRMSR is not reordered, but callers + * can indicate to avoid it when they have a suitable memory access between + * the invocation of this function and the WRMSR in question. + */ +static inline void weak_wrmsr_fence(bool have_mem_access) +{ + alternative("mfence", "", X86_FEATURE_NO_WRMSR_FENCE); + + if ( !have_mem_access ) + alternative("lfence", "", X86_FEATURE_NO_WRMSR_FENCE); +} + static inline uint64_t rdtsc(void) { uint32_t low, high; --- a/xen/arch/x86/smp.c +++ b/xen/arch/x86/smp.c @@ -39,7 +39,10 @@ static unsigned int prepare_ICR2(unsigne void apic_wait_icr_idle(void) { if ( x2apic_enabled ) + { + weak_wrmsr_fence(false); return; + } while ( apic_read(APIC_ICR) & APIC_ICR_BUSY ) cpu_relax();
Linux commit 25a068b8e9a4e ("x86/apic: Add extra serialization for non- serializing MSRs") explains why an MFENCE+LFENCE pair is generally needed ahead of ICR writes in x2APIC mode, and also why at least in theory such is also needed ahead of TSC_DEADLINE writes. A comment of our own in send_IPI_mask_x2apic_phys() further explains a condition under which the LFENCE can be avoided. Further Linux commit 04c3024560d3 ("x86/barrier: Do not serialize MSR accesses on AMD") explains that this barrier isn't needed on AMD or Hygon, and is in fact hampering performance in a measurable way. Introduce a similarly named helper function, but with a parameter allowing callers to specify whether a memory access will follow, thus permitting the LFENCE to be omitted. Putting an instance in apic_wait_icr_idle() is to be on the safe side. The one case where it was clearly missing is in send_IPI_shortcut(), which is also used in x2APIC mode when called from send_IPI_mask(). Function comment shamelessly borrowed (but adapted) from Linux. Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when possible") Signed-off-by: Jan Beulich <jbeulich@suse.com> --- I question the need for a fence ahead of writing TSC_DEADLINE: The Linux commit message talks about LVT being accessed via MMIO in xAPIC mode, but that should not be relevant here: It's all the local CPU, so there ought to not be visibility concerns (much like for a self-IPI no fence is needed ahead of the ICR write). If that wasn't needed, we could further use alternatives patching to remove the fence also from apic_wait_icr_idle() when in xAPIC mode. (And only then would I agree to have APIC in the feature identifier, like Linux has it.) A number of apic_write() may better be turned into apic_mem_write(), in particular e.g. the ones in send_IPI_mask_{flat,phys}(). That way it would be quite a bit easier to spot paths taken only in xAPIC mode. The INIT-INIT-SIPI sequence for AP startup doesn't use any barrier, also not in Linux afaics. I can't explain the lack thereof, though.