diff mbox series

x86: serializing of non-serializing MSR writes

Message ID 58656398-2d64-48b8-9ddc-c6836847a586@suse.com (mailing list archive)
State New, archived
Headers show
Series x86: serializing of non-serializing MSR writes | expand

Commit Message

Jan Beulich Feb. 28, 2024, 2:48 p.m. UTC
Linux commit 25a068b8e9a4e ("x86/apic: Add extra serialization for non-
serializing MSRs") explains why an MFENCE+LFENCE pair is generally
needed ahead of ICR writes in x2APIC mode, and also why at least in
theory such is also needed ahead of TSC_DEADLINE writes. A comment of
our own in send_IPI_mask_x2apic_phys() further explains a condition
under which the LFENCE can be avoided.

Further Linux commit 04c3024560d3 ("x86/barrier: Do not serialize MSR
accesses on AMD") explains that this barrier isn't needed on AMD or
Hygon, and is in fact hampering performance in a measurable way.

Introduce a similarly named helper function, but with a parameter
allowing callers to specify whether a memory access will follow, thus
permitting the LFENCE to be omitted.

Putting an instance in apic_wait_icr_idle() is to be on the safe side.
The one case where it was clearly missing is in send_IPI_shortcut(),
which is also used in x2APIC mode when called from send_IPI_mask().

Function comment shamelessly borrowed (but adapted) from Linux.

Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when possible")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I question the need for a fence ahead of writing TSC_DEADLINE: The Linux
commit message talks about LVT being accessed via MMIO in xAPIC mode,
but that should not be relevant here: It's all the local CPU, so there
ought to not be visibility concerns (much like for a self-IPI no fence
is needed ahead of the ICR write). If that wasn't needed, we could
further use alternatives patching to remove the fence also from
apic_wait_icr_idle() when in xAPIC mode. (And only then would I agree to
have APIC in the feature identifier, like Linux has it.)

A number of apic_write() may better be turned into apic_mem_write(), in
particular e.g. the ones in send_IPI_mask_{flat,phys}(). That way it
would be quite a bit easier to spot paths taken only in xAPIC mode.

The INIT-INIT-SIPI sequence for AP startup doesn't use any barrier, also
not in Linux afaics. I can't explain the lack thereof, though.

Comments

Andrew Cooper Feb. 29, 2024, 1:10 a.m. UTC | #1
On 28/02/2024 2:48 pm, Jan Beulich wrote:
> Linux commit 25a068b8e9a4e ("x86/apic: Add extra serialization for non-
> serializing MSRs") explains why an MFENCE+LFENCE pair is generally
> needed ahead of ICR writes in x2APIC mode, and also why at least in
> theory such is also needed ahead of TSC_DEADLINE writes. A comment of
> our own in send_IPI_mask_x2apic_phys() further explains a condition
> under which the LFENCE can be avoided.
>
> Further Linux commit 04c3024560d3 ("x86/barrier: Do not serialize MSR
> accesses on AMD") explains that this barrier isn't needed on AMD or
> Hygon, and is in fact hampering performance in a measurable way.
>
> Introduce a similarly named helper function, but with a parameter
> allowing callers to specify whether a memory access will follow, thus
> permitting the LFENCE to be omitted.
>
> Putting an instance in apic_wait_icr_idle() is to be on the safe side.
> The one case where it was clearly missing is in send_IPI_shortcut(),
> which is also used in x2APIC mode when called from send_IPI_mask().
>
> Function comment shamelessly borrowed (but adapted) from Linux.
>
> Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when possible")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I question the need for a fence ahead of writing TSC_DEADLINE: The Linux
> commit message talks about LVT being accessed via MMIO in xAPIC mode,
> but that should not be relevant here: It's all the local CPU, so there
> ought to not be visibility concerns (much like for a self-IPI no fence
> is needed ahead of the ICR write). If that wasn't needed, we could
> further use alternatives patching to remove the fence also from
> apic_wait_icr_idle() when in xAPIC mode. (And only then would I agree to
> have APIC in the feature identifier, like Linux has it.)
>
> A number of apic_write() may better be turned into apic_mem_write(), in
> particular e.g. the ones in send_IPI_mask_{flat,phys}(). That way it
> would be quite a bit easier to spot paths taken only in xAPIC mode.
>
> The INIT-INIT-SIPI sequence for AP startup doesn't use any barrier, also
> not in Linux afaics. I can't explain the lack thereof, though.

I have some opinions about what Linux did here...  I don't think a
single vendor/arch-neutral helper can possibly be right.

It is vendor and uarch dependent which WRMSR's transparently degrade to
WRMSRNS, and it is vendor dependent which serialising sequence to use
(if any).  e.g. AMD have recently (Zen2 uarch I believe) retroactively
defined FS/GS_BASE to be non-serialising.  (And this is another CPUID
bit we need to OR backwards in time.)

Furthermore, IIRC AMD still owe us an update to the APM; the APM
currently says that a serialising sequence is needed for ICR.  I'm told
this isn't actually true, but I'm also very wary making an adjustment
which is directly contradicted by the docs.


The Linux change explains why in principle the IPI can be emitted before
the stores are visible.

This does actually explain TSC_DEADLINE too.  Setting a deadline in the
past gets you an interrupt immediately, and if you combine that with a
WRMSR being ordered ahead of an MFENCE, then causality is violated. 
You'll take the interrupt on whichever instruction boundary has most
recently retired, which will can be the wrong side of the WRMSR
triggering the interrupts, at which point you'll livelock taking timer
interrupts and re-arming the timer in the past.


Now, for fencing, things are more complicated.  AMD define MFENCE as
architecturally serialising.  Intel do not, hence why apparently a WRMSR
can possibly move across it.  The LFENCE is added for it's new
speculative property of dispatch serialising.

We don't actually care about architecturally serialising.  If someone is
interacting with these MSRs with relevant data in WC memory, then they
get to keep all resulting pieces.

However, we do care about plain stores, and for that we do need an
MFENCE;LFENCE on Intel  (The jury is out on whether a single
SERIALIZE[sic] would be better, but it should have the correct semantics
architecturally speaking.)

In particular, ...


>
> --- a/xen/arch/x86/apic.c
> +++ b/xen/arch/x86/apic.c
> @@ -1309,6 +1309,12 @@ int reprogram_timer(s_time_t timeout)
>  
>      if ( tdt_enabled )
>      {
> +        /*
> +         * WRMSR to TSC_DEADLINE is not serializing.  We could pass @timeout
> +         * here, but the passed value is preferably compile-time-constant.
> +         */
> +        weak_wrmsr_fence(false);
> +
>          wrmsrl(MSR_IA32_TSC_DEADLINE, timeout ? stime2tsc(timeout) : 0);
>          return 1;
>      }
> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -675,8 +675,12 @@ void amd_log_freq(const struct cpuinfo_x
>  
>  void cf_check early_init_amd(struct cpuinfo_x86 *c)
>  {
> -	if (c == &boot_cpu_data)
> +	if (c == &boot_cpu_data) {
> +		/* No fencing needed ahead of certain MSR writes. */
> +		setup_force_cpu_cap(X86_FEATURE_NO_WRMSR_FENCE);
> +
>  		amd_init_levelling();
> +	}
>  
>  	ctxt_switch_levelling(NULL);
>  }
> --- a/xen/arch/x86/genapic/x2apic.c
> +++ b/xen/arch/x86/genapic/x2apic.c
> @@ -97,15 +97,15 @@ static void cf_check send_IPI_mask_x2api
>  
>      /*
>       * Ensure that any synchronisation data written in program order by this
> -     * CPU is seen by notified remote CPUs. The WRMSR contained within
> -     * apic_icr_write() can otherwise be executed early.
> +     * CPU is seen by notified remote CPUs. The WRMSR contained in the loop
> +     * below can otherwise be executed early.
>       * 
> -     * The reason smp_mb() is sufficient here is subtle: the register arguments
> +     * The reason MFENCE is sufficient here is subtle: the register arguments
>       * to WRMSR must depend on a memory read executed after the barrier. This
>       * is guaranteed by cpu_physical_id(), which reads from a global array (and
>       * so cannot be hoisted above the barrier even by a clever compiler).
>       */
> -    smp_mb();
> +    weak_wrmsr_fence(true);
>  
>      local_irq_save(flags);
>  
> @@ -130,7 +130,7 @@ static void cf_check send_IPI_mask_x2api
>      const cpumask_t *cluster_cpus;
>      unsigned long flags;
>  
> -    smp_mb(); /* See above for an explanation. */
> +    weak_wrmsr_fence(true); /* See above for an explanation. */
>  
>      local_irq_save(flags);
>  
> --- a/xen/arch/x86/include/asm/cpufeatures.h
> +++ b/xen/arch/x86/include/asm/cpufeatures.h
> @@ -24,7 +24,7 @@ XEN_CPUFEATURE(APERFMPERF,        X86_SY
>  XEN_CPUFEATURE(MFENCE_RDTSC,      X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */
>  XEN_CPUFEATURE(XEN_SMEP,          X86_SYNTH(10)) /* SMEP gets used by Xen itself */
>  XEN_CPUFEATURE(XEN_SMAP,          X86_SYNTH(11)) /* SMAP gets used by Xen itself */
> -/* Bit 12 unused. */
> +XEN_CPUFEATURE(NO_WRMSR_FENCE,    X86_SYNTH(12)) /* No MFENCE{,+LFENCE} ahead of certain WRMSR. */
>  XEN_CPUFEATURE(IND_THUNK_LFENCE,  X86_SYNTH(13)) /* Use IND_THUNK_LFENCE */
>  XEN_CPUFEATURE(IND_THUNK_JMP,     X86_SYNTH(14)) /* Use IND_THUNK_JMP */
>  XEN_CPUFEATURE(SC_NO_BRANCH_HARDEN, X86_SYNTH(15)) /* (Disable) Conditional branch hardening */
> --- a/xen/arch/x86/include/asm/msr.h
> +++ b/xen/arch/x86/include/asm/msr.h
> @@ -97,6 +97,25 @@ static inline void msr_split(struct cpu_
>      regs->rax = (uint32_t)val;
>  }
>  
> +/*
> + * Make previous memory operations globally visible before a WRMSR.  Most
> + * WRMSRs are full serializing instructions themselves and do not require this
> + * barrier.  This may only be required for the TSC_DEADLINE and x2APIC MSRs.
> + *
> + * MFENCE makes writes visible, but only affects load/store instructions.
> + * WRMSR is unfortunately not a load/store instruction and is unaffected by
> + * MFENCE.

[ On Intel.   AMD didn't end up with this (mis)behaviour. ]

>   The LFENCE ensures that the WRMSR is not reordered, but callers
> + * can indicate to avoid it when they have a suitable memory access between
> + * the invocation of this function and the WRMSR in question.

... this makes no sense.

We need the LFENCE for dispatch serialising properties, not it's load
ordering properties.  What use will other memory have, when the entire
problem is that WRMSR doesn't interact with them?

Worse, it's a Spectre-v1 gadget and we're now acutely familiar with how
the CPU will find its way around these.  So even expressing "I
critically need the LFENCE" still gets you pot luck on whether it has
any effect against a causality-violating WRMSR.

~Andrew

> + */
> +static inline void weak_wrmsr_fence(bool have_mem_access)
> +{
> +    alternative("mfence", "", X86_FEATURE_NO_WRMSR_FENCE);
> +
> +    if ( !have_mem_access )
> +        alternative("lfence", "", X86_FEATURE_NO_WRMSR_FENCE);
> +}
> +
>  static inline uint64_t rdtsc(void)
>  {
>      uint32_t low, high;
> --- a/xen/arch/x86/smp.c
> +++ b/xen/arch/x86/smp.c
> @@ -39,7 +39,10 @@ static unsigned int prepare_ICR2(unsigne
>  void apic_wait_icr_idle(void)
>  {
>      if ( x2apic_enabled )
> +    {
> +        weak_wrmsr_fence(false);
>          return;
> +    }
>  
>      while ( apic_read(APIC_ICR) & APIC_ICR_BUSY )
>          cpu_relax();
Jan Beulich Feb. 29, 2024, 9:02 a.m. UTC | #2
On 29.02.2024 02:10, Andrew Cooper wrote:
> On 28/02/2024 2:48 pm, Jan Beulich wrote:
>> Linux commit 25a068b8e9a4e ("x86/apic: Add extra serialization for non-
>> serializing MSRs") explains why an MFENCE+LFENCE pair is generally
>> needed ahead of ICR writes in x2APIC mode, and also why at least in
>> theory such is also needed ahead of TSC_DEADLINE writes. A comment of
>> our own in send_IPI_mask_x2apic_phys() further explains a condition
>> under which the LFENCE can be avoided.
>>
>> Further Linux commit 04c3024560d3 ("x86/barrier: Do not serialize MSR
>> accesses on AMD") explains that this barrier isn't needed on AMD or
>> Hygon, and is in fact hampering performance in a measurable way.
>>
>> Introduce a similarly named helper function, but with a parameter
>> allowing callers to specify whether a memory access will follow, thus
>> permitting the LFENCE to be omitted.
>>
>> Putting an instance in apic_wait_icr_idle() is to be on the safe side.
>> The one case where it was clearly missing is in send_IPI_shortcut(),
>> which is also used in x2APIC mode when called from send_IPI_mask().
>>
>> Function comment shamelessly borrowed (but adapted) from Linux.
>>
>> Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when possible")
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I question the need for a fence ahead of writing TSC_DEADLINE: The Linux
>> commit message talks about LVT being accessed via MMIO in xAPIC mode,
>> but that should not be relevant here: It's all the local CPU, so there
>> ought to not be visibility concerns (much like for a self-IPI no fence
>> is needed ahead of the ICR write). If that wasn't needed, we could
>> further use alternatives patching to remove the fence also from
>> apic_wait_icr_idle() when in xAPIC mode. (And only then would I agree to
>> have APIC in the feature identifier, like Linux has it.)
>>
>> A number of apic_write() may better be turned into apic_mem_write(), in
>> particular e.g. the ones in send_IPI_mask_{flat,phys}(). That way it
>> would be quite a bit easier to spot paths taken only in xAPIC mode.
>>
>> The INIT-INIT-SIPI sequence for AP startup doesn't use any barrier, also
>> not in Linux afaics. I can't explain the lack thereof, though.
> 
> I have some opinions about what Linux did here...  I don't think a
> single vendor/arch-neutral helper can possibly be right.
> 
> It is vendor and uarch dependent which WRMSR's transparently degrade to
> WRMSRNS, and it is vendor dependent which serialising sequence to use
> (if any).  e.g. AMD have recently (Zen2 uarch I believe) retroactively
> defined FS/GS_BASE to be non-serialising.  (And this is another CPUID
> bit we need to OR backwards in time.)
> 
> Furthermore, IIRC AMD still owe us an update to the APM; the APM
> currently says that a serialising sequence is needed for ICR.  I'm told
> this isn't actually true, but I'm also very wary making an adjustment
> which is directly contradicted by the docs.

I can see you wanting the doc to be corrected. What I'm having trouble
with is you having indicated (long ago) that we can avoid this fence on
AMD, just to now effectively object to me (finally) getting around to
actually doing so?

> The Linux change explains why in principle the IPI can be emitted before
> the stores are visible.
> 
> This does actually explain TSC_DEADLINE too.  Setting a deadline in the
> past gets you an interrupt immediately, and if you combine that with a
> WRMSR being ordered ahead of an MFENCE, then causality is violated. 
> You'll take the interrupt on whichever instruction boundary has most
> recently retired, which will can be the wrong side of the WRMSR
> triggering the interrupts, at which point you'll livelock taking timer
> interrupts and re-arming the timer in the past.

Are you saying the interrupt is raised ahead of the insn retiring? That
would be concerning, imo.

Irrespective of that, as to live-locking: If that would really be a
possible issue, moving lapic_timer_on() ahead of local_irq_enable() in
acpi_processor_idle() and mwait_idle() would avoid that; the sole other
use of reprogram_timer() already runs with IRQs off.

> Now, for fencing, things are more complicated.  AMD define MFENCE as
> architecturally serialising.  Intel do not, hence why apparently a WRMSR
> can possibly move across it.  The LFENCE is added for it's new
> speculative property of dispatch serialising.
> 
> We don't actually care about architecturally serialising.  If someone is
> interacting with these MSRs with relevant data in WC memory, then they
> get to keep all resulting pieces.
> 
> However, we do care about plain stores, and for that we do need an
> MFENCE;LFENCE on Intel  (The jury is out on whether a single
> SERIALIZE[sic] would be better, but it should have the correct semantics
> architecturally speaking.)
> 
> In particular, ...
> 
>> --- a/xen/arch/x86/genapic/x2apic.c
>> +++ b/xen/arch/x86/genapic/x2apic.c
>> @@ -97,15 +97,15 @@ static void cf_check send_IPI_mask_x2api
>>  
>>      /*
>>       * Ensure that any synchronisation data written in program order by this
>> -     * CPU is seen by notified remote CPUs. The WRMSR contained within
>> -     * apic_icr_write() can otherwise be executed early.
>> +     * CPU is seen by notified remote CPUs. The WRMSR contained in the loop
>> +     * below can otherwise be executed early.
>>       * 
>> -     * The reason smp_mb() is sufficient here is subtle: the register arguments
>> +     * The reason MFENCE is sufficient here is subtle: the register arguments
>>       * to WRMSR must depend on a memory read executed after the barrier. This
>>       * is guaranteed by cpu_physical_id(), which reads from a global array (and
>>       * so cannot be hoisted above the barrier even by a clever compiler).
>>       */
>> -    smp_mb();
>> +    weak_wrmsr_fence(true);
>>  
>>      local_irq_save(flags);
>>  
>> @@ -130,7 +130,7 @@ static void cf_check send_IPI_mask_x2api
>>      const cpumask_t *cluster_cpus;
>>      unsigned long flags;
>>  
>> -    smp_mb(); /* See above for an explanation. */
>> +    weak_wrmsr_fence(true); /* See above for an explanation. */
>>  
>>      local_irq_save(flags);
>>  
>> --- a/xen/arch/x86/include/asm/cpufeatures.h
>> +++ b/xen/arch/x86/include/asm/cpufeatures.h
>> @@ -24,7 +24,7 @@ XEN_CPUFEATURE(APERFMPERF,        X86_SY
>>  XEN_CPUFEATURE(MFENCE_RDTSC,      X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */
>>  XEN_CPUFEATURE(XEN_SMEP,          X86_SYNTH(10)) /* SMEP gets used by Xen itself */
>>  XEN_CPUFEATURE(XEN_SMAP,          X86_SYNTH(11)) /* SMAP gets used by Xen itself */
>> -/* Bit 12 unused. */
>> +XEN_CPUFEATURE(NO_WRMSR_FENCE,    X86_SYNTH(12)) /* No MFENCE{,+LFENCE} ahead of certain WRMSR. */
>>  XEN_CPUFEATURE(IND_THUNK_LFENCE,  X86_SYNTH(13)) /* Use IND_THUNK_LFENCE */
>>  XEN_CPUFEATURE(IND_THUNK_JMP,     X86_SYNTH(14)) /* Use IND_THUNK_JMP */
>>  XEN_CPUFEATURE(SC_NO_BRANCH_HARDEN, X86_SYNTH(15)) /* (Disable) Conditional branch hardening */
>> --- a/xen/arch/x86/include/asm/msr.h
>> +++ b/xen/arch/x86/include/asm/msr.h
>> @@ -97,6 +97,25 @@ static inline void msr_split(struct cpu_
>>      regs->rax = (uint32_t)val;
>>  }
>>  
>> +/*
>> + * Make previous memory operations globally visible before a WRMSR.  Most
>> + * WRMSRs are full serializing instructions themselves and do not require this
>> + * barrier.  This may only be required for the TSC_DEADLINE and x2APIC MSRs.
>> + *
>> + * MFENCE makes writes visible, but only affects load/store instructions.
>> + * WRMSR is unfortunately not a load/store instruction and is unaffected by
>> + * MFENCE.
> 
> [ On Intel.   AMD didn't end up with this (mis)behaviour. ]
> 
>>   The LFENCE ensures that the WRMSR is not reordered, but callers
>> + * can indicate to avoid it when they have a suitable memory access between
>> + * the invocation of this function and the WRMSR in question.
> 
> ... this makes no sense.
> 
> We need the LFENCE for dispatch serialising properties, not it's load
> ordering properties.  What use will other memory have, when the entire
> problem is that WRMSR doesn't interact with them?

Are you suggesting the comment (and code) in send_IPI_mask_x2apic_*()
(left visible in context further up) are wrong then? I consider it
correct (looking forward to see you prove it wrong), and with that
having a way to avoid the LFENCE looks correct to me. Plus the comment
here doesn't say "load ordering" anywhere. It's strictly execution
ordering, guaranteed by a memory access the WRMSR input is dependent
upon. For load ordering, MFENCE alone would be enough.

> Worse, it's a Spectre-v1 gadget and we're now acutely familiar with how
> the CPU will find its way around these.  So even expressing "I
> critically need the LFENCE" still gets you pot luck on whether it has
> any effect against a causality-violating WRMSR.

Hmm, besides me possibly taking this as "drop this patch" (which could
do with making explicit, if that was meant), I'm afraid I can't view
this remark as actionable in any way. Yet I firmly expect an IPI to not
be raised speculatively (and, as said above, neither an LAPIC timer
interrupt), so I'm even having trouble seeing how this would form a
Spectre-v1 gadget.

Jan
diff mbox series

Patch

--- a/xen/arch/x86/apic.c
+++ b/xen/arch/x86/apic.c
@@ -1309,6 +1309,12 @@  int reprogram_timer(s_time_t timeout)
 
     if ( tdt_enabled )
     {
+        /*
+         * WRMSR to TSC_DEADLINE is not serializing.  We could pass @timeout
+         * here, but the passed value is preferably compile-time-constant.
+         */
+        weak_wrmsr_fence(false);
+
         wrmsrl(MSR_IA32_TSC_DEADLINE, timeout ? stime2tsc(timeout) : 0);
         return 1;
     }
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -675,8 +675,12 @@  void amd_log_freq(const struct cpuinfo_x
 
 void cf_check early_init_amd(struct cpuinfo_x86 *c)
 {
-	if (c == &boot_cpu_data)
+	if (c == &boot_cpu_data) {
+		/* No fencing needed ahead of certain MSR writes. */
+		setup_force_cpu_cap(X86_FEATURE_NO_WRMSR_FENCE);
+
 		amd_init_levelling();
+	}
 
 	ctxt_switch_levelling(NULL);
 }
--- a/xen/arch/x86/genapic/x2apic.c
+++ b/xen/arch/x86/genapic/x2apic.c
@@ -97,15 +97,15 @@  static void cf_check send_IPI_mask_x2api
 
     /*
      * Ensure that any synchronisation data written in program order by this
-     * CPU is seen by notified remote CPUs. The WRMSR contained within
-     * apic_icr_write() can otherwise be executed early.
+     * CPU is seen by notified remote CPUs. The WRMSR contained in the loop
+     * below can otherwise be executed early.
      * 
-     * The reason smp_mb() is sufficient here is subtle: the register arguments
+     * The reason MFENCE is sufficient here is subtle: the register arguments
      * to WRMSR must depend on a memory read executed after the barrier. This
      * is guaranteed by cpu_physical_id(), which reads from a global array (and
      * so cannot be hoisted above the barrier even by a clever compiler).
      */
-    smp_mb();
+    weak_wrmsr_fence(true);
 
     local_irq_save(flags);
 
@@ -130,7 +130,7 @@  static void cf_check send_IPI_mask_x2api
     const cpumask_t *cluster_cpus;
     unsigned long flags;
 
-    smp_mb(); /* See above for an explanation. */
+    weak_wrmsr_fence(true); /* See above for an explanation. */
 
     local_irq_save(flags);
 
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -24,7 +24,7 @@  XEN_CPUFEATURE(APERFMPERF,        X86_SY
 XEN_CPUFEATURE(MFENCE_RDTSC,      X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */
 XEN_CPUFEATURE(XEN_SMEP,          X86_SYNTH(10)) /* SMEP gets used by Xen itself */
 XEN_CPUFEATURE(XEN_SMAP,          X86_SYNTH(11)) /* SMAP gets used by Xen itself */
-/* Bit 12 unused. */
+XEN_CPUFEATURE(NO_WRMSR_FENCE,    X86_SYNTH(12)) /* No MFENCE{,+LFENCE} ahead of certain WRMSR. */
 XEN_CPUFEATURE(IND_THUNK_LFENCE,  X86_SYNTH(13)) /* Use IND_THUNK_LFENCE */
 XEN_CPUFEATURE(IND_THUNK_JMP,     X86_SYNTH(14)) /* Use IND_THUNK_JMP */
 XEN_CPUFEATURE(SC_NO_BRANCH_HARDEN, X86_SYNTH(15)) /* (Disable) Conditional branch hardening */
--- a/xen/arch/x86/include/asm/msr.h
+++ b/xen/arch/x86/include/asm/msr.h
@@ -97,6 +97,25 @@  static inline void msr_split(struct cpu_
     regs->rax = (uint32_t)val;
 }
 
+/*
+ * Make previous memory operations globally visible before a WRMSR.  Most
+ * WRMSRs are full serializing instructions themselves and do not require this
+ * barrier.  This may only be required for the TSC_DEADLINE and x2APIC MSRs.
+ *
+ * MFENCE makes writes visible, but only affects load/store instructions.
+ * WRMSR is unfortunately not a load/store instruction and is unaffected by
+ * MFENCE.  The LFENCE ensures that the WRMSR is not reordered, but callers
+ * can indicate to avoid it when they have a suitable memory access between
+ * the invocation of this function and the WRMSR in question.
+ */
+static inline void weak_wrmsr_fence(bool have_mem_access)
+{
+    alternative("mfence", "", X86_FEATURE_NO_WRMSR_FENCE);
+
+    if ( !have_mem_access )
+        alternative("lfence", "", X86_FEATURE_NO_WRMSR_FENCE);
+}
+
 static inline uint64_t rdtsc(void)
 {
     uint32_t low, high;
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -39,7 +39,10 @@  static unsigned int prepare_ICR2(unsigne
 void apic_wait_icr_idle(void)
 {
     if ( x2apic_enabled )
+    {
+        weak_wrmsr_fence(false);
         return;
+    }
 
     while ( apic_read(APIC_ICR) & APIC_ICR_BUSY )
         cpu_relax();