diff mbox series

[2/2] x86/mtrr: Skip cache flushes on CPUs with cache self-snooping

Message ID d982278f-daf9-67e5-f9b2-775105ef3233@suse.com (mailing list archive)
State Superseded
Headers show
Series x86: port self-snoop related patches from Linux | expand

Commit Message

Jan Beulich July 18, 2019, 12:10 p.m. UTC
From: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>

Programming MTRR registers in multi-processor systems is a rather lengthy
process. Furthermore, all processors must program these registers in lock
step and with interrupts disabled; the process also involves flushing
caches and TLBs twice. As a result, the process may take a considerable
amount of time.

On some platforms, this can lead to a large skew of the refined-jiffies
clock source. Early when booting, if no other clock is available (e.g.,
booting with hpet=disabled), the refined-jiffies clock source is used to
monitor the TSC clock source. If the skew of refined-jiffies is too large,
Linux wrongly assumes that the TSC is unstable:

   clocksource: timekeeping watchdog on CPU1: Marking clocksource
                'tsc-early' as unstable because the skew is too large:
   clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last:
                fffedb90 mask: ffffffff
   clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4
                mask: ffffffffffffffff
   tsc: Marking TSC unstable due to clocksource watchdog

As per measurements, around 98% of the time needed by the procedure to
program MTRRs in multi-processor systems is spent flushing caches with
wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32
Architectures Software Developer's Manual, it is not necessary to flush
caches if the CPU supports cache self-snooping. Thus, skipping the cache
flushes can reduce by several tens of milliseconds the time needed to
complete the programming of the MTRR registers:

Platform                      	Before	   After
104-core (208 Threads) Skylake  1437ms      28ms
   2-core (  4 Threads) Haswell   114ms       2ms

Reported-by: Mohammad Etemadi <mohammad.etemadi@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
[Linux commit fd329f276ecaad7a371d6f91b9bbea031d0c3440]

Use alternatives patching instead of static_cpu_has() (which we don't
have [yet]).

Interestingly we've been lacking the 2nd wbinvd(), which I'm taking the
liberty here.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

Comments

Andrew Cooper July 18, 2019, 12:29 p.m. UTC | #1
On 18/07/2019 13:10, Jan Beulich wrote:
> From: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
>
> Programming MTRR registers in multi-processor systems is a rather lengthy
> process. Furthermore, all processors must program these registers in lock
> step and with interrupts disabled; the process also involves flushing
> caches and TLBs twice. As a result, the process may take a considerable
> amount of time.
>
> On some platforms, this can lead to a large skew of the refined-jiffies
> clock source. Early when booting, if no other clock is available (e.g.,
> booting with hpet=disabled), the refined-jiffies clock source is used to
> monitor the TSC clock source. If the skew of refined-jiffies is too large,
> Linux wrongly assumes that the TSC is unstable:
>
>    clocksource: timekeeping watchdog on CPU1: Marking clocksource
>                 'tsc-early' as unstable because the skew is too large:
>    clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last:
>                 fffedb90 mask: ffffffff
>    clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4
>                 mask: ffffffffffffffff
>    tsc: Marking TSC unstable due to clocksource watchdog
>
> As per measurements, around 98% of the time needed by the procedure to
> program MTRRs in multi-processor systems is spent flushing caches with
> wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32
> Architectures Software Developer's Manual, it is not necessary to flush
> caches if the CPU supports cache self-snooping. Thus, skipping the cache
> flushes can reduce by several tens of milliseconds the time needed to
> complete the programming of the MTRR registers:
>
> Platform                      	Before	   After
> 104-core (208 Threads) Skylake  1437ms      28ms
>    2-core (  4 Threads) Haswell   114ms       2ms
>
> Reported-by: Mohammad Etemadi <mohammad.etemadi@intel.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> [Linux commit fd329f276ecaad7a371d6f91b9bbea031d0c3440]
>
> Use alternatives patching instead of static_cpu_has() (which we don't
> have [yet]).
> Interestingly we've been lacking the 2nd wbinvd(), which I'm taking the
> liberty here.
>
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
diff mbox series

Patch

--- a/xen/arch/x86/cpu/mtrr/generic.c
+++ b/xen/arch/x86/cpu/mtrr/generic.c
@@ -450,7 +450,14 @@  static bool prepare_set(void)
  
  	/*  Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
  	write_cr0(read_cr0() | X86_CR0_CD);
-	wbinvd();
+
+	/*
+	 * Cache flushing is the most time-consuming step when programming
+	 * the MTRRs. Fortunately, as per the Intel Software Development
+	 * Manual, we can skip it if the processor supports cache self-
+	 * snooping.
+	 */
+	alternative("wbinvd", "", X86_FEATURE_SS);
  
  	cr4 = read_cr4();
  	if (cr4 & X86_CR4_PGE)
@@ -466,6 +473,9 @@  static bool prepare_set(void)
  	/*  Disable MTRRs, and set the default type to uncached  */
  	mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
  
+	/* Again, only flush caches if we have to. */
+	alternative("wbinvd", "", X86_FEATURE_SS);
+
  	return cr4 & X86_CR4_PGE;
  }