diff mbox series

[RFC,v4,4/8] x86/smp: Allow calling mwait_play_dead with arbitrary hint

Message ID 20241125132029.7241-5-patryk.wlazlyn@linux.intel.com (mailing list archive)
State Changes Requested, archived
Headers show
Series SRF: Fix offline CPU preventing pc6 entry | expand

Commit Message

Patryk Wlazlyn Nov. 25, 2024, 1:20 p.m. UTC
The MWAIT instruction needs different hints on different CPUs to reach
the most specific idle states. The current hint calculation* in
mwait_play_dead() code works in practice on current hardware, but it
fails on a recent one, Intel's Sierra Forest and possibly some future ones.
Those newer CPUs' power efficiency suffers when the CPU is put offline.

 * The current calculation is the for loop inspecting edx in
   mwait_play_dead()

The current implementation for looking up the mwait hint for the deepest
cstate, in mwait_play_dead() code works by inspecting CPUID leaf 0x5 and
calculates the mwait hint based on the number of reported substates.
This approach depends on the hints associated with them to be continuous
in the range [0, NUM_SUBSTATES-1]. This continuity is not documented and
is not met on the recent Intel platforms.

For example, Intel's Sierra Forest report two cstates with two substates
each in cpuid leaf 5:

  Name*   target cstate    target subcstate (mwait hint)
  ===========================================================
  C1      0x00             0x00
  C1E     0x00             0x01

  --      0x10             ----

  C6S     0x20             0x22
  C6P     0x20             0x23

  --      0x30             ----

  /* No more (sub)states all the way down to the end. */
  ===========================================================

   * Names of the cstates are not included in the CPUID leaf 0x5, they are
     taken from the product specific documentation.

Notice that hints 0x20 and 0x21 are skipped entirely for the target
cstate 0x20 (C6), being a cause of the problem for the current cpuid
leaf 0x5 algorithm.

Allow cpuidle code to provide mwait play dead loop with known, mwait
hint for the deepest idle state on a given platform, skipping the cpuid
based calculation.

Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
---
 arch/x86/include/asm/smp.h |  3 ++
 arch/x86/kernel/smpboot.c  | 86 ++++++++++++++++++++------------------
 2 files changed, 49 insertions(+), 40 deletions(-)

Comments

Rafael J. Wysocki Nov. 25, 2024, 2:39 p.m. UTC | #1
On Mon, Nov 25, 2024 at 2:21 PM Patryk Wlazlyn
<patryk.wlazlyn@linux.intel.com> wrote:
>
> The MWAIT instruction needs different hints on different CPUs to reach
> the most specific idle states. The current hint calculation* in
> mwait_play_dead() code works in practice on current hardware, but it
> fails on a recent one, Intel's Sierra Forest and possibly some future ones.
> Those newer CPUs' power efficiency suffers when the CPU is put offline.
>
>  * The current calculation is the for loop inspecting edx in
>    mwait_play_dead()
>
> The current implementation for looking up the mwait hint for the deepest
> cstate, in mwait_play_dead() code works by inspecting CPUID leaf 0x5 and
> calculates the mwait hint based on the number of reported substates.
> This approach depends on the hints associated with them to be continuous
> in the range [0, NUM_SUBSTATES-1]. This continuity is not documented and
> is not met on the recent Intel platforms.
>
> For example, Intel's Sierra Forest report two cstates with two substates
> each in cpuid leaf 5:
>
>   Name*   target cstate    target subcstate (mwait hint)
>   ===========================================================
>   C1      0x00             0x00
>   C1E     0x00             0x01
>
>   --      0x10             ----
>
>   C6S     0x20             0x22
>   C6P     0x20             0x23
>
>   --      0x30             ----
>
>   /* No more (sub)states all the way down to the end. */
>   ===========================================================
>
>    * Names of the cstates are not included in the CPUID leaf 0x5, they are
>      taken from the product specific documentation.
>
> Notice that hints 0x20 and 0x21 are skipped entirely for the target
> cstate 0x20 (C6), being a cause of the problem for the current cpuid
> leaf 0x5 algorithm.
>
> Allow cpuidle code to provide mwait play dead loop with known, mwait
> hint for the deepest idle state on a given platform, skipping the cpuid
> based calculation.
>
> Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>

I'm going to risk saying that the changelog doesn't match the code
changes in the patch any more.

The code changes are actually relatively straightforward: The bottom
half of mwait_play_dead() is split off, so it can be called from
multiple places.

The other places from which to call it are cpuidle drivers
implementing :enter_dead() callbacks that may want to use MWAIT as the
idle state entry method.  The ACPI processor_idle driver and
intel_idle will be updated by subsequent patches to do so.

The reason for it is mostly consistency: If the cpuidle driver uses a
specific idle state for things like suspend-to-idle, it is better to
let it decide what idle state to use for "play_dead" because it may
know better.

Another reason is what mwait_play_dead() does to determine the MWAIT
argument (referred to as a "hint"), but IMO it belongs in a changelog
of a later patch because this one doesn't actually do anything about
it.  In fact, it is not expected to change the behavior of the code.

> ---
>  arch/x86/include/asm/smp.h |  3 ++
>  arch/x86/kernel/smpboot.c  | 86 ++++++++++++++++++++------------------
>  2 files changed, 49 insertions(+), 40 deletions(-)
>
> diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
> index ca073f40698f..d12fab4a83c5 100644
> --- a/arch/x86/include/asm/smp.h
> +++ b/arch/x86/include/asm/smp.h
> @@ -114,6 +114,7 @@ void wbinvd_on_cpu(int cpu);
>  int wbinvd_on_all_cpus(void);
>
>  void smp_kick_mwait_play_dead(void);
> +void mwait_play_dead_with_hint(unsigned long hint);
>
>  void native_smp_send_reschedule(int cpu);
>  void native_send_call_func_ipi(const struct cpumask *mask);
> @@ -164,6 +165,8 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
>  {
>         return (struct cpumask *)cpumask_of(0);
>  }
> +
> +static inline void mwait_play_dead_with_hint(unsigned long eax_hint) { }
>  #endif /* CONFIG_SMP */
>
>  #ifdef CONFIG_DEBUG_NMI_SELFTEST
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index b5a8f0891135..d0464c7a0af5 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1272,13 +1272,57 @@ void play_dead_common(void)
>         local_irq_disable();
>  }
>
> +void __noreturn mwait_play_dead_with_hint(unsigned long eax_hint)
> +{
> +       struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
> +
> +       /* Set up state for the kexec() hack below */
> +       md->status = CPUDEAD_MWAIT_WAIT;
> +       md->control = CPUDEAD_MWAIT_WAIT;
> +
> +       wbinvd();
> +
> +       while (1) {
> +               /*
> +                * The CLFLUSH is a workaround for erratum AAI65 for
> +                * the Xeon 7400 series.  It's not clear it is actually
> +                * needed, but it should be harmless in either case.
> +                * The WBINVD is insufficient due to the spurious-wakeup
> +                * case where we return around the loop.
> +                */
> +               mb();
> +               clflush(md);
> +               mb();
> +               __monitor(md, 0, 0);
> +               mb();
> +               __mwait(eax_hint, 0);
> +
> +               if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
> +                       /*
> +                        * Kexec is about to happen. Don't go back into mwait() as
> +                        * the kexec kernel might overwrite text and data including
> +                        * page tables and stack. So mwait() would resume when the
> +                        * monitor cache line is written to and then the CPU goes
> +                        * south due to overwritten text, page tables and stack.
> +                        *
> +                        * Note: This does _NOT_ protect against a stray MCE, NMI,
> +                        * SMI. They will resume execution at the instruction
> +                        * following the HLT instruction and run into the problem
> +                        * which this is trying to prevent.
> +                        */
> +                       WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
> +                       while(1)
> +                               native_halt();
> +               }
> +       }
> +}
> +
>  /*
>   * We need to flush the caches before going to sleep, lest we have
>   * dirty data in our caches when we come back up.
>   */
>  static inline void mwait_play_dead(void)
>  {
> -       struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
>         unsigned int eax, ebx, ecx, edx;
>         unsigned int highest_cstate = 0;
>         unsigned int highest_subcstate = 0;
> @@ -1316,45 +1360,7 @@ static inline void mwait_play_dead(void)
>                         (highest_subcstate - 1);
>         }
>
> -       /* Set up state for the kexec() hack below */
> -       md->status = CPUDEAD_MWAIT_WAIT;
> -       md->control = CPUDEAD_MWAIT_WAIT;
> -
> -       wbinvd();
> -
> -       while (1) {
> -               /*
> -                * The CLFLUSH is a workaround for erratum AAI65 for
> -                * the Xeon 7400 series.  It's not clear it is actually
> -                * needed, but it should be harmless in either case.
> -                * The WBINVD is insufficient due to the spurious-wakeup
> -                * case where we return around the loop.
> -                */
> -               mb();
> -               clflush(md);
> -               mb();
> -               __monitor(md, 0, 0);
> -               mb();
> -               __mwait(eax, 0);
> -
> -               if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
> -                       /*
> -                        * Kexec is about to happen. Don't go back into mwait() as
> -                        * the kexec kernel might overwrite text and data including
> -                        * page tables and stack. So mwait() would resume when the
> -                        * monitor cache line is written to and then the CPU goes
> -                        * south due to overwritten text, page tables and stack.
> -                        *
> -                        * Note: This does _NOT_ protect against a stray MCE, NMI,
> -                        * SMI. They will resume execution at the instruction
> -                        * following the HLT instruction and run into the problem
> -                        * which this is trying to prevent.
> -                        */
> -                       WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
> -                       while(1)
> -                               native_halt();
> -               }
> -       }
> +       mwait_play_dead_with_hint(eax);
>  }
>
>  /*
> --
> 2.47.0
>
>
diff mbox series

Patch

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index ca073f40698f..d12fab4a83c5 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -114,6 +114,7 @@  void wbinvd_on_cpu(int cpu);
 int wbinvd_on_all_cpus(void);
 
 void smp_kick_mwait_play_dead(void);
+void mwait_play_dead_with_hint(unsigned long hint);
 
 void native_smp_send_reschedule(int cpu);
 void native_send_call_func_ipi(const struct cpumask *mask);
@@ -164,6 +165,8 @@  static inline struct cpumask *cpu_llc_shared_mask(int cpu)
 {
 	return (struct cpumask *)cpumask_of(0);
 }
+
+static inline void mwait_play_dead_with_hint(unsigned long eax_hint) { }
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_DEBUG_NMI_SELFTEST
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index b5a8f0891135..d0464c7a0af5 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1272,13 +1272,57 @@  void play_dead_common(void)
 	local_irq_disable();
 }
 
+void __noreturn mwait_play_dead_with_hint(unsigned long eax_hint)
+{
+	struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
+
+	/* Set up state for the kexec() hack below */
+	md->status = CPUDEAD_MWAIT_WAIT;
+	md->control = CPUDEAD_MWAIT_WAIT;
+
+	wbinvd();
+
+	while (1) {
+		/*
+		 * The CLFLUSH is a workaround for erratum AAI65 for
+		 * the Xeon 7400 series.  It's not clear it is actually
+		 * needed, but it should be harmless in either case.
+		 * The WBINVD is insufficient due to the spurious-wakeup
+		 * case where we return around the loop.
+		 */
+		mb();
+		clflush(md);
+		mb();
+		__monitor(md, 0, 0);
+		mb();
+		__mwait(eax_hint, 0);
+
+		if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
+			/*
+			 * Kexec is about to happen. Don't go back into mwait() as
+			 * the kexec kernel might overwrite text and data including
+			 * page tables and stack. So mwait() would resume when the
+			 * monitor cache line is written to and then the CPU goes
+			 * south due to overwritten text, page tables and stack.
+			 *
+			 * Note: This does _NOT_ protect against a stray MCE, NMI,
+			 * SMI. They will resume execution at the instruction
+			 * following the HLT instruction and run into the problem
+			 * which this is trying to prevent.
+			 */
+			WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
+			while(1)
+				native_halt();
+		}
+	}
+}
+
 /*
  * We need to flush the caches before going to sleep, lest we have
  * dirty data in our caches when we come back up.
  */
 static inline void mwait_play_dead(void)
 {
-	struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
 	unsigned int eax, ebx, ecx, edx;
 	unsigned int highest_cstate = 0;
 	unsigned int highest_subcstate = 0;
@@ -1316,45 +1360,7 @@  static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
-	/* Set up state for the kexec() hack below */
-	md->status = CPUDEAD_MWAIT_WAIT;
-	md->control = CPUDEAD_MWAIT_WAIT;
-
-	wbinvd();
-
-	while (1) {
-		/*
-		 * The CLFLUSH is a workaround for erratum AAI65 for
-		 * the Xeon 7400 series.  It's not clear it is actually
-		 * needed, but it should be harmless in either case.
-		 * The WBINVD is insufficient due to the spurious-wakeup
-		 * case where we return around the loop.
-		 */
-		mb();
-		clflush(md);
-		mb();
-		__monitor(md, 0, 0);
-		mb();
-		__mwait(eax, 0);
-
-		if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
-			/*
-			 * Kexec is about to happen. Don't go back into mwait() as
-			 * the kexec kernel might overwrite text and data including
-			 * page tables and stack. So mwait() would resume when the
-			 * monitor cache line is written to and then the CPU goes
-			 * south due to overwritten text, page tables and stack.
-			 *
-			 * Note: This does _NOT_ protect against a stray MCE, NMI,
-			 * SMI. They will resume execution at the instruction
-			 * following the HLT instruction and run into the problem
-			 * which this is trying to prevent.
-			 */
-			WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
-			while(1)
-				native_halt();
-		}
-	}
+	mwait_play_dead_with_hint(eax);
 }
 
 /*