diff mbox series

x86/idle: prevent entering C6 with in service interrupts on Intel

Message ID 20200507132236.26010-1-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series x86/idle: prevent entering C6 with in service interrupts on Intel | expand

Commit Message

Roger Pau Monné May 7, 2020, 1:22 p.m. UTC
Apply a workaround for Intel errata CLX30: "A Pending Fixed Interrupt
May Be Dispatched Before an Interrupt of The Same Priority Completes".

It's not clear which models are affected, as the errata is listed in
the "Second Generation Intel Xeon Scalable Processors" specification
update, but the issue has been seen as far back as Nehalem processors.
Apply the workaround to all Intel processors, the condition can be
relaxed later.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 docs/misc/xen-command-line.pandoc |  8 ++++++++
 xen/arch/x86/acpi/cpu_idle.c      | 22 +++++++++++++++++++++-
 xen/arch/x86/cpu/mwait-idle.c     |  3 +++
 xen/include/asm-x86/cpuidle.h     |  1 +
 4 files changed, 33 insertions(+), 1 deletion(-)

Comments

Jan Beulich May 8, 2020, 1:46 p.m. UTC | #1
On 07.05.2020 15:22, Roger Pau Monne wrote:
> Apply a workaround for Intel errata CLX30: "A Pending Fixed Interrupt
> May Be Dispatched Before an Interrupt of The Same Priority Completes".
> 
> It's not clear which models are affected, as the errata is listed in
> the "Second Generation Intel Xeon Scalable Processors" specification
> update, but the issue has been seen as far back as Nehalem processors.
> Apply the workaround to all Intel processors, the condition can be
> relaxed later.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  docs/misc/xen-command-line.pandoc |  8 ++++++++
>  xen/arch/x86/acpi/cpu_idle.c      | 22 +++++++++++++++++++++-
>  xen/arch/x86/cpu/mwait-idle.c     |  3 +++
>  xen/include/asm-x86/cpuidle.h     |  1 +
>  4 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> index ee12b0f53f..6e868a2185 100644
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -652,6 +652,14 @@ Specify the size of the console debug trace buffer. By specifying `cpu:`
>  additionally a trace buffer of the specified size is allocated per cpu.
>  The debug trace feature is only enabled in debugging builds of Xen.
>  
> +### disable-c6-isr
> +> `= <boolean>`
> +
> +> Default: `true for Intel CPUs`
> +
> +Workaround for Intel errata CLX30. Prevent entering C6 idle states with in
> +service local APIC interrupts. Enabled by default for all Intel CPUs.
> +
>  ### dma_bits
>  > `= <integer>`
>  
> diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
> index b83446e77d..5023fea148 100644
> --- a/xen/arch/x86/acpi/cpu_idle.c
> +++ b/xen/arch/x86/acpi/cpu_idle.c
> @@ -573,6 +573,25 @@ static bool errata_c6_eoi_workaround(void)
>      return (fix_needed && cpu_has_pending_apic_eoi());
>  }
>  
> +static int8_t __read_mostly disable_c6_isr = -1;
> +boolean_param("disable-c6-isr", disable_c6_isr);
> +
> +/*
> + * Errata CLX30: A Pending Fixed Interrupt May Be Dispatched Before an
> + * Interrupt of The Same Priority Completes.
> + *
> + * Prevent entering C6 if there are pending lapic interrupts, or else the
> + * processor might dispatch further pending interrupts before the first one has
> + * been completed.
> + */
> +bool errata_c6_isr_workaround(void)
> +{
> +    if ( unlikely(disable_c6_isr == -1) )
> +        disable_c6_isr = boot_cpu_data.x86_vendor == X86_VENDOR_INTEL;
> +
> +    return disable_c6_isr && cpu_has_pending_apic_eoi();

This check being the same as in errata_c6_eoi_workaround(),
why don't you simply extend that function? (See also below.)
Also both variable and command line option descriptor could
go inside the function, to limit their scopes.

> @@ -676,7 +695,8 @@ static void acpi_processor_idle(void)
>          return;
>      }
>  
> -    if ( (cx->type == ACPI_STATE_C3) && errata_c6_eoi_workaround() )
> +    if ( (cx->type == ACPI_STATE_C3) &&
> +         (errata_c6_eoi_workaround() || errata_c6_isr_workaround()) )
>          cx = power->safe_state;

I realize you only add to existing code, but I'm afraid this
existing code cannot be safely added to. Already prior to
your change there was a curious mismatch of C3 and c6 on this
line, and I don't see how ACPI_STATE_C3 correlates with
"core C6" state. Now this may have been the convention for
Nehalem/Westmere systems, but already the mwait-idle entries
for these CPU models have 4 entries (albeit such that they
match this scheme). As a result I think this at the very
least needs to be >=, not ==, even more so now that you want
to cover all Intel CPUs.

Obviously (I think) it is a mistake that mwait-idle doesn't
also activate this workaround, which imo is another reason to
stick to just errata_c6_eoi_workaround().

> --- a/xen/arch/x86/cpu/mwait-idle.c
> +++ b/xen/arch/x86/cpu/mwait-idle.c
> @@ -770,6 +770,9 @@ static void mwait_idle(void)
>  		return;
>  	}
>  
> +	if (cx->type == ACPI_STATE_C3 && errata_c6_isr_workaround())
> +		cx = power->safe_state;

Here it becomes even more relevant I think that == not be
used, as the static tables list deeper C-states; it's just
that the SKX table, which also gets used for CLX afaict, has
no entries beyond C6-SKX. Others with deeper ones presumably
have the deeper C-states similarly affected (or not) by this
erratum.

For reference, mwait_idle_cpu_init() has

		hint = flg2MWAIT(cpuidle_state_table[cstate].flags);
		state = MWAIT_HINT2CSTATE(hint) + 1;
                ...
		cx->type = state;

i.e. the value you compare against is derived from the static
table entries. For Nehalem/Westmere this means that what goes
under ACPI_STATE_C3 is indeed C6-NHM, and this looks to match
for all the non-Atoms, but for none of the Atoms. Now Atoms
could easily be unaffected, but (just to take an example) if
C6-SNB was affected, surely C7-SNB would be, too.

Jan
Roger Pau Monné May 8, 2020, 5:24 p.m. UTC | #2
On Fri, May 08, 2020 at 03:46:08PM +0200, Jan Beulich wrote:
> On 07.05.2020 15:22, Roger Pau Monne wrote:
> > diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
> > index b83446e77d..5023fea148 100644
> > --- a/xen/arch/x86/acpi/cpu_idle.c
> > +++ b/xen/arch/x86/acpi/cpu_idle.c
> > @@ -573,6 +573,25 @@ static bool errata_c6_eoi_workaround(void)
> >      return (fix_needed && cpu_has_pending_apic_eoi());
> >  }
> >  
> > +static int8_t __read_mostly disable_c6_isr = -1;
> > +boolean_param("disable-c6-isr", disable_c6_isr);
> > +
> > +/*
> > + * Errata CLX30: A Pending Fixed Interrupt May Be Dispatched Before an
> > + * Interrupt of The Same Priority Completes.
> > + *
> > + * Prevent entering C6 if there are pending lapic interrupts, or else the
> > + * processor might dispatch further pending interrupts before the first one has
> > + * been completed.
> > + */
> > +bool errata_c6_isr_workaround(void)
> > +{
> > +    if ( unlikely(disable_c6_isr == -1) )
> > +        disable_c6_isr = boot_cpu_data.x86_vendor == X86_VENDOR_INTEL;
> > +
> > +    return disable_c6_isr && cpu_has_pending_apic_eoi();
> 
> This check being the same as in errata_c6_eoi_workaround(),
> why don't you simply extend that function? (See also below.)
> Also both variable and command line option descriptor could
> go inside the function, to limit their scopes.

Since this is actually a superset (as it covers all Intel CPUs), I
should delete the previous (more restricted) workaround matching
logic.

> > @@ -676,7 +695,8 @@ static void acpi_processor_idle(void)
> >          return;
> >      }
> >  
> > -    if ( (cx->type == ACPI_STATE_C3) && errata_c6_eoi_workaround() )
> > +    if ( (cx->type == ACPI_STATE_C3) &&
> > +         (errata_c6_eoi_workaround() || errata_c6_isr_workaround()) )
> >          cx = power->safe_state;
> 
> I realize you only add to existing code, but I'm afraid this
> existing code cannot be safely added to. Already prior to
> your change there was a curious mismatch of C3 and c6 on this
> line, and I don't see how ACPI_STATE_C3 correlates with
> "core C6" state. Now this may have been the convention for
> Nehalem/Westmere systems, but already the mwait-idle entries
> for these CPU models have 4 entries (albeit such that they
> match this scheme). As a result I think this at the very
> least needs to be >=, not ==, even more so now that you want
> to cover all Intel CPUs.

Hm, I think this is because AFAICT acpi_processor_idle only
understands up to ACPI_STATE_C3, passing a type > ACPI_STATE_C3 will
just cause it to fallback to C0. I've adjusted the comparison to use
>= instead, as a safety imporvement in case we add support for more
states to acpi_processor_idle.

> Obviously (I think) it is a mistake that mwait-idle doesn't
> also activate this workaround, which imo is another reason to
> stick to just errata_c6_eoi_workaround().

I assumed the previous workaround didn't apply to any of the CPUs
supported by the mwait-idle driver, since it seems to only support
recent-ish models.

> > --- a/xen/arch/x86/cpu/mwait-idle.c
> > +++ b/xen/arch/x86/cpu/mwait-idle.c
> > @@ -770,6 +770,9 @@ static void mwait_idle(void)
> >  		return;
> >  	}
> >  
> > +	if (cx->type == ACPI_STATE_C3 && errata_c6_isr_workaround())
> > +		cx = power->safe_state;
> 
> Here it becomes even more relevant I think that == not be
> used, as the static tables list deeper C-states; it's just
> that the SKX table, which also gets used for CLX afaict, has
> no entries beyond C6-SKX. Others with deeper ones presumably
> have the deeper C-states similarly affected (or not) by this
> erratum.
> 
> For reference, mwait_idle_cpu_init() has
> 
> 		hint = flg2MWAIT(cpuidle_state_table[cstate].flags);
> 		state = MWAIT_HINT2CSTATE(hint) + 1;
>                 ...
> 		cx->type = state;
> 
> i.e. the value you compare against is derived from the static
> table entries. For Nehalem/Westmere this means that what goes
> under ACPI_STATE_C3 is indeed C6-NHM, and this looks to match
> for all the non-Atoms, but for none of the Atoms. Now Atoms
> could easily be unaffected, but (just to take an example) if
> C6-SNB was affected, surely C7-SNB would be, too.

Yes, I've adjusted this to use cx->type >= 3 instead. I have to admit
I'm confused by the name of the states being C6-* while the cstate
value reported by Xen will be 3 (I would instead expect the type to be
6 in order to match the name).

Thanks, Roger.
Jan Beulich May 11, 2020, 10:40 a.m. UTC | #3
On 08.05.2020 19:24, Roger Pau Monné wrote:
> On Fri, May 08, 2020 at 03:46:08PM +0200, Jan Beulich wrote:
>> On 07.05.2020 15:22, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/cpu/mwait-idle.c
>>> +++ b/xen/arch/x86/cpu/mwait-idle.c
>>> @@ -770,6 +770,9 @@ static void mwait_idle(void)
>>>  		return;
>>>  	}
>>>  
>>> +	if (cx->type == ACPI_STATE_C3 && errata_c6_isr_workaround())
>>> +		cx = power->safe_state;
>>
>> Here it becomes even more relevant I think that == not be
>> used, as the static tables list deeper C-states; it's just
>> that the SKX table, which also gets used for CLX afaict, has
>> no entries beyond C6-SKX. Others with deeper ones presumably
>> have the deeper C-states similarly affected (or not) by this
>> erratum.
>>
>> For reference, mwait_idle_cpu_init() has
>>
>> 		hint = flg2MWAIT(cpuidle_state_table[cstate].flags);
>> 		state = MWAIT_HINT2CSTATE(hint) + 1;
>>                 ...
>> 		cx->type = state;
>>
>> i.e. the value you compare against is derived from the static
>> table entries. For Nehalem/Westmere this means that what goes
>> under ACPI_STATE_C3 is indeed C6-NHM, and this looks to match
>> for all the non-Atoms, but for none of the Atoms. Now Atoms
>> could easily be unaffected, but (just to take an example) if
>> C6-SNB was affected, surely C7-SNB would be, too.
> 
> Yes, I've adjusted this to use cx->type >= 3 instead. I have to admit
> I'm confused by the name of the states being C6-* while the cstate
> value reported by Xen will be 3 (I would instead expect the type to be
> 6 in order to match the name).

Well, the problem is the disagreement in numbering between ACPI
and what the various CPU specs actually use.

Jan
diff mbox series

Patch

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index ee12b0f53f..6e868a2185 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -652,6 +652,14 @@  Specify the size of the console debug trace buffer. By specifying `cpu:`
 additionally a trace buffer of the specified size is allocated per cpu.
 The debug trace feature is only enabled in debugging builds of Xen.
 
+### disable-c6-isr
+> `= <boolean>`
+
+> Default: `true for Intel CPUs`
+
+Workaround for Intel errata CLX30. Prevent entering C6 idle states with in
+service local APIC interrupts. Enabled by default for all Intel CPUs.
+
 ### dma_bits
 > `= <integer>`
 
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index b83446e77d..5023fea148 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -573,6 +573,25 @@  static bool errata_c6_eoi_workaround(void)
     return (fix_needed && cpu_has_pending_apic_eoi());
 }
 
+static int8_t __read_mostly disable_c6_isr = -1;
+boolean_param("disable-c6-isr", disable_c6_isr);
+
+/*
+ * Errata CLX30: A Pending Fixed Interrupt May Be Dispatched Before an
+ * Interrupt of The Same Priority Completes.
+ *
+ * Prevent entering C6 if there are pending lapic interrupts, or else the
+ * processor might dispatch further pending interrupts before the first one has
+ * been completed.
+ */
+bool errata_c6_isr_workaround(void)
+{
+    if ( unlikely(disable_c6_isr == -1) )
+        disable_c6_isr = boot_cpu_data.x86_vendor == X86_VENDOR_INTEL;
+
+    return disable_c6_isr && cpu_has_pending_apic_eoi();
+}
+
 void update_last_cx_stat(struct acpi_processor_power *power,
                          struct acpi_processor_cx *cx, uint64_t ticks)
 {
@@ -676,7 +695,8 @@  static void acpi_processor_idle(void)
         return;
     }
 
-    if ( (cx->type == ACPI_STATE_C3) && errata_c6_eoi_workaround() )
+    if ( (cx->type == ACPI_STATE_C3) &&
+         (errata_c6_eoi_workaround() || errata_c6_isr_workaround()) )
         cx = power->safe_state;
 
 
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index b81937966e..e14cdaeed7 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -770,6 +770,9 @@  static void mwait_idle(void)
 		return;
 	}
 
+	if (cx->type == ACPI_STATE_C3 && errata_c6_isr_workaround())
+		cx = power->safe_state;
+
 	eax = cx->address;
 	cstate = ((eax >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
 
diff --git a/xen/include/asm-x86/cpuidle.h b/xen/include/asm-x86/cpuidle.h
index 5d7dffd228..8b9d6fdb15 100644
--- a/xen/include/asm-x86/cpuidle.h
+++ b/xen/include/asm-x86/cpuidle.h
@@ -26,4 +26,5 @@  void update_idle_stats(struct acpi_processor_power *,
 void update_last_cx_stat(struct acpi_processor_power *,
                          struct acpi_processor_cx *, uint64_t);
 
+bool errata_c6_isr_workaround(void);
 #endif /* __X86_ASM_CPUIDLE_H__ */