diff mbox series

x86/ucode: Refresh raw CPU policy after microcode load

Message ID 20230503185813.3050382-1-andrew.cooper3@citrix.com (mailing list archive)
State Superseded
Headers show
Series x86/ucode: Refresh raw CPU policy after microcode load | expand

Commit Message

Andrew Cooper May 3, 2023, 6:58 p.m. UTC
Loading microcode can cause new features to appear.  This has happened
routinely since Spectre/Meltdown, and even the presence of new status bits can
mean the administrator has no further work to perform.

Refresh the raw CPU policy after late microcode load, so xen-cpuid can reflect
the updated state of the system.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

This is also the first step of being able to livepatch support for new
functionality in microcode.
---
 xen/arch/x86/cpu-policy.c             | 6 +++---
 xen/arch/x86/cpu/microcode/core.c     | 4 ++++
 xen/arch/x86/include/asm/cpu-policy.h | 6 ++++++
 3 files changed, 13 insertions(+), 3 deletions(-)

Comments

Jan Beulich May 4, 2023, 7:08 a.m. UTC | #1
On 03.05.2023 20:58, Andrew Cooper wrote:
> Loading microcode can cause new features to appear.

Or disappear (LWP)? While I don't think we want to panic() in this
case (we do on the S3 resume path when recheck_cpu_features() fails
on the BSP), it would seem to me that we want to go a step further
than you do and at least warn when a feature went amiss. Or is that
too much of a scope-creeping request right here?

> @@ -677,6 +678,9 @@ static long cf_check microcode_update_helper(void *data)
>          spin_lock(&microcode_mutex);
>          microcode_update_cache(patch);
>          spin_unlock(&microcode_mutex);
> +
> +        /* Refresh the raw CPU policy, in case the features have changed. */
> +        calculate_raw_cpu_policy();

I understand this is in line with what we do during boot, but there
and here I wonder whether this wouldn't better deal with possible
asymmetries (e.g. in case ucode loading failed on one of the CPUs),
along the lines of what we do near the end of identify_cpu() for
APs. (Unlike the question higher up, this is definitely only a
remark here, not something I'd consider dealing with right in this
change.)

Jan
Andrew Cooper May 4, 2023, 8:08 a.m. UTC | #2
On 04/05/2023 8:08 am, Jan Beulich wrote:
> On 03.05.2023 20:58, Andrew Cooper wrote:
>> Loading microcode can cause new features to appear.
> Or disappear (LWP)? While I don't think we want to panic() in this
> case (we do on the S3 resume path when recheck_cpu_features() fails
> on the BSP), it would seem to me that we want to go a step further
> than you do and at least warn when a feature went amiss. Or is that
> too much of a scope-creeping request right here?

You're correct that I ought to discuss the disappear case.  But like
livepatching, it's firmly in the realm of "the end user takes
responsibility for trying this in their test system before running it in
production".

For LWP specifically, we ought to explicitly permit its disappearance in
recheck_cpu_features(), because this specific example is known to exist,
and known safe as Xen never used or virtualised LWP functionality. 
Crashing on S3

>
>> @@ -677,6 +678,9 @@ static long cf_check microcode_update_helper(void *data)
>>          spin_lock(&microcode_mutex);
>>          microcode_update_cache(patch);
>>          spin_unlock(&microcode_mutex);
>> +
>> +        /* Refresh the raw CPU policy, in case the features have changed. */
>> +        calculate_raw_cpu_policy();
> I understand this is in line with what we do during boot, but there
> and here I wonder whether this wouldn't better deal with possible
> asymmetries (e.g. in case ucode loading failed on one of the CPUs),
> along the lines of what we do near the end of identify_cpu() for
> APs. (Unlike the question higher up, this is definitely only a
> remark here, not something I'd consider dealing with right in this
> change.)

Asymmetry is an increasingly theoretical problem.  Yeah, it exists in
principle, but Xen has no way of letting you explicitly get into that
situation.

This too falls firmly into the "end user takes responsibility for
testing it properly first" category.

We have explicit symmetric assumptions/requirements elsewhere (e.g. for
a single system, there's 1 correct ucode blob).

We can acknowledge that asymmetry exists, but there is basically nothing
Xen can do about it other than highlight that something is very wrong on
the system.  Odds are that a system which gets into such a state won't
survive much longer.

~Andrew
Jan Beulich May 4, 2023, 8:17 a.m. UTC | #3
On 04.05.2023 10:08, Andrew Cooper wrote:
> On 04/05/2023 8:08 am, Jan Beulich wrote:
>> On 03.05.2023 20:58, Andrew Cooper wrote:
>>> Loading microcode can cause new features to appear.
>> Or disappear (LWP)? While I don't think we want to panic() in this
>> case (we do on the S3 resume path when recheck_cpu_features() fails
>> on the BSP), it would seem to me that we want to go a step further
>> than you do and at least warn when a feature went amiss. Or is that
>> too much of a scope-creeping request right here?
> 
> You're correct that I ought to discuss the disappear case.  But like
> livepatching, it's firmly in the realm of "the end user takes
> responsibility for trying this in their test system before running it in
> production".

Okay, with the case at least suitably mentioned
Reviewed-by: Jan Beulich <jbeulich@suse.com>

> For LWP specifically, we ought to explicitly permit its disappearance in
> recheck_cpu_features(), because this specific example is known to exist,
> and known safe as Xen never used or virtualised LWP functionality. 

Right, but iirc we did expose it to guests earlier on. And I've used it as
a known example only anyway. Who knows what vendors decide to make disappear
the next time round ...

> Crashing on S3

I may guess you meant to continue "... is bad anyway"?

>>> @@ -677,6 +678,9 @@ static long cf_check microcode_update_helper(void *data)
>>>          spin_lock(&microcode_mutex);
>>>          microcode_update_cache(patch);
>>>          spin_unlock(&microcode_mutex);
>>> +
>>> +        /* Refresh the raw CPU policy, in case the features have changed. */
>>> +        calculate_raw_cpu_policy();
>> I understand this is in line with what we do during boot, but there
>> and here I wonder whether this wouldn't better deal with possible
>> asymmetries (e.g. in case ucode loading failed on one of the CPUs),
>> along the lines of what we do near the end of identify_cpu() for
>> APs. (Unlike the question higher up, this is definitely only a
>> remark here, not something I'd consider dealing with right in this
>> change.)
> 
> Asymmetry is an increasingly theoretical problem.  Yeah, it exists in
> principle, but Xen has no way of letting you explicitly get into that
> situation.
> 
> This too falls firmly into the "end user takes responsibility for
> testing it properly first" category.
> 
> We have explicit symmetric assumptions/requirements elsewhere (e.g. for
> a single system, there's 1 correct ucode blob).
> 
> We can acknowledge that asymmetry exists, but there is basically nothing
> Xen can do about it other than highlight that something is very wrong on
> the system.  Odds are that a system which gets into such a state won't
> survive much longer.

Indeed. Hence the desire to at least log the fact, such that investigation
of the sudden death won't take long.

Jan
Roger Pau Monné May 4, 2023, 8:22 a.m. UTC | #4
On Thu, May 04, 2023 at 09:08:02AM +0100, Andrew Cooper wrote:
> On 04/05/2023 8:08 am, Jan Beulich wrote:
> > On 03.05.2023 20:58, Andrew Cooper wrote:
> >> Loading microcode can cause new features to appear.
> > Or disappear (LWP)? While I don't think we want to panic() in this
> > case (we do on the S3 resume path when recheck_cpu_features() fails
> > on the BSP), it would seem to me that we want to go a step further
> > than you do and at least warn when a feature went amiss. Or is that
> > too much of a scope-creeping request right here?
> 
> You're correct that I ought to discuss the disappear case.  But like
> livepatching, it's firmly in the realm of "the end user takes
> responsibility for trying this in their test system before running it in
> production".
> 
> For LWP specifically, we ought to explicitly permit its disappearance in
> recheck_cpu_features(), because this specific example is known to exist,
> and known safe as Xen never used or virtualised LWP functionality. 
> Crashing on S3

Printing disappearing features might be a nice bonus, but could be
done from the tool that loads the microcode itself, no need to do such
work in the hypervisor IMO.

> >
> >> @@ -677,6 +678,9 @@ static long cf_check microcode_update_helper(void *data)
> >>          spin_lock(&microcode_mutex);
> >>          microcode_update_cache(patch);
> >>          spin_unlock(&microcode_mutex);
> >> +
> >> +        /* Refresh the raw CPU policy, in case the features have changed. */
> >> +        calculate_raw_cpu_policy();
> > I understand this is in line with what we do during boot, but there
> > and here I wonder whether this wouldn't better deal with possible
> > asymmetries (e.g. in case ucode loading failed on one of the CPUs),
> > along the lines of what we do near the end of identify_cpu() for
> > APs. (Unlike the question higher up, this is definitely only a
> > remark here, not something I'd consider dealing with right in this
> > change.)
> 
> Asymmetry is an increasingly theoretical problem.  Yeah, it exists in
> principle, but Xen has no way of letting you explicitly get into that
> situation.
> 
> This too falls firmly into the "end user takes responsibility for
> testing it properly first" category.
> 
> We have explicit symmetric assumptions/requirements elsewhere (e.g. for
> a single system, there's 1 correct ucode blob).
> 
> We can acknowledge that asymmetry exists, but there is basically nothing
> Xen can do about it other than highlight that something is very wrong on
> the system.  Odds are that a system which gets into such a state won't
> survive much longer.

Would it make sense to only update the CPU policy if updated ==
nr_cores?

Thanks, Roger.
Jan Beulich May 4, 2023, 9:07 a.m. UTC | #5
On 04.05.2023 10:22, Roger Pau Monné wrote:
> On Thu, May 04, 2023 at 09:08:02AM +0100, Andrew Cooper wrote:
>> On 04/05/2023 8:08 am, Jan Beulich wrote:
>>> On 03.05.2023 20:58, Andrew Cooper wrote:
>>>> Loading microcode can cause new features to appear.
>>> Or disappear (LWP)? While I don't think we want to panic() in this
>>> case (we do on the S3 resume path when recheck_cpu_features() fails
>>> on the BSP), it would seem to me that we want to go a step further
>>> than you do and at least warn when a feature went amiss. Or is that
>>> too much of a scope-creeping request right here?
>>
>> You're correct that I ought to discuss the disappear case.  But like
>> livepatching, it's firmly in the realm of "the end user takes
>> responsibility for trying this in their test system before running it in
>> production".
>>
>> For LWP specifically, we ought to explicitly permit its disappearance in
>> recheck_cpu_features(), because this specific example is known to exist,
>> and known safe as Xen never used or virtualised LWP functionality. 
>> Crashing on S3
> 
> Printing disappearing features might be a nice bonus, but could be
> done from the tool that loads the microcode itself, no need to do such
> work in the hypervisor IMO.

Except that the system may not last long enough for the (or any) tool
to actually make it to producing such output, let alone any human
actually observing it (when that output isn't necessarily going to
some remote system).

Jan
Andrew Cooper May 4, 2023, 9:28 a.m. UTC | #6
On 04/05/2023 10:07 am, Jan Beulich wrote:
> On 04.05.2023 10:22, Roger Pau Monné wrote:
>> On Thu, May 04, 2023 at 09:08:02AM +0100, Andrew Cooper wrote:
>>> On 04/05/2023 8:08 am, Jan Beulich wrote:
>>>> On 03.05.2023 20:58, Andrew Cooper wrote:
>>>>> Loading microcode can cause new features to appear.
>>>> Or disappear (LWP)? While I don't think we want to panic() in this
>>>> case (we do on the S3 resume path when recheck_cpu_features() fails
>>>> on the BSP), it would seem to me that we want to go a step further
>>>> than you do and at least warn when a feature went amiss. Or is that
>>>> too much of a scope-creeping request right here?
>>> You're correct that I ought to discuss the disappear case.  But like
>>> livepatching, it's firmly in the realm of "the end user takes
>>> responsibility for trying this in their test system before running it in
>>> production".
>>>
>>> For LWP specifically, we ought to explicitly permit its disappearance in
>>> recheck_cpu_features(), because this specific example is known to exist,
>>> and known safe as Xen never used or virtualised LWP functionality. 
>>> Crashing on S3
>> Printing disappearing features might be a nice bonus, but could be
>> done from the tool that loads the microcode itself, no need to do such
>> work in the hypervisor IMO.

Xen is really the only entity in a position to judge when stuff has gone
away, so this can't really be deferred to another tool.

We have the X86_FEATURE_* names during boot for parsing the
{dom0-}cpuid= command line options, but we don't keep this beyond init

> Except that the system may not last long enough for the (or any) tool
> to actually make it to producing such output, let alone any human
> actually observing it (when that output isn't necessarily going to
> some remote system).

Yeah, `xl dmesg`/serial/whatever is the one place liable for anything
like this to get noticed.

~Andrew
Andrew Cooper May 4, 2023, 10:19 a.m. UTC | #7
On 04/05/2023 9:17 am, Jan Beulich wrote:
> On 04.05.2023 10:08, Andrew Cooper wrote:
>> On 04/05/2023 8:08 am, Jan Beulich wrote:
>>> On 03.05.2023 20:58, Andrew Cooper wrote:
>>>> Loading microcode can cause new features to appear.
>>> Or disappear (LWP)? While I don't think we want to panic() in this
>>> case (we do on the S3 resume path when recheck_cpu_features() fails
>>> on the BSP), it would seem to me that we want to go a step further
>>> than you do and at least warn when a feature went amiss. Or is that
>>> too much of a scope-creeping request right here?
>> You're correct that I ought to discuss the disappear case.  But like
>> livepatching, it's firmly in the realm of "the end user takes
>> responsibility for trying this in their test system before running it in
>> production".
> Okay, with the case at least suitably mentioned
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks,

>> For LWP specifically, we ought to explicitly permit its disappearance in
>> recheck_cpu_features(), because this specific example is known to exist,
>> and known safe as Xen never used or virtualised LWP functionality. 
> Right, but iirc we did expose it to guests earlier on. And I've used it as
> a known example only anyway. Who knows what vendors decide to make disappear
> the next time round ...

It's true that we used to expose the CPUID bit to guests, but IIRC we
never virtualised it correctly which was my justification for hiding the
feature bit when I was doing the original work to rationalise what
guests saw.

 
Removing LWP was an extraordinary situation, and AMD didn't do it lightly.

What they did was sacrifice a fairly expensive LWP errata workaround to
make space ucode space for IBPB instead.  They hid the CPUID bit (and
specifically by modifying the CPUID mask MSR only) because there were no
known production users of LWP, and a consequence of the patch, an errata
got unfixed.

On real AMD systems which used to enumerate LWP, and subsequently ceased
to, it was only "normal virt" levels of feature hiding.  The LWP CPUID
leaf still has real data in it, and you can still use %xcr0.lwp, and the
LWP MSRs still function.

Software which was genuinely using LWP, not tickling the errata case,
and late loaded this patch wouldn't actually have anything malfunction
underfoot.

>> Crashing on S3
> I may guess you meant to continue "... is bad anyway"?

Oops.  I was clearly in a rush for my morning meeting.  What I meant was
"we probably shouldn't crash on S3 resume in this known-ok case".

~Andrew
diff mbox series

Patch

diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index a58bf6cad54e..ef6a2d0d180a 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -15,7 +15,7 @@ 
 #include <asm/setup.h>
 #include <asm/xstate.h>
 
-struct cpu_policy __ro_after_init     raw_cpu_policy;
+struct cpu_policy __read_mostly       raw_cpu_policy;
 struct cpu_policy __ro_after_init    host_cpu_policy;
 #ifdef CONFIG_PV
 struct cpu_policy __ro_after_init  pv_max_cpu_policy;
@@ -343,7 +343,7 @@  static void recalculate_misc(struct cpu_policy *p)
     }
 }
 
-static void __init calculate_raw_policy(void)
+void calculate_raw_cpu_policy(void)
 {
     struct cpu_policy *p = &raw_cpu_policy;
 
@@ -655,7 +655,7 @@  static void __init calculate_hvm_def_policy(void)
 
 void __init init_guest_cpu_policies(void)
 {
-    calculate_raw_policy();
+    calculate_raw_cpu_policy();
     calculate_host_policy();
 
     if ( IS_ENABLED(CONFIG_PV) )
diff --git a/xen/arch/x86/cpu/microcode/core.c b/xen/arch/x86/cpu/microcode/core.c
index 61cd36d601d6..cd456c476fbf 100644
--- a/xen/arch/x86/cpu/microcode/core.c
+++ b/xen/arch/x86/cpu/microcode/core.c
@@ -34,6 +34,7 @@ 
 #include <xen/watchdog.h>
 
 #include <asm/apic.h>
+#include <asm/cpu-policy.h>
 #include <asm/delay.h>
 #include <asm/nmi.h>
 #include <asm/processor.h>
@@ -677,6 +678,9 @@  static long cf_check microcode_update_helper(void *data)
         spin_lock(&microcode_mutex);
         microcode_update_cache(patch);
         spin_unlock(&microcode_mutex);
+
+        /* Refresh the raw CPU policy, in case the features have changed. */
+        calculate_raw_cpu_policy();
     }
     else
         microcode_free_patch(patch);
diff --git a/xen/arch/x86/include/asm/cpu-policy.h b/xen/arch/x86/include/asm/cpu-policy.h
index b361537a602b..99d5a8e67eeb 100644
--- a/xen/arch/x86/include/asm/cpu-policy.h
+++ b/xen/arch/x86/include/asm/cpu-policy.h
@@ -24,4 +24,10 @@  void init_dom0_cpuid_policy(struct domain *d);
 /* Clamp the CPUID policy to reality. */
 void recalculate_cpuid_policy(struct domain *d);
 
+/*
+ * Collect the raw CPUID and MSR values.  Called during boot, and after late
+ * microcode loading.
+ */
+void calculate_raw_cpu_policy(void);
+
 #endif /* X86_CPU_POLICY_H */