diff mbox series

[2/3] x86/ucode: Fold microcode_update_cpu() and fix error handling

Message ID 20241107122117.4073266-3-andrew.cooper3@citrix.com (mailing list archive)
State Superseded
Headers show
Series x86/ucode: Simplify/fix loading paths further | expand

Commit Message

Andrew Cooper Nov. 7, 2024, 12:21 p.m. UTC
Fold microcode_update_cpu() into its single remaining caller and simplify the
logic by removing the patch != NULL path with microcode_mutex held.

Explain why we bother grabbing the microcode revision even if we can't load
microcode.

Furthermore, delete the -EIO path.  An error updating microcode on AP boot or
S3 resume is certainly bad, but freeing the cache is about the worst possible
action we can take in response; it prevents subsequent APs from taking an
update they might have accepted.

This avoids a double collect_cpu_info() call on each AP.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/cpu/microcode/core.c | 49 +++++++++----------------------
 1 file changed, 14 insertions(+), 35 deletions(-)

Comments

Jan Beulich Nov. 12, 2024, 10:45 a.m. UTC | #1
On 07.11.2024 13:21, Andrew Cooper wrote:
> Fold microcode_update_cpu() into its single remaining caller and simplify the
> logic by removing the patch != NULL path with microcode_mutex held.
> 
> Explain why we bother grabbing the microcode revision even if we can't load
> microcode.
> 
> Furthermore, delete the -EIO path.  An error updating microcode on AP boot or
> S3 resume is certainly bad, but freeing the cache is about the worst possible
> action we can take in response; it prevents subsequent APs from taking an
> update they might have accepted.

I'm afraid I disagree here, but I also disagree with the present error handling.
-EIO indicates the patch didn't apply. Why would there be any hope that any
other CPU would accept it? We're assuming fully symmetric hardware, after all.
However, imo it's not -EIO that ought to be special cased, but success and
-EEXIST. In all other cases the same error will re-surface for other CPUs. Plus
by not cleaning the cache we prevent an older revision to be installed (without
forcing its installation).

Keeping what's cached might be an option, but then followed by cleaning the
cache unless at least one CPU actually accepted the ucode.

Jan
Andrew Cooper Nov. 12, 2024, 12:55 p.m. UTC | #2
On 12/11/2024 10:45 am, Jan Beulich wrote:
> On 07.11.2024 13:21, Andrew Cooper wrote:
>> Fold microcode_update_cpu() into its single remaining caller and simplify the
>> logic by removing the patch != NULL path with microcode_mutex held.
>>
>> Explain why we bother grabbing the microcode revision even if we can't load
>> microcode.
>>
>> Furthermore, delete the -EIO path.  An error updating microcode on AP boot or
>> S3 resume is certainly bad, but freeing the cache is about the worst possible
>> action we can take in response; it prevents subsequent APs from taking an
>> update they might have accepted.
> I'm afraid I disagree here, but I also disagree with the present error handling.
> -EIO indicates the patch didn't apply. Why would there be any hope that any
> other CPU would accept it?

-EIO is "something went wrong".

On modern systems this can include "checksum didn't match because
there's a bad SRAM cell".  This is literally one of the failures leading
to the introduction of In-Field-Scan.

Individual cores really can fail in a way which won't be the same
elsewhere in the system.

> Keeping what's cached might be an option, but then followed by cleaning the
> cache unless at least one CPU actually accepted the ucode.

We already have that behaviour.


We cache speculatively on boot, even if the BSP doesn't need to load,
because APs might need to.  This really is the best we can do.

The only other time the cache gets modified is after a late-load attempt
which reported success.  There are still a lot of partial-failure error
cases to handle less badly, but that needs yet more untangling before it
can be addressed adequately.

~Andrew
Jan Beulich Nov. 12, 2024, 2:24 p.m. UTC | #3
On 12.11.2024 13:55, Andrew Cooper wrote:
> On 12/11/2024 10:45 am, Jan Beulich wrote:
>> On 07.11.2024 13:21, Andrew Cooper wrote:
>>> Fold microcode_update_cpu() into its single remaining caller and simplify the
>>> logic by removing the patch != NULL path with microcode_mutex held.
>>>
>>> Explain why we bother grabbing the microcode revision even if we can't load
>>> microcode.
>>>
>>> Furthermore, delete the -EIO path.  An error updating microcode on AP boot or
>>> S3 resume is certainly bad, but freeing the cache is about the worst possible
>>> action we can take in response; it prevents subsequent APs from taking an
>>> update they might have accepted.
>> I'm afraid I disagree here, but I also disagree with the present error handling.
>> -EIO indicates the patch didn't apply. Why would there be any hope that any
>> other CPU would accept it?
> 
> -EIO is "something went wrong".
> 
> On modern systems this can include "checksum didn't match because
> there's a bad SRAM cell".  This is literally one of the failures leading
> to the introduction of In-Field-Scan.
> 
> Individual cores really can fail in a way which won't be the same
> elsewhere in the system.

Hmm, well, slightly hesitantly
Acked-by: Jan Beulich <jbeulich@suse.com>

Ideally with a remark added to the description that there is known room
for further improvement.

>> Keeping what's cached might be an option, but then followed by cleaning the
>> cache unless at least one CPU actually accepted the ucode.
> 
> We already have that behaviour.
> 
> 
> We cache speculatively on boot, even if the BSP doesn't need to load,
> because APs might need to.  This really is the best we can do.

That's a different scenario. If we ended up with ucode which no single
CPU accepts, there's hardly much point in caching that ucode. This
specifically is meant not to include the case where simply all CPUs are
already up-to-date. The one largely theoretical case where caching may
still make sense is for CPU hotplug, where the hot-plugged CPU(s) may
accept what all boot-time CPUs refused.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/cpu/microcode/core.c b/xen/arch/x86/cpu/microcode/core.c
index d9406ec3fd34..fd4b08b45388 100644
--- a/xen/arch/x86/cpu/microcode/core.c
+++ b/xen/arch/x86/cpu/microcode/core.c
@@ -263,40 +263,6 @@  static bool cf_check wait_cpu_callout(unsigned int nr)
     return atomic_read(&cpu_out) >= nr;
 }
 
-/*
- * Load a microcode update to current CPU.
- *
- * If no patch is provided, the cached patch will be loaded. Microcode update
- * during APs bringup and CPU resuming falls into this case.
- */
-static int microcode_update_cpu(const struct microcode_patch *patch,
-                                unsigned int flags)
-{
-    int err;
-
-    alternative_vcall(ucode_ops.collect_cpu_info);
-
-    spin_lock(&microcode_mutex);
-    if ( patch )
-        err = alternative_call(ucode_ops.apply_microcode, patch, flags);
-    else if ( microcode_cache )
-    {
-        err = alternative_call(ucode_ops.apply_microcode, microcode_cache,
-                               flags);
-        if ( err == -EIO )
-        {
-            microcode_free_patch(microcode_cache);
-            microcode_cache = NULL;
-        }
-    }
-    else
-        /* No patch to update */
-        err = -ENOENT;
-    spin_unlock(&microcode_mutex);
-
-    return err;
-}
-
 static bool wait_for_state(typeof(loading_state) state)
 {
     typeof(loading_state) cur_state;
@@ -702,13 +668,26 @@  int microcode_update(XEN_GUEST_HANDLE(const_void) buf,
 /* Load a cached update to current cpu */
 int microcode_update_one(void)
 {
+    int rc;
+
+    /*
+     * This path is used for APs and S3 resume.  Read the microcode revision
+     * if possible, even if we can't load microcode.
+     */
     if ( ucode_ops.collect_cpu_info )
         alternative_vcall(ucode_ops.collect_cpu_info);
 
     if ( !ucode_ops.apply_microcode )
         return -EOPNOTSUPP;
 
-    return microcode_update_cpu(NULL, 0);
+    spin_lock(&microcode_mutex);
+    if ( microcode_cache )
+        rc = alternative_call(ucode_ops.apply_microcode, microcode_cache, 0);
+    else
+        rc = -ENOENT;
+    spin_unlock(&microcode_mutex);
+
+    return rc;
 }
 
 /*