Message ID | 20241107122117.4073266-3-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86/ucode: Simplify/fix loading paths further | expand |
On 07.11.2024 13:21, Andrew Cooper wrote: > Fold microcode_update_cpu() into its single remaining caller and simplify the > logic by removing the patch != NULL path with microcode_mutex held. > > Explain why we bother grabbing the microcode revision even if we can't load > microcode. > > Furthermore, delete the -EIO path. An error updating microcode on AP boot or > S3 resume is certainly bad, but freeing the cache is about the worst possible > action we can take in response; it prevents subsequent APs from taking an > update they might have accepted. I'm afraid I disagree here, but I also disagree with the present error handling. -EIO indicates the patch didn't apply. Why would there be any hope that any other CPU would accept it? We're assuming fully symmetric hardware, after all. However, imo it's not -EIO that ought to be special cased, but success and -EEXIST. In all other cases the same error will re-surface for other CPUs. Plus by not cleaning the cache we prevent an older revision to be installed (without forcing its installation). Keeping what's cached might be an option, but then followed by cleaning the cache unless at least one CPU actually accepted the ucode. Jan
On 12/11/2024 10:45 am, Jan Beulich wrote: > On 07.11.2024 13:21, Andrew Cooper wrote: >> Fold microcode_update_cpu() into its single remaining caller and simplify the >> logic by removing the patch != NULL path with microcode_mutex held. >> >> Explain why we bother grabbing the microcode revision even if we can't load >> microcode. >> >> Furthermore, delete the -EIO path. An error updating microcode on AP boot or >> S3 resume is certainly bad, but freeing the cache is about the worst possible >> action we can take in response; it prevents subsequent APs from taking an >> update they might have accepted. > I'm afraid I disagree here, but I also disagree with the present error handling. > -EIO indicates the patch didn't apply. Why would there be any hope that any > other CPU would accept it? -EIO is "something went wrong". On modern systems this can include "checksum didn't match because there's a bad SRAM cell". This is literally one of the failures leading to the introduction of In-Field-Scan. Individual cores really can fail in a way which won't be the same elsewhere in the system. > Keeping what's cached might be an option, but then followed by cleaning the > cache unless at least one CPU actually accepted the ucode. We already have that behaviour. We cache speculatively on boot, even if the BSP doesn't need to load, because APs might need to. This really is the best we can do. The only other time the cache gets modified is after a late-load attempt which reported success. There are still a lot of partial-failure error cases to handle less badly, but that needs yet more untangling before it can be addressed adequately. ~Andrew
On 12.11.2024 13:55, Andrew Cooper wrote: > On 12/11/2024 10:45 am, Jan Beulich wrote: >> On 07.11.2024 13:21, Andrew Cooper wrote: >>> Fold microcode_update_cpu() into its single remaining caller and simplify the >>> logic by removing the patch != NULL path with microcode_mutex held. >>> >>> Explain why we bother grabbing the microcode revision even if we can't load >>> microcode. >>> >>> Furthermore, delete the -EIO path. An error updating microcode on AP boot or >>> S3 resume is certainly bad, but freeing the cache is about the worst possible >>> action we can take in response; it prevents subsequent APs from taking an >>> update they might have accepted. >> I'm afraid I disagree here, but I also disagree with the present error handling. >> -EIO indicates the patch didn't apply. Why would there be any hope that any >> other CPU would accept it? > > -EIO is "something went wrong". > > On modern systems this can include "checksum didn't match because > there's a bad SRAM cell". This is literally one of the failures leading > to the introduction of In-Field-Scan. > > Individual cores really can fail in a way which won't be the same > elsewhere in the system. Hmm, well, slightly hesitantly Acked-by: Jan Beulich <jbeulich@suse.com> Ideally with a remark added to the description that there is known room for further improvement. >> Keeping what's cached might be an option, but then followed by cleaning the >> cache unless at least one CPU actually accepted the ucode. > > We already have that behaviour. > > > We cache speculatively on boot, even if the BSP doesn't need to load, > because APs might need to. This really is the best we can do. That's a different scenario. If we ended up with ucode which no single CPU accepts, there's hardly much point in caching that ucode. This specifically is meant not to include the case where simply all CPUs are already up-to-date. The one largely theoretical case where caching may still make sense is for CPU hotplug, where the hot-plugged CPU(s) may accept what all boot-time CPUs refused. Jan
diff --git a/xen/arch/x86/cpu/microcode/core.c b/xen/arch/x86/cpu/microcode/core.c index d9406ec3fd34..fd4b08b45388 100644 --- a/xen/arch/x86/cpu/microcode/core.c +++ b/xen/arch/x86/cpu/microcode/core.c @@ -263,40 +263,6 @@ static bool cf_check wait_cpu_callout(unsigned int nr) return atomic_read(&cpu_out) >= nr; } -/* - * Load a microcode update to current CPU. - * - * If no patch is provided, the cached patch will be loaded. Microcode update - * during APs bringup and CPU resuming falls into this case. - */ -static int microcode_update_cpu(const struct microcode_patch *patch, - unsigned int flags) -{ - int err; - - alternative_vcall(ucode_ops.collect_cpu_info); - - spin_lock(µcode_mutex); - if ( patch ) - err = alternative_call(ucode_ops.apply_microcode, patch, flags); - else if ( microcode_cache ) - { - err = alternative_call(ucode_ops.apply_microcode, microcode_cache, - flags); - if ( err == -EIO ) - { - microcode_free_patch(microcode_cache); - microcode_cache = NULL; - } - } - else - /* No patch to update */ - err = -ENOENT; - spin_unlock(µcode_mutex); - - return err; -} - static bool wait_for_state(typeof(loading_state) state) { typeof(loading_state) cur_state; @@ -702,13 +668,26 @@ int microcode_update(XEN_GUEST_HANDLE(const_void) buf, /* Load a cached update to current cpu */ int microcode_update_one(void) { + int rc; + + /* + * This path is used for APs and S3 resume. Read the microcode revision + * if possible, even if we can't load microcode. + */ if ( ucode_ops.collect_cpu_info ) alternative_vcall(ucode_ops.collect_cpu_info); if ( !ucode_ops.apply_microcode ) return -EOPNOTSUPP; - return microcode_update_cpu(NULL, 0); + spin_lock(µcode_mutex); + if ( microcode_cache ) + rc = alternative_call(ucode_ops.apply_microcode, microcode_cache, 0); + else + rc = -ENOENT; + spin_unlock(µcode_mutex); + + return rc; } /*
Fold microcode_update_cpu() into its single remaining caller and simplify the logic by removing the patch != NULL path with microcode_mutex held. Explain why we bother grabbing the microcode revision even if we can't load microcode. Furthermore, delete the -EIO path. An error updating microcode on AP boot or S3 resume is certainly bad, but freeing the cache is about the worst possible action we can take in response; it prevents subsequent APs from taking an update they might have accepted. This avoids a double collect_cpu_info() call on each AP. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/cpu/microcode/core.c | 49 +++++++++---------------------- 1 file changed, 14 insertions(+), 35 deletions(-)