Message ID | 20241001123807.605-2-alejandro.vallejo@cloud.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86: Expose consistent topology to guests | expand |
On 01.10.2024 14:37, Alejandro Vallejo wrote: > --- a/xen/lib/x86/policy.c > +++ b/xen/lib/x86/policy.c > @@ -15,7 +15,16 @@ int x86_cpu_policies_are_compatible(const struct cpu_policy *host, > #define FAIL_MSR(m) \ > do { e.msr = (m); goto out; } while ( 0 ) > > - if ( guest->basic.max_leaf > host->basic.max_leaf ) > + /* > + * Old AMD hardware doesn't expose topology information in leaf 0xb. We > + * want to emulate that leaf with credible information because it must be > + * present on systems in which we emulate the x2APIC. > + * > + * For that reason, allow the max basic guest leaf to be larger than the > + * hosts' up until 0xb. > + */ > + if ( guest->basic.max_leaf > 0xb && > + guest->basic.max_leaf > host->basic.max_leaf ) > FAIL_CPUID(0, NA); > > if ( guest->feat.max_subleaf > host->feat.max_subleaf ) I'm concerned by this in multiple ways: 1) It's pretty ad hoc, and hence doesn't make clear how to deal with similar situations in the future. 2) Why would we permit going up to leaf 0xb when x2APIC is off in the respective leaf? 3) We similarly force a higher extended leaf in order to accommodate the LFENCE- is-dispatch-serializing bit. Yet there's no similar extra logic there in the function here. 4) While there the guest vs host check won't matter, the situation with AMX and AVX10 leaves imo still wants considering here right away. IOW (taken together with at least 3) above) I think we need to first settle on a model for collectively all max (sub)leaf handling. That in particular needs to properly spell out who's responsible for what (tool stack vs Xen). Jan
Hi, On Wed Oct 9, 2024 at 10:40 AM BST, Jan Beulich wrote: > On 01.10.2024 14:37, Alejandro Vallejo wrote: > > --- a/xen/lib/x86/policy.c > > +++ b/xen/lib/x86/policy.c > > @@ -15,7 +15,16 @@ int x86_cpu_policies_are_compatible(const struct cpu_policy *host, > > #define FAIL_MSR(m) \ > > do { e.msr = (m); goto out; } while ( 0 ) > > > > - if ( guest->basic.max_leaf > host->basic.max_leaf ) > > + /* > > + * Old AMD hardware doesn't expose topology information in leaf 0xb. We > > + * want to emulate that leaf with credible information because it must be > > + * present on systems in which we emulate the x2APIC. > > + * > > + * For that reason, allow the max basic guest leaf to be larger than the > > + * hosts' up until 0xb. > > + */ > > + if ( guest->basic.max_leaf > 0xb && > > + guest->basic.max_leaf > host->basic.max_leaf ) > > FAIL_CPUID(0, NA); > > > > if ( guest->feat.max_subleaf > host->feat.max_subleaf ) > > I'm concerned by this in multiple ways: > > 1) It's pretty ad hoc, and hence doesn't make clear how to deal with similar > situations in the future. I agree. I don't have a principled suggestion for how to deal with other cases where we might have to bump the max leaf. It may be safe (as is here becasue everything below it is either used or unimplemnted), but AFAIU some leaves might be problematic to expose, even as zeroes. I suspect that's the problem you hint at later on about AMX and AVX10? > > 2) Why would we permit going up to leaf 0xb when x2APIC is off in the respective > leaf? I assume you mean when the x2APIC is not emulated? One reason is to avoid a migration barrier, as otherwise we can't migrate VMs created in "leaf 0xb"-capable hardware to non-"leaf 0xb"-capable even though the migration is perfectly safe. Also, it's benign and simplifies everything. Otherwise we have to find out during early creation not only whether the host has leaf 0xb, but also whether we're emulating an x2APIC or not. Furthermore, not doing this would actively prevent emulating an x2APIC on AMD Lisbon-like silicon even though it's fine to do so. Note that we have a broken invariant in existing code where the x2APIC is emulated and leaf 0xb is not exposed at all; not even to show the x2APIC IDs. > > 3) We similarly force a higher extended leaf in order to accommodate the LFENCE- > is-dispatch-serializing bit. Yet there's no similar extra logic there in the > function here. That's done on the host policy though, so there's no clash. In calculate_host_policy()... ``` /* * For AMD/Hygon hardware before Zen3, we unilaterally modify LFENCE to be * dispatch serialising for Spectre mitigations. Extend max_extd_leaf * beyond what hardware supports, to include the feature leaf containing * this information. */ if ( cpu_has_lfence_dispatch ) max_extd_leaf = max(max_extd_leaf, 0x80000021U); ``` One could imagine doing the same for leaf 0xb and dropping this patch, but then we'd have to synthesise something on that leaf for hardware that doesn't have it, which is a lot more annoying. > > 4) While there the guest vs host check won't matter, the situation with AMX and > AVX10 leaves imo still wants considering here right away. IOW (taken together > with at least 3) above) I think we need to first settle on a model for > collectively all max (sub)leaf handling. That in particular needs to properly > spell out who's responsible for what (tool stack vs Xen). I'm not sure I follow. What's the situation with AMX and AVX10 that you refer to? I'd assume that making ad-hoc decisions on this is pretty much unavoidable, but maybe the solution to the problem you mention would highlight a more general approach. > > Jan Cheers, Alejandro
On 01/10/2024 1:37 pm, Alejandro Vallejo wrote: > diff --git a/xen/lib/x86/policy.c b/xen/lib/x86/policy.c > index f033d22785be..63bc96451d2c 100644 > --- a/xen/lib/x86/policy.c > +++ b/xen/lib/x86/policy.c > @@ -15,7 +15,16 @@ int x86_cpu_policies_are_compatible(const struct cpu_policy *host, > #define FAIL_MSR(m) \ > do { e.msr = (m); goto out; } while ( 0 ) > > - if ( guest->basic.max_leaf > host->basic.max_leaf ) > + /* > + * Old AMD hardware doesn't expose topology information in leaf 0xb. We > + * want to emulate that leaf with credible information because it must be > + * present on systems in which we emulate the x2APIC. > + * > + * For that reason, allow the max basic guest leaf to be larger than the > + * hosts' up until 0xb. > + */ > + if ( guest->basic.max_leaf > 0xb && > + guest->basic.max_leaf > host->basic.max_leaf ) > FAIL_CPUID(0, NA); > > if ( guest->feat.max_subleaf > host->feat.max_subleaf ) This isn't appropriate. You're violating the property that a guest policy should be bounded by the max (called host here). Instead, the max policies basic.max_leaf should be at least 0xb. This is another case where the max wants to be higher with default inheriting from host. But it occurs to me that with max being an not-sane-in-isolation policy anyway, we probably want to end up setting max->$foo.max_leaf to the appropriate ARRAY_SIZE()'s. For any leaves within the bounds that Xen knows, it's fine for the toolstack to set values there even beyond host max leaf, as long as it doesn't violate other consistency checks. So, please fix this by adjusting calculate_{pv,hvm}_{max,default}_policy() in Xen. Sadly I think it will end up a tad ugly, but don't worry too much. We've got a number of host==default but max special fields, and I'm wanting to get a few more examples before choosing how best to rationalise it to reduce opencoding. ~Andrew
On 09.10.2024 17:57, Alejandro Vallejo wrote: > On Wed Oct 9, 2024 at 10:40 AM BST, Jan Beulich wrote: >> On 01.10.2024 14:37, Alejandro Vallejo wrote: >>> --- a/xen/lib/x86/policy.c >>> +++ b/xen/lib/x86/policy.c >>> @@ -15,7 +15,16 @@ int x86_cpu_policies_are_compatible(const struct cpu_policy *host, >>> #define FAIL_MSR(m) \ >>> do { e.msr = (m); goto out; } while ( 0 ) >>> >>> - if ( guest->basic.max_leaf > host->basic.max_leaf ) >>> + /* >>> + * Old AMD hardware doesn't expose topology information in leaf 0xb. We >>> + * want to emulate that leaf with credible information because it must be >>> + * present on systems in which we emulate the x2APIC. >>> + * >>> + * For that reason, allow the max basic guest leaf to be larger than the >>> + * hosts' up until 0xb. >>> + */ >>> + if ( guest->basic.max_leaf > 0xb && >>> + guest->basic.max_leaf > host->basic.max_leaf ) >>> FAIL_CPUID(0, NA); >>> >>> if ( guest->feat.max_subleaf > host->feat.max_subleaf ) >> >> I'm concerned by this in multiple ways: >> >> 1) It's pretty ad hoc, and hence doesn't make clear how to deal with similar >> situations in the future. > > I agree. I don't have a principled suggestion for how to deal with other cases > where we might have to bump the max leaf. It may be safe (as is here becasue > everything below it is either used or unimplemnted), but AFAIU some leaves > might be problematic to expose, even as zeroes. I suspect that's the problem > you hint at later on about AMX and AVX10? Not exactly, but perhaps somewhat related (see below). >> 2) Why would we permit going up to leaf 0xb when x2APIC is off in the respective >> leaf? > > I assume you mean when the x2APIC is not emulated? One reason is to avoid a > migration barrier, as otherwise we can't migrate VMs created in "leaf > 0xb"-capable hardware to non-"leaf 0xb"-capable even though the migration is > perfectly safe. Leaf 0xb ought to be synthesized anyway (to match the guest's topology); hardware capabilities hence don't matter here. > Also, it's benign and simplifies everything. Otherwise we have to find out > during early creation not only whether the host has leaf 0xb, but also whether > we're emulating an x2APIC or not. The policy passed by the tool stack will tell you what the choice there was. > Furthermore, not doing this would actively prevent emulating an x2APIC on AMD > Lisbon-like silicon even though it's fine to do so. I'm afraid I don't understand this. If the tool stack cleared the x2APIC bit, x2APIC ought to not be emulated. If it sets it (as permitted by the max policy), x2APIC would be emulated. > Note that we have a broken > invariant in existing code where the x2APIC is emulated and leaf 0xb is not > exposed at all; not even to show the x2APIC IDs. Well, fixing this is what this series is about, isn't it? >> 3) We similarly force a higher extended leaf in order to accommodate the LFENCE- >> is-dispatch-serializing bit. Yet there's no similar extra logic there in the >> function here. > > That's done on the host policy though, so there's no clash. There's no clash, sure, but ... > In calculate_host_policy()... > > ``` > /* > * For AMD/Hygon hardware before Zen3, we unilaterally modify LFENCE to be > * dispatch serialising for Spectre mitigations. Extend max_extd_leaf > * beyond what hardware supports, to include the feature leaf containing > * this information. > */ > if ( cpu_has_lfence_dispatch ) > max_extd_leaf = max(max_extd_leaf, 0x80000021U); > ``` > > One could imagine doing the same for leaf 0xb and dropping this patch, but then > we'd have to synthesise something on that leaf for hardware that doesn't have > it, which is a lot more annoying. ... we're doing things one way there and another way here. Which is generally undesirable imo. >> 4) While there the guest vs host check won't matter, the situation with AMX and >> AVX10 leaves imo still wants considering here right away. IOW (taken together >> with at least 3) above) I think we need to first settle on a model for >> collectively all max (sub)leaf handling. That in particular needs to properly >> spell out who's responsible for what (tool stack vs Xen). > > I'm not sure I follow. What's the situation with AMX and AVX10 that you refer > to? See the prereq series to both, most recently posted at https://lists.xen.org/archives/html/xen-devel/2024-08/msg00591.html That's hacky; Andrew has indicated that he'd like to take care of this (mostly) in the tool stack instead. Yet so far nothing has surfaced, hence I'm keeping to have this dependency for both series. Jan > I'd assume that making ad-hoc decisions on this is pretty much unavoidable, > but maybe the solution to the problem you mention would highlight a more > general approach. > > Cheers, > Alejandro
diff --git a/tools/tests/cpu-policy/test-cpu-policy.c b/tools/tests/cpu-policy/test-cpu-policy.c index 301df2c00285..9216010b1c5d 100644 --- a/tools/tests/cpu-policy/test-cpu-policy.c +++ b/tools/tests/cpu-policy/test-cpu-policy.c @@ -586,6 +586,10 @@ static void test_is_compatible_success(void) .platform_info.cpuid_faulting = true, }, }, + { + .name = "Host missing leaf 0xb, Guest wanted", + .guest.basic.max_leaf = 0xb, + }, }; struct cpu_policy_errors no_errors = INIT_CPU_POLICY_ERRORS; @@ -614,7 +618,7 @@ static void test_is_compatible_failure(void) } tests[] = { { .name = "Host basic.max_leaf out of range", - .guest.basic.max_leaf = 1, + .guest.basic.max_leaf = 0xc, .e = { 0, -1, -1 }, }, { diff --git a/xen/lib/x86/policy.c b/xen/lib/x86/policy.c index f033d22785be..63bc96451d2c 100644 --- a/xen/lib/x86/policy.c +++ b/xen/lib/x86/policy.c @@ -15,7 +15,16 @@ int x86_cpu_policies_are_compatible(const struct cpu_policy *host, #define FAIL_MSR(m) \ do { e.msr = (m); goto out; } while ( 0 ) - if ( guest->basic.max_leaf > host->basic.max_leaf ) + /* + * Old AMD hardware doesn't expose topology information in leaf 0xb. We + * want to emulate that leaf with credible information because it must be + * present on systems in which we emulate the x2APIC. + * + * For that reason, allow the max basic guest leaf to be larger than the + * hosts' up until 0xb. + */ + if ( guest->basic.max_leaf > 0xb && + guest->basic.max_leaf > host->basic.max_leaf ) FAIL_CPUID(0, NA); if ( guest->feat.max_subleaf > host->feat.max_subleaf )
Allow a guest policy have up to leaf 0xb even if the host doesn't. Otherwise it's not possible to show leaf 0xb to guests we're emulating an x2APIC for on old AMD machines. No externally visible changes though because toolstack doesn't yet populate that leaf. Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> --- tools/tests/cpu-policy/test-cpu-policy.c | 6 +++++- xen/lib/x86/policy.c | 11 ++++++++++- 2 files changed, 15 insertions(+), 2 deletions(-)