diff mbox series

[25/25] KVM: x86: Add CPUID bits missing from KVM_GET_SUPPORTED_CPUID

Message ID 20240812224820.34826-26-rick.p.edgecombe@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX vCPU/VM creation | expand

Commit Message

Rick Edgecombe Aug. 12, 2024, 10:48 p.m. UTC
Originally, the plan was to filter the directly configurable CPUID bits
exposed by KVM_TDX_CAPABILITIES, and the final configured bit values
provided by KVM_TDX_GET_CPUID. However, several issues were found with
this. Both the filtering done with KVM_TDX_CAPABILITIES and
KVM_TDX_GET_CPUID had the issue that the get_supported_cpuid() provided
default values instead of supported masks for multi-bit fields (i.e. those
encoding a multi-bit number).

For KVM_TDX_CAPABILITIES, there was also the problem of bits that are
actually supported by KVM, but missing from get_supported_cpuid() for one
reason or another. These include X86_FEATURE_MWAIT, X86_FEATURE_HT and
X86_FEATURE_TSC_DEADLINE_TIMER. This is currently worked around in QEMU by
adjusting which features are expected. Some of these are going to be added
to get_supported_cpuid(), and that is probably the right long term fix.

For KVM_TDX_GET_CPUID, there is another problem. Some CPUID bits are fixed
on by the TDX module, but unsupported by KVM. This means that the TD will
have them set, but KVM and userspace won't know about them. This class of
bits is dealt with by having QEMU expect not to see them. The bits include:
X86_FEATURE_HYPERVISOR. The proper fix for this specifically is probably to
change KVM to show it as supported (currently a patch exists). But this
scenario could be expected in the end of TDX module ever setting and
default 1, or fixed 1 bits. It would be good to have discussion on whether
KVM community should mandate that this doesn't happen.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
uAPI breakout v1:
 - New patch
---
 arch/x86/kvm/vmx/tdx.c | 96 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 95 insertions(+), 1 deletion(-)

Comments

Chao Gao Aug. 13, 2024, 11:34 a.m. UTC | #1
On Mon, Aug 12, 2024 at 03:48:20PM -0700, Rick Edgecombe wrote:
>Originally, the plan was to filter the directly configurable CPUID bits
>exposed by KVM_TDX_CAPABILITIES, and the final configured bit values
>provided by KVM_TDX_GET_CPUID. However, several issues were found with
>this. Both the filtering done with KVM_TDX_CAPABILITIES and
>KVM_TDX_GET_CPUID had the issue that the get_supported_cpuid() provided
>default values instead of supported masks for multi-bit fields (i.e. those
>encoding a multi-bit number).
>
>For KVM_TDX_CAPABILITIES, there was also the problem of bits that are
>actually supported by KVM, but missing from get_supported_cpuid() for one
>reason or another. These include X86_FEATURE_MWAIT, X86_FEATURE_HT and
>X86_FEATURE_TSC_DEADLINE_TIMER. This is currently worked around in QEMU by
>adjusting which features are expected. Some of these are going to be added
>to get_supported_cpuid(), and that is probably the right long term fix.
>
>For KVM_TDX_GET_CPUID, there is another problem. Some CPUID bits are fixed
>on by the TDX module, but unsupported by KVM. This means that the TD will
>have them set, but KVM and userspace won't know about them. This class of

What's the problem of having KVM and userspace see some unsupported bits set?

>bits is dealt with by having QEMU expect not to see them. The bits include:
>X86_FEATURE_HYPERVISOR. The proper fix for this specifically is probably to
>change KVM to show it as supported (currently a patch exists). But this
>scenario could be expected in the end of TDX module ever setting and
>default 1, or fixed 1 bits. It would be good to have discussion on whether
>KVM community should mandate that this doesn't happen.

Just my two cents:

Mandating that all fixed-1 bits be supported by KVM would be a burden for both
KVM and the TDX module: the TDX module couldn't add any fixed-1 bits until KVM
supports them, and KVM shouldn't drop any feature that was ever a fixed-1 bit
in any TDX module. I don't think this is a good idea. TDX module support for a
feature will likely be ready earlier than KVM's, as TDX module is smaller and
is developed inside Intel. Requiring the TDX module to avoid adding fixed-1
bits doesn't make much sense, as making all features configurable would
increase its complexity.

I think adding new fixed-1 bits is fine as long as they don't break KVM, i.e.,
KVM shouldn't need to take any action for the new fixed-1 bits, like
saving/restoring more host CPU states across TD-enter/exit or emulating
CPUID/MSR accesses from guests
Xiaoyao Li Aug. 13, 2024, 3:14 p.m. UTC | #2
On 8/13/2024 7:34 PM, Chao Gao wrote:
> I think adding new fixed-1 bits is fine as long as they don't break KVM, i.e.,
> KVM shouldn't need to take any action for the new fixed-1 bits, like
> saving/restoring more host CPU states across TD-enter/exit or emulating
> CPUID/MSR accesses from guests

I disagree. Adding new fixed-1 bits in a newer TDX module can lead to a 
different TD with same cpu model.

People may argue that for the new features that have no vmcs control bit 
(usually the new instruction) face the similar issue. Booting a VM with 
same cpu model on a new platform with such new feature leads to the VM 
actually can use the new feature.

However, for the perspective of CPUID, VMM at least can make sure it 
unchanged, though guest can access the feature even when guest CPUID 
tells no such feature. This is virtualization hole. no one like it.
Rick Edgecombe Aug. 13, 2024, 6:45 p.m. UTC | #3
On Tue, 2024-08-13 at 19:34 +0800, Chao Gao wrote:
> Mandating that all fixed-1 bits be supported by KVM would be a burden for both
> KVM and the TDX module: the TDX module couldn't add any fixed-1 bits until KVM
> supports them, and 

> KVM shouldn't drop any feature that was ever a fixed-1 bit
> in any TDX module.

Honest question...can/does this happen for normal VMs? KVM dropping support for
features? I think I recall even MPX getting limped along for backward
compatibility reasons.

>  I don't think this is a good idea. TDX module support for a
> feature will likely be ready earlier than KVM's, as TDX module is smaller and
> is developed inside Intel. Requiring the TDX module to avoid adding fixed-1
> bits doesn't make much sense, as making all features configurable would
> increase its complexity.
> 
> I think adding new fixed-1 bits is fine as long as they don't break KVM, i.e.,
> KVM shouldn't need to take any action for the new fixed-1 bits, like
> saving/restoring more host CPU states across TD-enter/exit or emulating
> CPUID/MSR accesses from guests

If these would only be simple features, then I'd wonder how much complexity
making them configurable would really add to the TDX module.

I think there are more concerns than just TDX module breaking KVM. (my 2 cents
would be that it should just be considered a TDX module bug) But KVM should also
want to avoid getting boxed into some ABI. For example a a new userspace
developed against a new TDX module, but old KVM could start using some new
feature that KVM would want to handle differently. As you point out KVM
implementation could happen later, at which point userspace could already expect
a certain behavior. Then KVM would have to have some other opt in for it's
preferred behavior.

Now, that is comparing *sometimes* KVM needing to have an opt-in, with TDX
module *always* needing an opt-in. But I don't see how never having fixed bits
is more complex for KVM.
Chao Gao Aug. 14, 2024, 12:47 a.m. UTC | #4
On Tue, Aug 13, 2024 at 11:14:31PM +0800, Xiaoyao Li wrote:
>On 8/13/2024 7:34 PM, Chao Gao wrote:
>> I think adding new fixed-1 bits is fine as long as they don't break KVM, i.e.,
>> KVM shouldn't need to take any action for the new fixed-1 bits, like
>> saving/restoring more host CPU states across TD-enter/exit or emulating
>> CPUID/MSR accesses from guests
>
>I disagree. Adding new fixed-1 bits in a newer TDX module can lead to a
>different TD with same cpu model.

The new TDX module simply doesn't support old CPU models. QEMU can report an
error and define a new CPU model that works with the TDX module. Sometimes,
CPUs may drop features; this may cause KVM to not support some features and
in turn some old CPU models having those features cannot be supported. is it a
requirement for TDX modules alone that old CPU models must always be supported?

>
>People may argue that for the new features that have no vmcs control bit
>(usually the new instruction) face the similar issue. Booting a VM with same
>cpu model on a new platform with such new feature leads to the VM actually
>can use the new feature.
>
>However, for the perspective of CPUID, VMM at least can make sure it
>unchanged, though guest can access the feature even when guest CPUID tells no
>such feature. This is virtualization hole. no one like it.
Sean Christopherson Aug. 14, 2024, 1:10 a.m. UTC | #5
On Tue, Aug 13, 2024, Rick P Edgecombe wrote:
> On Tue, 2024-08-13 at 19:34 +0800, Chao Gao wrote:
> > Mandating that all fixed-1 bits be supported by KVM would be a burden for both
> > KVM and the TDX module: the TDX module couldn't add any fixed-1 bits until KVM
> > supports them, and 
> 
> > KVM shouldn't drop any feature that was ever a fixed-1 bit
> > in any TDX module.
> 
> Honest question...can/does this happen for normal VMs? KVM dropping support for
> features?

Almost never.  KVM still supports Intel CPUs without virtual NMI support, which
IIRC was something like one SKU of Yonah that was 32-bit only.  Keeping backwards
compability is annoying from time to time, but it's generally not that much of a
maintenance burden.  The only CPUs I really wish had never existed are those that
have EPT without A/D bits.  Other than that, maintaining support for old CPUs
doesn't hinder us too much.

> I think I recall even MPX getting limped along for backward compatibility reasons.

Yep, KVM still supports virtualizing MPX.
Sean Christopherson Aug. 14, 2024, 1:16 a.m. UTC | #6
On Wed, Aug 14, 2024, Chao Gao wrote:
> On Tue, Aug 13, 2024 at 11:14:31PM +0800, Xiaoyao Li wrote:
> >On 8/13/2024 7:34 PM, Chao Gao wrote:
> >> I think adding new fixed-1 bits is fine as long as they don't break KVM, i.e.,
> >> KVM shouldn't need to take any action for the new fixed-1 bits, like
> >> saving/restoring more host CPU states across TD-enter/exit or emulating
> >> CPUID/MSR accesses from guests
> >
> >I disagree. Adding new fixed-1 bits in a newer TDX module can lead to a
> >different TD with same cpu model.
> 
> The new TDX module simply doesn't support old CPU models.

What happens if the new TDX module is needed to fix a security issue?  Or if a
customer wants to support a heterogenous migration pool, and older (physical)
CPUs don't support the feature?  Or if a customer wants to continue hosting
existing VM shapes on newer hardware?

> QEMU can report an error and define a new CPU model that works with the TDX
> module. Sometimes, CPUs may drop features;

Very, very rarely.  And when it does happen, there are years of warning before
the features are dropped.

> this may cause KVM to not support some features and in turn some old CPU
> models having those features cannot be supported.  is it a requirement for
> TDX modules alone that old CPU models must always be supported?

Not a hard requirement, but a pretty firm one.  There needs to be sane, reasonable
behavior, or we're going to have problems.
Chao Gao Aug. 14, 2024, 10:46 a.m. UTC | #7
On Tue, Aug 13, 2024 at 06:16:10PM -0700, Sean Christopherson wrote:
>On Wed, Aug 14, 2024, Chao Gao wrote:
>> On Tue, Aug 13, 2024 at 11:14:31PM +0800, Xiaoyao Li wrote:
>> >On 8/13/2024 7:34 PM, Chao Gao wrote:
>> >> I think adding new fixed-1 bits is fine as long as they don't break KVM, i.e.,
>> >> KVM shouldn't need to take any action for the new fixed-1 bits, like
>> >> saving/restoring more host CPU states across TD-enter/exit or emulating
>> >> CPUID/MSR accesses from guests
>> >
>> >I disagree. Adding new fixed-1 bits in a newer TDX module can lead to a
>> >different TD with same cpu model.
>> 
>> The new TDX module simply doesn't support old CPU models.
>
>What happens if the new TDX module is needed to fix a security issue?  Or if a
>customer wants to support a heterogenous migration pool, and older (physical)
>CPUs don't support the feature?  Or if a customer wants to continue hosting
>existing VM shapes on newer hardware?
>
>> QEMU can report an error and define a new CPU model that works with the TDX
>> module. Sometimes, CPUs may drop features;
>
>Very, very rarely.  And when it does happen, there are years of warning before
>the features are dropped.
>
>> this may cause KVM to not support some features and in turn some old CPU
>> models having those features cannot be supported.  is it a requirement for
>> TDX modules alone that old CPU models must always be supported?
>
>Not a hard requirement, but a pretty firm one.  There needs to be sane, reasonable
>behavior, or we're going to have problems.

OK. So, the expectation is the TDX module should avoid adding new fixed-1 bits.

I suppose this also applies to "native" CPUID bits, which are not configurable
and simply reflected as native values to TDs.

One scenario where "fixed-1" bits can help is: we discover a security issue and
release a microcode update to expose a feature indicating which CPUs are
vulnerable. if the TDX module allows the VMM to configure the feature as 0
(i.e., not vulnerable) on vulnerable CPUs, a TD might incorrectly assume it's
not vulnerable, creating a security issue.

I think in above case, the TDX module has to add a "fixed-1" bit. An example of
such a feature is RRSBA in the IA32_ARCH_CAPABILITIES MSR.
Chao Gao Aug. 14, 2024, 11:36 a.m. UTC | #8
On Wed, Aug 14, 2024 at 02:45:50AM +0800, Edgecombe, Rick P wrote:
>On Tue, 2024-08-13 at 19:34 +0800, Chao Gao wrote:
>> Mandating that all fixed-1 bits be supported by KVM would be a burden for both
>> KVM and the TDX module: the TDX module couldn't add any fixed-1 bits until KVM
>> supports them, and 
>
>> KVM shouldn't drop any feature that was ever a fixed-1 bit
>> in any TDX module.
>
>Honest question...can/does this happen for normal VMs? KVM dropping support for
>features? I think I recall even MPX getting limped along for backward
>compatibility reasons.
>
>>  I don't think this is a good idea. TDX module support for a
>> feature will likely be ready earlier than KVM's, as TDX module is smaller and
>> is developed inside Intel. Requiring the TDX module to avoid adding fixed-1
>> bits doesn't make much sense, as making all features configurable would
>> increase its complexity.
>> 
>> I think adding new fixed-1 bits is fine as long as they don't break KVM, i.e.,
>> KVM shouldn't need to take any action for the new fixed-1 bits, like
>> saving/restoring more host CPU states across TD-enter/exit or emulating
>> CPUID/MSR accesses from guests
>
>If these would only be simple features, then I'd wonder how much complexity
>making them configurable would really add to the TDX module.
>
>I think there are more concerns than just TDX module breaking KVM. (my 2 cents
>would be that it should just be considered a TDX module bug) But KVM should also
>want to avoid getting boxed into some ABI. For example a a new userspace
>developed against a new TDX module, but old KVM could start using some new
>feature that KVM would want to handle differently. As you point out KVM
>implementation could happen later, at which point userspace could already expect
>a certain behavior. Then KVM would have to have some other opt in for it's
>preferred behavior.

I don't fully understand "getting boxed into some ABI". But filtering out
unsupported bits could also cause ABI breakage if those bits later become
supported and are no longer filtered, but userspace may still expect them to be
cleared.

It seems that KVM would have to refuse to work with the TDX module if it
detects some fixed-1/native bits are unsupported/unknown.

But if we do that, IIUC, disabling certain features using the "clearcpuid="
kernel cmdline on the host may cause KVM to be incompatible with the TDX
module. Anyway, this is probably a minor issue.

>
>Now, that is comparing *sometimes* KVM needing to have an opt-in, with TDX
>module *always* needing an opt-in. But I don't see how never having fixed bits
>is more complex for KVM.
Sean Christopherson Aug. 14, 2024, 1:35 p.m. UTC | #9
On Wed, Aug 14, 2024, Chao Gao wrote:
> On Tue, Aug 13, 2024 at 06:16:10PM -0700, Sean Christopherson wrote:
> >On Wed, Aug 14, 2024, Chao Gao wrote:
> >> On Tue, Aug 13, 2024 at 11:14:31PM +0800, Xiaoyao Li wrote:
> >> >On 8/13/2024 7:34 PM, Chao Gao wrote:
> >> >> I think adding new fixed-1 bits is fine as long as they don't break KVM, i.e.,
> >> >> KVM shouldn't need to take any action for the new fixed-1 bits, like
> >> >> saving/restoring more host CPU states across TD-enter/exit or emulating
> >> >> CPUID/MSR accesses from guests
> >> >
> >> >I disagree. Adding new fixed-1 bits in a newer TDX module can lead to a
> >> >different TD with same cpu model.
> >> 
> >> The new TDX module simply doesn't support old CPU models.
> >
> >What happens if the new TDX module is needed to fix a security issue?  Or if a
> >customer wants to support a heterogenous migration pool, and older (physical)
> >CPUs don't support the feature?  Or if a customer wants to continue hosting
> >existing VM shapes on newer hardware?
> >
> >> QEMU can report an error and define a new CPU model that works with the TDX
> >> module. Sometimes, CPUs may drop features;
> >
> >Very, very rarely.  And when it does happen, there are years of warning before
> >the features are dropped.
> >
> >> this may cause KVM to not support some features and in turn some old CPU
> >> models having those features cannot be supported.  is it a requirement for
> >> TDX modules alone that old CPU models must always be supported?
> >
> >Not a hard requirement, but a pretty firm one.  There needs to be sane, reasonable
> >behavior, or we're going to have problems.
> 
> OK. So, the expectation is the TDX module should avoid adding new fixed-1 bits.
> 
> I suppose this also applies to "native" CPUID bits, which are not configurable
> and simply reflected as native values to TDs.

Yes, unless all of Intel's customers are ok with the effective restriction that
the *only* valid vCPU model for a TDX VM is the real underlying CPU model.  To
me, that seems like a poor bet to make.  The cost of allowing feature bits to be
flexible isn't _that_ high, versus the potential cost of forcing customers to
change how they operate and manage VM shapes, CPU/platform upgrades, etc.

Maybe Intel has already had those conversations with product folk and everyone
is ok with the restriction, it just seems like very avoidable pain to me.

> One scenario where "fixed-1" bits can help is: we discover a security issue and
> release a microcode update to expose a feature indicating which CPUs are
> vulnerable. if the TDX module allows the VMM to configure the feature as 0
> (i.e., not vulnerable) on vulnerable CPUs, a TD might incorrectly assume it's
> not vulnerable, creating a security issue.
>
> I think in above case, the TDX module has to add a "fixed-1" bit. An example of
> such a feature is RRSBA in the IA32_ARCH_CAPABILITIES MSR.

That would be fine, I would classify that as reasonable.  However, that scenario
doesn't really work in practice, at least not the way Intel probably hopes it
plays out.  For the new fixed-1 bit to provide value, it would require a guest
reboot and likely a guets kernel upgrade.
Rick Edgecombe Aug. 14, 2024, 5:17 p.m. UTC | #10
On Wed, 2024-08-14 at 19:36 +0800, Chao Gao wrote:
> > I think there are more concerns than just TDX module breaking KVM. (my 2
> > cents
> > would be that it should just be considered a TDX module bug) But KVM should
> > also
> > want to avoid getting boxed into some ABI. For example a a new userspace
> > developed against a new TDX module, but old KVM could start using some new
> > feature that KVM would want to handle differently. As you point out KVM
> > implementation could happen later, at which point userspace could already
> > expect
> > a certain behavior. Then KVM would have to have some other opt in for it's
> > preferred behavior.
> 
> I don't fully understand "getting boxed into some ABI". But filtering out
> unsupported bits could also cause ABI breakage if those bits later become
> supported and are no longer filtered, but userspace may still expect them to
> be
> cleared.

Hmm, any change to the kernel could cause a backwards compatibility issue. But
if KVM doesn't support the bit, I would hope userspace wouldn't have developed
some problematic behavior around it already.

I guess the problem would be if, as is currently implemented in the QEMU
patches, userspace checks for unexpected bits and errors out. This would fail
similarly if the bits were not filtered by KVM and TDX module was updated to a
version with new fixed bits, so it kind of leads you down the road to "fixed
bits are a problem", doesn't it?

> 
> It seems that KVM would have to refuse to work with the TDX module if it
> detects some fixed-1/native bits are unsupported/unknown.

I don't think there is really a way to detect this, without encoding a bunch of
CPUID rules into KVM.

> 
> But if we do that, IIUC, disabling certain features using the "clearcpuid="
> kernel cmdline on the host may cause KVM to be incompatible with the TDX
> module. Anyway, this is probably a minor issue.

True, I would think that would be out of scope for backwards compatibility
though.
Rick Edgecombe Aug. 14, 2024, 5:35 p.m. UTC | #11
On Wed, 2024-08-14 at 06:35 -0700, Sean Christopherson wrote:
> > One scenario where "fixed-1" bits can help is: we discover a security issue
> > and
> > release a microcode update to expose a feature indicating which CPUs are
> > vulnerable. if the TDX module allows the VMM to configure the feature as 0
> > (i.e., not vulnerable) on vulnerable CPUs, a TD might incorrectly assume
> > it's
> > not vulnerable, creating a security issue.
> > 
> > I think in above case, the TDX module has to add a "fixed-1" bit. An example
> > of
> > such a feature is RRSBA in the IA32_ARCH_CAPABILITIES MSR.
> 
> That would be fine, I would classify that as reasonable.  However, that
> scenario
> doesn't really work in practice, at least not the way Intel probably hopes it
> plays out.  For the new fixed-1 bit to provide value, it would require a guest
> reboot and likely a guets kernel upgrade.

If we allow "reasonable" fixed bits, we need to decide how to handle any that
KVM sees but doesn't know about. Not filtering them is simpler to implement.
Filtering them seems a little more controlled to me.

It might depend on how reasonable, "reasonable" turns out. Maybe we give not
filtering a try and see how it goes. If we run into a problem, we can filter new
bits from that point, and add a quirk for whatever the issue is. I'm still on
the fence.
Sean Christopherson Aug. 14, 2024, 9:22 p.m. UTC | #12
On Wed, Aug 14, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-08-14 at 06:35 -0700, Sean Christopherson wrote:
> > > One scenario where "fixed-1" bits can help is: we discover a security issue
> > > and
> > > release a microcode update to expose a feature indicating which CPUs are
> > > vulnerable. if the TDX module allows the VMM to configure the feature as 0
> > > (i.e., not vulnerable) on vulnerable CPUs, a TD might incorrectly assume
> > > it's
> > > not vulnerable, creating a security issue.
> > > 
> > > I think in above case, the TDX module has to add a "fixed-1" bit. An example
> > > of
> > > such a feature is RRSBA in the IA32_ARCH_CAPABILITIES MSR.
> > 
> > That would be fine, I would classify that as reasonable.  However, that
> > scenario
> > doesn't really work in practice, at least not the way Intel probably hopes it
> > plays out.  For the new fixed-1 bit to provide value, it would require a guest
> > reboot and likely a guets kernel upgrade.
> 
> If we allow "reasonable" fixed bits, we need to decide how to handle any that
> KVM sees but doesn't know about. Not filtering them is simpler to implement.
> Filtering them seems a little more controlled to me.
> 
> It might depend on how reasonable, "reasonable" turns out. Maybe we give not
> filtering a try and see how it goes. If we run into a problem, we can filter new
> bits from that point, and add a quirk for whatever the issue is. I'm still on
> the fence.

As I see it, it's ultimately unlikely to be KVM's problem.  If Intel ships a
TDX-Module that does bad things, and someone's setup breaks when they upgrade to
that TDX-Module, then their gripe is with Intel.  KVM can't do anything to remedy
the problem.

If the upgrade breaks a setup because it confuses _KVM_, then I'll care, but I
suspect/hope that won't happen in practice, purely because KVM has so little
visiblity into the guest, i.e. doesn't care what is/isn't advertised to the guest.

FWIW, AMD has effectively gone the "fixed-1" route for a few things[*], e.g. KVM
can't intercept XCR0 or XSS writes.  And while I detest the behavior, I haven't
refused to merge support for SEV-ES+.  I just grumble every time it comes up :-)

[*] https://lore.kernel.org/all/ZUQvNIE9iU5TqJfw@google.com
Paolo Bonzini Sept. 10, 2024, 5:52 p.m. UTC | #13
On 8/13/24 00:48, Rick Edgecombe wrote:
> Originally, the plan was to filter the directly configurable CPUID bits
> exposed by KVM_TDX_CAPABILITIES, and the final configured bit values
> provided by KVM_TDX_GET_CPUID. However, several issues were found with
> this. Both the filtering done with KVM_TDX_CAPABILITIES and
> KVM_TDX_GET_CPUID had the issue that the get_supported_cpuid() provided
> default values instead of supported masks for multi-bit fields (i.e. those
> encoding a multi-bit number).
> 
> For KVM_TDX_CAPABILITIES, there was also the problem of bits that are
> actually supported by KVM, but missing from get_supported_cpuid() for one
> reason or another. These include X86_FEATURE_MWAIT, X86_FEATURE_HT and
> X86_FEATURE_TSC_DEADLINE_TIMER. This is currently worked around in QEMU by
> adjusting which features are expected. Some of these are going to be added
> to get_supported_cpuid(), and that is probably the right long term fix.

There are several cases here:

- MWAIT is hidden because it's hard to virtualize its C-state parameters

- HT is hidden because it depends on the topology, and cannot be added 
blindly.

- TSC_DEADLINE_TIMER is queried with KVM_CHECK_EXTENSION for historical 
reasons

There are basically two kinds of userspace:

- those that fetch KVM_GET_SUPPORED_CPUID and pass it blindly to 
KVM_SET_CPUID2.  These mostly work, though they may miss a feature or 
three (e.g. the TSC deadline timer).

- those that know each bit and make an informed decision on what to 
enable; for those, KVM_GET_SUPPORTED_CPUID is just guidance.

Because of this, KVM_GET_SUPPORTED_CPUID doesn't return bits that are 
one; it returns a mix of:

- maximum supported values (e.g. CPUID[7,0].EAX)

- values from the host (e.g. FMS or model name)

- supported features

It's an awfully defined API but it is easier to use than it sounds (some 
of the quirks are being documented in 
Documentation/virt/kvm/x86/errata.rst and 
Documentation/virt/kvm/x86/api.rst).  The idea is that, if userspace 
manages individual CPUID bits, it already knows what can be one anyway.

This is the kind of API that we need to present for TDX, even if the 
details on how to get the supported CPUID are different.  Not because 
it's a great API, but rather because it's a known API.

The difference between this and KVM_GET_SUPPORTED_CPUID are small, but 
the main one is X86_FEATURE_HYPERVISOR (I am not sure whether to make it 
different with respect to X86_FEATURE_TSC_DEADLINE_TIMER; leaning 
towards no).

We may also need a second ioctl specifically to return the fixed-1 bits. 
  Asking Xiaoyao for input with regard to what he'd like to have in QEMU.

> +	entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 0x0, 0);
> +	if (WARN_ON(!entry))
> +		goto err;
> +	/* Fixup of maximum basic leaf */
> +	entry->eax |= 0x000000FF;
> +
> +	entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 0x1, 0);
> +	if (WARN_ON(!entry))
> +		goto err;
> +	/* Fixup of FMS */
> +	entry->eax |= 0x0fff3fff;
> +	/* Fixup of maximum logical processors per package */
> +	entry->ebx |= 0x00ff0000;
> +

I see now why you could blindly AND things in patch 24.

However, the right mode of operation is still to pick manually which 
bits to AND.

Paolo
Xiaoyao Li Sept. 12, 2024, 7:48 a.m. UTC | #14
On 9/11/2024 1:52 AM, Paolo Bonzini wrote:
> On 8/13/24 00:48, Rick Edgecombe wrote:
>> Originally, the plan was to filter the directly configurable CPUID bits
>> exposed by KVM_TDX_CAPABILITIES, and the final configured bit values
>> provided by KVM_TDX_GET_CPUID. However, several issues were found with
>> this. Both the filtering done with KVM_TDX_CAPABILITIES and
>> KVM_TDX_GET_CPUID had the issue that the get_supported_cpuid() provided
>> default values instead of supported masks for multi-bit fields (i.e. 
>> those
>> encoding a multi-bit number).
>>
>> For KVM_TDX_CAPABILITIES, there was also the problem of bits that are
>> actually supported by KVM, but missing from get_supported_cpuid() for one
>> reason or another. These include X86_FEATURE_MWAIT, X86_FEATURE_HT and
>> X86_FEATURE_TSC_DEADLINE_TIMER. This is currently worked around in 
>> QEMU by
>> adjusting which features are expected. 

I'm not sure what issuee/problem can be worked around in QEMU.

QEMU doesn't expect these bit are reported by KVM as supported for TDX. 
QEMU just accepts the result reported by KVM.

The problem is, TDX module and the hardware allow these bits be 
configured for TD guest, but KVM doesn't allow. It leads to users cannot 
create a TD with these bits on.

QEMU cannot work around this problem.

>> Some of these are going to be 
>> added
>> to get_supported_cpuid(), and that is probably the right long term fix.
> 
> There are several cases here:
> 
> - MWAIT is hidden because it's hard to virtualize its C-state parameters
> 
> - HT is hidden because it depends on the topology, and cannot be added 
> blindly.
> 
> - TSC_DEADLINE_TIMER is queried with KVM_CHECK_EXTENSION for historical 
> reasons
> 
> There are basically two kinds of userspace:
> 
> - those that fetch KVM_GET_SUPPORED_CPUID and pass it blindly to 
> KVM_SET_CPUID2.  These mostly work, though they may miss a feature or 
> three (e.g. the TSC deadline timer).
> 
> - those that know each bit and make an informed decision on what to 
> enable; for those, KVM_GET_SUPPORTED_CPUID is just guidance.
> 
> Because of this, KVM_GET_SUPPORTED_CPUID doesn't return bits that are 
> one; it returns a mix of:
> 
> - maximum supported values (e.g. CPUID[7,0].EAX)
> 
> - values from the host (e.g. FMS or model name)
> 
> - supported features
> 
> It's an awfully defined API but it is easier to use than it sounds (some 
> of the quirks are being documented in 
> Documentation/virt/kvm/x86/errata.rst and 
> Documentation/virt/kvm/x86/api.rst).  The idea is that, if userspace 
> manages individual CPUID bits, it already knows what can be one anyway.
> 
> This is the kind of API that we need to present for TDX, even if the 
> details on how to get the supported CPUID are different.  Not because 
> it's a great API, but rather because it's a known API.

However there are differences for TDX. For legacy VMs, the result of 
KVM_GET_SUPPORTED_CPUID isn't used to filter the input of KVM_SET_CPUID2.

But for TDX, it needs to filter the input of KVM_TDX_VM_INIT.CPUID[] 
because TDX module only allows the bits that are reported as 
configurable to be set to 1.

> The difference between this and KVM_GET_SUPPORTED_CPUID are small, but 
> the main one is X86_FEATURE_HYPERVISOR (I am not sure whether to make it 
> different with respect to X86_FEATURE_TSC_DEADLINE_TIMER; leaning 
> towards no).
> 
> We may also need a second ioctl specifically to return the fixed-1 bits. 
>   Asking Xiaoyao for input with regard to what he'd like to have in QEMU.

With current designed API, QEMU can only know which bits are 
configurable before KVM_TDX_VM_INIT, i.e., which bits can be set to 1 or 
0 freely.

For other bits not reported as configurable, QEMU can know the exact 
value of them via KVM_TDX_GET_CPUID, after KVM_TDX_VM_INIT and before 
TD's running. With it, QEMU can validate the return value is matched 
with what QEMU wants to set that determined by users input. If not 
matched, QEMU can provide some warnings like what for legacy VMs:

   - TDX doesn't support requested feature: CPUID.01H.ECX.tsc-deadline 
[bit 24]
   - TDX forcibly sets features: CPUID.01H:ECX.hypervisor [bit 31]

If there are ioctls to report the fixed0 bits and fixed1 bits for TDX, 
QEMU can validate the user's configuration earlier.

>> +    entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 
>> 0x0, 0);
>> +    if (WARN_ON(!entry))
>> +        goto err;
>> +    /* Fixup of maximum basic leaf */
>> +    entry->eax |= 0x000000FF;
>> +
>> +    entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 
>> 0x1, 0);
>> +    if (WARN_ON(!entry))
>> +        goto err;
>> +    /* Fixup of FMS */
>> +    entry->eax |= 0x0fff3fff;
>> +    /* Fixup of maximum logical processors per package */
>> +    entry->ebx |= 0x00ff0000;
>> +
> 
> I see now why you could blindly AND things in patch 24.
> 
> However, the right mode of operation is still to pick manually which 
> bits to AND.
> 
> Paolo
>
Paolo Bonzini Sept. 12, 2024, 2:09 p.m. UTC | #15
On Thu, Sep 12, 2024 at 9:48 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> On 9/11/2024 1:52 AM, Paolo Bonzini wrote:
> > On 8/13/24 00:48, Rick Edgecombe wrote:
> >> For KVM_TDX_CAPABILITIES, there was also the problem of bits that are
> >> actually supported by KVM, but missing from get_supported_cpuid() for one
> >> reason or another. These include X86_FEATURE_MWAIT, X86_FEATURE_HT and
> >> X86_FEATURE_TSC_DEADLINE_TIMER. This is currently worked around in
> >> QEMU by
> >> adjusting which features are expected.
>
> I'm not sure what issue/problem can be worked around in QEMU.
> QEMU doesn't expect these bit are reported by KVM as supported for TDX.
> QEMU just accepts the result reported by KVM.

QEMU already adds some extra bits, for example:

        ret |= CPUID_EXT_HYPERVISOR;
        if (kvm_irqchip_in_kernel() &&
                kvm_check_extension(s, KVM_CAP_TSC_DEADLINE_TIMER)) {
            ret |= CPUID_EXT_TSC_DEADLINE_TIMER;
        }

> The problem is, TDX module and the hardware allow these bits be
> configured for TD guest, but KVM doesn't allow. It leads to users cannot
> create a TD with these bits on.

KVM is not going to have any checks, it's only going to pass the
CPUID to the TDX module and return an error if the check fails
in the TDX module.

KVM can have a TDX-specific version of KVM_GET_SUPPORTED_CPUID, so
that we can keep a variant of the "get supported bits and pass them
to KVM_SET_CPUID2" logic, but that's it.

> > This is the kind of API that we need to present for TDX, even if the
> > details on how to get the supported CPUID are different.  Not because
> > it's a great API, but rather because it's a known API.
>
> However there are differences for TDX. For legacy VMs, the result of
> KVM_GET_SUPPORTED_CPUID isn't used to filter the input of KVM_SET_CPUID2.
> But for TDX, it needs to filter the input of KVM_TDX_VM_INIT.CPUID[]
> because TDX module only allows the bits that are reported as
> configurable to be set to 1.

Yes, that's userspace's responsibility.

> With current designed API, QEMU can only know which bits are
> configurable before KVM_TDX_VM_INIT, i.e., which bits can be set to 1 or
> 0 freely.

The API needs userspace to have full knowledge of the
requirements of the TDX module, if it wants to change the
defaults provided by KVM.

This is the same as for non-TDX VMs (including SNP).  The only
difference is that TDX and SNP fails, while non-confidential VMs
get slightly garbage CPUID.

> For other bits not reported as configurable, QEMU can know the exact
> value of them via KVM_TDX_GET_CPUID, after KVM_TDX_VM_INIT and before
> TD's running. With it, QEMU can validate the return value is matched
> with what QEMU wants to set that determined by users input. If not
> matched, QEMU can provide some warnings like what for legacy VMs:
>
>    - TDX doesn't support requested feature: CPUID.01H.ECX.tsc-deadline
> [bit 24]
>    - TDX forcibly sets features: CPUID.01H:ECX.hypervisor [bit 31]
>
> If there are ioctls to report the fixed0 bits and fixed1 bits for TDX,
> QEMU can validate the user's configuration earlier.

Yes, that's fine.

Paolo
Xiaoyao Li Sept. 12, 2024, 2:45 p.m. UTC | #16
On 9/12/2024 10:09 PM, Paolo Bonzini wrote:
> On Thu, Sep 12, 2024 at 9:48 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>> On 9/11/2024 1:52 AM, Paolo Bonzini wrote:
>>> On 8/13/24 00:48, Rick Edgecombe wrote:
>>>> For KVM_TDX_CAPABILITIES, there was also the problem of bits that are
>>>> actually supported by KVM, but missing from get_supported_cpuid() for one
>>>> reason or another. These include X86_FEATURE_MWAIT, X86_FEATURE_HT and
>>>> X86_FEATURE_TSC_DEADLINE_TIMER. This is currently worked around in
>>>> QEMU by
>>>> adjusting which features are expected.
>>
>> I'm not sure what issue/problem can be worked around in QEMU.
>> QEMU doesn't expect these bit are reported by KVM as supported for TDX.
>> QEMU just accepts the result reported by KVM.
> 
> QEMU already adds some extra bits, for example:
> 
>          ret |= CPUID_EXT_HYPERVISOR;
>          if (kvm_irqchip_in_kernel() &&
>                  kvm_check_extension(s, KVM_CAP_TSC_DEADLINE_TIMER)) {
>              ret |= CPUID_EXT_TSC_DEADLINE_TIMER;
>          }
> 
>> The problem is, TDX module and the hardware allow these bits be
>> configured for TD guest, but KVM doesn't allow. It leads to users cannot
>> create a TD with these bits on.
> 
> KVM is not going to have any checks, it's only going to pass the
> CPUID to the TDX module and return an error if the check fails
> in the TDX module.

If so, new feature can be enabled for TDs out of KVM's control.

Is it acceptable?

> KVM can have a TDX-specific version of KVM_GET_SUPPORTED_CPUID, so
> that we can keep a variant of the "get supported bits and pass them
> to KVM_SET_CPUID2" logic, but that's it.
> 
>>> This is the kind of API that we need to present for TDX, even if the
>>> details on how to get the supported CPUID are different.  Not because
>>> it's a great API, but rather because it's a known API.
>>
>> However there are differences for TDX. For legacy VMs, the result of
>> KVM_GET_SUPPORTED_CPUID isn't used to filter the input of KVM_SET_CPUID2.
>> But for TDX, it needs to filter the input of KVM_TDX_VM_INIT.CPUID[]
>> because TDX module only allows the bits that are reported as
>> configurable to be set to 1.
> 
> Yes, that's userspace's responsibility.
> 
>> With current designed API, QEMU can only know which bits are
>> configurable before KVM_TDX_VM_INIT, i.e., which bits can be set to 1 or
>> 0 freely.
> 
> The API needs userspace to have full knowledge of the
> requirements of the TDX module, if it wants to change the
> defaults provided by KVM.
> 
> This is the same as for non-TDX VMs (including SNP).  The only
> difference is that TDX and SNP fails, while non-confidential VMs
> get slightly garbage CPUID.
> 
>> For other bits not reported as configurable, QEMU can know the exact
>> value of them via KVM_TDX_GET_CPUID, after KVM_TDX_VM_INIT and before
>> TD's running. With it, QEMU can validate the return value is matched
>> with what QEMU wants to set that determined by users input. If not
>> matched, QEMU can provide some warnings like what for legacy VMs:
>>
>>     - TDX doesn't support requested feature: CPUID.01H.ECX.tsc-deadline
>> [bit 24]
>>     - TDX forcibly sets features: CPUID.01H:ECX.hypervisor [bit 31]
>>
>> If there are ioctls to report the fixed0 bits and fixed1 bits for TDX,
>> QEMU can validate the user's configuration earlier.
> 
> Yes, that's fine.
> 
> Paolo
>
Paolo Bonzini Sept. 12, 2024, 2:48 p.m. UTC | #17
On Thu, Sep 12, 2024 at 4:45 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> > KVM is not going to have any checks, it's only going to pass the
> > CPUID to the TDX module and return an error if the check fails
> > in the TDX module.
>
> If so, new feature can be enabled for TDs out of KVM's control.
>
> Is it acceptable?

It's the same as for non-TDX VMs, I think it's acceptable.

Paolo
Rick Edgecombe Sept. 12, 2024, 3:07 p.m. UTC | #18
On Thu, 2024-09-12 at 16:09 +0200, Paolo Bonzini wrote:
> 
> > The problem is, TDX module and the hardware allow these bits be
> > configured for TD guest, but KVM doesn't allow. It leads to users cannot
> > create a TD with these bits on.
> 
> KVM is not going to have any checks, it's only going to pass the
> CPUID to the TDX module and return an error if the check fails
> in the TDX module.

Ok. 

> 
> KVM can have a TDX-specific version of KVM_GET_SUPPORTED_CPUID, so
> that we can keep a variant of the "get supported bits and pass them
> to KVM_SET_CPUID2" logic, but that's it.

Can you clarify what you mean here when you say TDX-specific version of
KVM_GET_SUPPORTED_CPUID?

We have two things kind of like that implemented in this series:
1. KVM_TDX_GET_CPUID, which returns the CPUID bits actually set in the TD
2. KVM_TDX_CAPABILITIES, which returns CPUID bits that TDX module allows full
control over (i.e. what we have been calling directly configurable CPUID bits)

KVM_TDX_GET_CPUID->KVM_SET_CPUID2 kind of works like
KVM_GET_SUPPORTED_CPUID->KVM_SET_CPUID2, so I think that is what you mean, but
just want to confirm.

We can't get the needed information (fixed bits, etc) to create a TDX
KVM_GET_SUPPORTED_CPUID today from the TDX module, so we would have to encode it
into KVM. This was NAKed by Sean at some point. We have started looking into
exposing the needed info in the TDX module, but it is just starting.
Xiaoyao Li Sept. 12, 2024, 3:26 p.m. UTC | #19
On 9/12/2024 10:48 PM, Paolo Bonzini wrote:
> On Thu, Sep 12, 2024 at 4:45 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>> KVM is not going to have any checks, it's only going to pass the
>>> CPUID to the TDX module and return an error if the check fails
>>> in the TDX module.
>>
>> If so, new feature can be enabled for TDs out of KVM's control.
>>
>> Is it acceptable?
> 
> It's the same as for non-TDX VMs, I think it's acceptable.

another question is for patch 24, will we keep the filtering of the 
configurable CPUDIDs in KVM_TDX_CAPABILITIES with KVM_GET_SUPPORTED_CPUID?

> Paolo
>
Paolo Bonzini Sept. 12, 2024, 3:37 p.m. UTC | #20
On Thu, Sep 12, 2024 at 5:08 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> > KVM can have a TDX-specific version of KVM_GET_SUPPORTED_CPUID, so
> > that we can keep a variant of the "get supported bits and pass them
> > to KVM_SET_CPUID2" logic, but that's it.
>
> Can you clarify what you mean here when you say TDX-specific version of
> KVM_GET_SUPPORTED_CPUID?
>
> We have two things kind of like that implemented in this series:
> 1. KVM_TDX_GET_CPUID, which returns the CPUID bits actually set in the TD
> 2. KVM_TDX_CAPABILITIES, which returns CPUID bits that TDX module allows full
> control over (i.e. what we have been calling directly configurable CPUID bits)
>
> KVM_TDX_GET_CPUID->KVM_SET_CPUID2 kind of works like
> KVM_GET_SUPPORTED_CPUID->KVM_SET_CPUID2, so I think that is what you mean, but
> just want to confirm.

Yes, that's correct.

> We can't get the needed information (fixed bits, etc) to create a TDX
> KVM_GET_SUPPORTED_CPUID today from the TDX module, so we would have to encode it
> into KVM. This was NAKed by Sean at some point. We have started looking into
> exposing the needed info in the TDX module, but it is just starting.

I think a bare minimum of this API is needed (adding HYPERVISOR,
and masking TDX-supported features against what KVM supports).
It's too much of a fundamental step in KVM's configuration API.

I am not sure if there are other fixed-1 bits than HYPERVISOR as of
today.  But in any case, if the TDX module breaks it unilaterally by
adding more fixed-1 bits, that's a problem for Intel not for KVM.

On the other hand is KVM_TDX_CAPABILITIES even needed?  If userspace
can replace that with hardcoded logic or info from the infamous JSON
file, that would work.

Paolo
Rick Edgecombe Sept. 12, 2024, 4:38 p.m. UTC | #21
On Thu, 2024-09-12 at 17:37 +0200, Paolo Bonzini wrote:
> Yes, that's correct.

Thanks.

> 
> > We can't get the needed information (fixed bits, etc) to create a TDX
> > KVM_GET_SUPPORTED_CPUID today from the TDX module, so we would have to
> > encode it
> > into KVM. This was NAKed by Sean at some point. We have started looking into
> > exposing the needed info in the TDX module, but it is just starting.
> 
> I think a bare minimum of this API is needed (adding HYPERVISOR,
> and masking TDX-supported features against what KVM supports).
> It's too much of a fundamental step in KVM's configuration API.

Ok so we want KVM_TDX_CAPABILITIES to filter bits, but not KVM_TDX_GET_CPUID.

> 
> I am not sure if there are other fixed-1 bits than HYPERVISOR as of
> today.  But in any case, if the TDX module breaks it unilaterally by
> adding more fixed-1 bits, that's a problem for Intel not for KVM.
> 
> On the other hand is KVM_TDX_CAPABILITIES even needed?  If userspace
> can replace that with hardcoded logic or info from the infamous JSON
> file, that would work.

The directly configurable CPUID bits will grow over time. So if we don't expose
the supported ones, userspace will have to guess which ones it can set at that
point.

But as long as the list doesn't shrink we could encode the directly configurable
data in userspace for now, then add an API later when the list of bits grows. If
the API is not present, userspace can assume it's only the original list.
Sean Christopherson Sept. 12, 2024, 4:42 p.m. UTC | #22
On Thu, Sep 12, 2024, Paolo Bonzini wrote:
> On Thu, Sep 12, 2024 at 4:45 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> > > KVM is not going to have any checks, it's only going to pass the
> > > CPUID to the TDX module and return an error if the check fails
> > > in the TDX module.
> >
> > If so, new feature can be enabled for TDs out of KVM's control.
> >
> > Is it acceptable?
> 
> It's the same as for non-TDX VMs, I think it's acceptable.

No?  IIUC, it's not the same.

E.g. KVM doesn't yet support CET, and while userspace can enumerate CET support
to VMs all it wants, guests will never be able to set CR4.CET and thus can't
actually enable CET.

IIUC, the proposal here is to allow userspace to configure the features that are
exposed _and enabled_ for a TDX VM without any enforcement from KVM.

CET might be a bad example because it looks like it's controlled by TDCS.XFAM, but
presumably there are other CPUID-based features that would actively enable some
feature for a TDX VM.

For HYPERVISOR and TSC_DEADLINE_TIMER, I would much prefer to fix those KVM warts,
and have already posted patches[1][2] to do exactly that.

With those out of the way, are there any other CPUID-based features that KVM
supports, but doesn't advertise?  Ignore MWAIT, it's a special case and isn't
allowed in TDX VMs anyways.

[1] https://lore.kernel.org/all/20240517173926.965351-34-seanjc@google.com
[2] https://lore.kernel.org/all/20240517173926.965351-35-seanjc@google.com
Paolo Bonzini Sept. 12, 2024, 6:29 p.m. UTC | #23
On Thu, Sep 12, 2024 at 6:42 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Sep 12, 2024, Paolo Bonzini wrote:
> > On Thu, Sep 12, 2024 at 4:45 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> > > > KVM is not going to have any checks, it's only going to pass the
> > > > CPUID to the TDX module and return an error if the check fails
> > > > in the TDX module.
> > >
> > > If so, new feature can be enabled for TDs out of KVM's control.
> > >
> > > Is it acceptable?
> >
> > It's the same as for non-TDX VMs, I think it's acceptable.
>
> No?  IIUC, it's not the same.
>
> E.g. KVM doesn't yet support CET, and while userspace can enumerate CET support
> to VMs all it wants, guests will never be able to set CR4.CET and thus can't
> actually enable CET.
>
> IIUC, the proposal here is to allow userspace to configure the features that are
> exposed _and enabled_ for a TDX VM without any enforcement from KVM.

Yeah, that's correct, on the other hand a lot of features are just
new instructions and no new registers.  Those pass under the radar
and in fact you can even use them if the CPUID bit is 0 (of course).
Others are just data, and again you can pass any crap you'd like.

And for SNP we had the case where we are forced to leave features
enabled if their state is in the VMSA, because we cannot block
writes to XCR0 and XSS that we'd like to be invalid.

> CET might be a bad example because it looks like it's controlled by TDCS.XFAM, but
> presumably there are other CPUID-based features that would actively enable some
> feature for a TDX VM.

XFAM is controlled by userspace though, not KVM, so we've got no
control on that either.

> For HYPERVISOR and TSC_DEADLINE_TIMER, I would much prefer to fix those KVM warts,
> and have already posted patches[1][2] to do exactly that.
>
> With those out of the way, are there any other CPUID-based features that KVM
> supports, but doesn't advertise?  Ignore MWAIT, it's a special case and isn't
> allowed in TDX VMs anyways.

I don't think so.

Paolo
Sean Christopherson Sept. 12, 2024, 6:41 p.m. UTC | #24
On Thu, Sep 12, 2024, Paolo Bonzini wrote:
> On Thu, Sep 12, 2024 at 6:42 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Sep 12, 2024, Paolo Bonzini wrote:
> > > On Thu, Sep 12, 2024 at 4:45 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> > > > > KVM is not going to have any checks, it's only going to pass the
> > > > > CPUID to the TDX module and return an error if the check fails
> > > > > in the TDX module.
> > > >
> > > > If so, new feature can be enabled for TDs out of KVM's control.
> > > >
> > > > Is it acceptable?
> > >
> > > It's the same as for non-TDX VMs, I think it's acceptable.
> >
> > No?  IIUC, it's not the same.
> >
> > E.g. KVM doesn't yet support CET, and while userspace can enumerate CET support
> > to VMs all it wants, guests will never be able to set CR4.CET and thus can't
> > actually enable CET.
> >
> > IIUC, the proposal here is to allow userspace to configure the features that are
> > exposed _and enabled_ for a TDX VM without any enforcement from KVM.
> 
> Yeah, that's correct, on the other hand a lot of features are just
> new instructions and no new registers.  Those pass under the radar
> and in fact you can even use them if the CPUID bit is 0 (of course).
> Others are just data, and again you can pass any crap you'd like.

Right, I don't care about those precisely because there's nothing KVM can or
_needs_ to do for features that don't have interception controls.

> And for SNP we had the case where we are forced to leave features
> enabled if their state is in the VMSA, because we cannot block
> writes to XCR0 and XSS that we'd like to be invalid.

Oh, I'm well aware :-)

> > CET might be a bad example because it looks like it's controlled by TDCS.XFAM, but
> > presumably there are other CPUID-based features that would actively enable some
> > feature for a TDX VM.
> 
> XFAM is controlled by userspace though, not KVM, so we've got no
> control on that either.

I assume it's plain text though?  I.e. whatever ioctl() sets TDCS.XFAM can be
rejected by KVM if it attempts to enable unsupported features?

I don't expect that we'll want KVM to gatekeep many, if any features, but I do
think we should require explicit enabling in KVM whenever possible, even if the
enabling is boring and largely ceremonial.
Rick Edgecombe Sept. 12, 2024, 6:42 p.m. UTC | #25
On Thu, 2024-09-12 at 20:29 +0200, Paolo Bonzini wrote:
> > IIUC, the proposal here is to allow userspace to configure the features that
> > are
> > exposed _and enabled_ for a TDX VM without any enforcement from KVM.
> 
> Yeah, that's correct, on the other hand a lot of features are just
> new instructions and no new registers.  Those pass under the radar
> and in fact you can even use them if the CPUID bit is 0 (of course).
> Others are just data, and again you can pass any crap you'd like.
> 
> And for SNP we had the case where we are forced to leave features
> enabled if their state is in the VMSA, because we cannot block
> writes to XCR0 and XSS that we'd like to be invalid.
> 
> > CET might be a bad example because it looks like it's controlled by
> > TDCS.XFAM, but
> > presumably there are other CPUID-based features that would actively enable
> > some
> > feature for a TDX VM.
> 
> XFAM is controlled by userspace though, not KVM, so we've got no
> control on that either.

There are some ATTRIBUTES (the non-xsave features like PKS get bucketed in
there), which can affect the host. So we have to filter this config in KVM. I'd
just assume not trust future XFAM bits because it's easy to implement.


> 
> > For HYPERVISOR and TSC_DEADLINE_TIMER, I would much prefer to fix those KVM
> > warts,
> > and have already posted patches[1][2] to do exactly that.
> > 
> > With those out of the way, are there any other CPUID-based features that KVM
> > supports, but doesn't advertise?  Ignore MWAIT, it's a special case and
> > isn't
> > allowed in TDX VMs anyways.
> 
> I don't think so.
Xiaoyao Li Sept. 13, 2024, 3:54 a.m. UTC | #26
On 9/13/2024 2:41 AM, Sean Christopherson wrote:
>>> CET might be a bad example because it looks like it's controlled by TDCS.XFAM, but
>>> presumably there are other CPUID-based features that would actively enable some
>>> feature for a TDX VM.
>> XFAM is controlled by userspace though, not KVM, so we've got no
>> control on that either.
> I assume it's plain text though?  I.e. whatever ioctl() sets TDCS.XFAM can be
> rejected by KVM if it attempts to enable unsupported features?

yes. XFAM is validated by KVM actually in this series.

KVM reports supported_xfam via KVM_TDX_CAPABILITIES and userspace sets 
XFAM via ioctl(KVM_TDX_VM_INIT). If userspace sets any bits beyond the 
supported_xfam, KVM returns -EINVAL.

The same for attributes.

> I don't expect that we'll want KVM to gatekeep many, if any features, but I do
> think we should require explicit enabling in KVM whenever possible, even if the
> enabling is boring and largely ceremonial.
+1
Xiaoyao Li Sept. 13, 2024, 3:57 a.m. UTC | #27
On 9/13/2024 12:42 AM, Sean Christopherson wrote:
> On Thu, Sep 12, 2024, Paolo Bonzini wrote:
>> On Thu, Sep 12, 2024 at 4:45 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>> KVM is not going to have any checks, it's only going to pass the
>>>> CPUID to the TDX module and return an error if the check fails
>>>> in the TDX module.
>>>
>>> If so, new feature can be enabled for TDs out of KVM's control.
>>>
>>> Is it acceptable?
>>
>> It's the same as for non-TDX VMs, I think it's acceptable.
> 
> No?  IIUC, it's not the same.
> 
> E.g. KVM doesn't yet support CET, and while userspace can enumerate CET support
> to VMs all it wants, guests will never be able to set CR4.CET and thus can't
> actually enable CET.
> 
> IIUC, the proposal here is to allow userspace to configure the features that are
> exposed _and enabled_ for a TDX VM without any enforcement from KVM.
> 
> CET might be a bad example because it looks like it's controlled by TDCS.XFAM, but
> presumably there are other CPUID-based features that would actively enable some
> feature for a TDX VM.
> 
> For HYPERVISOR and TSC_DEADLINE_TIMER, I would much prefer to fix those KVM warts,
> and have already posted patches[1][2] to do exactly that.
> 
> With those out of the way, are there any other CPUID-based features that KVM
> supports, but doesn't advertise?  Ignore MWAIT, it's a special case and isn't
> allowed in TDX VMs anyways.

Actually MWAIT becoems allowed by TDX and it's configurable.

> [1] https://lore.kernel.org/all/20240517173926.965351-34-seanjc@google.com
> [2] https://lore.kernel.org/all/20240517173926.965351-35-seanjc@google.com
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d45b4f7b69ba..34e838d8f7fd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1086,13 +1086,24 @@  static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 	return ret;
 }
 
+/*
+ * This function is used in two cases:
+ * 1. mask KVM unsupported/unknown bits from the configurable CPUIDs reported
+ *    by TDX module. in setup_kvm_tdx_caps().
+ * 2. mask KVM unsupported/unknown bits from the actual CPUID value of TD that
+ *    read from TDX module. in tdx_vcpu_get_cpuid().
+ *
+ * For both cases, it needs fixup for the field that consists of multiple bits.
+ * For multi-bits field, we need a mask however what
+ * kvm_get_supported_cpuid_internal() returns is just a default value.
+ */
 static int tdx_get_kvm_supported_cpuid(struct kvm_cpuid2 **cpuid)
 {
-
 	int r;
 	static const u32 funcs[] = {
 		0, 0x80000000, KVM_CPUID_SIGNATURE,
 	};
+	struct kvm_cpuid_entry2 *entry;
 
 	*cpuid = kzalloc(sizeof(struct kvm_cpuid2) +
 			sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES,
@@ -1104,6 +1115,89 @@  static int tdx_get_kvm_supported_cpuid(struct kvm_cpuid2 **cpuid)
 	if (r)
 		goto err;
 
+	entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 0x0, 0);
+	if (WARN_ON(!entry))
+		goto err;
+	/* Fixup of maximum basic leaf */
+	entry->eax |= 0x000000FF;
+
+	entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 0x1, 0);
+	if (WARN_ON(!entry))
+		goto err;
+	/* Fixup of FMS */
+	entry->eax |= 0x0fff3fff;
+	/* Fixup of maximum logical processors per package */
+	entry->ebx |= 0x00ff0000;
+
+	/*
+	 * Fixup of CPUID leaf 4, which enmerates cache info, all of the
+	 * non-reserved fields except EBX[11:0] (System Coherency Line Size)
+	 * are configurable for TDs.
+	 */
+	entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 0x4, 0);
+	if (WARN_ON(!entry))
+		goto err;
+	entry->eax |= 0xffffc3ff;
+	entry->ebx |= 0xfffff000;
+	entry->ecx |= 0xffffffff;
+	entry->edx |= 0x00000007;
+
+	entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 0x4, 1);
+	if (WARN_ON(!entry))
+		goto err;
+	entry->eax |= 0xffffc3ff;
+	entry->ebx |= 0xfffff000;
+	entry->ecx |= 0xffffffff;
+	entry->edx |= 0x00000007;
+
+	entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 0x4, 2);
+	if (WARN_ON(!entry))
+		goto err;
+	entry->eax |= 0xffffc3ff;
+	entry->ebx |= 0xfffff000;
+	entry->ecx |= 0xffffffff;
+	entry->edx |= 0x00000007;
+
+	entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 0x4, 3);
+	if (WARN_ON(!entry))
+		goto err;
+	entry->eax |= 0xffffc3ff;
+	entry->ebx |= 0xfffff000;
+	entry->ecx |= 0xffffffff;
+	entry->edx |= 0x00000007;
+
+	/* Fixup of CPUID leaf 0xB */
+	entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 0xb, 0);
+	if (WARN_ON(!entry))
+		goto err;
+	entry->eax = 0x0000001f;
+	entry->ebx = 0x0000ffff;
+	entry->ecx = 0x0000ffff;
+
+	/*
+	 * Fixup of CPUID leaf 0x1f, which is totally configurable for TDs.
+	 */
+	entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 0x1f, 0);
+	if (WARN_ON(!entry))
+		goto err;
+	entry->eax = 0x0000001f;
+	entry->ebx = 0x0000ffff;
+	entry->ecx = 0x0000ffff;
+
+	for (int i = 1; i <= 5; i++) {
+		entry = kvm_find_cpuid_entry2((*cpuid)->entries, (*cpuid)->nent, 0x1f, i);
+		if (!entry) {
+			entry = &(*cpuid)->entries[(*cpuid)->nent];
+			entry->function = 0x1f;
+			entry->index = i;
+			entry->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+			(*cpuid)->nent++;
+		}
+		entry->eax = 0x0000001f;
+		entry->ebx = 0x0000ffff;
+		entry->ecx = 0x0000ffff;
+	}
+
 	return 0;
 err:
 	kfree(*cpuid);