Message ID | 9bd868a287599eb2a854f6983f13b4500f47d2ae.1708933498.git.isaku.yamahata@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro | expand |
On 26/02/2024 9:25 pm, Yamahata, Isaku wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > > TDX has its own limitation on the maximum number of vcpus that the guest > can accommodate. "limitation" -> "control". "the guest" -> "a guest". Allow x86 kvm backend to implement its own KVM_ENABLE_CAP > handler and implement TDX backend for KVM_CAP_MAX_VCPUS. I am not sure we normally say "x86 KVM backend". Just say "Allow KVM x86 ...". user space VMM, > e.g. qemu, can specify its value instead of KVM_MAX_VCPUS. Grammar check. > > When creating TD (TDH.MNG.INIT), the maximum number of vcpu needs to be > specified as struct td_params_struct. 'struct td_params_struct'?? Anyway, I don't think you need to mention such details. and the value is a part of > measurement. The user space has to specify the value somehow. "and" -> "And" (grammar check please). And add an empty line to start below as a new paragraph. There are > two options for it. > option 1. API (Set KVM_CAP_MAX_VCPU) to specify the value (this patch) > option 2. Add max_vcpu as a parameter to initialize the guest. > (TDG.MNG.INIT) First of all, it seems to me that the two are not conflicting. Based on the uapi/kvm.h: #define KVM_CAP_MAX_VCPUS 66 /* returns max vcpus per vm */ Currently KVM x86 doesn't allow to configure MAX_VCPU on VM-basis, but always reports KVM_MAX_VCPUS for _ALL_ VMs. I.e., it doesn't support userspace to explicitly enable KVM_CAP_MAX_VCPUS for a given VM. Now, if we allow the userspace to configure the MAX_VCPU for TDX guest (this could be a separate discussion in fact) due to attestation whatever, we need to support allowing userspace to configure MAX_VCPUS on VM-basis. Therefore, option 1 isn't really an option to me, but is the thing that we _SHOULD_ do to support TDX. So this pach should really just add "per-VM max vcpus" support for TDX, starting from: struct kvm_tdx { /* or 'struct kvm_arch' ?? */ ... int max_vcpus; } And in TDH.MNG.INIT, we need to manually check the MAX_VCPU specified in TD_PARAMS structure to make sure it matches to the record that we specified via KVM_CAP_MAX_VCPUS. So how about: " TDX has its own mechanism to control the maximum number of VCPUs that the TDX guest can use. When creating a TDX guest, the maximum number of vcpus needs to be passed to the TDX module as part of the measurement of the guest. Because the value is part of the measurement, thus part of attestation, it better to allow the userspace to be able to configure it. E.g. the users may want to precisely control the maximum number of vcpus their precious VMs can use. The actual control itself must be done via the TDH.MNG.INIT SEAMCALL itself, where the number of maximum cpus is an input to the TDX module, but KVM needs to support the "per-VM number of maximum vcpus" and reflect that in the KVM_CAP_MAX_VCPUS. Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus on VM-basis. Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs. The userspace-configured value then can be verified when KVM is actually creating the TDX guest. "
On Fri, Mar 22, 2024 at 12:36:40PM +1300, "Huang, Kai" <kai.huang@intel.com> wrote: > So how about: Thanks for it. I'll update the commit message with some minor fixes. > " > TDX has its own mechanism to control the maximum number of VCPUs that the > TDX guest can use. When creating a TDX guest, the maximum number of vcpus > needs to be passed to the TDX module as part of the measurement of the > guest. > > Because the value is part of the measurement, thus part of attestation, it ^'s > better to allow the userspace to be able to configure it. E.g. the users the userspace to configure it ^, > may want to precisely control the maximum number of vcpus their precious VMs > can use. > > The actual control itself must be done via the TDH.MNG.INIT SEAMCALL itself, > where the number of maximum cpus is an input to the TDX module, but KVM > needs to support the "per-VM number of maximum vcpus" and reflect that in per-VM maximum number of vcpus > the KVM_CAP_MAX_VCPUS. > > Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't > allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus maximum number of vcpus > on VM-basis. > > Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs. > > The userspace-configured value then can be verified when KVM is actually used > creating the TDX guest. > "
On 3/23/2024 9:13 AM, Isaku Yamahata wrote: > On Fri, Mar 22, 2024 at 12:36:40PM +1300, > "Huang, Kai" <kai.huang@intel.com> wrote: > >> So how about: > Thanks for it. I'll update the commit message with some minor fixes. > >> " >> TDX has its own mechanism to control the maximum number of VCPUs that the >> TDX guest can use. When creating a TDX guest, the maximum number of vcpus >> needs to be passed to the TDX module as part of the measurement of the >> guest. >> >> Because the value is part of the measurement, thus part of attestation, it > ^'s >> better to allow the userspace to be able to configure it. E.g. the users > the userspace to configure it ^, >> may want to precisely control the maximum number of vcpus their precious VMs >> can use. >> >> The actual control itself must be done via the TDH.MNG.INIT SEAMCALL itself, >> where the number of maximum cpus is an input to the TDX module, but KVM >> needs to support the "per-VM number of maximum vcpus" and reflect that in > per-VM maximum number of vcpus >> the KVM_CAP_MAX_VCPUS. >> >> Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't >> allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus > maximum number of vcpus >> on VM-basis. >> >> Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs. >> >> The userspace-configured value then can be verified when KVM is actually > used Here, "verified", I think Kai wanted to emphasize that the value of max_vcpus passed in via KVM_TDX_INIT_VM should be checked against the value configured via KVM_CAP_MAX_VCPUS? Maybe "verified and used" ? >> creating the TDX guest. >> " >
>> Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't >> allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus > maximum number of vcpus >> on VM-basis. >> >> Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs. >> >> The userspace-configured value then can be verified when KVM is actually > used >> creating the TDX guest. >> " I think we still have two options regarding to how 'max_vcpus' is handled in ioctl() to do TDH.MNG.INIT: 1) Just use the 'max_vcpus' done in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS), 2) Still pass the 'max_vcpus' as input, but KVM verifies it against the value that is saved in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS). 2) seems unnecessary, so I don't have objection to use 1). But it seems we could still mention it in the changelog in that patch?
On Mon, Mar 25, 2024 at 09:43:31PM +1300, "Huang, Kai" <kai.huang@intel.com> wrote: > > > > Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't > > > allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus > > maximum number of vcpus > > > on VM-basis. > > > > > > Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs. > > > > > > The userspace-configured value then can be verified when KVM is actually > > used > > > creating the TDX guest. > > > " > > I think we still have two options regarding to how 'max_vcpus' is handled in > ioctl() to do TDH.MNG.INIT: > > 1) Just use the 'max_vcpus' done in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS), > 2) Still pass the 'max_vcpus' as input, but KVM verifies it against the > value that is saved in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS). > > 2) seems unnecessary, so I don't have objection to use 1). But it seems we > could still mention it in the changelog in that patch? Sure, let me update the commit log.
On Mon, Mar 25, 2024 at 04:42:36PM +0800, Binbin Wu <binbin.wu@linux.intel.com> wrote: > > > On 3/23/2024 9:13 AM, Isaku Yamahata wrote: > > On Fri, Mar 22, 2024 at 12:36:40PM +1300, > > "Huang, Kai" <kai.huang@intel.com> wrote: > > > > > So how about: > > Thanks for it. I'll update the commit message with some minor fixes. > > > > > " > > > TDX has its own mechanism to control the maximum number of VCPUs that the > > > TDX guest can use. When creating a TDX guest, the maximum number of vcpus > > > needs to be passed to the TDX module as part of the measurement of the > > > guest. > > > > > > Because the value is part of the measurement, thus part of attestation, it > > ^'s > > > better to allow the userspace to be able to configure it. E.g. the users > > the userspace to configure it ^, > > > may want to precisely control the maximum number of vcpus their precious VMs > > > can use. > > > > > > The actual control itself must be done via the TDH.MNG.INIT SEAMCALL itself, > > > where the number of maximum cpus is an input to the TDX module, but KVM > > > needs to support the "per-VM number of maximum vcpus" and reflect that in > > per-VM maximum number of vcpus > > > the KVM_CAP_MAX_VCPUS. > > > > > > Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't > > > allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus > > maximum number of vcpus > > > on VM-basis. > > > > > > Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs. > > > > > > The userspace-configured value then can be verified when KVM is actually > > used > > Here, "verified", I think Kai wanted to emphasize that the value of > max_vcpus passed in via > KVM_TDX_INIT_VM should be checked against the value configured via > KVM_CAP_MAX_VCPUS? > > Maybe "verified and used" ? Ok. I don't have strong opinion here.
> > Here, "verified", I think Kai wanted to emphasize that the value of > > max_vcpus passed in via KVM_TDX_INIT_VM should be checked against the > > value configured via KVM_CAP_MAX_VCPUS? > > > > Maybe "verified and used" ? > > Ok. I don't have strong opinion here. It depends on how you implement that patch. If we don't pass 'max_vcpus' in that patch, there's nothing to verify really.
On Mon, Feb 26, 2024, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > > TDX has its own limitation on the maximum number of vcpus that the guest > can accommodate. Allow x86 kvm backend to implement its own KVM_ENABLE_CAP > handler and implement TDX backend for KVM_CAP_MAX_VCPUS. user space VMM, > e.g. qemu, can specify its value instead of KVM_MAX_VCPUS. > > When creating TD (TDH.MNG.INIT), the maximum number of vcpu needs to be > specified as struct td_params_struct. and the value is a part of > measurement. The user space has to specify the value somehow. There are > two options for it. > option 1. API (Set KVM_CAP_MAX_VCPU) to specify the value (this patch) When I suggested adding a capability[*], the intent was for the capability to be generic, not buried in TDX code. I can't think of any reason why this can't be supported for all VMs on all architectures. The only wrinkle is that it'll require a separate capability since userspace needs to be able to detect that KVM supports restricting the number of vCPUs, but that'll still be _less_ code. [*] https://lore.kernel.org/all/YZVsnZ8e7cXls2P2@google.com > +static int vt_max_vcpus(struct kvm *kvm) > +{ > + if (!kvm) > + return KVM_MAX_VCPUS; > + > + if (is_td(kvm)) > + return min(kvm->max_vcpus, TDX_MAX_VCPUS); > + > + return kvm->max_vcpus; This is _completely_ orthogonal to allowing userspace to restrict the maximum number of vCPUs. And unless I'm missing something, it's also ridiculous and unnecessary at this time. KVM x86 limits KVM_MAX_VCPUS to 4096: config KVM_MAX_NR_VCPUS int "Maximum number of vCPUs per KVM guest" depends on KVM range 1024 4096 default 4096 if MAXSMP default 1024 help whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking a 16-bit unsigned value: #define TDX_MAX_VCPUS (~(u16)0) i.e. it will likely be _years_ before TDX's limitation matters, if it ever does. And _if_ it becomes a problem, we don't necessarily need to have a different _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS being <= 64k. So rather than add a bunch of pointless plumbing, just throw in diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 137d08da43c3..018d5b9eb93d 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -2488,6 +2488,9 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params, return -EOPNOTSUPP; } + BUILD_BUG_ON(CONFIG_KVM_MAX_NR_VCPUS < + sizeof(td_params->max_vcpus) * BITS_PER_BYTE); + td_params->max_vcpus = kvm->max_vcpus; td_params->attributes = init_vm->attributes; /* td_params->exec_controls = TDX_CONTROL_FLAG_NO_RBP_MOD; */
On 10/05/2024 4:35 am, Sean Christopherson wrote: > On Mon, Feb 26, 2024, isaku.yamahata@intel.com wrote: >> From: Isaku Yamahata <isaku.yamahata@intel.com> >> >> TDX has its own limitation on the maximum number of vcpus that the guest >> can accommodate. Allow x86 kvm backend to implement its own KVM_ENABLE_CAP >> handler and implement TDX backend for KVM_CAP_MAX_VCPUS. user space VMM, >> e.g. qemu, can specify its value instead of KVM_MAX_VCPUS. >> >> When creating TD (TDH.MNG.INIT), the maximum number of vcpu needs to be >> specified as struct td_params_struct. and the value is a part of >> measurement. The user space has to specify the value somehow. There are >> two options for it. >> option 1. API (Set KVM_CAP_MAX_VCPU) to specify the value (this patch) > > When I suggested adding a capability[*], the intent was for the capability to > be generic, not buried in TDX code. I can't think of any reason why this can't > be supported for all VMs on all architectures. The only wrinkle is that it'll > require a separate capability since userspace needs to be able to detect that > KVM supports restricting the number of vCPUs, but that'll still be _less_ code. > > [*] https://lore.kernel.org/all/YZVsnZ8e7cXls2P2@google.com > >> +static int vt_max_vcpus(struct kvm *kvm) >> +{ >> + if (!kvm) >> + return KVM_MAX_VCPUS; >> + >> + if (is_td(kvm)) >> + return min(kvm->max_vcpus, TDX_MAX_VCPUS); >> + >> + return kvm->max_vcpus; > > This is _completely_ orthogonal to allowing userspace to restrict the maximum > number of vCPUs. And unless I'm missing something, it's also ridiculous and > unnecessary at this time. Right it's not necessary. I think it can be reported as: case KVM_CAP_MAX_VCPUS: r = KVM_MAX_VCPUS; + if (kvm) + r = kvm->max_vcpus; break; >KVM x86 limits KVM_MAX_VCPUS to 4096: > > config KVM_MAX_NR_VCPUS > int "Maximum number of vCPUs per KVM guest" > depends on KVM > range 1024 4096 > default 4096 if MAXSMP > default 1024 > help > > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking > a 16-bit unsigned value: > > #define TDX_MAX_VCPUS (~(u16)0) > > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does. > And _if_ it becomes a problem, we don't necessarily need to have a different > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS > being <= 64k. Actually later versions of TDX module (starting from 1.5 AFAICT), the module has a metadata field to report the maximum vCPUs that the module can support for all TDX guests. > > So rather than add a bunch of pointless plumbing, just throw in > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 137d08da43c3..018d5b9eb93d 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -2488,6 +2488,9 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params, > return -EOPNOTSUPP; > } > > + BUILD_BUG_ON(CONFIG_KVM_MAX_NR_VCPUS < > + sizeof(td_params->max_vcpus) * BITS_PER_BYTE); > + > td_params->max_vcpus = kvm->max_vcpus; > td_params->attributes = init_vm->attributes; > /* td_params->exec_controls = TDX_CONTROL_FLAG_NO_RBP_MOD; */ > Yeah the above could be helpful, but might not be necessary. So the logic of updated patch is: 1) During module loading time, we grab the maximum vCPUs that the TDX module can support: /kvm_vm_ioctl_enable_cap * TDX module may not support MD_FIELD_ID_MAX_VCPUS_PER_TD * depending on its version. */ tdx_info->max_vcpus_per_td = U16_MAX; if (!tdx_sys_metadata_field_read(MD_FIELD_ID_MAX_VCPUS_PER_TD, &tmp)) tdx_info->max_vcpus_per_td = (u16)tmp; 2) When TDX guest is created, the userspace needs to call IOCTL(KVM_ENABLE_CAP) to configure the maximum vCPUs of the guest. A new kvm_x86_ops::vm_enable_cap() is added because TDX has it's own limitation (metadata field) as mentioned above. @@ -6827,6 +6829,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } default: r = -EINVAL; + if (kvm_x86_ops.vm_enable_cap) + r = static_call(kvm_x86_vm_enable_cap)(kvm, + cap); And we only allow the kvm->max_vcpus to be updated if it's a TDX guest in the vt_vm_enable_cap(). The reason is we want to avoid unnecessary change for normal VMX guests. And the new kvm->max_vcpus cannot exceed the KVM_MAX_VCPUS and the tdx_info->max_vcpus_per_td: + case KVM_CAP_MAX_VCPUS: { + if (cap->flags || cap->args[0] == 0) + return -EINVAL; + if (cap->args[0] > KVM_MAX_VCPUS || + cap->args[0] > tdx_info->max_vcpus_per_td) + return -E2BIG; + + mutex_lock(&kvm->lock); + if (kvm->created_vcpus) + r = -EBUSY; + else { + kvm->max_vcpus = cap->args[0]; + r = 0; + } + mutex_unlock(&kvm->lock); + break; + } 3) We just report kvm->max_vcpus when the userspace wants to check the KVM_CAP_MAX_VCPUS as shown in the beginning of my reply. Does this make sense to you? I am also pasting the new updated patch for your review (there are line wrapper issues unfortunately due to the simple copy/paste): From 797e439634d106f744517c97c5ea7887e494fc44 Mon Sep 17 00:00:00 2001 From: Isaku Yamahata <isaku.yamahata@intel.com> Date: Thu, 16 Feb 2023 17:03:40 -0800 Subject: [PATCH] KVM: TDX: Allow userspace to configure maximum vCPUs for TDX guests TDX has its own mechanism to control the maximum number of vCPUs that the TDX guest can use. When creating a TDX guest, the maximum number of vCPUs of the guest needs to be passed to the TDX module as part of the measurement of the guest. Depending on TDX module's version, it may also report the maximum vCPUs it can support for all TDX guests. Because the maximum number of vCPUs is part of the measurement, thus part of attestation, it's better to allow the userspace to be able to configure it. E.g. the users may want to precisely control the maximum number of vCPUs their precious VMs can use. The actual control itself must be done via the TDH.MNG.INIT SEAMCALL, where the number of maximum cpus is part of the input to the TDX module, but KVM needs to support the "per-VM maximum number of vCPUs" and reflect that in the KVM_CAP_MAX_VCPUS. Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vCPUs on VM-basis. Add "per-VM maximum number of vCPUs" to KVM x86/TDX to accommodate TDX's needs. Specifically, use KVM's existing KVM_ENABLE_CAP IOCTL() to allow the userspace to configure the maximum vCPUs by making KVM x86 support enabling the KVM_CAP_MAX_VCPUS cap on VM-basis. For that, add a new 'kvm_x86_ops::vm_enable_cap()' callback and call it from kvm_vm_ioctl_enable_cap() as a placeholder to handle the KVM_CAP_MAX_VCPUS for TDX guests (and other KVM_CAP_xx for TDX and/or other VMs if needed in the future). Implement the callback for TDX guest to check whether the maximum vCPUs passed from usrspace can be supported by TDX, and if it can, override Accordingly, in the KVM_CHECK_EXTENSION IOCTL(), change to return the 'struct kvm::max_vcpus' for a given VM for the KVM_CAP_MAX_VCPUS. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> --- v20: - Drop max_vcpu ops to use kvm.max_vcpus - Remove TDX_MAX_VCPUS (Kai) - Use type cast (u16) instead of calling memcpy() when reading the 'max_vcpus_per_td' (Kai) - Improve change log and change patch title from "KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific" (Kai) --- arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/main.c | 10 ++++++++ arch/x86/kvm/vmx/tdx.c | 40 ++++++++++++++++++++++++++++++ arch/x86/kvm/vmx/x86_ops.h | 5 ++++ arch/x86/kvm/x86.c | 4 +++ 6 files changed, 61 insertions(+) diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index bcb8302561f2..022b9eace3a5 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -20,6 +20,7 @@ KVM_X86_OP(hardware_disable) KVM_X86_OP(hardware_unsetup) KVM_X86_OP(has_emulated_msr) KVM_X86_OP(vcpu_after_set_cpuid) +KVM_X86_OP_OPTIONAL(vm_enable_cap) KVM_X86_OP(vm_init) KVM_X86_OP_OPTIONAL(vm_destroy) KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c461c2e57fcb..1d10e3d29533 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1639,6 +1639,7 @@ struct kvm_x86_ops { void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu); unsigned int vm_size; + int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap); int (*vm_init)(struct kvm *kvm); void (*vm_destroy)(struct kvm *kvm); diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c index 8e4aa8d15aec..686ca6348993 100644 --- a/arch/x86/kvm/vmx/main.c +++ b/arch/x86/kvm/vmx/main.c @@ -6,6 +6,7 @@ #include "nested.h" #include "pmu.h" #include "tdx.h" +#include "tdx_arch.h" static bool enable_tdx __ro_after_init; module_param_named(tdx, enable_tdx, bool, 0444); @@ -33,6 +34,14 @@ static void vt_hardware_unsetup(void) vmx_hardware_unsetup(); } +static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) +{ + if (is_td(kvm)) + return tdx_vm_enable_cap(kvm, cap); + + return -EINVAL; +} + static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp) { if (!is_td(kvm)) @@ -63,6 +72,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = { .has_emulated_msr = vmx_has_emulated_msr, .vm_size = sizeof(struct kvm_vmx), + .vm_enable_cap = vt_vm_enable_cap, .vm_init = vmx_vm_init, .vm_destroy = vmx_vm_destroy, diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index c7d849582d44..cdfc95904d6c 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -34,6 +34,8 @@ struct tdx_info { u64 xfam_fixed0; u64 xfam_fixed1; + u16 max_vcpus_per_td; + u16 num_cpuid_config; /* This must the last member. */ DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs); @@ -42,6 +44,35 @@ struct tdx_info { /* Info about the TDX module. */ static struct tdx_info *tdx_info; +int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) +{ + int r; + + switch (cap->cap) { + case KVM_CAP_MAX_VCPUS: { + if (cap->flags || cap->args[0] == 0) + return -EINVAL; + if (cap->args[0] > KVM_MAX_VCPUS || + cap->args[0] > tdx_info->max_vcpus_per_td) + return -E2BIG; + + mutex_lock(&kvm->lock); + if (kvm->created_vcpus) + r = -EBUSY; + else { + kvm->max_vcpus = cap->args[0]; + r = 0; + } + mutex_unlock(&kvm->lock); + break; + } + default: + r = -EINVAL; + break; + } + return r; +} + static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd) { struct kvm_tdx_capabilities __user *user_caps; @@ -129,6 +160,7 @@ static int __init tdx_module_setup(void) u16 num_cpuid_config; /* More member will come. */ } st; + u64 tmp; int ret; u32 i; @@ -167,6 +199,14 @@ static int __init tdx_module_setup(void) return -ENOMEM; tdx_info->num_cpuid_config = st.num_cpuid_config; + /* + * TDX module may not support MD_FIELD_ID_MAX_VCPUS_PER_TD depending + * on its version. + */ + tdx_info->max_vcpus_per_td = U16_MAX; + if (!tdx_sys_metadata_field_read(MD_FIELD_ID_MAX_VCPUS_PER_TD, &tmp)) + tdx_info->max_vcpus_per_td = (u16)tmp; + ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info); ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info); if (ret) goto error_out; diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h index 9bc287a7efac..7c768e360bc6 100644 --- a/arch/x86/kvm/vmx/x86_ops.h +++ b/arch/x86/kvm/vmx/x86_ops.h @@ -139,11 +139,16 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu); int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops); void tdx_hardware_unsetup(void); +int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap); int tdx_vm_ioctl(struct kvm *kvm, void __user *argp); #else static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; } static inline void tdx_hardware_unsetup(void) {} +static inline int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) +{ + return -EINVAL; +}; static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; } #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ee8288a46d30..97ed4fe25964 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4776,6 +4776,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) break; case KVM_CAP_MAX_VCPUS: r = KVM_MAX_VCPUS; + if (kvm) + r = kvm->max_vcpus; break; case KVM_CAP_MAX_VCPU_ID: r = KVM_MAX_VCPU_IDS; @@ -6827,6 +6829,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } default: r = -EINVAL; + if (kvm_x86_ops.vm_enable_cap) + r = static_call(kvm_x86_vm_enable_cap)(kvm, cap); break; } return r; -- 2.34.1
> > Implement the callback for TDX guest to check whether the maximum vCPUs > passed from usrspace can be supported by TDX, and if it can, override > Accordingly, in the KVM_CHECK_EXTENSION IOCTL(), change to return the > 'struct kvm::max_vcpus' for a given VM for the KVM_CAP_MAX_VCPUS. Sorry some error during copy/paste. The above should be: Implement the callback for TDX guest to check whether the maximum vCPUs passed from usrspace can be supported by TDX, and if it can, override the 'struct kvm::max_vcpus'. Leave VMX guests and all AMD guests unsupported to avoid any side-effect for those VMs. Accordingly, in the KVM_CHECK_EXTENSION IOCTL(), change to return the 'struct kvm::max_vcpus' for a given VM for the KVM_CAP_MAX_VCPUS.
On Fri, May 10, 2024, Kai Huang wrote: > On 10/05/2024 4:35 am, Sean Christopherson wrote: > > KVM x86 limits KVM_MAX_VCPUS to 4096: > > > > config KVM_MAX_NR_VCPUS > > int "Maximum number of vCPUs per KVM guest" > > depends on KVM > > range 1024 4096 > > default 4096 if MAXSMP > > default 1024 > > help > > > > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking > > a 16-bit unsigned value: > > > > #define TDX_MAX_VCPUS (~(u16)0) > > > > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does. > > And _if_ it becomes a problem, we don't necessarily need to have a different > > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS > > being <= 64k. > > Actually later versions of TDX module (starting from 1.5 AFAICT), the module > has a metadata field to report the maximum vCPUs that the module can support > for all TDX guests. My quick glance at the 1.5 source shows that the limit is still effectively 0xffff, so again, who cares? Assert on 0xffff compile time, and on the reported max at runtime and simply refuse to use a TDX module that has dropped the minimum below 0xffff. > And we only allow the kvm->max_vcpus to be updated if it's a TDX guest in > the vt_vm_enable_cap(). The reason is we want to avoid unnecessary change > for normal VMX guests. That's a frankly ridiculous reason to bury code in TDX. Nothing is _forcing_ userspace to set KVM_CAP_MAX_VCPUS, i.e. there won't be any change to VMX VMs unless userspace _wants_ there to be a change.
On 10/05/2024 10:52 am, Sean Christopherson wrote: > On Fri, May 10, 2024, Kai Huang wrote: >> On 10/05/2024 4:35 am, Sean Christopherson wrote: >>> KVM x86 limits KVM_MAX_VCPUS to 4096: >>> >>> config KVM_MAX_NR_VCPUS >>> int "Maximum number of vCPUs per KVM guest" >>> depends on KVM >>> range 1024 4096 >>> default 4096 if MAXSMP >>> default 1024 >>> help >>> >>> whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking >>> a 16-bit unsigned value: >>> >>> #define TDX_MAX_VCPUS (~(u16)0) >>> >>> i.e. it will likely be _years_ before TDX's limitation matters, if it ever does. >>> And _if_ it becomes a problem, we don't necessarily need to have a different >>> _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS >>> being <= 64k. >> >> Actually later versions of TDX module (starting from 1.5 AFAICT), the module >> has a metadata field to report the maximum vCPUs that the module can support >> for all TDX guests. > > My quick glance at the 1.5 source shows that the limit is still effectively > 0xffff, so again, who cares? Assert on 0xffff compile time, and on the reported > max at runtime and simply refuse to use a TDX module that has dropped the minimum > below 0xffff. I need to double check why this metadata field was added. My concern is in future module versions they may just low down the value. But another way to handle is we can adjust code when that truly happens? Might not be ideal for stable kernel situation, though? > >> And we only allow the kvm->max_vcpus to be updated if it's a TDX guest in >> the vt_vm_enable_cap(). The reason is we want to avoid unnecessary change >> for normal VMX guests. > > That's a frankly ridiculous reason to bury code in TDX. Nothing is _forcing_ > userspace to set KVM_CAP_MAX_VCPUS, i.e. there won't be any change to VMX VMs > unless userspace _wants_ there to be a change. Right. Anyway allowing userspace to set KVM_CAP_MAX_VCPUS for non-TDX guests shouldn't have any issue. The main reason to bury code in TDX is it needs to additionally check tdx_info->max_vcpus_per_td. We can just do in common code if we avoid that TDX specific check.
On Fri, May 10, 2024 at 11:19:44AM +1200, "Huang, Kai" <kai.huang@intel.com> wrote: > > > On 10/05/2024 10:52 am, Sean Christopherson wrote: > > On Fri, May 10, 2024, Kai Huang wrote: > > > On 10/05/2024 4:35 am, Sean Christopherson wrote: > > > > KVM x86 limits KVM_MAX_VCPUS to 4096: > > > > > > > > config KVM_MAX_NR_VCPUS > > > > int "Maximum number of vCPUs per KVM guest" > > > > depends on KVM > > > > range 1024 4096 > > > > default 4096 if MAXSMP > > > > default 1024 > > > > help > > > > > > > > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking > > > > a 16-bit unsigned value: > > > > > > > > #define TDX_MAX_VCPUS (~(u16)0) > > > > > > > > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does. > > > > And _if_ it becomes a problem, we don't necessarily need to have a different > > > > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS > > > > being <= 64k. > > > > > > Actually later versions of TDX module (starting from 1.5 AFAICT), the module > > > has a metadata field to report the maximum vCPUs that the module can support > > > for all TDX guests. > > > > My quick glance at the 1.5 source shows that the limit is still effectively > > 0xffff, so again, who cares? Assert on 0xffff compile time, and on the reported > > max at runtime and simply refuse to use a TDX module that has dropped the minimum > > below 0xffff. > > I need to double check why this metadata field was added. My concern is in > future module versions they may just low down the value. TD partitioning would reduce it much. > But another way to handle is we can adjust code when that truly happens? > Might not be ideal for stable kernel situation, though? > > > > And we only allow the kvm->max_vcpus to be updated if it's a TDX guest in > > > the vt_vm_enable_cap(). The reason is we want to avoid unnecessary change > > > for normal VMX guests. > > > > That's a frankly ridiculous reason to bury code in TDX. Nothing is _forcing_ > > userspace to set KVM_CAP_MAX_VCPUS, i.e. there won't be any change to VMX VMs > > unless userspace _wants_ there to be a change. > > Right. Anyway allowing userspace to set KVM_CAP_MAX_VCPUS for non-TDX > guests shouldn't have any issue. > > The main reason to bury code in TDX is it needs to additionally check > tdx_info->max_vcpus_per_td. We can just do in common code if we avoid that > TDX specific check. So we can make it arch-independent. When creating VM, we can set kvm->max_vcpu = tdx_info->max_vcpus_per_td by tdx_vm_init(). The check can be common like "if (new max_vcpu > kvm->max_vcpu) error". Or we can add kvm->hard_max_vcpu or something, arch-common code can have "if (kvm->hard_max_vcpu && new max_vcpu > kvm->hard_max_vcpu) error".
On Thu, May 09, 2024, Isaku Yamahata wrote: > On Fri, May 10, 2024 at 11:19:44AM +1200, Kai Huang <kai.huang@intel.com> wrote: > > On 10/05/2024 10:52 am, Sean Christopherson wrote: > > > On Fri, May 10, 2024, Kai Huang wrote: > > > > On 10/05/2024 4:35 am, Sean Christopherson wrote: > > > > > KVM x86 limits KVM_MAX_VCPUS to 4096: > > > > > > > > > > config KVM_MAX_NR_VCPUS > > > > > int "Maximum number of vCPUs per KVM guest" > > > > > depends on KVM > > > > > range 1024 4096 > > > > > default 4096 if MAXSMP > > > > > default 1024 > > > > > help > > > > > > > > > > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking > > > > > a 16-bit unsigned value: > > > > > > > > > > #define TDX_MAX_VCPUS (~(u16)0) > > > > > > > > > > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does. > > > > > And _if_ it becomes a problem, we don't necessarily need to have a different > > > > > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS > > > > > being <= 64k. > > > > > > > > Actually later versions of TDX module (starting from 1.5 AFAICT), the module > > > > has a metadata field to report the maximum vCPUs that the module can support > > > > for all TDX guests. > > > > > > My quick glance at the 1.5 source shows that the limit is still effectively > > > 0xffff, so again, who cares? Assert on 0xffff compile time, and on the reported > > > max at runtime and simply refuse to use a TDX module that has dropped the minimum > > > below 0xffff. > > > > I need to double check why this metadata field was added. My concern is in > > future module versions they may just low down the value. > > TD partitioning would reduce it much. That's still not a reason to plumb in what is effectively dead code. Either partitioning is opt-in, at which I suspect KVM will need yet more uAPI to express the limitations to userspace, or the TDX-module is potentially breaking existing use cases.
On 11/05/2024 2:04 am, Sean Christopherson wrote: > On Thu, May 09, 2024, Isaku Yamahata wrote: >> On Fri, May 10, 2024 at 11:19:44AM +1200, Kai Huang <kai.huang@intel.com> wrote: >>> On 10/05/2024 10:52 am, Sean Christopherson wrote: >>>> On Fri, May 10, 2024, Kai Huang wrote: >>>>> On 10/05/2024 4:35 am, Sean Christopherson wrote: >>>>>> KVM x86 limits KVM_MAX_VCPUS to 4096: >>>>>> >>>>>> config KVM_MAX_NR_VCPUS >>>>>> int "Maximum number of vCPUs per KVM guest" >>>>>> depends on KVM >>>>>> range 1024 4096 >>>>>> default 4096 if MAXSMP >>>>>> default 1024 >>>>>> help >>>>>> >>>>>> whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking >>>>>> a 16-bit unsigned value: >>>>>> >>>>>> #define TDX_MAX_VCPUS (~(u16)0) >>>>>> >>>>>> i.e. it will likely be _years_ before TDX's limitation matters, if it ever does. >>>>>> And _if_ it becomes a problem, we don't necessarily need to have a different >>>>>> _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS >>>>>> being <= 64k. >>>>> >>>>> Actually later versions of TDX module (starting from 1.5 AFAICT), the module >>>>> has a metadata field to report the maximum vCPUs that the module can support >>>>> for all TDX guests. >>>> >>>> My quick glance at the 1.5 source shows that the limit is still effectively >>>> 0xffff, so again, who cares? Assert on 0xffff compile time, and on the reported >>>> max at runtime and simply refuse to use a TDX module that has dropped the minimum >>>> below 0xffff. >>> >>> I need to double check why this metadata field was added. My concern is in >>> future module versions they may just low down the value. >> >> TD partitioning would reduce it much. > > That's still not a reason to plumb in what is effectively dead code. Either > partitioning is opt-in, at which I suspect KVM will need yet more uAPI to express > the limitations to userspace, or the TDX-module is potentially breaking existing > use cases. The 'max_vcpus_per_td' global metadata fields is static for the TDX module. If the module supports the TD partitioning, it just reports some smaller value regardless whether we opt-in TDX partitioning or not. I think the point is this 'max_vcpus_per_td' is TDX architectural thing and kernel should not make any assumption of the value of it. The architectural behaviour is: If the module has this 'max_vcpus_per_td', software should read and use it; Otherwise software should treat it as U16_MAX. Thus I don't think we will need a new uAPI (TDX specific presumably) just for TD partitioning. And this doesn't break existing use cases. In fact, this doesn't prevent us from making the KVM_CAP_MAX_VCPUS code generic, e.g., we can do below: 1) In tdx_vm_init() (called via KVM_VM_CREATE -> vt_vm_init()), we do: kvm->max_vcpus = min(kvm->max_vcpus, tdx_info->max_vcpus_per_td); 2) In kvm_vm_ioctl_enable_cap_generic(), we add support to handle KVM_CAP_MAX_VCPUS to have the generic code to do: if (new_max_vcpus > kvm->max_vcpus) return -EINVAL; kvm->max_vcpus = new_max_vcpus; However this means we only allow "lowing down" the kvm->max_vcpus in the kvm_vm_ioctl_enable_cap_generic(KVM_CAP_MAX_VCPUS), but I think this is acceptable? If it is a concern, alternatively, we can add a new 'kvm->hard_max_vcpus' (or whatever makes sense), and set it in kvm_create_vm() right after kvm_arch_init_vm(): r = kvm_arch_init_vm(kvm, type); if (r) goto out_err_no_arch_destroy_vm; kvm->hard_max_vcpus = kvm->max_vcpus; So it always contains "the max_vcpus limited by the ARCH hardware/firmware etc". And in kvm_vm_ioctl_enable_cap_generic(), we check against kvm->hard_max_vcpus instead to get rid of the limitation of only allowing lowing down the kvm->max_vcpus. But I don't think this is necessary at this stage.
On Tue, May 14, 2024, Kai Huang wrote: > > > On 11/05/2024 2:04 am, Sean Christopherson wrote: > > On Thu, May 09, 2024, Isaku Yamahata wrote: > > > On Fri, May 10, 2024 at 11:19:44AM +1200, Kai Huang <kai.huang@intel.com> wrote: > > > > On 10/05/2024 10:52 am, Sean Christopherson wrote: > > > > > On Fri, May 10, 2024, Kai Huang wrote: > > > > > > On 10/05/2024 4:35 am, Sean Christopherson wrote: > > > > > > > KVM x86 limits KVM_MAX_VCPUS to 4096: > > > > > > > > > > > > > > config KVM_MAX_NR_VCPUS > > > > > > > int "Maximum number of vCPUs per KVM guest" > > > > > > > depends on KVM > > > > > > > range 1024 4096 > > > > > > > default 4096 if MAXSMP > > > > > > > default 1024 > > > > > > > help > > > > > > > > > > > > > > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking > > > > > > > a 16-bit unsigned value: > > > > > > > > > > > > > > #define TDX_MAX_VCPUS (~(u16)0) > > > > > > > > > > > > > > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does. > > > > > > > And _if_ it becomes a problem, we don't necessarily need to have a different > > > > > > > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS > > > > > > > being <= 64k. > > > > > > > > > > > > Actually later versions of TDX module (starting from 1.5 AFAICT), the module > > > > > > has a metadata field to report the maximum vCPUs that the module can support > > > > > > for all TDX guests. > > > > > > > > > > My quick glance at the 1.5 source shows that the limit is still effectively > > > > > 0xffff, so again, who cares? Assert on 0xffff compile time, and on the reported > > > > > max at runtime and simply refuse to use a TDX module that has dropped the minimum > > > > > below 0xffff. > > > > > > > > I need to double check why this metadata field was added. My concern is in > > > > future module versions they may just low down the value. > > > > > > TD partitioning would reduce it much. > > > > That's still not a reason to plumb in what is effectively dead code. Either > > partitioning is opt-in, at which I suspect KVM will need yet more uAPI to express > > the limitations to userspace, or the TDX-module is potentially breaking existing > > use cases. > > The 'max_vcpus_per_td' global metadata fields is static for the TDX module. > If the module supports the TD partitioning, it just reports some smaller > value regardless whether we opt-in TDX partitioning or not. > > I think the point is this 'max_vcpus_per_td' is TDX architectural thing and > kernel should not make any assumption of the value of it. It's not an assumption, it's a requirement. And KVM already places requirements on "hardware", e.g. kvm-intel.ko will refuse to load if the CPU doesn't support a bare mimimum VMX feature set. Refusing to enable TDX because max_vcpus_per_td is unexpectedly low isn't fundamentally different than refusing to enable VMX because IRQ window exiting is unsupported. In the unlikely event there is a legitimate reason for max_vcpus_per_td being less than KVM's minimum, then we can update KVM's minimum as needed. But AFAICT, that's purely theoretical at this point, i.e. this is all much ado about nothing.
On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote: > On Tue, May 14, 2024, Kai Huang wrote: > > > > > > On 11/05/2024 2:04 am, Sean Christopherson wrote: > > > On Thu, May 09, 2024, Isaku Yamahata wrote: > > > > On Fri, May 10, 2024 at 11:19:44AM +1200, Kai Huang <kai.huang@intel.com> wrote: > > > > > On 10/05/2024 10:52 am, Sean Christopherson wrote: > > > > > > On Fri, May 10, 2024, Kai Huang wrote: > > > > > > > On 10/05/2024 4:35 am, Sean Christopherson wrote: > > > > > > > > KVM x86 limits KVM_MAX_VCPUS to 4096: > > > > > > > > > > > > > > > > config KVM_MAX_NR_VCPUS > > > > > > > > int "Maximum number of vCPUs per KVM guest" > > > > > > > > depends on KVM > > > > > > > > range 1024 4096 > > > > > > > > default 4096 if MAXSMP > > > > > > > > default 1024 > > > > > > > > help > > > > > > > > > > > > > > > > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking > > > > > > > > a 16-bit unsigned value: > > > > > > > > > > > > > > > > #define TDX_MAX_VCPUS (~(u16)0) > > > > > > > > > > > > > > > > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does. > > > > > > > > And _if_ it becomes a problem, we don't necessarily need to have a different > > > > > > > > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS > > > > > > > > being <= 64k. > > > > > > > > > > > > > > Actually later versions of TDX module (starting from 1.5 AFAICT), the module > > > > > > > has a metadata field to report the maximum vCPUs that the module can support > > > > > > > for all TDX guests. > > > > > > > > > > > > My quick glance at the 1.5 source shows that the limit is still effectively > > > > > > 0xffff, so again, who cares? Assert on 0xffff compile time, and on the reported > > > > > > max at runtime and simply refuse to use a TDX module that has dropped the minimum > > > > > > below 0xffff. > > > > > > > > > > I need to double check why this metadata field was added. My concern is in > > > > > future module versions they may just low down the value. > > > > > > > > TD partitioning would reduce it much. > > > > > > That's still not a reason to plumb in what is effectively dead code. Either > > > partitioning is opt-in, at which I suspect KVM will need yet more uAPI to express > > > the limitations to userspace, or the TDX-module is potentially breaking existing > > > use cases. > > > > The 'max_vcpus_per_td' global metadata fields is static for the TDX module. > > If the module supports the TD partitioning, it just reports some smaller > > value regardless whether we opt-in TDX partitioning or not. > > > > I think the point is this 'max_vcpus_per_td' is TDX architectural thing and > > kernel should not make any assumption of the value of it. > > It's not an assumption, it's a requirement. And KVM already places requirements > on "hardware", e.g. kvm-intel.ko will refuse to load if the CPU doesn't support > a bare mimimum VMX feature set. Refusing to enable TDX because max_vcpus_per_td > is unexpectedly low isn't fundamentally different than refusing to enable VMX > because IRQ window exiting is unsupported. OK. I have no argument against this. But I am not sure why we need to have such requirement. See below. > > In the unlikely event there is a legitimate reason for max_vcpus_per_td being > less than KVM's minimum, then we can update KVM's minimum as needed. But AFAICT, > that's purely theoretical at this point, i.e. this is all much ado about nothing. I am afraid we already have a legitimate case: TD partitioning. Isaku told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD partitioning supported. And again this is static, i.e., doesn't require TD partitioning to be opt-in to low to 512. So AFAICT this isn't a theoretical thing now. Also, I want to say I was wrong about "MAX_VCPUS" in the TD_PARAMS is part of attestation. It is not. TDREPORT dosen't include the "MAX_VCPUS", and it is not involved in the calculation of the measurement of the guest. Given "MAX_VCPUS" is not part of attestation, I think there's no need to allow user to change kvm->max_vcpus by enabling KVM_ENABLE_CAP ioctl() for KVM_CAP_MAX_VCPUS. So we could just once for all adjust kvm->max_vcpus for TDX in the tdx_vm_init() for TDX guest: kvm->max_vcpus = min(kvm->max_vcpus, tdx_info->max_vcpus_per_td); AFAICT no other change is needed. And in KVM_TDX_VM_INIT (where TDH.MNG.INIT is done) we can just use kvm- >max_vcpus to fill the "MAX_VCPUS" in TD_PARAMS.
On Thu, May 30, 2024, Kai Huang wrote: > On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote: > > In the unlikely event there is a legitimate reason for max_vcpus_per_td being > > less than KVM's minimum, then we can update KVM's minimum as needed. But AFAICT, > > that's purely theoretical at this point, i.e. this is all much ado about nothing. > > I am afraid we already have a legitimate case: TD partitioning. Isaku > told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD > partitioning supported. And again this is static, i.e., doesn't require > TD partitioning to be opt-in to low to 512. So what's Intel's plan for use cases that creates TDs with >512 vCPUs? > So AFAICT this isn't a theoretical thing now. > > Also, I want to say I was wrong about "MAX_VCPUS" in the TD_PARAMS is part > of attestation. It is not. TDREPORT dosen't include the "MAX_VCPUS", and > it is not involved in the calculation of the measurement of the guest. > > Given "MAX_VCPUS" is not part of attestation, I think there's no need to > allow user to change kvm->max_vcpus by enabling KVM_ENABLE_CAP ioctl() for > KVM_CAP_MAX_VCPUS. Sure, but KVM would still need to advertise the reduced value for KVM_CAP_MAX_VCPUS when queried via KVM_CHECK_EXTENSION. And userspace needs to be conditioned to do a VM-scoped check, not a system-scoped check. > So we could just once for all adjust kvm->max_vcpus for TDX in the > tdx_vm_init() for TDX guest: > > kvm->max_vcpus = min(kvm->max_vcpus, tdx_info->max_vcpus_per_td); > > AFAICT no other change is needed. > > And in KVM_TDX_VM_INIT (where TDH.MNG.INIT is done) we can just use kvm- > >max_vcpus to fill the "MAX_VCPUS" in TD_PARAMS.
On Thu, 2024-05-30 at 16:12 -0700, Sean Christopherson wrote: > On Thu, May 30, 2024, Kai Huang wrote: > > On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote: > > > In the unlikely event there is a legitimate reason for max_vcpus_per_td being > > > less than KVM's minimum, then we can update KVM's minimum as needed. But AFAICT, > > > that's purely theoretical at this point, i.e. this is all much ado about nothing. > > > > I am afraid we already have a legitimate case: TD partitioning. Isaku > > told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD > > partitioning supported. And again this is static, i.e., doesn't require > > TD partitioning to be opt-in to low to 512. > > So what's Intel's plan for use cases that creates TDs with >512 vCPUs? I don't think we have such use cases. Let me double check with TDX module guys. > > > So AFAICT this isn't a theoretical thing now. > > > > Also, I want to say I was wrong about "MAX_VCPUS" in the TD_PARAMS is part > > of attestation. It is not. TDREPORT dosen't include the "MAX_VCPUS", and > > it is not involved in the calculation of the measurement of the guest. > > > > Given "MAX_VCPUS" is not part of attestation, I think there's no need to > > allow user to change kvm->max_vcpus by enabling KVM_ENABLE_CAP ioctl() for > > KVM_CAP_MAX_VCPUS. > > Sure, but KVM would still need to advertise the reduced value for KVM_CAP_MAX_VCPUS > when queried via KVM_CHECK_EXTENSION. And userspace needs to be conditioned to > do a VM-scoped check, not a system-scoped check. Oh yes.
On Thu, 2024-05-30 at 16:12 -0700, Sean Christopherson wrote: > On Thu, May 30, 2024, Kai Huang wrote: > > On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote: > > > In the unlikely event there is a legitimate reason for max_vcpus_per_td being > > > less than KVM's minimum, then we can update KVM's minimum as needed. But AFAICT, > > > that's purely theoretical at this point, i.e. this is all much ado about nothing. > > > > I am afraid we already have a legitimate case: TD partitioning. Isaku > > told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD > > partitioning supported. And again this is static, i.e., doesn't require > > TD partitioning to be opt-in to low to 512. > > So what's Intel's plan for use cases that creates TDs with >512 vCPUs? I checked with TDX module guys. Turns out the 'max_vcpus_per_td' wasn't introduced because of TD partitioning, and they are not actually related. They introduced this to support "topology virtualization", which requires a table to record the X2APIC IDs for all vcpus for each TD. In practice, given a TDX module, the 'max_vcpus_per_td', a.k.a, the X2APIC ID table size reflects the physical logical cpus that *ALL* platforms that the module supports can possibly have. The reason of this design is TDX guys don't believe there's sense in supporting the case where the 'max_vcpus' for one single TD needs to exceed the physical logical cpus. So in short: - The "max_vcpus_per_td" can be different depending on module versions. In practice it reflects the maximum physical logical cpus that all the platforms (that the module supports) can possibly have. - Before CSPs deploy/migrate TD on a TDX machine, they must be aware of the "max_vcpus_per_td" the module supports, and only deploy/migrate TD to it when it can support. - For TDX 1.5.xx modules, the value is 576 (the previous number 512 isn't correct); For TDX 2.0.xx modules, the value is larger (>1000). For future module versions, it could have a smaller number, depending on what platforms that module needs to support. Also, if TDX ever gets supported on client platforms, we can image the number could be much smaller due to the "vcpus per td no need to exceed physical logical cpus". We may ask them to support the case where 'max_vcpus' for single TD exceeds the physical logical cpus, or at least not to low down the value any further for future modules (> 2.0.xx modules). We may also ask them to give promise to not low the number to below some certain value for any future modules. But I am not sure there's any concrete reason to do so? What's your thinking?
On Tue, 2024-06-04 at 10:48 +0000, Huang, Kai wrote: > On Thu, 2024-05-30 at 16:12 -0700, Sean Christopherson wrote: > > On Thu, May 30, 2024, Kai Huang wrote: > > > On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote: > > > > In the unlikely event there is a legitimate reason for max_vcpus_per_td being > > > > less than KVM's minimum, then we can update KVM's minimum as needed. But AFAICT, > > > > that's purely theoretical at this point, i.e. this is all much ado about nothing. > > > > > > I am afraid we already have a legitimate case: TD partitioning. Isaku > > > told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD > > > partitioning supported. And again this is static, i.e., doesn't require > > > TD partitioning to be opt-in to low to 512. > > > > So what's Intel's plan for use cases that creates TDs with >512 vCPUs? > > I checked with TDX module guys. Turns out the 'max_vcpus_per_td' wasn't > introduced because of TD partitioning, and they are not actually related. > > They introduced this to support "topology virtualization", which requires > a table to record the X2APIC IDs for all vcpus for each TD. In practice, > given a TDX module, the 'max_vcpus_per_td', a.k.a, the X2APIC ID table > size reflects the physical logical cpus that *ALL* platforms that the > module supports can possibly have. > > The reason of this design is TDX guys don't believe there's sense in > supporting the case where the 'max_vcpus' for one single TD needs to > exceed the physical logical cpus. > > So in short: > > - The "max_vcpus_per_td" can be different depending on module versions. In > practice it reflects the maximum physical logical cpus that all the > platforms (that the module supports) can possibly have. > > - Before CSPs deploy/migrate TD on a TDX machine, they must be aware of > the "max_vcpus_per_td" the module supports, and only deploy/migrate TD to > it when it can support. > > - For TDX 1.5.xx modules, the value is 576 (the previous number 512 isn't > correct); For TDX 2.0.xx modules, the value is larger (>1000). For future > module versions, it could have a smaller number, depending on what > platforms that module needs to support. Also, if TDX ever gets supported > on client platforms, we can image the number could be much smaller due to > the "vcpus per td no need to exceed physical logical cpus". > > We may ask them to support the case where 'max_vcpus' for single TD > exceeds the physical logical cpus, or at least not to low down the value > any further for future modules (> 2.0.xx modules). We may also ask them > to give promise to not low the number to below some certain value for any > future modules. But I am not sure there's any concrete reason to do so? > > What's your thinking? > Hi Sean, Sorry to ping, but do you have any comments on this? Is there anything we should make TDX module guys do?
On Fri, Jun 14, 2024, Kai Huang wrote: > On Tue, 2024-06-04 at 10:48 +0000, Huang, Kai wrote: > > On Thu, 2024-05-30 at 16:12 -0700, Sean Christopherson wrote: > > > On Thu, May 30, 2024, Kai Huang wrote: > > > > On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote: > > > > > In the unlikely event there is a legitimate reason for max_vcpus_per_td being > > > > > less than KVM's minimum, then we can update KVM's minimum as needed. But AFAICT, > > > > > that's purely theoretical at this point, i.e. this is all much ado about nothing. > > > > > > > > I am afraid we already have a legitimate case: TD partitioning. Isaku > > > > told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD > > > > partitioning supported. And again this is static, i.e., doesn't require > > > > TD partitioning to be opt-in to low to 512. > > > > > > So what's Intel's plan for use cases that creates TDs with >512 vCPUs? > > > > I checked with TDX module guys. Turns out the 'max_vcpus_per_td' wasn't > > introduced because of TD partitioning, and they are not actually related. > > > > They introduced this to support "topology virtualization", which requires > > a table to record the X2APIC IDs for all vcpus for each TD. In practice, > > given a TDX module, the 'max_vcpus_per_td', a.k.a, the X2APIC ID table > > size reflects the physical logical cpus that *ALL* platforms that the > > module supports can possibly have. > > > > The reason of this design is TDX guys don't believe there's sense in > > supporting the case where the 'max_vcpus' for one single TD needs to > > exceed the physical logical cpus. > > > > So in short: > > > > - The "max_vcpus_per_td" can be different depending on module versions. In > > practice it reflects the maximum physical logical cpus that all the > > platforms (that the module supports) can possibly have. > > > > - Before CSPs deploy/migrate TD on a TDX machine, they must be aware of > > the "max_vcpus_per_td" the module supports, and only deploy/migrate TD to > > it when it can support. > > > > - For TDX 1.5.xx modules, the value is 576 (the previous number 512 isn't > > correct); For TDX 2.0.xx modules, the value is larger (>1000). For future > > module versions, it could have a smaller number, depending on what > > platforms that module needs to support. Also, if TDX ever gets supported > > on client platforms, we can image the number could be much smaller due to > > the "vcpus per td no need to exceed physical logical cpus". > > > > We may ask them to support the case where 'max_vcpus' for single TD > > exceeds the physical logical cpus, or at least not to low down the value > > any further for future modules (> 2.0.xx modules). We may also ask them > > to give promise to not low the number to below some certain value for any > > future modules. But I am not sure there's any concrete reason to do so? > > > > What's your thinking? It's a reasonable restriction, e.g. KVM_CAP_NR_VCPUS is already capped at number of online CPUs, although userspace is obviously allowed to create oversubscribed VMs. I think the sane thing to do is document that TDX VMs are restricted to the number of logical CPUs in the system, have KVM_CAP_MAX_VCPUS enumerate exactly that, and then sanity check that max_vcpus_per_td is greater than or equal to what KVM reports for KVM_CAP_MAX_VCPUS. Stating that the maximum number of vCPUs depends on the whims TDX module doesn't provide a predictable ABI for KVM, i.e. I don't want to simply forward TDX's max_vcpus_per_td to userspace.
On 15/06/2024 12:04 pm, Sean Christopherson wrote: > On Fri, Jun 14, 2024, Kai Huang wrote: >> On Tue, 2024-06-04 at 10:48 +0000, Huang, Kai wrote: >>> On Thu, 2024-05-30 at 16:12 -0700, Sean Christopherson wrote: >>>> On Thu, May 30, 2024, Kai Huang wrote: >>>>> On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote: >>>>>> In the unlikely event there is a legitimate reason for max_vcpus_per_td being >>>>>> less than KVM's minimum, then we can update KVM's minimum as needed. But AFAICT, >>>>>> that's purely theoretical at this point, i.e. this is all much ado about nothing. >>>>> >>>>> I am afraid we already have a legitimate case: TD partitioning. Isaku >>>>> told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD >>>>> partitioning supported. And again this is static, i.e., doesn't require >>>>> TD partitioning to be opt-in to low to 512. >>>> >>>> So what's Intel's plan for use cases that creates TDs with >512 vCPUs? >>> >>> I checked with TDX module guys. Turns out the 'max_vcpus_per_td' wasn't >>> introduced because of TD partitioning, and they are not actually related. >>> >>> They introduced this to support "topology virtualization", which requires >>> a table to record the X2APIC IDs for all vcpus for each TD. In practice, >>> given a TDX module, the 'max_vcpus_per_td', a.k.a, the X2APIC ID table >>> size reflects the physical logical cpus that *ALL* platforms that the >>> module supports can possibly have. >>> >>> The reason of this design is TDX guys don't believe there's sense in >>> supporting the case where the 'max_vcpus' for one single TD needs to >>> exceed the physical logical cpus. >>> >>> So in short: >>> >>> - The "max_vcpus_per_td" can be different depending on module versions. In >>> practice it reflects the maximum physical logical cpus that all the >>> platforms (that the module supports) can possibly have. >>> >>> - Before CSPs deploy/migrate TD on a TDX machine, they must be aware of >>> the "max_vcpus_per_td" the module supports, and only deploy/migrate TD to >>> it when it can support. >>> >>> - For TDX 1.5.xx modules, the value is 576 (the previous number 512 isn't >>> correct); For TDX 2.0.xx modules, the value is larger (>1000). For future >>> module versions, it could have a smaller number, depending on what >>> platforms that module needs to support. Also, if TDX ever gets supported >>> on client platforms, we can image the number could be much smaller due to >>> the "vcpus per td no need to exceed physical logical cpus". >>> >>> We may ask them to support the case where 'max_vcpus' for single TD >>> exceeds the physical logical cpus, or at least not to low down the value >>> any further for future modules (> 2.0.xx modules). We may also ask them >>> to give promise to not low the number to below some certain value for any >>> future modules. But I am not sure there's any concrete reason to do so? >>> >>> What's your thinking? > > It's a reasonable restriction, e.g. KVM_CAP_NR_VCPUS is already capped at number > of online CPUs, although userspace is obviously allowed to create oversubscribed > VMs. > > I think the sane thing to do is document that TDX VMs are restricted to the number > of logical CPUs in the system, have KVM_CAP_MAX_VCPUS enumerate exactly that, and > then sanity check that max_vcpus_per_td is greater than or equal to what KVM > reports for KVM_CAP_MAX_VCPUS. > > Stating that the maximum number of vCPUs depends on the whims TDX module doesn't > provide a predictable ABI for KVM, i.e. I don't want to simply forward TDX's > max_vcpus_per_td to userspace. This sounds good to me. I think it should be also OK for client too, if TDX ever gets supported for client. IIUC we can consult the @nr_cpu_ids or num_possible_cpus() to get the "number of logical CPUs in the system". And we can reject to use the TDX module if 'max_vcpus_per_td' turns to be smaller. I think the relevant question is is whether we should still report "number of logical CPUs in the system" via KVM_CAP_MAX_VCPUS? Because if doing so, this still means the userspace will need to check KVM_CAP_MAX_VCPUS vm extention on per-vm basis. And if it does, then from userspace's perspective, it actually doesn't matter whether underneath the per-vm KVM_CAP_MAX_VCPUS is limited by TDX or the system cpus (also see below). The userspace cannot tell the difference anyway. It just needs to change to query KVM_CAP_MAX_VCPUS to per-vm basis. Or, we could limit this to TDX guest ONLY: The KVM_CAP_MAX_VCPUS is still global. However for TDX specifically, the userspace should use other way to query the number of LPs the system supports (I assume there should be existing ABI for this?). But looks this isn't something nice?
On Tue, Jun 18, 2024, Kai Huang wrote: > On 15/06/2024 12:04 pm, Sean Christopherson wrote: > > On Fri, Jun 14, 2024, Kai Huang wrote: > > > > - The "max_vcpus_per_td" can be different depending on module versions. In > > > > practice it reflects the maximum physical logical cpus that all the > > > > platforms (that the module supports) can possibly have. > > > > It's a reasonable restriction, e.g. KVM_CAP_NR_VCPUS is already capped at number > > of online CPUs, although userspace is obviously allowed to create oversubscribed > > VMs. > > > > I think the sane thing to do is document that TDX VMs are restricted to the number > > of logical CPUs in the system, have KVM_CAP_MAX_VCPUS enumerate exactly that, and > > then sanity check that max_vcpus_per_td is greater than or equal to what KVM > > reports for KVM_CAP_MAX_VCPUS. > > > Stating that the maximum number of vCPUs depends on the whims TDX module doesn't > > provide a predictable ABI for KVM, i.e. I don't want to simply forward TDX's > > max_vcpus_per_td to userspace. > > This sounds good to me. I think it should be also OK for client too, if TDX > ever gets supported for client. > > IIUC we can consult the @nr_cpu_ids or num_possible_cpus() to get the > "number of logical CPUs in the system". And we can reject to use the TDX > module if 'max_vcpus_per_td' turns to be smaller. I assume TDX is incompatible with actual physical CPU hotplug? If so, we can and should use num_present_cpus(). If loading the TDX module completely disables onlining CPUs, then we can use num_online_cpus(). > I think the relevant question is is whether we should still report "number > of logical CPUs in the system" via KVM_CAP_MAX_VCPUS? Because if doing so, > this still means the userspace will need to check KVM_CAP_MAX_VCPUS vm > extention on per-vm basis. Yes. > And if it does, then from userspace's perspective, it actually doesn't > matter whether underneath the per-vm KVM_CAP_MAX_VCPUS is limited by TDX or > the system cpus (also see below). It matters because I don't want KVM's ABI to be tied to the whims of the TDX module. Today, there's no limitations on the max number of vCPUs. Tomorrow, it's limited by the number of pCPUs. Three days from now, I don't want to find out that the TDX module is limiting the number of vCPUs based on some other new criteria. > The userspace cannot tell the difference anyway. It just needs to change to > query KVM_CAP_MAX_VCPUS to per-vm basis. > > Or, we could limit this to TDX guest ONLY: > > The KVM_CAP_MAX_VCPUS is still global. However for TDX specifically, the > userspace should use other way to query the number of LPs the system > supports (I assume there should be existing ABI for this?). > > But looks this isn't something nice? What's wrong with querying KVM_CAP_MAX_VCPUS on the VM file descriptor?
On Tue, 2024-06-18 at 07:49 -0700, Sean Christopherson wrote: > On Tue, Jun 18, 2024, Kai Huang wrote: > > On 15/06/2024 12:04 pm, Sean Christopherson wrote: > > > On Fri, Jun 14, 2024, Kai Huang wrote: > > > > > - The "max_vcpus_per_td" can be different depending on module versions. In > > > > > practice it reflects the maximum physical logical cpus that all the > > > > > platforms (that the module supports) can possibly have. > > > > > > It's a reasonable restriction, e.g. KVM_CAP_NR_VCPUS is already capped at number > > > of online CPUs, although userspace is obviously allowed to create oversubscribed > > > VMs. > > > > > > I think the sane thing to do is document that TDX VMs are restricted to the number > > > of logical CPUs in the system, have KVM_CAP_MAX_VCPUS enumerate exactly that, and > > > then sanity check that max_vcpus_per_td is greater than or equal to what KVM > > > reports for KVM_CAP_MAX_VCPUS. > > > > Stating that the maximum number of vCPUs depends on the whims TDX module doesn't > > > provide a predictable ABI for KVM, i.e. I don't want to simply forward TDX's > > > max_vcpus_per_td to userspace. > > > > This sounds good to me. I think it should be also OK for client too, if TDX > > ever gets supported for client. > > > > IIUC we can consult the @nr_cpu_ids or num_possible_cpus() to get the > > "number of logical CPUs in the system". And we can reject to use the TDX > > module if 'max_vcpus_per_td' turns to be smaller. > > I assume TDX is incompatible with actual physical CPU hotplug? > Correct. > If so, we can and > should use num_present_cpus(). > On TDX platform num_present_cpus() and num_possible_cpus() should be just identical, because TDX requires BIOS to mark all all physical LPs the platform as enabled, and TDX doesn't support physical CPU hotplug. Using num_present_cpus() w/o holding CPU hotplug lock is a little bit annoying from code's perspective, but it's OK to me. We can add a comment saying TDX doesn't support physical CPU hotplug. > If loading the TDX module completely disables > onlining CPUs, then we can use num_online_cpus(). > > > I think the relevant question is is whether we should still report "number > > of logical CPUs in the system" via KVM_CAP_MAX_VCPUS? Because if doing so, > > this still means the userspace will need to check KVM_CAP_MAX_VCPUS vm > > extention on per-vm basis. > > Yes. > > > And if it does, then from userspace's perspective, it actually doesn't > > matter whether underneath the per-vm KVM_CAP_MAX_VCPUS is limited by TDX or > > the system cpus (also see below). > > It matters because I don't want KVM's ABI to be tied to the whims of the TDX module. > Today, there's no limitations on the max number of vCPUs. Tomorrow, it's limited > by the number of pCPUs. Three days from now, I don't want to find out that the > TDX module is limiting the number of vCPUs based on some other new criteria. Yeah understood. > > > The userspace cannot tell the difference anyway. It just needs to change to > > query KVM_CAP_MAX_VCPUS to per-vm basis. > > > > Or, we could limit this to TDX guest ONLY: > > > > The KVM_CAP_MAX_VCPUS is still global. However for TDX specifically, the > > userspace should use other way to query the number of LPs the system > > supports (I assume there should be existing ABI for this?). > > > > But looks this isn't something nice? > > What's wrong with querying KVM_CAP_MAX_VCPUS on the VM file descriptor? Nothing wrong. I just wanted to point out if we require userspace to do so, from userspace's perspective it cannot tell how the number is limited underneath by KVM. Will go with this route. Thanks!
On Tue, 2024-06-18 at 22:19 +0000, Huang, Kai wrote: > On Tue, 2024-06-18 at 07:49 -0700, Sean Christopherson wrote: > > On Tue, Jun 18, 2024, Kai Huang wrote: > > > On 15/06/2024 12:04 pm, Sean Christopherson wrote: > > > > On Fri, Jun 14, 2024, Kai Huang wrote: > > > > > > - The "max_vcpus_per_td" can be different depending on module versions. In > > > > > > practice it reflects the maximum physical logical cpus that all the > > > > > > platforms (that the module supports) can possibly have. > > > > > > > > It's a reasonable restriction, e.g. KVM_CAP_NR_VCPUS is already capped at number > > > > of online CPUs, although userspace is obviously allowed to create oversubscribed > > > > VMs. > > > > > > > > I think the sane thing to do is document that TDX VMs are restricted to the number > > > > of logical CPUs in the system, have KVM_CAP_MAX_VCPUS enumerate exactly that, and > > > > then sanity check that max_vcpus_per_td is greater than or equal to what KVM > > > > reports for KVM_CAP_MAX_VCPUS. > > > > > Stating that the maximum number of vCPUs depends on the whims TDX module doesn't > > > > provide a predictable ABI for KVM, i.e. I don't want to simply forward TDX's > > > > max_vcpus_per_td to userspace. > > > > > > This sounds good to me. I think it should be also OK for client too, if TDX > > > ever gets supported for client. > > > > > > IIUC we can consult the @nr_cpu_ids or num_possible_cpus() to get the > > > "number of logical CPUs in the system". And we can reject to use the TDX > > > module if 'max_vcpus_per_td' turns to be smaller. > > > > I assume TDX is incompatible with actual physical CPU hotplug? > > > > Correct. > > > If so, we can and > > should use num_present_cpus(). > > > > On TDX platform num_present_cpus() and num_possible_cpus() should be just > identical, because TDX requires BIOS to mark all all physical LPs the > platform as enabled, and TDX doesn't support physical CPU hotplug. > > Using num_present_cpus() w/o holding CPU hotplug lock is a little bit > annoying from code's perspective, but it's OK to me. We can add a comment > saying TDX doesn't support physical CPU hotplug. > > > If loading the TDX module completely disables > > onlining CPUs, then we can use num_online_cpus(). Missed this one. No loading/initializing TDX module doesn't disable onlining CPUs. TDX module can be initialized on a subset of physical logical CPUs. Also, after module initialization, logical CPU can be put to offline and then brought back to online again.
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index 00b371d9a1ca..1b0dacc6b6c0 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -21,6 +21,8 @@ KVM_X86_OP(hardware_unsetup) KVM_X86_OP(has_emulated_msr) KVM_X86_OP(vcpu_after_set_cpuid) KVM_X86_OP(is_vm_type_supported) +KVM_X86_OP_OPTIONAL(max_vcpus); +KVM_X86_OP_OPTIONAL(vm_enable_cap) KVM_X86_OP(vm_init) KVM_X86_OP_OPTIONAL(vm_destroy) KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 37cda8aa07b6..cf8eb46b3a20 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1604,7 +1604,9 @@ struct kvm_x86_ops { void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu); bool (*is_vm_type_supported)(unsigned long vm_type); + int (*max_vcpus)(struct kvm *kvm); unsigned int vm_size; + int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap); int (*vm_init)(struct kvm *kvm); void (*vm_destroy)(struct kvm *kvm); diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c index 082e82ce6580..e8a1a7533eea 100644 --- a/arch/x86/kvm/vmx/main.c +++ b/arch/x86/kvm/vmx/main.c @@ -6,6 +6,7 @@ #include "nested.h" #include "pmu.h" #include "tdx.h" +#include "tdx_arch.h" static bool enable_tdx __ro_after_init; module_param_named(tdx, enable_tdx, bool, 0444); @@ -16,6 +17,17 @@ static bool vt_is_vm_type_supported(unsigned long type) (enable_tdx && tdx_is_vm_type_supported(type)); } +static int vt_max_vcpus(struct kvm *kvm) +{ + if (!kvm) + return KVM_MAX_VCPUS; + + if (is_td(kvm)) + return min(kvm->max_vcpus, TDX_MAX_VCPUS); + + return kvm->max_vcpus; +} + static __init int vt_hardware_setup(void) { int ret; @@ -39,6 +51,14 @@ static void vt_hardware_unsetup(void) vmx_hardware_unsetup(); } +static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) +{ + if (is_td(kvm)) + return tdx_vm_enable_cap(kvm, cap); + + return -EINVAL; +} + static int vt_vm_init(struct kvm *kvm) { if (is_td(kvm)) @@ -77,7 +97,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = { .has_emulated_msr = vmx_has_emulated_msr, .is_vm_type_supported = vt_is_vm_type_supported, + .max_vcpus = vt_max_vcpus, .vm_size = sizeof(struct kvm_vmx), + .vm_enable_cap = vt_vm_enable_cap, .vm_init = vt_vm_init, .vm_destroy = vmx_vm_destroy, diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 816ccdb4bc41..ee015f3ce2c9 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -56,6 +56,35 @@ struct tdx_info { /* Info about the TDX module. */ static struct tdx_info *tdx_info; +int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) +{ + int r; + + switch (cap->cap) { + case KVM_CAP_MAX_VCPUS: { + if (cap->flags || cap->args[0] == 0) + return -EINVAL; + if (cap->args[0] > KVM_MAX_VCPUS || + cap->args[0] > TDX_MAX_VCPUS) + return -E2BIG; + + mutex_lock(&kvm->lock); + if (kvm->created_vcpus) + r = -EBUSY; + else { + kvm->max_vcpus = cap->args[0]; + r = 0; + } + mutex_unlock(&kvm->lock); + break; + } + default: + r = -EINVAL; + break; + } + return r; +} + static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd) { struct kvm_tdx_capabilities __user *user_caps; diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h index f6c57ad44f80..0031a8d61589 100644 --- a/arch/x86/kvm/vmx/x86_ops.h +++ b/arch/x86/kvm/vmx/x86_ops.h @@ -139,12 +139,17 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops); void tdx_hardware_unsetup(void); bool tdx_is_vm_type_supported(unsigned long type); +int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap); int tdx_vm_ioctl(struct kvm *kvm, void __user *argp); #else static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; } static inline void tdx_hardware_unsetup(void) {} static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; } +static inline int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) +{ + return -EINVAL; +}; static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; } #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c459a5e9e520..3f16e9450d2f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4726,6 +4726,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) break; case KVM_CAP_MAX_VCPUS: r = KVM_MAX_VCPUS; + if (kvm_x86_ops.max_vcpus) + r = static_call(kvm_x86_max_vcpus)(kvm); break; case KVM_CAP_MAX_VCPU_ID: r = KVM_MAX_VCPU_IDS; @@ -6709,6 +6711,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, break; default: r = -EINVAL; + if (kvm_x86_ops.vm_enable_cap) + r = static_call(kvm_x86_vm_enable_cap)(kvm, cap); break; } return r;