Message ID | 20230301105438.599196-1-kai.huang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: VMX: Make setup_vmcs_config() preemption disabled | expand |
On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote: >Make setup_vmcs_config() preemption disabled so it always performs on >the same local cpu. > >During module loading time, KVM intends to call setup_vmcs_config() to >set up the global VMCS configurations on _one_ cpu in hardware_setup(), >and then calls setup_vmcs_config() on all other online cpus via sending *all other* is misleading. The compatibility check is actually done on *all* online cpus. for_each_online_cpu(cpu) { smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1); if (r < 0) goto out_unwind_ops; } Given this, it probably is ok to not disable preemption because all CPUs are guaranteed to be compatible later in the flow in terms of VMCS capabilities. But we don't want to have such a subtle dependency. Do you see any real problem with preemption enabled? >IPI to perform VMX compatibility check. Further more, KVM has CPU >hotplug callback to call setup_vmcs_config() to do compatibility check >on the "new-online" cpu to make sure it is compatible too. > >setup_vmcs_config() is supposed to be done on the same cpu. This is >true in the compatibility check code path as setup_vmcs_config() is >called either via IPI or in per-cpu CPU hotplug thread. However, the >first call from hardware_setup() isn't as it is called when preemption >is enabled. > >Change the existing setup_vmcs_config() to __setup_vmcs_config() and >call the latter directly in the compatibility check code path. Change >setup_vmcs_config() to call __setup_vmcs_config() with preemption >disabled so __setup_vmcs_config() is always done on the same cpu. Maybe you can simply disable preemption in hardware_setup() although I don't have a strong preference. nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of vmcs_conf, should it be called on the same CPU as setup_vmcs_config()?
On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote: > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote: > > Make setup_vmcs_config() preemption disabled so it always performs on > > the same local cpu. > > > > During module loading time, KVM intends to call setup_vmcs_config() to > > set up the global VMCS configurations on _one_ cpu in hardware_setup(), > > and then calls setup_vmcs_config() on all other online cpus via sending > > *all other* is misleading. The compatibility check is actually done on > *all* online cpus. > > for_each_online_cpu(cpu) { > smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1); > if (r < 0) > goto out_unwind_ops; > } > > Given this, it probably is ok to not disable preemption because all CPUs > are guaranteed to be compatible later in the flow in terms of VMCS > capabilities. But we don't want to have such a subtle dependency. > > Do you see any real problem with preemption enabled? No. Just thought it is the right thing to do. > > > IPI to perform VMX compatibility check. Further more, KVM has CPU > > hotplug callback to call setup_vmcs_config() to do compatibility check > > on the "new-online" cpu to make sure it is compatible too. > > > > setup_vmcs_config() is supposed to be done on the same cpu. This is > > true in the compatibility check code path as setup_vmcs_config() is > > called either via IPI or in per-cpu CPU hotplug thread. However, the > > first call from hardware_setup() isn't as it is called when preemption > > is enabled. > > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and > > call the latter directly in the compatibility check code path. Change > > setup_vmcs_config() to call __setup_vmcs_config() with preemption > > disabled so __setup_vmcs_config() is always done on the same cpu. > > Maybe you can simply disable preemption in hardware_setup() although I > don't have a strong preference. > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()? Yes I think so. I missed this :) Not sure whether there are other similar places too even outside of hardware_setup(). But compatibility check only checks things calculated via setup_vmcs_config() and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put hardware_setup() inside preemption disabled.
On Thu, Mar 02, 2023, Huang, Kai wrote: > On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote: > > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote: > > > Make setup_vmcs_config() preemption disabled so it always performs on > > > the same local cpu. > > > > > > During module loading time, KVM intends to call setup_vmcs_config() to > > > set up the global VMCS configurations on _one_ cpu in hardware_setup(), That may have been the very original intention, but I don't think it has been the true intention for a very long time. > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and > > > call the latter directly in the compatibility check code path. Change > > > setup_vmcs_config() to call __setup_vmcs_config() with preemption > > > disabled so __setup_vmcs_config() is always done on the same cpu. > > > > Maybe you can simply disable preemption in hardware_setup() although I > > don't have a strong preference. > > > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of > > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()? > > Yes I think so. I missed this :) > > Not sure whether there are other similar places too even outside of > hardware_setup(). > > But compatibility check only checks things calculated via setup_vmcs_config() > and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put > hardware_setup() inside preemption disabled. Disabling preemption across hardware_setup() isn't feasible as there are a number of allocations that might sleep. But disabling preemption isn't necessary to ensure setup runs on one CPU, that only requires disabling _migration_. So _if_ we want to handle this in the kernel, we could simply do: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 541982de5762..9126fdf02649 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) int r; mutex_lock(&vendor_module_lock); + migrate_disable(); r = __kvm_x86_vendor_init(ops); + migrate_enable(); mutex_unlock(&vendor_module_lock); return r; But I'm not convinced we should handle this in the kernel. Many of the checks, especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform setup on a single CPU, all of those would need to be converted to this_cpu_has(). Some of those boot_cpu_has() calls should be changed regardless of whether or not migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after open coding cpu_has_svm() into kvm_is_svm_supported()[*]). But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(), because the teardown path needs to mirror the setup path, e.g. if KVM ended up running on frankenstein hardware where not all CPUs have a constant TSC, KVM could leave a callback dangling and hose the kernel. Obviously such hardware wouldn't correctly run VMs, but crashing the kernel is a touch worse than KVM not working correctly. I'm not totally against converting to this_cpu_has() for the setup, as it would be more intuitive in a lot of ways. But, I don't think pinning the task actually hardens KVM in a meaningful way. If there are any divergences between CPUs, then either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants thereof) without sanity checking across CPUs. And if userspace _really_ wants to have guarantees about how setup is performed, e.g. for repeatable, deterministic behavior, then userspace should force loading of KVM to be done on CPU0. So my vote is to leave things as-is (modulo the cpu_has_svm() mess). But maybe add documentation to explain the caveats about loading KVM, and how userspace can mitigate those caveats? [*] https://lore.kernel.org/all/20221201232655.290720-14-seanjc@google.com
On Tue, 2023-03-07 at 09:17 -0800, Sean Christopherson wrote: > On Thu, Mar 02, 2023, Huang, Kai wrote: > > On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote: > > > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote: > > > > Make setup_vmcs_config() preemption disabled so it always performs on > > > > the same local cpu. > > > > > > > > During module loading time, KVM intends to call setup_vmcs_config() to > > > > set up the global VMCS configurations on _one_ cpu in hardware_setup(), > > That may have been the very original intention, but I don't think it has been the > true intention for a very long time. Wondering what's the current intention? > > > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and > > > > call the latter directly in the compatibility check code path. Change > > > > setup_vmcs_config() to call __setup_vmcs_config() with preemption > > > > disabled so __setup_vmcs_config() is always done on the same cpu. > > > > > > Maybe you can simply disable preemption in hardware_setup() although I > > > don't have a strong preference. > > > > > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of > > > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()? > > > > Yes I think so. I missed this :) > > > > Not sure whether there are other similar places too even outside of > > hardware_setup(). > > > > But compatibility check only checks things calculated via setup_vmcs_config() > > and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put > > hardware_setup() inside preemption disabled. > > Disabling preemption across hardware_setup() isn't feasible as there are a number > of allocations that might sleep. But disabling preemption isn't necessary to > ensure setup runs on one CPU, that only requires disabling _migration_. So _if_ > we want to handle this in the kernel, we could simply do: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 541982de5762..9126fdf02649 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) > int r; > > mutex_lock(&vendor_module_lock); > + migrate_disable(); > r = __kvm_x86_vendor_init(ops); > + migrate_enable(); > mutex_unlock(&vendor_module_lock); > > return r; > > > But I'm not convinced we should handle this in the kernel. Many of the checks, > especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform > setup on a single CPU, all of those would need to be converted to this_cpu_has(). > > Some of those boot_cpu_has() calls should be changed regardless of whether or not > migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy > due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after > open coding cpu_has_svm() into kvm_is_svm_supported()[*]). > > But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(), > because the teardown path needs to mirror the setup path, e.g. if KVM ended up > running on frankenstein hardware where not all CPUs have a constant TSC, KVM could > leave a callback dangling and hose the kernel. Obviously such hardware wouldn't > correctly run VMs, but crashing the kernel is a touch worse than KVM not working > correctly. > > I'm not totally against converting to this_cpu_has() for the setup, as it would be > more intuitive in a lot of ways. But, I don't think pinning the task actually > hardens KVM in a meaningful way. If there are any divergences between CPUs, then > either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will > never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants > thereof) without sanity checking across CPUs. And if userspace _really_ wants to > have guarantees about how setup is performed, e.g. for repeatable, deterministic > behavior, then userspace should force loading of KVM to be done on CPU0. My intention is never for userspace, but simply/purely from compatibility check's point of view (see below). Also, I don't think userspace wants to guarantee anything -- it just wants to load the KVM module. It's even arguable that it may be an acceptable behaviour to fail to run any VM even loading module was successful. > > So my vote is to leave things as-is (modulo the cpu_has_svm() mess). But maybe add > documentation to explain the caveats about loading KVM, and how userspace can > mitigate those caveats? I made this patch because I have some other patches to move VMXON support out of KVM in order to support TDX, but so far those patches are not included in that series (and I'd like to leave it out if we really don't need it). In the patch to move VMXON out of KVM, I changed to use per-cpu variable to cache the MSR_IA32_VMX_BASIC value and setup the VMXON region when one CPU is becoming online. And setup_vmcs_config() is changed to use __this_cpu_read() to read the per-cpu MSR value instead of reading from hardware. Obviously w/o preempt_disable() or similar __this_cpu_read() can report kernel bug: printk(KERN_ERR "BUG: using %s%s() in preemptible [%08x] code: %s/%d\n", what1, what2, preempt_count() - 1, current->comm, current->pid); That being said, I am fine to keep existing code, even w/o documenting. We can discuss more how to handle when we really want to move VMXON out of KVM (i.e. supporting TDX IO?). Or we can just fix compatibility check part? For instance, move setup_vmcs_config() and nested_vmx_setup_ctls_msrs() together in hardware_setup() and call preempt_disable() around them?
On Wed, Mar 08, 2023, Huang, Kai wrote: > On Tue, 2023-03-07 at 09:17 -0800, Sean Christopherson wrote: > > On Thu, Mar 02, 2023, Huang, Kai wrote: > > > On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote: > > > > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote: > > > > > Make setup_vmcs_config() preemption disabled so it always performs on > > > > > the same local cpu. > > > > > > > > > > During module loading time, KVM intends to call setup_vmcs_config() to > > > > > set up the global VMCS configurations on _one_ cpu in hardware_setup(), > > > > That may have been the very original intention, but I don't think it has been the > > true intention for a very long time. > > Wondering what's the current intention? I don't think there's a deliberate "intention" beyond "does it work?". Like many of the historical bits of KVM x86, I think this is a case of the original authors _wanting_ to provide certain behavior, but not actually ensuring that behavior in code. > > > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and > > > > > call the latter directly in the compatibility check code path. Change > > > > > setup_vmcs_config() to call __setup_vmcs_config() with preemption > > > > > disabled so __setup_vmcs_config() is always done on the same cpu. > > > > > > > > Maybe you can simply disable preemption in hardware_setup() although I > > > > don't have a strong preference. > > > > > > > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of > > > > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()? > > > > > > Yes I think so. I missed this :) > > > > > > Not sure whether there are other similar places too even outside of > > > hardware_setup(). > > > > > > But compatibility check only checks things calculated via setup_vmcs_config() > > > and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put > > > hardware_setup() inside preemption disabled. > > > > Disabling preemption across hardware_setup() isn't feasible as there are a number > > of allocations that might sleep. But disabling preemption isn't necessary to > > ensure setup runs on one CPU, that only requires disabling _migration_. So _if_ > > we want to handle this in the kernel, we could simply do: > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 541982de5762..9126fdf02649 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) > > int r; > > > > mutex_lock(&vendor_module_lock); > > + migrate_disable(); > > r = __kvm_x86_vendor_init(ops); > > + migrate_enable(); > > mutex_unlock(&vendor_module_lock); > > > > return r; > > > > > > But I'm not convinced we should handle this in the kernel. Many of the checks, > > especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform > > setup on a single CPU, all of those would need to be converted to this_cpu_has(). > > > > Some of those boot_cpu_has() calls should be changed regardless of whether or not > > migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy > > due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after > > open coding cpu_has_svm() into kvm_is_svm_supported()[*]). > > > > But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(), > > because the teardown path needs to mirror the setup path, e.g. if KVM ended up > > running on frankenstein hardware where not all CPUs have a constant TSC, KVM could > > leave a callback dangling and hose the kernel. Obviously such hardware wouldn't > > correctly run VMs, but crashing the kernel is a touch worse than KVM not working > > correctly. > > > > I'm not totally against converting to this_cpu_has() for the setup, as it would be > > more intuitive in a lot of ways. But, I don't think pinning the task actually > > hardens KVM in a meaningful way. If there are any divergences between CPUs, then > > either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will > > never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants > > thereof) without sanity checking across CPUs. And if userspace _really_ wants to > > have guarantees about how setup is performed, e.g. for repeatable, deterministic > > behavior, then userspace should force loading of KVM to be done on CPU0. > > My intention is never for userspace, but simply/purely from compatibility > check's point of view (see below). Also, I don't think userspace wants to > guarantee anything -- it just wants to load the KVM module. That very much depends on the use case. For personal usage of KVM, it's extremely unlikely that userspace is doing anything remotely sophisticated. But for a more "formal" deployment, userspace absolutely has its hands all over the system, e.g. scheduling VMs across systems, monitoring the health of the system, etc. Whether or not userspaces actually do tightly control loading KVM is another matter... > It's even arguable that it may be an acceptable behaviour to fail to run any > VM even loading module was successful. > > > > > So my vote is to leave things as-is (modulo the cpu_has_svm() mess). But maybe add > > documentation to explain the caveats about loading KVM, and how userspace can > > mitigate those caveats? > > I made this patch because I have some other patches to move VMXON support out of > KVM in order to support TDX, but so far those patches are not included in that > series (and I'd like to leave it out if we really don't need it). Me too. :-) > In the patch to move VMXON out of KVM, I changed to use per-cpu variable to > cache the MSR_IA32_VMX_BASIC value and setup the VMXON region when one CPU is > becoming online. And setup_vmcs_config() is changed to use __this_cpu_read() to > read the per-cpu MSR value instead of reading from hardware. Obviously w/o > preempt_disable() or similar __this_cpu_read() can report kernel bug: > > printk(KERN_ERR "BUG: using %s%s() in preemptible [%08x] code: %s/%d\n", > what1, what2, preempt_count() - 1, current->comm, current->pid); > > That being said, I am fine to keep existing code, even w/o documenting. We can > discuss more how to handle when we really want to move VMXON out of KVM (i.e. > supporting TDX IO?). > > Or we can just fix compatibility check part? For instance, move > setup_vmcs_config() and nested_vmx_setup_ctls_msrs() together in > hardware_setup() and call preempt_disable() around them? Eh, the compatibility checks we really care about run in IRQ context, i.e. they're guaranteed to have a stable CPU. Splitting the _setup_ for the compatibility checks across multiple CPUs isn't a problem because KVM will still get the right "answer", i.e. any divergence will be detected (barring _very_ flaky hardware that might get false negatives anyways). Don't get me wrong, I agree it's ugly, but I don't want to go halfway. I either want to guard the whole thing, or nothing, and I can't convince myself that guarding everything is worthwhile since userspace can (and IMO should) do a better job.
On Wed, 2023-03-08 at 13:03 -0800, Sean Christopherson wrote: > On Wed, Mar 08, 2023, Huang, Kai wrote: > > On Tue, 2023-03-07 at 09:17 -0800, Sean Christopherson wrote: > > > On Thu, Mar 02, 2023, Huang, Kai wrote: > > > > On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote: > > > > > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote: > > > > > > Make setup_vmcs_config() preemption disabled so it always performs on > > > > > > the same local cpu. > > > > > > > > > > > > During module loading time, KVM intends to call setup_vmcs_config() to > > > > > > set up the global VMCS configurations on _one_ cpu in hardware_setup(), > > > > > > That may have been the very original intention, but I don't think it has been the > > > true intention for a very long time. > > > > Wondering what's the current intention? > > I don't think there's a deliberate "intention" beyond "does it work?". Like many > of the historical bits of KVM x86, I think this is a case of the original authors > _wanting_ to provide certain behavior, but not actually ensuring that behavior in > code. Yep. > > > > > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and > > > > > > call the latter directly in the compatibility check code path. Change > > > > > > setup_vmcs_config() to call __setup_vmcs_config() with preemption > > > > > > disabled so __setup_vmcs_config() is always done on the same cpu. > > > > > > > > > > Maybe you can simply disable preemption in hardware_setup() although I > > > > > don't have a strong preference. > > > > > > > > > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of > > > > > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()? > > > > > > > > Yes I think so. I missed this :) > > > > > > > > Not sure whether there are other similar places too even outside of > > > > hardware_setup(). > > > > > > > > But compatibility check only checks things calculated via setup_vmcs_config() > > > > and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put > > > > hardware_setup() inside preemption disabled. > > > > > > Disabling preemption across hardware_setup() isn't feasible as there are a number > > > of allocations that might sleep. But disabling preemption isn't necessary to > > > ensure setup runs on one CPU, that only requires disabling _migration_. So _if_ > > > we want to handle this in the kernel, we could simply do: > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index 541982de5762..9126fdf02649 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) > > > int r; > > > > > > mutex_lock(&vendor_module_lock); > > > + migrate_disable(); > > > r = __kvm_x86_vendor_init(ops); > > > + migrate_enable(); > > > mutex_unlock(&vendor_module_lock); > > > > > > return r; > > > > > > > > > But I'm not convinced we should handle this in the kernel. Many of the checks, > > > especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform > > > setup on a single CPU, all of those would need to be converted to this_cpu_has(). > > > > > > Some of those boot_cpu_has() calls should be changed regardless of whether or not > > > migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy > > > due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after > > > open coding cpu_has_svm() into kvm_is_svm_supported()[*]). > > > > > > But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(), > > > because the teardown path needs to mirror the setup path, e.g. if KVM ended up > > > running on frankenstein hardware where not all CPUs have a constant TSC, KVM could > > > leave a callback dangling and hose the kernel. Obviously such hardware wouldn't > > > correctly run VMs, but crashing the kernel is a touch worse than KVM not working > > > correctly. > > > > > > I'm not totally against converting to this_cpu_has() for the setup, as it would be > > > more intuitive in a lot of ways. But, I don't think pinning the task actually > > > hardens KVM in a meaningful way. If there are any divergences between CPUs, then > > > either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will > > > never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants > > > thereof) without sanity checking across CPUs. And if userspace _really_ wants to > > > have guarantees about how setup is performed, e.g. for repeatable, deterministic > > > behavior, then userspace should force loading of KVM to be done on CPU0. > > > > My intention is never for userspace, but simply/purely from compatibility > > check's point of view (see below). Also, I don't think userspace wants to > > guarantee anything -- it just wants to load the KVM module. > > That very much depends on the use case. For personal usage of KVM, it's extremely > unlikely that userspace is doing anything remotely sophisticated. But for a more > "formal" deployment, userspace absolutely has its hands all over the system, e.g. > scheduling VMs across systems, monitoring the health of the system, etc. Whether > or not userspaces actually do tightly control loading KVM is another matter... Agreed. > > > It's even arguable that it may be an acceptable behaviour to fail to run any > > VM even loading module was successful. > > > > > > > > So my vote is to leave things as-is (modulo the cpu_has_svm() mess). But maybe add > > > documentation to explain the caveats about loading KVM, and how userspace can > > > mitigate those caveats? > > > > I made this patch because I have some other patches to move VMXON support out of > > KVM in order to support TDX, but so far those patches are not included in that > > series (and I'd like to leave it out if we really don't need it). > > Me too. :-) > > > In the patch to move VMXON out of KVM, I changed to use per-cpu variable to > > cache the MSR_IA32_VMX_BASIC value and setup the VMXON region when one CPU is > > becoming online. And setup_vmcs_config() is changed to use __this_cpu_read() to > > read the per-cpu MSR value instead of reading from hardware. Obviously w/o > > preempt_disable() or similar __this_cpu_read() can report kernel bug: > > > > printk(KERN_ERR "BUG: using %s%s() in preemptible [%08x] code: %s/%d\n", > > what1, what2, preempt_count() - 1, current->comm, current->pid); > > > > That being said, I am fine to keep existing code, even w/o documenting. We can > > discuss more how to handle when we really want to move VMXON out of KVM (i.e. > > supporting TDX IO?). > > > > Or we can just fix compatibility check part? For instance, move > > setup_vmcs_config() and nested_vmx_setup_ctls_msrs() together in > > hardware_setup() and call preempt_disable() around them? > > Eh, the compatibility checks we really care about run in IRQ context, i.e. they're > guaranteed to have a stable CPU. Splitting the _setup_ for the compatibility > checks across multiple CPUs isn't a problem because KVM will still get the right > "answer", i.e. any divergence will be detected (barring _very_ flaky hardware that > might get false negatives anyways). > > Don't get me wrong, I agree it's ugly, but I don't want to go halfway. I either > want to guard the whole thing, or nothing, and I can't convince myself that > guarding everything is worthwhile since userspace can (and IMO should) do a better > job. Agreed. Let's just leave the current code as is. Thanks for your time!
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 33614ee2cd67..dfbc0e9095ec 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2578,8 +2578,8 @@ static u64 adjust_vmx_controls64(u64 ctl_opt, u32 msr) return ctl_opt & allowed; } -static int setup_vmcs_config(struct vmcs_config *vmcs_conf, - struct vmx_capability *vmx_cap) +static int __setup_vmcs_config(struct vmcs_config *vmcs_conf, + struct vmx_capability *vmx_cap) { u32 vmx_msr_low, vmx_msr_high; u32 _pin_based_exec_control = 0; @@ -2745,6 +2745,18 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf, return 0; } +static int setup_vmcs_config(struct vmcs_config *vmcs_conf, + struct vmx_capability *vmx_cap) +{ + int ret; + + preempt_disable(); + ret = __setup_vmcs_config(vmcs_conf, vmx_cap); + preempt_enable(); + + return ret; +} + static bool kvm_is_vmx_supported(void) { int cpu = raw_smp_processor_id(); @@ -2772,7 +2784,7 @@ static int vmx_check_processor_compat(void) if (!kvm_is_vmx_supported()) return -EIO; - if (setup_vmcs_config(&vmcs_conf, &vmx_cap) < 0) { + if (__setup_vmcs_config(&vmcs_conf, &vmx_cap) < 0) { pr_err("Failed to setup VMCS config on CPU %d\n", cpu); return -EIO; }
Make setup_vmcs_config() preemption disabled so it always performs on the same local cpu. During module loading time, KVM intends to call setup_vmcs_config() to set up the global VMCS configurations on _one_ cpu in hardware_setup(), and then calls setup_vmcs_config() on all other online cpus via sending IPI to perform VMX compatibility check. Further more, KVM has CPU hotplug callback to call setup_vmcs_config() to do compatibility check on the "new-online" cpu to make sure it is compatible too. setup_vmcs_config() is supposed to be done on the same cpu. This is true in the compatibility check code path as setup_vmcs_config() is called either via IPI or in per-cpu CPU hotplug thread. However, the first call from hardware_setup() isn't as it is called when preemption is enabled. Change the existing setup_vmcs_config() to __setup_vmcs_config() and call the latter directly in the compatibility check code path. Change setup_vmcs_config() to call __setup_vmcs_config() with preemption disabled so __setup_vmcs_config() is always done on the same cpu. Signed-off-by: Kai Huang <kai.huang@intel.com> --- arch/x86/kvm/vmx/vmx.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) base-commit: 93827a0a36396f2fd6368a54a020f420c8916e9b