KVM: VMX: Make setup_vmcs_config() preemption disabled

Message ID	20230301105438.599196-1-kai.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: Kai Huang <kai.huang@intel.com> To: kvm@vger.kernel.org Cc: pbonzini@redhat.com, seanjc@google.com, linux-kernel@vger.kernel.org, Kai Huang <kai.huang@intel.com> Subject: [PATCH] KVM: VMX: Make setup_vmcs_config() preemption disabled Date: Wed, 1 Mar 2023 23:54:38 +1300 Message-Id: <20230301105438.599196-1-kai.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: VMX: Make setup_vmcs_config() preemption disabled \| expand KVM: VMX: Make setup_vmcs_config() preemption disabled

Huang, Kai March 1, 2023, 10:54 a.m. UTC

Make setup_vmcs_config() preemption disabled so it always performs on
the same local cpu.

During module loading time, KVM intends to call setup_vmcs_config() to
set up the global VMCS configurations on _one_ cpu in hardware_setup(),
and then calls setup_vmcs_config() on all other online cpus via sending
IPI to perform VMX compatibility check.  Further more, KVM has CPU
hotplug callback to call setup_vmcs_config() to do compatibility check
on the "new-online" cpu to make sure it is compatible too.

setup_vmcs_config() is supposed to be done on the same cpu.  This is
true in the compatibility check code path as setup_vmcs_config() is
called either via IPI or in per-cpu CPU hotplug thread.  However, the
first call from hardware_setup() isn't as it is called when preemption
is enabled.

Change the existing setup_vmcs_config() to __setup_vmcs_config() and
call the latter directly in the compatibility check code path.  Change
setup_vmcs_config() to call __setup_vmcs_config() with preemption
disabled so __setup_vmcs_config() is always done on the same cpu.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)


base-commit: 93827a0a36396f2fd6368a54a020f420c8916e9b

Chao Gao March 2, 2023, 5:36 a.m. UTC | #1

On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote:
>Make setup_vmcs_config() preemption disabled so it always performs on
>the same local cpu.
>
>During module loading time, KVM intends to call setup_vmcs_config() to
>set up the global VMCS configurations on _one_ cpu in hardware_setup(),
>and then calls setup_vmcs_config() on all other online cpus via sending

*all other* is misleading. The compatibility check is actually done on
*all* online cpus.

        for_each_online_cpu(cpu) {
                smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1);
                if (r < 0)
                        goto out_unwind_ops;
        }

Given this, it probably is ok to not disable preemption because all CPUs
are guaranteed to be compatible later in the flow in terms of VMCS
capabilities. But we don't want to have such a subtle dependency.

Do you see any real problem with preemption enabled?

>IPI to perform VMX compatibility check.  Further more, KVM has CPU
>hotplug callback to call setup_vmcs_config() to do compatibility check
>on the "new-online" cpu to make sure it is compatible too.
>
>setup_vmcs_config() is supposed to be done on the same cpu.  This is
>true in the compatibility check code path as setup_vmcs_config() is
>called either via IPI or in per-cpu CPU hotplug thread.  However, the
>first call from hardware_setup() isn't as it is called when preemption
>is enabled.
>
>Change the existing setup_vmcs_config() to __setup_vmcs_config() and
>call the latter directly in the compatibility check code path.  Change
>setup_vmcs_config() to call __setup_vmcs_config() with preemption
>disabled so __setup_vmcs_config() is always done on the same cpu.

Maybe you can simply disable preemption in hardware_setup() although I
don't have a strong preference.

nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of
vmcs_conf, should it be called on the same CPU as setup_vmcs_config()?

Huang, Kai March 2, 2023, 8:39 a.m. UTC | #2

On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote:
> On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote:
> > Make setup_vmcs_config() preemption disabled so it always performs on
> > the same local cpu.
> > 
> > During module loading time, KVM intends to call setup_vmcs_config() to
> > set up the global VMCS configurations on _one_ cpu in hardware_setup(),
> > and then calls setup_vmcs_config() on all other online cpus via sending
> 
> *all other* is misleading. The compatibility check is actually done on
> *all* online cpus.
> 
>         for_each_online_cpu(cpu) {
>                 smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1);
>                 if (r < 0)
>                         goto out_unwind_ops;
>         }
> 
> Given this, it probably is ok to not disable preemption because all CPUs
> are guaranteed to be compatible later in the flow in terms of VMCS
> capabilities. But we don't want to have such a subtle dependency.
> 
> Do you see any real problem with preemption enabled?

No.  Just thought it is the right thing to do.

> 
> > IPI to perform VMX compatibility check.  Further more, KVM has CPU
> > hotplug callback to call setup_vmcs_config() to do compatibility check
> > on the "new-online" cpu to make sure it is compatible too.
> > 
> > setup_vmcs_config() is supposed to be done on the same cpu.  This is
> > true in the compatibility check code path as setup_vmcs_config() is
> > called either via IPI or in per-cpu CPU hotplug thread.  However, the
> > first call from hardware_setup() isn't as it is called when preemption
> > is enabled.
> > 
> > Change the existing setup_vmcs_config() to __setup_vmcs_config() and
> > call the latter directly in the compatibility check code path.  Change
> > setup_vmcs_config() to call __setup_vmcs_config() with preemption
> > disabled so __setup_vmcs_config() is always done on the same cpu.
> 
> Maybe you can simply disable preemption in hardware_setup() although I
> don't have a strong preference.
> 
> nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of
> vmcs_conf, should it be called on the same CPU as setup_vmcs_config()?

Yes I think so.  I missed this :)

Not sure whether there are other similar places too even outside of
hardware_setup().

But compatibility check only checks things calculated via setup_vmcs_config()
and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put
hardware_setup() inside preemption disabled.

Sean Christopherson March 7, 2023, 5:17 p.m. UTC | #3

On Thu, Mar 02, 2023, Huang, Kai wrote:
> On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote:
> > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote:
> > > Make setup_vmcs_config() preemption disabled so it always performs on
> > > the same local cpu.
> > > 
> > > During module loading time, KVM intends to call setup_vmcs_config() to
> > > set up the global VMCS configurations on _one_ cpu in hardware_setup(),

That may have been the very original intention, but I don't think it has been the
true intention for a very long time.

> > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and
> > > call the latter directly in the compatibility check code path.  Change
> > > setup_vmcs_config() to call __setup_vmcs_config() with preemption
> > > disabled so __setup_vmcs_config() is always done on the same cpu.
> > 
> > Maybe you can simply disable preemption in hardware_setup() although I
> > don't have a strong preference.
> > 
> > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of
> > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()?
> 
> Yes I think so.  I missed this :)
> 
> Not sure whether there are other similar places too even outside of
> hardware_setup().
> 
> But compatibility check only checks things calculated via setup_vmcs_config()
> and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put
> hardware_setup() inside preemption disabled.

Disabling preemption across hardware_setup() isn't feasible as there are a number
of allocations that might sleep.  But disabling preemption isn't necessary to
ensure setup runs on one CPU, that only requires disabling _migration_.  So _if_
we want to handle this in the kernel, we could simply do:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 541982de5762..9126fdf02649 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
        int r;

        mutex_lock(&vendor_module_lock);
+       migrate_disable();
        r = __kvm_x86_vendor_init(ops);
+       migrate_enable();
        mutex_unlock(&vendor_module_lock);

        return r;

But I'm not convinced we should handle this in the kernel.  Many of the checks,
especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform
setup on a single CPU, all of those would need to be converted to this_cpu_has().

Some of those boot_cpu_has() calls should be changed regardless of whether or not
migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy
due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after
open coding cpu_has_svm() into kvm_is_svm_supported()[*]).

But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(),
because the teardown path needs to mirror the setup path, e.g. if KVM ended up
running on frankenstein hardware where not all CPUs have a constant TSC, KVM could
leave a callback dangling and hose the kernel.  Obviously such hardware wouldn't
correctly run VMs, but crashing the kernel is a touch worse than KVM not working
correctly.

I'm not totally against converting to this_cpu_has() for the setup, as it would be
more intuitive in a lot of ways.  But, I don't think pinning the task actually
hardens KVM in a meaningful way.  If there are any divergences between CPUs, then
either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will
never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants
thereof) without sanity checking across CPUs.  And if userspace _really_ wants to
have guarantees about how setup is performed, e.g. for repeatable, deterministic
behavior, then userspace should force loading of KVM to be done on CPU0.

So my vote is to leave things as-is (modulo the cpu_has_svm() mess).  But maybe add
documentation to explain the caveats about loading KVM, and how userspace can
mitigate those caveats?

[*] https://lore.kernel.org/all/20221201232655.290720-14-seanjc@google.com

Huang, Kai March 8, 2023, 1:20 a.m. UTC | #4

On Tue, 2023-03-07 at 09:17 -0800, Sean Christopherson wrote:
> On Thu, Mar 02, 2023, Huang, Kai wrote:
> > On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote:
> > > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote:
> > > > Make setup_vmcs_config() preemption disabled so it always performs on
> > > > the same local cpu.
> > > > 
> > > > During module loading time, KVM intends to call setup_vmcs_config() to
> > > > set up the global VMCS configurations on _one_ cpu in hardware_setup(),
> 
> That may have been the very original intention, but I don't think it has been the
> true intention for a very long time.

Wondering what's the current intention?

> 
> > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and
> > > > call the latter directly in the compatibility check code path.  Change
> > > > setup_vmcs_config() to call __setup_vmcs_config() with preemption
> > > > disabled so __setup_vmcs_config() is always done on the same cpu.
> > > 
> > > Maybe you can simply disable preemption in hardware_setup() although I
> > > don't have a strong preference.
> > > 
> > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of
> > > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()?
> > 
> > Yes I think so.  I missed this :)
> > 
> > Not sure whether there are other similar places too even outside of
> > hardware_setup().
> > 
> > But compatibility check only checks things calculated via setup_vmcs_config()
> > and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put
> > hardware_setup() inside preemption disabled.
> 
> Disabling preemption across hardware_setup() isn't feasible as there are a number
> of allocations that might sleep.  But disabling preemption isn't necessary to
> ensure setup runs on one CPU, that only requires disabling _migration_.  So _if_
> we want to handle this in the kernel, we could simply do:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 541982de5762..9126fdf02649 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
>         int r;
>  
>         mutex_lock(&vendor_module_lock);
> +       migrate_disable();
>         r = __kvm_x86_vendor_init(ops);
> +       migrate_enable();
>         mutex_unlock(&vendor_module_lock);
>  
>         return r;
> 
> 
> But I'm not convinced we should handle this in the kernel.  Many of the checks,
> especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform
> setup on a single CPU, all of those would need to be converted to this_cpu_has().
> 
> Some of those boot_cpu_has() calls should be changed regardless of whether or not
> migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy
> due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after
> open coding cpu_has_svm() into kvm_is_svm_supported()[*]).
> 
> But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(),
> because the teardown path needs to mirror the setup path, e.g. if KVM ended up
> running on frankenstein hardware where not all CPUs have a constant TSC, KVM could
> leave a callback dangling and hose the kernel.  Obviously such hardware wouldn't
> correctly run VMs, but crashing the kernel is a touch worse than KVM not working
> correctly.
> 
> I'm not totally against converting to this_cpu_has() for the setup, as it would be
> more intuitive in a lot of ways.  But, I don't think pinning the task actually
> hardens KVM in a meaningful way.  If there are any divergences between CPUs, then
> either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will
> never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants
> thereof) without sanity checking across CPUs.  And if userspace _really_ wants to
> have guarantees about how setup is performed, e.g. for repeatable, deterministic
> behavior, then userspace should force loading of KVM to be done on CPU0.

My intention is never for userspace, but simply/purely from compatibility
check's point of view (see below).  Also, I don't think userspace wants to
guarantee anything  -- it just wants to load the KVM module.  It's even arguable
that it may be an acceptable behaviour to fail to run any VM even loading module
was successful.

> 
> So my vote is to leave things as-is (modulo the cpu_has_svm() mess).  But maybe add
> documentation to explain the caveats about loading KVM, and how userspace can
> mitigate those caveats?

I made this patch because I have some other patches to move VMXON support out of
KVM in order to support TDX, but so far those patches are not included in that
series (and I'd like to leave it out if we really don't need it).

In the patch to move VMXON out of KVM, I changed to use per-cpu variable to
cache the MSR_IA32_VMX_BASIC value and setup the VMXON region when one CPU is
becoming online.  And setup_vmcs_config() is changed to use __this_cpu_read() to
read the per-cpu MSR value instead of reading from hardware.  Obviously w/o
preempt_disable() or similar __this_cpu_read() can report kernel bug:

        printk(KERN_ERR "BUG: using %s%s() in preemptible [%08x] code: %s/%d\n",
                what1, what2, preempt_count() - 1, current->comm, current->pid);

That being said, I am fine to keep existing code, even w/o documenting.  We can
discuss more how to handle when we really want to move VMXON out of KVM (i.e.
supporting TDX IO?).

Or we can just fix compatibility check part?  For instance, move
setup_vmcs_config() and nested_vmx_setup_ctls_msrs() together in
hardware_setup() and call preempt_disable() around them?

Sean Christopherson March 8, 2023, 9:03 p.m. UTC | #5

On Wed, Mar 08, 2023, Huang, Kai wrote:
> On Tue, 2023-03-07 at 09:17 -0800, Sean Christopherson wrote:
> > On Thu, Mar 02, 2023, Huang, Kai wrote:
> > > On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote:
> > > > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote:
> > > > > Make setup_vmcs_config() preemption disabled so it always performs on
> > > > > the same local cpu.
> > > > > 
> > > > > During module loading time, KVM intends to call setup_vmcs_config() to
> > > > > set up the global VMCS configurations on _one_ cpu in hardware_setup(),
> > 
> > That may have been the very original intention, but I don't think it has been the
> > true intention for a very long time.
> 
> Wondering what's the current intention?

I don't think there's a deliberate "intention" beyond "does it work?".  Like many
of the historical bits of KVM x86, I think this is a case of the original authors
_wanting_ to provide certain behavior, but not actually ensuring that behavior in
code.

> > > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and
> > > > > call the latter directly in the compatibility check code path.  Change
> > > > > setup_vmcs_config() to call __setup_vmcs_config() with preemption
> > > > > disabled so __setup_vmcs_config() is always done on the same cpu.
> > > > 
> > > > Maybe you can simply disable preemption in hardware_setup() although I
> > > > don't have a strong preference.
> > > > 
> > > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of
> > > > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()?
> > > 
> > > Yes I think so.  I missed this :)
> > > 
> > > Not sure whether there are other similar places too even outside of
> > > hardware_setup().
> > > 
> > > But compatibility check only checks things calculated via setup_vmcs_config()
> > > and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put
> > > hardware_setup() inside preemption disabled.
> > 
> > Disabling preemption across hardware_setup() isn't feasible as there are a number
> > of allocations that might sleep.  But disabling preemption isn't necessary to
> > ensure setup runs on one CPU, that only requires disabling _migration_.  So _if_
> > we want to handle this in the kernel, we could simply do:
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 541982de5762..9126fdf02649 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
> >         int r;
> >  
> >         mutex_lock(&vendor_module_lock);
> > +       migrate_disable();
> >         r = __kvm_x86_vendor_init(ops);
> > +       migrate_enable();
> >         mutex_unlock(&vendor_module_lock);
> >  
> >         return r;
> > 
> > 
> > But I'm not convinced we should handle this in the kernel.  Many of the checks,
> > especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform
> > setup on a single CPU, all of those would need to be converted to this_cpu_has().
> > 
> > Some of those boot_cpu_has() calls should be changed regardless of whether or not
> > migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy
> > due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after
> > open coding cpu_has_svm() into kvm_is_svm_supported()[*]).
> > 
> > But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(),
> > because the teardown path needs to mirror the setup path, e.g. if KVM ended up
> > running on frankenstein hardware where not all CPUs have a constant TSC, KVM could
> > leave a callback dangling and hose the kernel.  Obviously such hardware wouldn't
> > correctly run VMs, but crashing the kernel is a touch worse than KVM not working
> > correctly.
> > 
> > I'm not totally against converting to this_cpu_has() for the setup, as it would be
> > more intuitive in a lot of ways.  But, I don't think pinning the task actually
> > hardens KVM in a meaningful way.  If there are any divergences between CPUs, then
> > either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will
> > never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants
> > thereof) without sanity checking across CPUs.  And if userspace _really_ wants to
> > have guarantees about how setup is performed, e.g. for repeatable, deterministic
> > behavior, then userspace should force loading of KVM to be done on CPU0.
> 
> My intention is never for userspace, but simply/purely from compatibility
> check's point of view (see below).  Also, I don't think userspace wants to
> guarantee anything  -- it just wants to load the KVM module.

That very much depends on the use case.  For personal usage of KVM, it's extremely
unlikely that userspace is doing anything remotely sophisticated.  But for a more
"formal" deployment, userspace absolutely has its hands all over the system, e.g.
scheduling VMs across systems, monitoring the health of the system, etc.  Whether
or not userspaces actually do tightly control loading KVM is another matter...

> It's even arguable that it may be an acceptable behaviour to fail to run any
> VM even loading module was successful.
> 
> > 
> > So my vote is to leave things as-is (modulo the cpu_has_svm() mess).  But maybe add
> > documentation to explain the caveats about loading KVM, and how userspace can
> > mitigate those caveats?
> 
> I made this patch because I have some other patches to move VMXON support out of
> KVM in order to support TDX, but so far those patches are not included in that
> series (and I'd like to leave it out if we really don't need it).

Me too. :-)

> In the patch to move VMXON out of KVM, I changed to use per-cpu variable to
> cache the MSR_IA32_VMX_BASIC value and setup the VMXON region when one CPU is
> becoming online.  And setup_vmcs_config() is changed to use __this_cpu_read() to
> read the per-cpu MSR value instead of reading from hardware.  Obviously w/o
> preempt_disable() or similar __this_cpu_read() can report kernel bug:
> 
>         printk(KERN_ERR "BUG: using %s%s() in preemptible [%08x] code: %s/%d\n",
>                 what1, what2, preempt_count() - 1, current->comm, current->pid);
> 
> That being said, I am fine to keep existing code, even w/o documenting.  We can
> discuss more how to handle when we really want to move VMXON out of KVM (i.e.
> supporting TDX IO?).
> 
> Or we can just fix compatibility check part?  For instance, move
> setup_vmcs_config() and nested_vmx_setup_ctls_msrs() together in
> hardware_setup() and call preempt_disable() around them?

Eh, the compatibility checks we really care about run in IRQ context, i.e. they're
guaranteed to have a stable CPU.  Splitting the _setup_ for the compatibility
checks across multiple CPUs isn't a problem because KVM will still get the right
"answer", i.e. any divergence will be detected (barring _very_ flaky hardware that
might get false negatives anyways).

Don't get me wrong, I agree it's ugly, but I don't want to go halfway.  I either
want to guard the whole thing, or nothing, and I can't convince myself that
guarding everything is worthwhile since userspace can (and IMO should) do a better
job.

Huang, Kai March 9, 2023, 2:11 a.m. UTC | #6

On Wed, 2023-03-08 at 13:03 -0800, Sean Christopherson wrote:
> On Wed, Mar 08, 2023, Huang, Kai wrote:
> > On Tue, 2023-03-07 at 09:17 -0800, Sean Christopherson wrote:
> > > On Thu, Mar 02, 2023, Huang, Kai wrote:
> > > > On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote:
> > > > > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote:
> > > > > > Make setup_vmcs_config() preemption disabled so it always performs on
> > > > > > the same local cpu.
> > > > > > 
> > > > > > During module loading time, KVM intends to call setup_vmcs_config() to
> > > > > > set up the global VMCS configurations on _one_ cpu in hardware_setup(),
> > > 
> > > That may have been the very original intention, but I don't think it has been the
> > > true intention for a very long time.
> > 
> > Wondering what's the current intention?
> 
> I don't think there's a deliberate "intention" beyond "does it work?".  Like many
> of the historical bits of KVM x86, I think this is a case of the original authors
> _wanting_ to provide certain behavior, but not actually ensuring that behavior in
> code.

Yep.

> 
> > > > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and
> > > > > > call the latter directly in the compatibility check code path.  Change
> > > > > > setup_vmcs_config() to call __setup_vmcs_config() with preemption
> > > > > > disabled so __setup_vmcs_config() is always done on the same cpu.
> > > > > 
> > > > > Maybe you can simply disable preemption in hardware_setup() although I
> > > > > don't have a strong preference.
> > > > > 
> > > > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of
> > > > > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()?
> > > > 
> > > > Yes I think so.  I missed this :)
> > > > 
> > > > Not sure whether there are other similar places too even outside of
> > > > hardware_setup().
> > > > 
> > > > But compatibility check only checks things calculated via setup_vmcs_config()
> > > > and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put
> > > > hardware_setup() inside preemption disabled.
> > > 
> > > Disabling preemption across hardware_setup() isn't feasible as there are a number
> > > of allocations that might sleep.  But disabling preemption isn't necessary to
> > > ensure setup runs on one CPU, that only requires disabling _migration_.  So _if_
> > > we want to handle this in the kernel, we could simply do:
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 541982de5762..9126fdf02649 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
> > >         int r;
> > >  
> > >         mutex_lock(&vendor_module_lock);
> > > +       migrate_disable();
> > >         r = __kvm_x86_vendor_init(ops);
> > > +       migrate_enable();
> > >         mutex_unlock(&vendor_module_lock);
> > >  
> > >         return r;
> > > 
> > > 
> > > But I'm not convinced we should handle this in the kernel.  Many of the checks,
> > > especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform
> > > setup on a single CPU, all of those would need to be converted to this_cpu_has().
> > > 
> > > Some of those boot_cpu_has() calls should be changed regardless of whether or not
> > > migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy
> > > due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after
> > > open coding cpu_has_svm() into kvm_is_svm_supported()[*]).
> > > 
> > > But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(),
> > > because the teardown path needs to mirror the setup path, e.g. if KVM ended up
> > > running on frankenstein hardware where not all CPUs have a constant TSC, KVM could
> > > leave a callback dangling and hose the kernel.  Obviously such hardware wouldn't
> > > correctly run VMs, but crashing the kernel is a touch worse than KVM not working
> > > correctly.
> > > 
> > > I'm not totally against converting to this_cpu_has() for the setup, as it would be
> > > more intuitive in a lot of ways.  But, I don't think pinning the task actually
> > > hardens KVM in a meaningful way.  If there are any divergences between CPUs, then
> > > either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will
> > > never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants
> > > thereof) without sanity checking across CPUs.  And if userspace _really_ wants to
> > > have guarantees about how setup is performed, e.g. for repeatable, deterministic
> > > behavior, then userspace should force loading of KVM to be done on CPU0.
> > 
> > My intention is never for userspace, but simply/purely from compatibility
> > check's point of view (see below).  Also, I don't think userspace wants to
> > guarantee anything  -- it just wants to load the KVM module.
> 
> That very much depends on the use case.  For personal usage of KVM, it's extremely
> unlikely that userspace is doing anything remotely sophisticated.  But for a more
> "formal" deployment, userspace absolutely has its hands all over the system, e.g.
> scheduling VMs across systems, monitoring the health of the system, etc.  Whether
> or not userspaces actually do tightly control loading KVM is another matter...

Agreed.

> 
> > It's even arguable that it may be an acceptable behaviour to fail to run any
> > VM even loading module was successful.
> > 
> > > 
> > > So my vote is to leave things as-is (modulo the cpu_has_svm() mess).  But maybe add
> > > documentation to explain the caveats about loading KVM, and how userspace can
> > > mitigate those caveats?
> > 
> > I made this patch because I have some other patches to move VMXON support out of
> > KVM in order to support TDX, but so far those patches are not included in that
> > series (and I'd like to leave it out if we really don't need it).
> 
> Me too. :-)
> 
> > In the patch to move VMXON out of KVM, I changed to use per-cpu variable to
> > cache the MSR_IA32_VMX_BASIC value and setup the VMXON region when one CPU is
> > becoming online.  And setup_vmcs_config() is changed to use __this_cpu_read() to
> > read the per-cpu MSR value instead of reading from hardware.  Obviously w/o
> > preempt_disable() or similar __this_cpu_read() can report kernel bug:
> > 
> >         printk(KERN_ERR "BUG: using %s%s() in preemptible [%08x] code: %s/%d\n",
> >                 what1, what2, preempt_count() - 1, current->comm, current->pid);
> > 
> > That being said, I am fine to keep existing code, even w/o documenting.  We can
> > discuss more how to handle when we really want to move VMXON out of KVM (i.e.
> > supporting TDX IO?).
> > 
> > Or we can just fix compatibility check part?  For instance, move
> > setup_vmcs_config() and nested_vmx_setup_ctls_msrs() together in
> > hardware_setup() and call preempt_disable() around them?
> 
> Eh, the compatibility checks we really care about run in IRQ context, i.e. they're
> guaranteed to have a stable CPU.  Splitting the _setup_ for the compatibility
> checks across multiple CPUs isn't a problem because KVM will still get the right
> "answer", i.e. any divergence will be detected (barring _very_ flaky hardware that
> might get false negatives anyways).
> 
> Don't get me wrong, I agree it's ugly, but I don't want to go halfway.  I either
> want to guard the whole thing, or nothing, and I can't convince myself that
> guarding everything is worthwhile since userspace can (and IMO should) do a better
> job.

Agreed.

Let's just leave the current code as is.

Thanks for your time!

KVM: VMX: Make setup_vmcs_config() preemption disabled

Commit Message

Comments

Patch