[RFC] KVM: x86: Don't wipe TDP MMU when guest sets %cr4

Message ID	b46ee4de968733a69117458e9f8f9d2a6682376f.camel@infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> Message-ID: <b46ee4de968733a69117458e9f8f9d2a6682376f.camel@infradead.org> Subject: [RFC] KVM: x86: Don't wipe TDP MMU when guest sets %cr4 From: David Woodhouse <dwmw2@infradead.org> To: kvm@vger.kernel.org, Ben Gardon <bgardon@google.com>, "Szczepanek, Bartosz" <bsz@amazon.de> Cc: Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com> Date: Tue, 10 Oct 2023 22:36:41 +0100 Content-Type: multipart/signed; micalg="sha-256"; protocol="application/pkcs7-signature"; boundary="=-St+9ByDqZNkDk8GE6VzI" User-Agent: Evolution 3.44.4-0ubuntu2 MIME-Version: 1.0 Precedence: bulk
Series	[RFC] KVM: x86: Don't wipe TDP MMU when guest sets %cr4 \| expand [RFC] KVM: x86: Don't wipe TDP MMU when guest sets %cr4

David Woodhouse Oct. 10, 2023, 9:36 p.m. UTC

If I understand things correctly, the point of the TDP MMU is to use
page tables such as EPT for GPA → HPA translations, but let the
virtualization support in the CPU handle all of the *virtual*
addressing and page tables, including the non-root mode %cr3/%cr4.

I have a guest which loves to flip the SMEP bit on and off in %cr4 all
the time. The guest is actually Xen, in its 'PV shim' mode which
enables it to support a single PV guest, while running in a true
hardware virtual machine:
https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00497.html

The performance is *awful*, since as far as I can tell, on every flip
KVM flushes the entire EPT. I understand why that might be necessary
for the mode where KVM is building up a set of shadow page tables to
directly map GVA → HPA and be loaded into %cr3 of a CPU that doesn't
support native EPT translations. But I don't understand why the TDP MMU
would need to do it. Surely we don't have to change anything in the EPT
just because the stuff in the non-root-mode %cr3/%cr4 changes?

So I tried this, and it went faster and nothing appears to have blown
up.

Am I missing something? Is this stupidly wrong?

                kvm_update_cpuid_runtime(vcpu);


Also... if I have *two* vCPUs it doesn't go quite as slowly while Xen
starts Grub and then Grub boots a Linux kernel. Until the kernel brings
up its second vCPU and *then* it starts going really slowly again. Is
that because the TDP roots are refcounted, and that idle vCPU holds
onto the unused one and prevents it from being completely thrown away?
Until the vCPU stops being idle and starts flipping SMEP on/off on
Linux←→Xen transitions too?

In practice, there's not a lot of point in Xen using SMEP when it's
purely acting as a library for its *one* guest, living in an HVM
container. The above patch speeds things up but telling Xen not to use
SMEP at all makes things go *much* faster. But if I'm not being
*entirely* stupid, there may be some generic improvements for
KVM+TDPMMU here somewhere so it's worth making a fool of myself by
asking...?

Jim Mattson Oct. 10, 2023, 10:27 p.m. UTC | #1

On Tue, Oct 10, 2023 at 2:37 PM David Woodhouse <dwmw2@infradead.org> wrote:
>
> If I understand things correctly, the point of the TDP MMU is to use
> page tables such as EPT for GPA → HPA translations, but let the
> virtualization support in the CPU handle all of the *virtual*
> addressing and page tables, including the non-root mode %cr3/%cr4.
>
> I have a guest which loves to flip the SMEP bit on and off in %cr4 all
> the time. The guest is actually Xen, in its 'PV shim' mode which
> enables it to support a single PV guest, while running in a true
> hardware virtual machine:
> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00497.html
>
> The performance is *awful*, since as far as I can tell, on every flip
> KVM flushes the entire EPT. I understand why that might be necessary
> for the mode where KVM is building up a set of shadow page tables to
> directly map GVA → HPA and be loaded into %cr3 of a CPU that doesn't
> support native EPT translations. But I don't understand why the TDP MMU
> would need to do it. Surely we don't have to change anything in the EPT
> just because the stuff in the non-root-mode %cr3/%cr4 changes?
>
> So I tried this, and it went faster and nothing appears to have blown
> up.
>
> Am I missing something? Is this stupidly wrong?

The TDP MMU is essentially a virtual TLB of guest-physical mappings,
so it's oblivious to %cr4 changes. However, the hardware TLBs of the
current logical processor contain combined mappings, which may have to
be flushed, depending on how %cr4 has changed. Since kvm has emulated
the instruction, it is responsible for flushing the hardware TLBs as
required (see the SDM, volume 3, section 4.10.4: Operations that
Invalidate TLBs and Paging-Structure Caches). Some of the logic in
kvm_post_set_cr4() seems to deal with that.

Not your fault, of course, but can't we come up with a better name
than "kvm_post_set_cr4"?

Sean Christopherson Oct. 10, 2023, 11:25 p.m. UTC | #2

On Tue, Oct 10, 2023, David Woodhouse wrote:
> If I understand things correctly, the point of the TDP MMU is to use
> page tables such as EPT for GPA → HPA translations, but let the
> virtualization support in the CPU handle all of the *virtual*
> addressing and page tables, including the non-root mode %cr3/%cr4.
> 
> I have a guest which loves to flip the SMEP bit on and off in %cr4 all
> the time. The guest is actually Xen, in its 'PV shim' mode which
> enables it to support a single PV guest, while running in a true
> hardware virtual machine:
> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00497.html
> 
> The performance is *awful*, since as far as I can tell, on every flip
> KVM flushes the entire EPT. I understand why that might be necessary
> for the mode where KVM is building up a set of shadow page tables to
> directly map GVA → HPA and be loaded into %cr3 of a CPU that doesn't
> support native EPT translations. But I don't understand why the TDP MMU
> would need to do it. Surely we don't have to change anything in the EPT
> just because the stuff in the non-root-mode %cr3/%cr4 changes?
> 
> So I tried this, and it went faster and nothing appears to have blown
> up.
> 
> Am I missing something? Is this stupidly wrong?

Heh, you're in luck, because regardless of what your darn pronoun "this" refers
to, the answer is yes, "this" is stupidly wrong.

The below is stupidly wrong.  KVM needs to at least reconfigure the guest's paging
metadata that is used to translate GVAs to GPAs during emulation.

But the TDP MMU behavior *was* also stupidly wrong.  The reason that two vCPUs
suck less is because KVM would zap SPTEs (EPT roots) if and only if *both* vCPUs
unloaded their roots at the same time.

Commit edbdb43fc96b ("KVM: x86: Preserve TDP MMU roots until they are explicitly
invalidated") should fix the behavior you're seeing.

And if we want to try and make SMEP blazing fast on Intel, we can probably let
the guest write it directly, i.e. give SMEP the same treatment as CR0.WP.  See
commits cf9f4c0eb169 ("KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated
permission faults") and fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit").

Oh, and if your userspace is doing something silly like constantly creating and
deleting memslots, see commit 0df9dab891ff ("KVM: x86/mmu: Stop zapping invalidated
TDP MMU roots asynchronously").

> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1072,7 +1074,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned
> long cr4)
>         if (kvm_x86_ops.set_cr4(vcpu, cr4))
>                 return 1;
>  
> -       kvm_post_set_cr4(vcpu, old_cr4, cr4);
> +       if (!vcpu->kvm->arch.tdp_mmu_enabled)
> +               kvm_post_set_cr4(vcpu, old_cr4, cr4);
>  
>         if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
>                 kvm_update_cpuid_runtime(vcpu);
> 
> 
> Also... if I have *two* vCPUs it doesn't go quite as slowly while Xen
> starts Grub and then Grub boots a Linux kernel. Until the kernel brings
> up its second vCPU and *then* it starts going really slowly again. Is
> that because the TDP roots are refcounted, and that idle vCPU holds
> onto the unused one and prevents it from being completely thrown away?
> Until the vCPU stops being idle and starts flipping SMEP on/off on
> Linux←→Xen transitions too?
> 
> In practice, there's not a lot of point in Xen using SMEP when it's
> purely acting as a library for its *one* guest, living in an HVM
> container. The above patch speeds things up but telling Xen not to use
> SMEP at all makes things go *much* faster. But if I'm not being
> *entirely* stupid, there may be some generic improvements for
> KVM+TDPMMU here somewhere so it's worth making a fool of myself by
> asking...?

David Woodhouse Oct. 11, 2023, 8:20 a.m. UTC | #3

On Tue, 2023-10-10 at 16:25 -0700, Sean Christopherson wrote:
> On Tue, Oct 10, 2023, David Woodhouse wrote:
> > If I understand things correctly, the point of the TDP MMU is to use
> > page tables such as EPT for GPA → HPA translations, but let the
> > virtualization support in the CPU handle all of the *virtual*
> > addressing and page tables, including the non-root mode %cr3/%cr4.
> > 
> > I have a guest which loves to flip the SMEP bit on and off in %cr4 all
> > the time. The guest is actually Xen, in its 'PV shim' mode which
> > enables it to support a single PV guest, while running in a true
> > hardware virtual machine:
> > https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00497.html
> > 
> > The performance is *awful*, since as far as I can tell, on every flip
> > KVM flushes the entire EPT. I understand why that might be necessary
> > for the mode where KVM is building up a set of shadow page tables to
> > directly map GVA → HPA and be loaded into %cr3 of a CPU that doesn't
> > support native EPT translations. But I don't understand why the TDP MMU
> > would need to do it. Surely we don't have to change anything in the EPT
> > just because the stuff in the non-root-mode %cr3/%cr4 changes?
> > 
> > So I tried this, and it went faster and nothing appears to have blown
> > up.
> > 
> > Am I missing something? Is this stupidly wrong?
> 
> Heh, you're in luck, because regardless of what your darn pronoun "this" refers
> to, the answer is yes, "this" is stupidly wrong.

Hehe. Thought that might be the case. Thank you for the coherent
explanation and especially for the references.

(I hadn't got the PV shim working in qemu yet. I shall bump that up my
TODO list by a few quanta, as it would have let me run this test on a
newer kernel.)

> The below is stupidly wrong.  KVM needs to at least reconfigure the guest's paging
> metadata that is used to translate GVAs to GPAs during emulation.
> 
> But the TDP MMU behavior *was* also stupidly wrong.  The reason that two vCPUs
> suck less is because KVM would zap SPTEs (EPT roots) if and only if *both* vCPUs
> unloaded their roots at the same time.
> 
> Commit edbdb43fc96b ("KVM: x86: Preserve TDP MMU roots until they are explicitly
> invalidated") should fix the behavior you're seeing.
> 
> And if we want to try and make SMEP blazing fast on Intel, we can probably let
> the guest write it directly, i.e. give SMEP the same treatment as CR0.WP.  See
> commits cf9f4c0eb169 ("KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated
> permission faults") and fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit").

Thanks. In fact, looking at __kvm_mmu_refresh_passthrough_bits() in
commit cf9f4c0eb169 ("KVM: x86/mmu: Refresh CR0.WP…"), shouldn't it be
looking at the SMEP bit in %cr4 anyway?

In kvm_calc_cpu_role() the value of ____is_cr0_wp() is used *three*
times. For setting role.base.{cr0_wp,smep_andnot_wp,smap_andnot_wp}.

But __kvm_mmu_refresh_passthrough_bits() only refreshes
role.base.cr0_wp and not the other two. Do we need this?

--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5159,6 +5159,8 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
                return;

        mmu->cpu_role.base.cr0_wp = cr0_wp;
+       mmu->cpu_role.base.smep_andnot_wp = mmu->cpu_role.ext.cr4_smep && !cr0_wp;
+       mmu->cpu_role.base.smap_andnot_wp = mmu->cpu_role.ext.cr4_smap && !cr0_wp;
        reset_guest_paging_metadata(vcpu, mmu);
 }

But I'm confused here. Even if I don't go as far as actually making
CR4.SMEP a guest-owned bit, and KVM still ends up handling it in
kvm_post_load_cr4()... why does KVM need to completely unload and
reinit the MMU? Would it not be sufficient just to refresh the role
bits, much like __kvm_mmu_refresh_passthrough_bits() does for CR0.WP?

(And what about flushing the hardware TLB, as Jim mentioned. I guess if
it's guest-owned we trust the CPU to do that, and if it's trapped then
KVM is required to do so)?

> Oh, and if your userspace is doing something silly like constantly creating and
> deleting memslots, see commit 0df9dab891ff ("KVM: x86/mmu: Stop zapping invalidated
> TDP MMU roots asynchronously").

Oooh, I'll play with that. Thanks! Although I bristle at 'silly' :)

Userspace only wants to add and remove single overlay pages. That's not
so silly, that's explicitly requested by the guest and required for PV
drivers to work.

AIUI it's the KVM API which means that userspace needs to stop all
vCPUs, *remove* the original memslot, add back the two halves of the
original memslot with the newly overlaid page in the middle, and then
let all the vCPUs run again. We don't need an atomic way to change
*all* the memslots at once, but a way to atomically overlay a *single*
range would make the whole enterprise feel less 'silly'.

Jim Mattson Oct. 11, 2023, 4:43 p.m. UTC | #4

On Wed, Oct 11, 2023 at 1:21 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> ...
> But I'm confused here. Even if I don't go as far as actually making
> CR4.SMEP a guest-owned bit, and KVM still ends up handling it in
> kvm_post_load_cr4()... why does KVM need to completely unload and
> reinit the MMU? Would it not be sufficient just to refresh the role
> bits, much like __kvm_mmu_refresh_passthrough_bits() does for CR0.WP?
>
> (And what about flushing the hardware TLB, as Jim mentioned. I guess if
> it's guest-owned we trust the CPU to do that, and if it's trapped then
> KVM is required to do so)?

Yes, guest-owned bits become the responsibility of the CPU.

With the TDP MMU, KVM probably should not intercept writes to %cr4 at
all. I'd argue that it's better to construct guest paging attributes
from first principles every time KVM needs to know, rather than
intercepting any operation that may change guest paging attributes. If
that's too expensive, perhaps KVM is emulating too many instructions.
:)

Particularly aggravating to me is that too many of these cost/benefit
decisions were made without consideration of nested virtualization. At
one time, someone observed that Linux hardly ever writes %cr4, so it
must be cheap to intercept. How ironic that they didn't consider the
fact that KVM is part of Linux, and KVM writes %cr4 all the time!

If the cost of determining paging attributes from first principles is
really prohibitive, maybe the solution is more flexibility. KVM can
start by intercepting writes to %cr4, but if the rate of intercepts is
egregious, it should do something else.

Paolo Bonzini Oct. 11, 2023, 5:06 p.m. UTC | #5

On 10/11/23 10:20, David Woodhouse wrote:
> But __kvm_mmu_refresh_passthrough_bits() only refreshes
> role.base.cr0_wp and not the other two. Do we need this?
> 
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5159,6 +5159,8 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>                  return;
>   
>          mmu->cpu_role.base.cr0_wp = cr0_wp;
> +       mmu->cpu_role.base.smep_andnot_wp = mmu->cpu_role.ext.cr4_smep && !cr0_wp;
> +       mmu->cpu_role.base.smap_andnot_wp = mmu->cpu_role.ext.cr4_smap && !cr0_wp;
>          reset_guest_paging_metadata(vcpu, mmu);
>   }

{smep,smap}_andnot_wp only matter for shadow paging.  You can remove 
them from this function, and instead assign which is not called for 
shadow paging anyway, and set them in the root_role in kvm_init_shadow_mmu.

Paolo

Sean Christopherson Oct. 14, 2023, 12:12 a.m. UTC | #6

On Wed, Oct 11, 2023, David Woodhouse wrote:
> On Tue, 2023-10-10 at 16:25 -0700, Sean Christopherson wrote:
> > On Tue, Oct 10, 2023, David Woodhouse wrote:
> But I'm confused here. Even if I don't go as far as actually making
> CR4.SMEP a guest-owned bit, and KVM still ends up handling it in
> kvm_post_load_cr4()... why does KVM need to completely unload and
> reinit the MMU? Would it not be sufficient just to refresh the role
> bits, much like __kvm_mmu_refresh_passthrough_bits() does for CR0.WP?

Maybe?  It largely hasn't happened simply because no one (yet) cares enough about
the performance of other bits to force the issue.

> (And what about flushing the hardware TLB, as Jim mentioned. I guess if
> it's guest-owned we trust the CPU to do that, and if it's trapped then
> KVM is required to do so)?

TBH, I forgot that clearing SMEP architecturally requires a TLB flush for the current
PCID on Intel.  I *think* it's actually fine so long as TDP is enabled.  Ah, yes,
it's fine.  Well, either that or kvm_invalidate_pcid() is buggy :-)

Relying on the CPU to flush the hardware TLBs is definitely ok, I was just trying
to think if there were any KVM artifacts that would need to be flushed/invalidated.
I can't think of any, e.g. KVM already disable GVA-based MMIO caching when TDP is
enabled, L1 is responsible for its virtual TLB if L1 is shadowing legacy paging for
L2, and if L1 is using EPT, i.e. KVM is shadowing L1 EPT and thus has a virtual TLB
for L2, then the PCID is irrelevant (doesn't affect EPT translations).

[RFC] KVM: x86: Don't wipe TDP MMU when guest sets %cr4

Commit Message

Comments

Patch