Message ID | b46ee4de968733a69117458e9f8f9d2a6682376f.camel@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] KVM: x86: Don't wipe TDP MMU when guest sets %cr4 | expand |
On Tue, Oct 10, 2023 at 2:37 PM David Woodhouse <dwmw2@infradead.org> wrote: > > If I understand things correctly, the point of the TDP MMU is to use > page tables such as EPT for GPA → HPA translations, but let the > virtualization support in the CPU handle all of the *virtual* > addressing and page tables, including the non-root mode %cr3/%cr4. > > I have a guest which loves to flip the SMEP bit on and off in %cr4 all > the time. The guest is actually Xen, in its 'PV shim' mode which > enables it to support a single PV guest, while running in a true > hardware virtual machine: > https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00497.html > > The performance is *awful*, since as far as I can tell, on every flip > KVM flushes the entire EPT. I understand why that might be necessary > for the mode where KVM is building up a set of shadow page tables to > directly map GVA → HPA and be loaded into %cr3 of a CPU that doesn't > support native EPT translations. But I don't understand why the TDP MMU > would need to do it. Surely we don't have to change anything in the EPT > just because the stuff in the non-root-mode %cr3/%cr4 changes? > > So I tried this, and it went faster and nothing appears to have blown > up. > > Am I missing something? Is this stupidly wrong? The TDP MMU is essentially a virtual TLB of guest-physical mappings, so it's oblivious to %cr4 changes. However, the hardware TLBs of the current logical processor contain combined mappings, which may have to be flushed, depending on how %cr4 has changed. Since kvm has emulated the instruction, it is responsible for flushing the hardware TLBs as required (see the SDM, volume 3, section 4.10.4: Operations that Invalidate TLBs and Paging-Structure Caches). Some of the logic in kvm_post_set_cr4() seems to deal with that. Not your fault, of course, but can't we come up with a better name than "kvm_post_set_cr4"?
On Tue, Oct 10, 2023, David Woodhouse wrote: > If I understand things correctly, the point of the TDP MMU is to use > page tables such as EPT for GPA → HPA translations, but let the > virtualization support in the CPU handle all of the *virtual* > addressing and page tables, including the non-root mode %cr3/%cr4. > > I have a guest which loves to flip the SMEP bit on and off in %cr4 all > the time. The guest is actually Xen, in its 'PV shim' mode which > enables it to support a single PV guest, while running in a true > hardware virtual machine: > https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00497.html > > The performance is *awful*, since as far as I can tell, on every flip > KVM flushes the entire EPT. I understand why that might be necessary > for the mode where KVM is building up a set of shadow page tables to > directly map GVA → HPA and be loaded into %cr3 of a CPU that doesn't > support native EPT translations. But I don't understand why the TDP MMU > would need to do it. Surely we don't have to change anything in the EPT > just because the stuff in the non-root-mode %cr3/%cr4 changes? > > So I tried this, and it went faster and nothing appears to have blown > up. > > Am I missing something? Is this stupidly wrong? Heh, you're in luck, because regardless of what your darn pronoun "this" refers to, the answer is yes, "this" is stupidly wrong. The below is stupidly wrong. KVM needs to at least reconfigure the guest's paging metadata that is used to translate GVAs to GPAs during emulation. But the TDP MMU behavior *was* also stupidly wrong. The reason that two vCPUs suck less is because KVM would zap SPTEs (EPT roots) if and only if *both* vCPUs unloaded their roots at the same time. Commit edbdb43fc96b ("KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated") should fix the behavior you're seeing. And if we want to try and make SMEP blazing fast on Intel, we can probably let the guest write it directly, i.e. give SMEP the same treatment as CR0.WP. See commits cf9f4c0eb169 ("KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faults") and fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit"). Oh, and if your userspace is doing something silly like constantly creating and deleting memslots, see commit 0df9dab891ff ("KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously"). > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1072,7 +1074,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned > long cr4) > if (kvm_x86_ops.set_cr4(vcpu, cr4)) > return 1; > > - kvm_post_set_cr4(vcpu, old_cr4, cr4); > + if (!vcpu->kvm->arch.tdp_mmu_enabled) > + kvm_post_set_cr4(vcpu, old_cr4, cr4); > > if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE)) > kvm_update_cpuid_runtime(vcpu); > > > Also... if I have *two* vCPUs it doesn't go quite as slowly while Xen > starts Grub and then Grub boots a Linux kernel. Until the kernel brings > up its second vCPU and *then* it starts going really slowly again. Is > that because the TDP roots are refcounted, and that idle vCPU holds > onto the unused one and prevents it from being completely thrown away? > Until the vCPU stops being idle and starts flipping SMEP on/off on > Linux←→Xen transitions too? > > In practice, there's not a lot of point in Xen using SMEP when it's > purely acting as a library for its *one* guest, living in an HVM > container. The above patch speeds things up but telling Xen not to use > SMEP at all makes things go *much* faster. But if I'm not being > *entirely* stupid, there may be some generic improvements for > KVM+TDPMMU here somewhere so it's worth making a fool of myself by > asking...?
On Tue, 2023-10-10 at 16:25 -0700, Sean Christopherson wrote: > On Tue, Oct 10, 2023, David Woodhouse wrote: > > If I understand things correctly, the point of the TDP MMU is to use > > page tables such as EPT for GPA → HPA translations, but let the > > virtualization support in the CPU handle all of the *virtual* > > addressing and page tables, including the non-root mode %cr3/%cr4. > > > > I have a guest which loves to flip the SMEP bit on and off in %cr4 all > > the time. The guest is actually Xen, in its 'PV shim' mode which > > enables it to support a single PV guest, while running in a true > > hardware virtual machine: > > https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00497.html > > > > The performance is *awful*, since as far as I can tell, on every flip > > KVM flushes the entire EPT. I understand why that might be necessary > > for the mode where KVM is building up a set of shadow page tables to > > directly map GVA → HPA and be loaded into %cr3 of a CPU that doesn't > > support native EPT translations. But I don't understand why the TDP MMU > > would need to do it. Surely we don't have to change anything in the EPT > > just because the stuff in the non-root-mode %cr3/%cr4 changes? > > > > So I tried this, and it went faster and nothing appears to have blown > > up. > > > > Am I missing something? Is this stupidly wrong? > > Heh, you're in luck, because regardless of what your darn pronoun "this" refers > to, the answer is yes, "this" is stupidly wrong. Hehe. Thought that might be the case. Thank you for the coherent explanation and especially for the references. (I hadn't got the PV shim working in qemu yet. I shall bump that up my TODO list by a few quanta, as it would have let me run this test on a newer kernel.) > The below is stupidly wrong. KVM needs to at least reconfigure the guest's paging > metadata that is used to translate GVAs to GPAs during emulation. > > But the TDP MMU behavior *was* also stupidly wrong. The reason that two vCPUs > suck less is because KVM would zap SPTEs (EPT roots) if and only if *both* vCPUs > unloaded their roots at the same time. > > Commit edbdb43fc96b ("KVM: x86: Preserve TDP MMU roots until they are explicitly > invalidated") should fix the behavior you're seeing. > > And if we want to try and make SMEP blazing fast on Intel, we can probably let > the guest write it directly, i.e. give SMEP the same treatment as CR0.WP. See > commits cf9f4c0eb169 ("KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated > permission faults") and fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit"). Thanks. In fact, looking at __kvm_mmu_refresh_passthrough_bits() in commit cf9f4c0eb169 ("KVM: x86/mmu: Refresh CR0.WP…"), shouldn't it be looking at the SMEP bit in %cr4 anyway? In kvm_calc_cpu_role() the value of ____is_cr0_wp() is used *three* times. For setting role.base.{cr0_wp,smep_andnot_wp,smap_andnot_wp}. But __kvm_mmu_refresh_passthrough_bits() only refreshes role.base.cr0_wp and not the other two. Do we need this? --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5159,6 +5159,8 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, return; mmu->cpu_role.base.cr0_wp = cr0_wp; + mmu->cpu_role.base.smep_andnot_wp = mmu->cpu_role.ext.cr4_smep && !cr0_wp; + mmu->cpu_role.base.smap_andnot_wp = mmu->cpu_role.ext.cr4_smap && !cr0_wp; reset_guest_paging_metadata(vcpu, mmu); } But I'm confused here. Even if I don't go as far as actually making CR4.SMEP a guest-owned bit, and KVM still ends up handling it in kvm_post_load_cr4()... why does KVM need to completely unload and reinit the MMU? Would it not be sufficient just to refresh the role bits, much like __kvm_mmu_refresh_passthrough_bits() does for CR0.WP? (And what about flushing the hardware TLB, as Jim mentioned. I guess if it's guest-owned we trust the CPU to do that, and if it's trapped then KVM is required to do so)? > Oh, and if your userspace is doing something silly like constantly creating and > deleting memslots, see commit 0df9dab891ff ("KVM: x86/mmu: Stop zapping invalidated > TDP MMU roots asynchronously"). Oooh, I'll play with that. Thanks! Although I bristle at 'silly' :) Userspace only wants to add and remove single overlay pages. That's not so silly, that's explicitly requested by the guest and required for PV drivers to work. AIUI it's the KVM API which means that userspace needs to stop all vCPUs, *remove* the original memslot, add back the two halves of the original memslot with the newly overlaid page in the middle, and then let all the vCPUs run again. We don't need an atomic way to change *all* the memslots at once, but a way to atomically overlay a *single* range would make the whole enterprise feel less 'silly'.
On Wed, Oct 11, 2023 at 1:21 AM David Woodhouse <dwmw2@infradead.org> wrote: > > ... > But I'm confused here. Even if I don't go as far as actually making > CR4.SMEP a guest-owned bit, and KVM still ends up handling it in > kvm_post_load_cr4()... why does KVM need to completely unload and > reinit the MMU? Would it not be sufficient just to refresh the role > bits, much like __kvm_mmu_refresh_passthrough_bits() does for CR0.WP? > > (And what about flushing the hardware TLB, as Jim mentioned. I guess if > it's guest-owned we trust the CPU to do that, and if it's trapped then > KVM is required to do so)? Yes, guest-owned bits become the responsibility of the CPU. With the TDP MMU, KVM probably should not intercept writes to %cr4 at all. I'd argue that it's better to construct guest paging attributes from first principles every time KVM needs to know, rather than intercepting any operation that may change guest paging attributes. If that's too expensive, perhaps KVM is emulating too many instructions. :) Particularly aggravating to me is that too many of these cost/benefit decisions were made without consideration of nested virtualization. At one time, someone observed that Linux hardly ever writes %cr4, so it must be cheap to intercept. How ironic that they didn't consider the fact that KVM is part of Linux, and KVM writes %cr4 all the time! If the cost of determining paging attributes from first principles is really prohibitive, maybe the solution is more flexibility. KVM can start by intercepting writes to %cr4, but if the rate of intercepts is egregious, it should do something else.
On 10/11/23 10:20, David Woodhouse wrote: > But __kvm_mmu_refresh_passthrough_bits() only refreshes > role.base.cr0_wp and not the other two. Do we need this? > > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -5159,6 +5159,8 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, > return; > > mmu->cpu_role.base.cr0_wp = cr0_wp; > + mmu->cpu_role.base.smep_andnot_wp = mmu->cpu_role.ext.cr4_smep && !cr0_wp; > + mmu->cpu_role.base.smap_andnot_wp = mmu->cpu_role.ext.cr4_smap && !cr0_wp; > reset_guest_paging_metadata(vcpu, mmu); > } {smep,smap}_andnot_wp only matter for shadow paging. You can remove them from this function, and instead assign which is not called for shadow paging anyway, and set them in the root_role in kvm_init_shadow_mmu. Paolo
On Wed, Oct 11, 2023, David Woodhouse wrote: > On Tue, 2023-10-10 at 16:25 -0700, Sean Christopherson wrote: > > On Tue, Oct 10, 2023, David Woodhouse wrote: > But I'm confused here. Even if I don't go as far as actually making > CR4.SMEP a guest-owned bit, and KVM still ends up handling it in > kvm_post_load_cr4()... why does KVM need to completely unload and > reinit the MMU? Would it not be sufficient just to refresh the role > bits, much like __kvm_mmu_refresh_passthrough_bits() does for CR0.WP? Maybe? It largely hasn't happened simply because no one (yet) cares enough about the performance of other bits to force the issue. > (And what about flushing the hardware TLB, as Jim mentioned. I guess if > it's guest-owned we trust the CPU to do that, and if it's trapped then > KVM is required to do so)? TBH, I forgot that clearing SMEP architecturally requires a TLB flush for the current PCID on Intel. I *think* it's actually fine so long as TDP is enabled. Ah, yes, it's fine. Well, either that or kvm_invalidate_pcid() is buggy :-) Relying on the CPU to flush the hardware TLBs is definitely ok, I was just trying to think if there were any KVM artifacts that would need to be flushed/invalidated. I can't think of any, e.g. KVM already disable GVA-based MMIO caching when TDP is enabled, L1 is responsible for its virtual TLB if L1 is shadowing legacy paging for L2, and if L1 is using EPT, i.e. KVM is shadowing L1 EPT and thus has a virtual TLB for L2, then the PCID is irrelevant (doesn't affect EPT translations).
--- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1072,7 +1074,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) if (kvm_x86_ops.set_cr4(vcpu, cr4)) return 1; - kvm_post_set_cr4(vcpu, old_cr4, cr4); + if (!vcpu->kvm->arch.tdp_mmu_enabled) + kvm_post_set_cr4(vcpu, old_cr4, cr4); if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))