Message ID | 20241009192345.1148353-2-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/mmu: Don't zap "direct" non-leaf SPTEs on memslot removal | expand |
Tests of "normal VM + nested VM + 3 selftests" passed on the 3 configs 1) modprobe kvm_intel ept=0, 2) modprobe kvm tdp_mmu=0 modprobe kvm_intel ept=1 3) modprobe kvm tdp_mmu=1 modprobe kvm_intel ept=1 Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> On Wed, Oct 09, 2024 at 12:23:43PM -0700, Sean Christopherson wrote: > When performing a targeted zap on memslot removal, zap only MMU pages that > shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and > unnecessary. Furthermore, for_each_gfn_valid_sp() arguably shouldn't > exist, because it doesn't do what most people would it expect it to do. > The "round gfn for level" adjustment that is done for direct SPs (no gPTE) > means that the exact gfn comparison will not get a match, even when a SP > does "cover" a gfn, or was even created specifically for a gfn. > > For memslot deletion specifically, KVM's behavior will vary significantly > based on the size and alignment of a memslot, and in weird ways. E.g. for > a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if > it's only 4KiB aligned. And as described below, zapping SPs in the > aligned case overzaps for direct MMUs, as odds are good the upper-level > SPs are serving other memslots. > > To iterate over all potentially-relevant gfns, KVM would need to make a > pass over the hash table for each level, with the gfn used for lookup > rounded for said level. And then check that the SP is of the correct > level, too, e.g. to avoid over-zapping. > > But even then, KVM would massively overzap, as processing every level is > all but guaranteed to zap SPs that serve other memslots, especially if the > memslot being removed is relatively small. KVM could mitigate that issue > by processing only levels that can be possible guest huge pages, i.e. are > less likely to be re-used for other memslot, but while somewhat logical, > that's quite arbitrary and would be a bit of a mess to implement. > > So, zap only SPs with gPTEs, as the resulting behavior is easy to describe, > is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that > absolutely must be zapped. > > Cc: Yan Zhao <yan.y.zhao@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 16 ++++++---------- > 1 file changed, 6 insertions(+), 10 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index a9a23e058555..09494d01c38e 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -1884,14 +1884,10 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp) > if (is_obsolete_sp((_kvm), (_sp))) { \ > } else > > -#define for_each_gfn_valid_sp(_kvm, _sp, _gfn) \ > +#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \ > for_each_valid_sp(_kvm, _sp, \ > &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \ > - if ((_sp)->gfn != (_gfn)) {} else > - > -#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \ > - for_each_gfn_valid_sp(_kvm, _sp, _gfn) \ > - if (!sp_has_gptes(_sp)) {} else > + if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else > > static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) > { > @@ -7063,15 +7059,15 @@ static void kvm_mmu_zap_memslot_pages_and_flush(struct kvm *kvm, > > /* > * Since accounting information is stored in struct kvm_arch_memory_slot, > - * shadow pages deletion (e.g. unaccount_shadowed()) requires that all > - * gfns with a shadow page have a corresponding memslot. Do so before > - * the memslot goes away. > + * all MMU pages that are shadowing guest PTEs must be zapped before the > + * memslot is deleted, as freeing such pages after the memslot is freed > + * will result in use-after-free, e.g. in unaccount_shadowed(). > */ > for (i = 0; i < slot->npages; i++) { > struct kvm_mmu_page *sp; > gfn_t gfn = slot->base_gfn + i; > > - for_each_gfn_valid_sp(kvm, sp, gfn) > + for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) > kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); > > if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { > -- > 2.47.0.rc1.288.g06298d1525-goog >
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a9a23e058555..09494d01c38e 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1884,14 +1884,10 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp) if (is_obsolete_sp((_kvm), (_sp))) { \ } else -#define for_each_gfn_valid_sp(_kvm, _sp, _gfn) \ +#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \ for_each_valid_sp(_kvm, _sp, \ &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \ - if ((_sp)->gfn != (_gfn)) {} else - -#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \ - for_each_gfn_valid_sp(_kvm, _sp, _gfn) \ - if (!sp_has_gptes(_sp)) {} else + if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) { @@ -7063,15 +7059,15 @@ static void kvm_mmu_zap_memslot_pages_and_flush(struct kvm *kvm, /* * Since accounting information is stored in struct kvm_arch_memory_slot, - * shadow pages deletion (e.g. unaccount_shadowed()) requires that all - * gfns with a shadow page have a corresponding memslot. Do so before - * the memslot goes away. + * all MMU pages that are shadowing guest PTEs must be zapped before the + * memslot is deleted, as freeing such pages after the memslot is freed + * will result in use-after-free, e.g. in unaccount_shadowed(). */ for (i = 0; i < slot->npages; i++) { struct kvm_mmu_page *sp; gfn_t gfn = slot->base_gfn + i; - for_each_gfn_valid_sp(kvm, sp, gfn) + for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
When performing a targeted zap on memslot removal, zap only MMU pages that shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and unnecessary. Furthermore, for_each_gfn_valid_sp() arguably shouldn't exist, because it doesn't do what most people would it expect it to do. The "round gfn for level" adjustment that is done for direct SPs (no gPTE) means that the exact gfn comparison will not get a match, even when a SP does "cover" a gfn, or was even created specifically for a gfn. For memslot deletion specifically, KVM's behavior will vary significantly based on the size and alignment of a memslot, and in weird ways. E.g. for a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if it's only 4KiB aligned. And as described below, zapping SPs in the aligned case overzaps for direct MMUs, as odds are good the upper-level SPs are serving other memslots. To iterate over all potentially-relevant gfns, KVM would need to make a pass over the hash table for each level, with the gfn used for lookup rounded for said level. And then check that the SP is of the correct level, too, e.g. to avoid over-zapping. But even then, KVM would massively overzap, as processing every level is all but guaranteed to zap SPs that serve other memslots, especially if the memslot being removed is relatively small. KVM could mitigate that issue by processing only levels that can be possible guest huge pages, i.e. are less likely to be re-used for other memslot, but while somewhat logical, that's quite arbitrary and would be a bit of a mess to implement. So, zap only SPs with gPTEs, as the resulting behavior is easy to describe, is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that absolutely must be zapped. Cc: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/mmu/mmu.c | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-)