diff mbox series

[v3,1/1] KVM: x86/mmu: Remove KVM mmu shrinker

Message ID 20241101201437.1604321-2-vipinsh@google.com (mailing list archive)
State New
Headers show
Series Remove KVM MMU shrinker | expand

Commit Message

Vipin Sharma Nov. 1, 2024, 8:14 p.m. UTC
Remove KVM MMU shrinker and all its related code. Remove global
kvm_total_used_mmu_pages and page zapping flow from MMU shrinker.
Remove zapped_obsolete_pages list from struct kvm_arch{} and use local
list in kvm_zap_obsolete_pages() since MMU shrinker is not using it
anymore.

Current flow of KVM MMU shrinker is very disruptive to VMs. It picks the
first VM in the vm_list, zaps the oldest page which is most likely an
upper level SPTEs and most like to be reused. Prior to TDP MMU, this is
even more disruptive in nested VMs case, considering L1 SPTEs will be
the oldest even though most of the entries are for L2 SPTEs.

As discussed in [1] shrinker logic has not be very useful in actually
keeping VMs performant and reducing memory usage.

There was an alternative suggested [2] to repurpose shrinker for
shrinking vCPU caches. But considering that in all of the KVM MMU
shrinker history it hasn't been used/needed/complained, and there has
not been any conversation regarding KVM using lots of page tables, it
might be better to just not have shrinker. If the need arise [2] can be
revisited.

[1] https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com/
[2] https://lore.kernel.org/kvm/20241004195540.210396-3-vipinsh@google.com/

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_host.h |   1 -
 arch/x86/kvm/mmu/mmu.c          | 111 ++------------------------------
 2 files changed, 5 insertions(+), 107 deletions(-)

Comments

Vipin Sharma Nov. 1, 2024, 8:21 p.m. UTC | #1
On Fri, Nov 1, 2024 at 1:14 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> Remove KVM MMU shrinker and all its related code. Remove global
> kvm_total_used_mmu_pages and page zapping flow from MMU shrinker.
> Remove zapped_obsolete_pages list from struct kvm_arch{} and use local
> list in kvm_zap_obsolete_pages() since MMU shrinker is not using it
> anymore.
>
> Current flow of KVM MMU shrinker is very disruptive to VMs. It picks the
> first VM in the vm_list, zaps the oldest page which is most likely an
> upper level SPTEs and most like to be reused. Prior to TDP MMU, this is
> even more disruptive in nested VMs case, considering L1 SPTEs will be
> the oldest even though most of the entries are for L2 SPTEs.
>
> As discussed in [1] shrinker logic has not be very useful in actually
> keeping VMs performant and reducing memory usage.
>
> There was an alternative suggested [2] to repurpose shrinker for
> shrinking vCPU caches. But considering that in all of the KVM MMU
> shrinker history it hasn't been used/needed/complained, and there has
> not been any conversation regarding KVM using lots of page tables, it
> might be better to just not have shrinker. If the need arise [2] can be
> revisited.
>
> [1] https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com/
> [2] https://lore.kernel.org/kvm/20241004195540.210396-3-vipinsh@google.com/
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Suggested-by: David Matlack <dmatlack@google.com>
> Reviewed-by: David Matlack <dmatlack@google.com>

FYI, I carried forward David's Reviewed-by from the previous versions.
Extra change from the previous version is removing registration of KVM
MMU shrinker in kvm_mmu_vendor_module_init() and mmu_shrinker object
along with its callback functions.
Sean Christopherson Nov. 5, 2024, 3:43 a.m. UTC | #2
On Fri, Nov 01, 2024, Vipin Sharma wrote:
> On Fri, Nov 1, 2024 at 1:14 PM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > Remove KVM MMU shrinker and all its related code. Remove global
> > kvm_total_used_mmu_pages and page zapping flow from MMU shrinker.

Don't provide a play-by-play of the code changes.  Simply stating "all its
related code" is sufficient.  From that, readers know the *intent* of the patch,
and that allows reviewers to double check that all code is indeed removed.
Stating what the patch literally does verbatim adds a lot of noise and very little
value.

> > Remove zapped_obsolete_pages list from struct kvm_arch{} and use local
> > list in kvm_zap_obsolete_pages() since MMU shrinker is not using it
> > anymore.
> >
> > Current flow of KVM MMU shrinker is very disruptive to VMs.

Please use full sentences.

> > It picks the first VM in the vm_list, zaps the oldest page which is most
> > likely an upper level SPTEs and most like to be reused. Prior to TDP MMU,
> > this is even more disruptive in nested VMs case, considering L1 SPTEs will
> > be the oldest even though most of the entries are for L2 SPTEs.

This flaw isn't limited to the shrinker though, it's inherent to all of KVM's
force page table reclamation.

> > As discussed in [1] shrinker logic has not be very useful in actually
> > keeping VMs performant

I don't think anyone has ever claimed that the shrinker would be useful in
providing performance for VMs.  AFAICT, it's always been about memory usage, and
nothing more.

> > and reducing memory usage.

This one I definitely agree on :-)

> > There was an alternative suggested [2] to repurpose shrinker for
> > shrinking vCPU caches. But considering that in all of the KVM MMU
> > shrinker history it hasn't been used/needed/complained, and there has
> > not been any conversation regarding KVM using lots of page tables, it
> > might be better to just not have shrinker.

A complaint about KVM's page table usage would be an argument for keeping (and
improving) the current shrinker implementation, not for dropping the per-vCPU
caches.  And _that_ to me leads to the the real argument for not wiring up the
shrinker to the per-vCPU caches: it doesn't scale.  E.g. a VM with 4 vCPUs and
4 TiB of memory will, at most, reclaim a laugable 640KiB (4*40*4KiB) of memory.
That's obviously more than a bit contrived, but IMO it really shows that targeting
the per-vCPU caches is unlikely to be useful in practice.  At best, it would be
a premature memory optimization.

> > If the need arise [2] can be revisited.

Everything can be revisited, I think what's important here is to state why forcing
future developers to (re)start from scratch is a non-issue.

> >
> > [1] https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com/
> > [2] https://lore.kernel.org/kvm/20241004195540.210396-3-vipinsh@google.com/
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Suggested-by: David Matlack <dmatlack@google.com>
> > Reviewed-by: David Matlack <dmatlack@google.com>
> 
> FYI, I carried forward David's Reviewed-by from the previous versions.
> Extra change from the previous version is removing registration of KVM
> MMU shrinker in kvm_mmu_vendor_module_init() and mmu_shrinker object
> along with its callback functions.

Heh, I'm going to drop David's review, because I am going to split this into two
patches when applying.  One to yank out the shrinker, and one to use a local list
in kvm_zap_obsolete_pages() and drop zapped_obsolete_pages.

Somewhat subtly, the only reason dropping zapped_obsolete_pages doesn't introduce
a functional change is because kvm_zap_obsolete_pages() is called under slots_lock,
and doesn't drop said lock when yielding.  I.e. there can't be multiple writers
(or readers) of zapped_obsolete_pages, and so the list is guaranteed to be empty
on entry and exit.  That's worth a changelog and bisection point of its own.

With the patch split in two, this is what I ended up with for the main changelog.

Please speak up if you want to change anything!

    KVM: x86/mmu: Remove KVM's MMU shrinker
    
    Remove KVM's MMU shrinker and (almost) all of its related code, as the
    current implementation is very disruptive to VMs (if it ever runs),
    without providing any meaningful benefit[1].
    
    Alternatively, KVM could repurpose its shrinker, e.g. to reclaim pages
    from the per-vCPU caches[2], but given that no one has complained about
    lack of TDP MMU support for the shrinker in the 3+ years since the TDP MMU
    was enabled by default, it's safe to say that there is likely no real use
    case for initiating reclaim of KVM's page tables from the shrinker.
    
    And while clever/cute, reclaiming the per-vCPU caches doesn't scale the
    same way that reclaiming in-use page table pages does.  E.g. the amount of
    memory being used by a VM doesn't always directly correlate with the
    number vCPUs, and even when it does, reclaiming a few pages from per-vCPU
    caches likely won't make much of a dent in the VM's total memory usage,
    especially for VMs with huge amounts of memory.
    
    Lastly, if it turns out that there is a strong use case for dropping the
    per-vCPU caches, re-introducing the shrinker registration is trivial
    compared to the complexity of actually reclaiming pages from the caches.
    
    [1] https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com
    [2] https://lore.kernel.org/kvm/20241004195540.210396-3-vipinsh@google.com
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 70c7ed0ef1847..3e8afc82ae2fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1306,7 +1306,6 @@  struct kvm_arch {
 	bool pre_fault_allowed;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	struct list_head active_mmu_pages;
-	struct list_head zapped_obsolete_pages;
 	/*
 	 * A list of kvm_mmu_page structs that, if zapped, could possibly be
 	 * replaced by an NX huge page.  A shadow page is on this list if its
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 443845bb2e011..3116c4c31d87c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -179,7 +179,6 @@  struct kvm_shadow_walk_iterator {
 
 static struct kmem_cache *pte_list_desc_cache;
 struct kmem_cache *mmu_page_header_cache;
-static struct percpu_counter kvm_total_used_mmu_pages;
 
 static void mmu_spte_set(u64 *sptep, u64 spte);
 
@@ -1622,27 +1621,15 @@  static void kvm_mmu_check_sptes_at_free(struct kvm_mmu_page *sp)
 #endif
 }
 
-/*
- * This value is the sum of all of the kvm instances's
- * kvm->arch.n_used_mmu_pages values.  We need a global,
- * aggregate version in order to make the slab shrinker
- * faster
- */
-static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
-{
-	kvm->arch.n_used_mmu_pages += nr;
-	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
-}
-
 static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	kvm_mod_used_mmu_pages(kvm, +1);
+	kvm->arch.n_used_mmu_pages++;
 	kvm_account_pgtable_pages((void *)sp->spt, +1);
 }
 
 static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	kvm_mod_used_mmu_pages(kvm, -1);
+	kvm->arch.n_used_mmu_pages--;
 	kvm_account_pgtable_pages((void *)sp->spt, -1);
 }
 
@@ -6380,6 +6367,7 @@  static void kvm_zap_obsolete_pages(struct kvm *kvm)
 {
 	struct kvm_mmu_page *sp, *node;
 	int nr_zapped, batch = 0;
+	LIST_HEAD(invalid_list);
 	bool unstable;
 
 restart:
@@ -6413,7 +6401,7 @@  static void kvm_zap_obsolete_pages(struct kvm *kvm)
 		}
 
 		unstable = __kvm_mmu_prepare_zap_page(kvm, sp,
-				&kvm->arch.zapped_obsolete_pages, &nr_zapped);
+				&invalid_list, &nr_zapped);
 		batch += nr_zapped;
 
 		if (unstable)
@@ -6429,7 +6417,7 @@  static void kvm_zap_obsolete_pages(struct kvm *kvm)
 	 * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
 	 * running with an obsolete MMU.
 	 */
-	kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 }
 
 /*
@@ -6492,16 +6480,10 @@  static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 		kvm_tdp_mmu_zap_invalidated_roots(kvm);
 }
 
-static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
-{
-	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
-}
-
 void kvm_mmu_init_vm(struct kvm *kvm)
 {
 	kvm->arch.shadow_mmio_value = shadow_mmio_value;
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
-	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
 	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
 	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
 
@@ -7112,72 +7094,6 @@  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 	}
 }
 
-static unsigned long mmu_shrink_scan(struct shrinker *shrink,
-				     struct shrink_control *sc)
-{
-	struct kvm *kvm;
-	int nr_to_scan = sc->nr_to_scan;
-	unsigned long freed = 0;
-
-	mutex_lock(&kvm_lock);
-
-	list_for_each_entry(kvm, &vm_list, vm_list) {
-		int idx;
-
-		/*
-		 * Never scan more than sc->nr_to_scan VM instances.
-		 * Will not hit this condition practically since we do not try
-		 * to shrink more than one VM and it is very unlikely to see
-		 * !n_used_mmu_pages so many times.
-		 */
-		if (!nr_to_scan--)
-			break;
-		/*
-		 * n_used_mmu_pages is accessed without holding kvm->mmu_lock
-		 * here. We may skip a VM instance errorneosly, but we do not
-		 * want to shrink a VM that only started to populate its MMU
-		 * anyway.
-		 */
-		if (!kvm->arch.n_used_mmu_pages &&
-		    !kvm_has_zapped_obsolete_pages(kvm))
-			continue;
-
-		idx = srcu_read_lock(&kvm->srcu);
-		write_lock(&kvm->mmu_lock);
-
-		if (kvm_has_zapped_obsolete_pages(kvm)) {
-			kvm_mmu_commit_zap_page(kvm,
-			      &kvm->arch.zapped_obsolete_pages);
-			goto unlock;
-		}
-
-		freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
-
-unlock:
-		write_unlock(&kvm->mmu_lock);
-		srcu_read_unlock(&kvm->srcu, idx);
-
-		/*
-		 * unfair on small ones
-		 * per-vm shrinkers cry out
-		 * sadness comes quickly
-		 */
-		list_move_tail(&kvm->vm_list, &vm_list);
-		break;
-	}
-
-	mutex_unlock(&kvm_lock);
-	return freed;
-}
-
-static unsigned long mmu_shrink_count(struct shrinker *shrink,
-				      struct shrink_control *sc)
-{
-	return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
-}
-
-static struct shrinker *mmu_shrinker;
-
 static void mmu_destroy_caches(void)
 {
 	kmem_cache_destroy(pte_list_desc_cache);
@@ -7304,23 +7220,8 @@  int kvm_mmu_vendor_module_init(void)
 	if (!mmu_page_header_cache)
 		goto out;
 
-	if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
-		goto out;
-
-	mmu_shrinker = shrinker_alloc(0, "x86-mmu");
-	if (!mmu_shrinker)
-		goto out_shrinker;
-
-	mmu_shrinker->count_objects = mmu_shrink_count;
-	mmu_shrinker->scan_objects = mmu_shrink_scan;
-	mmu_shrinker->seeks = DEFAULT_SEEKS * 10;
-
-	shrinker_register(mmu_shrinker);
-
 	return 0;
 
-out_shrinker:
-	percpu_counter_destroy(&kvm_total_used_mmu_pages);
 out:
 	mmu_destroy_caches();
 	return ret;
@@ -7337,8 +7238,6 @@  void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 void kvm_mmu_vendor_module_exit(void)
 {
 	mmu_destroy_caches();
-	percpu_counter_destroy(&kvm_total_used_mmu_pages);
-	shrinker_free(mmu_shrinker);
 }
 
 /*