diff mbox series

[v2,2/3] KVM: x86/mmu: Use MMU shrinker to shrink KVM MMU memory caches

Message ID 20241004195540.210396-3-vipinsh@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Repurpose MMU shrinker into page cache shrinker | expand

Commit Message

Vipin Sharma Oct. 4, 2024, 7:55 p.m. UTC
Use MMU shrinker to iterate through all the vCPUs of all the VMs and
free pages allocated in MMU memory caches. Protect cache allocation in
page fault and MMU load path from MMU shrinker by using a per vCPU
mutex. In MMU shrinker, move the iterated VM to the end of the VMs list
so that the pain of emptying cache spread among other VMs too.

The specific caches to empty are mmu_shadow_page_cache and
mmu_shadowed_info_cache as these caches store whole pages. Emptying them
will give more impact to shrinker compared to other caches like
mmu_pte_list_desc_cache{} and mmu_page_header_cache{}

Holding per vCPU mutex lock ensures that a vCPU doesn't get surprised
by finding its cache emptied after filling them up for page table
allocations during page fault handling and MMU load operation. Per vCPU
mutex also makes sure there is only race between MMU shrinker and all
other vCPUs. This should result in very less contention.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_host.h |  6 +++
 arch/x86/kvm/mmu/mmu.c          | 69 +++++++++++++++++++++++++++------
 arch/x86/kvm/mmu/paging_tmpl.h  | 14 ++++---
 include/linux/kvm_host.h        |  1 +
 virt/kvm/kvm_main.c             |  8 +++-
 5 files changed, 81 insertions(+), 17 deletions(-)

Comments

Vipin Sharma Oct. 4, 2024, 8:04 p.m. UTC | #1
On Fri, Oct 4, 2024 at 12:55 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> Use MMU shrinker to iterate through all the vCPUs of all the VMs and
> free pages allocated in MMU memory caches. Protect cache allocation in
> page fault and MMU load path from MMU shrinker by using a per vCPU
> mutex. In MMU shrinker, move the iterated VM to the end of the VMs list
> so that the pain of emptying cache spread among other VMs too.
>
> The specific caches to empty are mmu_shadow_page_cache and
> mmu_shadowed_info_cache as these caches store whole pages. Emptying them
> will give more impact to shrinker compared to other caches like
> mmu_pte_list_desc_cache{} and mmu_page_header_cache{}
>
> Holding per vCPU mutex lock ensures that a vCPU doesn't get surprised
> by finding its cache emptied after filling them up for page table
> allocations during page fault handling and MMU load operation. Per vCPU
> mutex also makes sure there is only race between MMU shrinker and all
> other vCPUs. This should result in very less contention.
>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>

I also meant to add
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: David Matlack <dmatlack@google.com>

I can send v3 or please take it from v2.
Sean Christopherson Oct. 24, 2024, 11:25 p.m. UTC | #2
On Fri, Oct 04, 2024, Vipin Sharma wrote:
> Use MMU shrinker to iterate through all the vCPUs of all the VMs and
> free pages allocated in MMU memory caches. Protect cache allocation in
> page fault and MMU load path from MMU shrinker by using a per vCPU
> mutex. In MMU shrinker, move the iterated VM to the end of the VMs list
> so that the pain of emptying cache spread among other VMs too.
> 
> The specific caches to empty are mmu_shadow_page_cache and
> mmu_shadowed_info_cache as these caches store whole pages. Emptying them
> will give more impact to shrinker compared to other caches like
> mmu_pte_list_desc_cache{} and mmu_page_header_cache{}
> 
> Holding per vCPU mutex lock ensures that a vCPU doesn't get surprised
> by finding its cache emptied after filling them up for page table
> allocations during page fault handling and MMU load operation. Per vCPU
> mutex also makes sure there is only race between MMU shrinker and all
> other vCPUs. This should result in very less contention.
> 
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---

...

> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 213e46b55dda2..8e2935347615d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4524,29 +4524,33 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	if (r != RET_PF_INVALID)
>  		return r;
>  
> +	mutex_lock(&vcpu->arch.mmu_memory_cache_lock);
>  	r = mmu_topup_memory_caches(vcpu, false);
>  	if (r)
> -		return r;
> +		goto out_mmu_memory_cache_unlock;
>  
>  	r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
>  	if (r != RET_PF_CONTINUE)
> -		return r;
> +		goto out_mmu_memory_cache_unlock;
>  
>  	r = RET_PF_RETRY;
>  	write_lock(&vcpu->kvm->mmu_lock);
>  
>  	if (is_page_fault_stale(vcpu, fault))
> -		goto out_unlock;
> +		goto out_mmu_unlock;
>  
>  	r = make_mmu_pages_available(vcpu);
>  	if (r)
> -		goto out_unlock;
> +		goto out_mmu_unlock;
>  
>  	r = direct_map(vcpu, fault);
>  
> -out_unlock:
> +out_mmu_unlock:
>  	write_unlock(&vcpu->kvm->mmu_lock);
>  	kvm_release_pfn_clean(fault->pfn);
> +out_mmu_memory_cache_unlock:
> +	mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);

I've been thinking about this patch on and off for the past few weeks, and every
time I come back to it I can't shake the feeling that we came up with a clever
solution for a problem that doesn't exist.  I can't recall a single complaint
about KVM consuming an unreasonable amount of memory for page tables.  In fact,
the only time I can think of where the code in question caused problems was when
I unintentionally inverted the iterator and zapped the newest SPs instead of the
oldest SPs.

So, I'm leaning more and more toward simply removing the shrinker integration.
Vipin Sharma Oct. 25, 2024, 5:36 p.m. UTC | #3
On Thu, Oct 24, 2024 at 4:25 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Oct 04, 2024, Vipin Sharma wrote:
> > +out_mmu_memory_cache_unlock:
> > +     mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
>
> I've been thinking about this patch on and off for the past few weeks, and every
> time I come back to it I can't shake the feeling that we came up with a clever
> solution for a problem that doesn't exist.  I can't recall a single complaint
> about KVM consuming an unreasonable amount of memory for page tables.  In fact,
> the only time I can think of where the code in question caused problems was when
> I unintentionally inverted the iterator and zapped the newest SPs instead of the
> oldest SPs.
>
> So, I'm leaning more and more toward simply removing the shrinker integration.

One thing we can agree on is that we don't need MMU shrinker in its
current form because it is removing pages which are very well being
used by VM instead of shrinking its cache.

Regarding the current series, the biggest VM in GCE we can have 416
vCPUs, considering each thread can have 40 pages in its cache, total
cost gonna be around 65 MiB, doesn't seem much to me considering these
VMs have memory in TiB. Since caches in VMs are not unbounded, I think
it is fine to not have a MMU shrinker as its impact is miniscule in
KVM.
David Matlack Oct. 28, 2024, 8:37 p.m. UTC | #4
On Fri, Oct 25, 2024 at 10:37 AM Vipin Sharma <vipinsh@google.com> wrote:
>
> On Thu, Oct 24, 2024 at 4:25 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Oct 04, 2024, Vipin Sharma wrote:
> > > +out_mmu_memory_cache_unlock:
> > > +     mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
> >
> > I've been thinking about this patch on and off for the past few weeks, and every
> > time I come back to it I can't shake the feeling that we came up with a clever
> > solution for a problem that doesn't exist.  I can't recall a single complaint
> > about KVM consuming an unreasonable amount of memory for page tables.  In fact,
> > the only time I can think of where the code in question caused problems was when
> > I unintentionally inverted the iterator and zapped the newest SPs instead of the
> > oldest SPs.
> >
> > So, I'm leaning more and more toward simply removing the shrinker integration.
>
> One thing we can agree on is that we don't need MMU shrinker in its
> current form because it is removing pages which are very well being
> used by VM instead of shrinking its cache.
>
> Regarding the current series, the biggest VM in GCE we can have 416
> vCPUs, considering each thread can have 40 pages in its cache, total
> cost gonna be around 65 MiB, doesn't seem much to me considering these
> VMs have memory in TiB. Since caches in VMs are not unbounded, I think
> it is fine to not have a MMU shrinker as its impact is miniscule in
> KVM.

I have no objection to removing the shrinker entirely.
Sean Christopherson Oct. 28, 2024, 8:49 p.m. UTC | #5
On Mon, Oct 28, 2024, David Matlack wrote:
> On Fri, Oct 25, 2024 at 10:37 AM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > On Thu, Oct 24, 2024 at 4:25 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Oct 04, 2024, Vipin Sharma wrote:
> > > > +out_mmu_memory_cache_unlock:
> > > > +     mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
> > >
> > > I've been thinking about this patch on and off for the past few weeks, and every
> > > time I come back to it I can't shake the feeling that we came up with a clever
> > > solution for a problem that doesn't exist.  I can't recall a single complaint
> > > about KVM consuming an unreasonable amount of memory for page tables.  In fact,
> > > the only time I can think of where the code in question caused problems was when
> > > I unintentionally inverted the iterator and zapped the newest SPs instead of the
> > > oldest SPs.
> > >
> > > So, I'm leaning more and more toward simply removing the shrinker integration.
> >
> > One thing we can agree on is that we don't need MMU shrinker in its
> > current form because it is removing pages which are very well being
> > used by VM instead of shrinking its cache.
> >
> > Regarding the current series, the biggest VM in GCE we can have 416
> > vCPUs, considering each thread can have 40 pages in its cache, total
> > cost gonna be around 65 MiB, doesn't seem much to me considering these
> > VMs have memory in TiB. Since caches in VMs are not unbounded, I think
> > it is fine to not have a MMU shrinker as its impact is miniscule in
> > KVM.
> 
> I have no objection to removing the shrinker entirely.

Let's do that.  In the unlikely scenario someone comes along with a strong use
case for purging the vCPU caches, we can always resurrect this approach.

Vipin, can you send a v3?
Vipin Sharma Oct. 28, 2024, 10:32 p.m. UTC | #6
On Mon, Oct 28, 2024 at 1:49 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Oct 28, 2024, David Matlack wrote:
> > On Fri, Oct 25, 2024 at 10:37 AM Vipin Sharma <vipinsh@google.com> wrote:
> > >
> > > On Thu, Oct 24, 2024 at 4:25 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Fri, Oct 04, 2024, Vipin Sharma wrote:
> > > > > +out_mmu_memory_cache_unlock:
> > > > > +     mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
> > > >
> > > > I've been thinking about this patch on and off for the past few weeks, and every
> > > > time I come back to it I can't shake the feeling that we came up with a clever
> > > > solution for a problem that doesn't exist.  I can't recall a single complaint
> > > > about KVM consuming an unreasonable amount of memory for page tables.  In fact,
> > > > the only time I can think of where the code in question caused problems was when
> > > > I unintentionally inverted the iterator and zapped the newest SPs instead of the
> > > > oldest SPs.
> > > >
> > > > So, I'm leaning more and more toward simply removing the shrinker integration.
> > >
> > > One thing we can agree on is that we don't need MMU shrinker in its
> > > current form because it is removing pages which are very well being
> > > used by VM instead of shrinking its cache.
> > >
> > > Regarding the current series, the biggest VM in GCE we can have 416
> > > vCPUs, considering each thread can have 40 pages in its cache, total
> > > cost gonna be around 65 MiB, doesn't seem much to me considering these
> > > VMs have memory in TiB. Since caches in VMs are not unbounded, I think
> > > it is fine to not have a MMU shrinker as its impact is miniscule in
> > > KVM.
> >
> > I have no objection to removing the shrinker entirely.
>
> Let's do that.  In the unlikely scenario someone comes along with a strong use
> case for purging the vCPU caches, we can always resurrect this approach.
>
> Vipin, can you send a v3?

Okay, I will send a v3.

Thanks
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cbfe31bac6cf6..63eaf03111ebb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -811,6 +811,12 @@  struct kvm_vcpu_arch {
 	 */
 	struct kvm_mmu *walk_mmu;
 
+	/*
+	 * Protect cache from getting emptied in MMU shrinker while vCPU might
+	 * use cache for fault handling or loading MMU.  As this is a per vCPU
+	 * lock, only contention might happen when MMU shrinker runs.
+	 */
+	struct mutex mmu_memory_cache_lock;
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 213e46b55dda2..8e2935347615d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4524,29 +4524,33 @@  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	if (r != RET_PF_INVALID)
 		return r;
 
+	mutex_lock(&vcpu->arch.mmu_memory_cache_lock);
 	r = mmu_topup_memory_caches(vcpu, false);
 	if (r)
-		return r;
+		goto out_mmu_memory_cache_unlock;
 
 	r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
 	if (r != RET_PF_CONTINUE)
-		return r;
+		goto out_mmu_memory_cache_unlock;
 
 	r = RET_PF_RETRY;
 	write_lock(&vcpu->kvm->mmu_lock);
 
 	if (is_page_fault_stale(vcpu, fault))
-		goto out_unlock;
+		goto out_mmu_unlock;
 
 	r = make_mmu_pages_available(vcpu);
 	if (r)
-		goto out_unlock;
+		goto out_mmu_unlock;
 
 	r = direct_map(vcpu, fault);
 
-out_unlock:
+out_mmu_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(fault->pfn);
+out_mmu_memory_cache_unlock:
+	mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
+
 	return r;
 }
 
@@ -4617,25 +4621,28 @@  static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 	if (r != RET_PF_INVALID)
 		return r;
 
+	mutex_lock(&vcpu->arch.mmu_memory_cache_lock);
 	r = mmu_topup_memory_caches(vcpu, false);
 	if (r)
-		return r;
+		goto out_mmu_memory_cache_unlock;
 
 	r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
 	if (r != RET_PF_CONTINUE)
-		return r;
+		goto out_mmu_memory_cache_unlock;
 
 	r = RET_PF_RETRY;
 	read_lock(&vcpu->kvm->mmu_lock);
 
 	if (is_page_fault_stale(vcpu, fault))
-		goto out_unlock;
+		goto out_mmu_unlock;
 
 	r = kvm_tdp_mmu_map(vcpu, fault);
 
-out_unlock:
+out_mmu_unlock:
 	read_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(fault->pfn);
+out_mmu_memory_cache_unlock:
+	mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
 	return r;
 }
 #endif
@@ -5691,6 +5698,7 @@  int kvm_mmu_load(struct kvm_vcpu *vcpu)
 {
 	int r;
 
+	mutex_lock(&vcpu->arch.mmu_memory_cache_lock);
 	r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct);
 	if (r)
 		goto out;
@@ -5717,6 +5725,7 @@  int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	 */
 	kvm_x86_call(flush_tlb_current)(vcpu);
 out:
+	mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
 	return r;
 }
 
@@ -6303,6 +6312,7 @@  int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	if (!vcpu->arch.mmu_shadow_page_cache.init_value)
 		vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
+	mutex_init(&vcpu->arch.mmu_memory_cache_lock);
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
 
@@ -6997,13 +7007,50 @@  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 static unsigned long mmu_shrink_scan(struct shrinker *shrink,
 				     struct shrink_control *sc)
 {
-	return SHRINK_STOP;
+	struct kvm *kvm, *next_kvm, *first_kvm = NULL;
+	unsigned long i, freed = 0;
+	struct kvm_vcpu *vcpu;
+
+	mutex_lock(&kvm_lock);
+	list_for_each_entry_safe(kvm, next_kvm, &vm_list, vm_list) {
+		if (!first_kvm)
+			first_kvm = kvm;
+		else if (first_kvm == kvm)
+			break;
+
+		list_move_tail(&kvm->vm_list, &vm_list);
+
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			if (!mutex_trylock(&vcpu->arch.mmu_memory_cache_lock))
+				continue;
+			freed += kvm_mmu_empty_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+			freed += kvm_mmu_empty_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
+			mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
+			if (freed >= sc->nr_to_scan)
+				goto out;
+		}
+	}
+out:
+	mutex_unlock(&kvm_lock);
+	return freed;
 }
 
 static unsigned long mmu_shrink_count(struct shrinker *shrink,
 				      struct shrink_control *sc)
 {
-	return SHRINK_EMPTY;
+	unsigned long i, count = 0;
+	struct kvm_vcpu *vcpu;
+	struct kvm *kvm;
+
+	mutex_lock(&kvm_lock);
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			count += READ_ONCE(vcpu->arch.mmu_shadow_page_cache.nobjs);
+			count += READ_ONCE(vcpu->arch.mmu_shadowed_info_cache.nobjs);
+		}
+	}
+	mutex_unlock(&kvm_lock);
+	return !count ? SHRINK_EMPTY : count;
 }
 
 static struct shrinker *mmu_shrinker;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 405bd7ceee2a3..084a5c532078f 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -809,13 +809,14 @@  static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		return RET_PF_EMULATE;
 	}
 
+	mutex_lock(&vcpu->arch.mmu_memory_cache_lock);
 	r = mmu_topup_memory_caches(vcpu, true);
 	if (r)
-		return r;
+		goto out_mmu_memory_cache_unlock;
 
 	r = kvm_faultin_pfn(vcpu, fault, walker.pte_access);
 	if (r != RET_PF_CONTINUE)
-		return r;
+		goto out_mmu_memory_cache_unlock;
 
 	/*
 	 * Do not change pte_access if the pfn is a mmio page, otherwise
@@ -840,16 +841,19 @@  static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	write_lock(&vcpu->kvm->mmu_lock);
 
 	if (is_page_fault_stale(vcpu, fault))
-		goto out_unlock;
+		goto out_mmu_unlock;
 
 	r = make_mmu_pages_available(vcpu);
 	if (r)
-		goto out_unlock;
+		goto out_mmu_unlock;
 	r = FNAME(fetch)(vcpu, fault, &walker);
 
-out_unlock:
+out_mmu_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(fault->pfn);
+out_mmu_memory_cache_unlock:
+	mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
+
 	return r;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b23c6d48392f7..288e503f14a0b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1446,6 +1446,7 @@  void kvm_flush_remote_tlbs_memslot(struct kvm *kvm,
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
 int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
+int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb2b78e92910f..5d89ca218791b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -451,15 +451,21 @@  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
 	return mc->nobjs;
 }
 
-void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
+int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
 {
+	int freed = mc->nobjs;
 	while (mc->nobjs) {
 		if (mc->kmem_cache)
 			kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
 		else
 			free_page((unsigned long)mc->objects[--mc->nobjs]);
 	}
+	return freed;
+}
 
+void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
+{
+	kvm_mmu_empty_memory_cache(mc);
 	kvfree(mc->objects);
 
 	mc->objects = NULL;