[v2] kvm: mmu: lazy collapse small sptes into large sptes

Message ID	1428041361-4741-1-git-send-email-wanpeng.li@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Wanpeng Li <wanpeng.li@linux.intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Xiao Guangrong <guangrong.xiao@linux.intel.com>, Wanpeng Li <wanpeng.li@linux.intel.com> Subject: [PATCH v2] kvm: mmu: lazy collapse small sptes into large sptes Date: Fri, 3 Apr 2015 14:09:21 +0800 Message-Id: <1428041361-4741-1-git-send-email-wanpeng.li@linux.intel.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Message ID

1428041361-4741-1-git-send-email-wanpeng.li@linux.intel.com (mailing list archive)

State

New, archived

Headers

From: Wanpeng Li <wanpeng.li@linux.intel.com>
To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Wanpeng Li <wanpeng.li@linux.intel.com>
Subject: [PATCH v2] kvm: mmu: lazy collapse small sptes into large sptes
Date: Fri,  3 Apr 2015 14:09:21 +0800
Message-Id: <1428041361-4741-1-git-send-email-wanpeng.li@linux.intel.com>
Sender: kvm-owner@vger.kernel.org
Precedence: bulk

Commit Message

Wanpeng Li April 3, 2015, 6:09 a.m. UTC

There are two scenarios for the requirement of collapsing small sptes
into large sptes.
- dirty logging tracks sptes in 4k granularity, so large sptes are split,
  the large sptes will be reallocated in the destination machine and the
  guest in the source machine will be destroyed when live migration successfully.
  However, the guest in the source machine will continue to run if live migration
  fail due to some reasons, the sptes still keep small which lead to bad
  performance.
- our customers write tools to track the dirty speed of guests by EPT D bit/PML
  in order to determine the most appropriate one to be live migrated, however
  sptes will still keep small after tracking dirty speed.

This patch introduce lazy collapse small sptes into large sptes, the memory region
will be scanned on the ioctl context when dirty log is stopped, the ones which can
be collapsed into large pages will be dropped during the scan, it depends the on
later #PF to reallocate all large sptes.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
---
v1 -> v2:
 * use 'bool' instead of 'int'
 * add more comments 
 * fix can not get the next spte after drop the current spte

 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.c              | 71 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              | 19 +++++++++++
 3 files changed, 92 insertions(+)

Comments

Xiao Guangrong April 3, 2015, 7:27 a.m. UTC | #1

On 04/03/2015 02:09 PM, Wanpeng Li wrote:
> There are two scenarios for the requirement of collapsing small sptes
> into large sptes.
> - dirty logging tracks sptes in 4k granularity, so large sptes are split,
>    the large sptes will be reallocated in the destination machine and the
>    guest in the source machine will be destroyed when live migration successfully.
>    However, the guest in the source machine will continue to run if live migration
>    fail due to some reasons, the sptes still keep small which lead to bad
>    performance.
> - our customers write tools to track the dirty speed of guests by EPT D bit/PML
>    in order to determine the most appropriate one to be live migrated, however
>    sptes will still keep small after tracking dirty speed.
>
> This patch introduce lazy collapse small sptes into large sptes, the memory region
> will be scanned on the ioctl context when dirty log is stopped, the ones which can
> be collapsed into large pages will be dropped during the scan, it depends the on
> later #PF to reallocate all large sptes.
>
> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
> ---
> v1 -> v2:
>   * use 'bool' instead of 'int'
>   * add more comments
>   * fix can not get the next spte after drop the current spte
>
>   arch/x86/include/asm/kvm_host.h |  2 ++
>   arch/x86/kvm/mmu.c              | 71 +++++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/x86.c              | 19 +++++++++++
>   3 files changed, 92 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 30b28dc..91b5bdb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>   void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>   void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>   				      struct kvm_memory_slot *memslot);
> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> +					struct kvm_memory_slot *memslot);
>   void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>   				   struct kvm_memory_slot *memslot);
>   void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index cee7592..df3f2e3 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4465,6 +4465,77 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>   		kvm_flush_remote_tlbs(kvm);
>   }
>
> +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> +		unsigned long *rmapp)
> +{
> +	u64 *sptep;
> +	struct rmap_iterator iter;
> +	int need_tlb_flush = 0;
> +	pfn_t pfn;
> +	struct kvm_mmu_page *sp;
> +
> +	while ((sptep = rmap_get_first(*rmapp, &iter))) {
> +		BUG_ON(!(*sptep & PT_PRESENT_MASK));
> +
> +		sp = page_header(__pa(sptep));
> +		pfn = spte_to_pfn(*sptep);
> +
> +		/*
> +		 * Let support EPT only now, an efficient way need to be figure
> +		 * out to let these code be aware what mapping level used in
> +		 * guest.

This English seems strange... but i am not good at it. :)

> +		 */
> +		if (sp->role.direct &&
> +			!kvm_is_reserved_pfn(pfn) &&
> +			PageTransCompound(pfn_to_page(pfn))) {
> +			drop_spte(kvm, sptep);
> +			need_tlb_flush = 1;
> +		}

If the conditions are not comfortable, it does loop forever...

Otherwise, it looks good to me.

Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wanpeng Li April 3, 2015, 7:44 a.m. UTC | #2

On Fri, Apr 03, 2015 at 03:27:51PM +0800, Xiao Guangrong wrote:
>
>
>On 04/03/2015 02:09 PM, Wanpeng Li wrote:
>>There are two scenarios for the requirement of collapsing small sptes
>>into large sptes.
>>- dirty logging tracks sptes in 4k granularity, so large sptes are split,
>>   the large sptes will be reallocated in the destination machine and the
>>   guest in the source machine will be destroyed when live migration successfully.
>>   However, the guest in the source machine will continue to run if live migration
>>   fail due to some reasons, the sptes still keep small which lead to bad
>>   performance.
>>- our customers write tools to track the dirty speed of guests by EPT D bit/PML
>>   in order to determine the most appropriate one to be live migrated, however
>>   sptes will still keep small after tracking dirty speed.
>>
>>This patch introduce lazy collapse small sptes into large sptes, the memory region
>>will be scanned on the ioctl context when dirty log is stopped, the ones which can
>>be collapsed into large pages will be dropped during the scan, it depends the on
>>later #PF to reallocate all large sptes.
>>
>>Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
>>---
>>v1 -> v2:
>>  * use 'bool' instead of 'int'
>>  * add more comments
>>  * fix can not get the next spte after drop the current spte
>>
>>  arch/x86/include/asm/kvm_host.h |  2 ++
>>  arch/x86/kvm/mmu.c              | 71 +++++++++++++++++++++++++++++++++++++++++
>>  arch/x86/kvm/x86.c              | 19 +++++++++++
>>  3 files changed, 92 insertions(+)
>>
>>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>index 30b28dc..91b5bdb 100644
>>--- a/arch/x86/include/asm/kvm_host.h
>>+++ b/arch/x86/include/asm/kvm_host.h
>>@@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>  				      struct kvm_memory_slot *memslot);
>>+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>+					struct kvm_memory_slot *memslot);
>>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>>  				   struct kvm_memory_slot *memslot);
>>  void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
>>diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>index cee7592..df3f2e3 100644
>>--- a/arch/x86/kvm/mmu.c
>>+++ b/arch/x86/kvm/mmu.c
>>@@ -4465,6 +4465,77 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>  		kvm_flush_remote_tlbs(kvm);
>>  }
>>
>>+static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>>+		unsigned long *rmapp)
>>+{
>>+	u64 *sptep;
>>+	struct rmap_iterator iter;
>>+	int need_tlb_flush = 0;
>>+	pfn_t pfn;
>>+	struct kvm_mmu_page *sp;
>>+
>>+	while ((sptep = rmap_get_first(*rmapp, &iter))) {
>>+		BUG_ON(!(*sptep & PT_PRESENT_MASK));
>>+
>>+		sp = page_header(__pa(sptep));
>>+		pfn = spte_to_pfn(*sptep);
>>+
>>+		/*
>>+		 * Let support EPT only now, an efficient way need to be figure
>>+		 * out to let these code be aware what mapping level used in
>>+		 * guest.
>
>This English seems strange... but i am not good at it. :)

I'm also not good at English, anyway, update it in v3. :)

>
>>+		 */
>>+		if (sp->role.direct &&
>>+			!kvm_is_reserved_pfn(pfn) &&
>>+			PageTransCompound(pfn_to_page(pfn))) {
>>+			drop_spte(kvm, sptep);
>>+			need_tlb_flush = 1;
>>+		}
>
>If the conditions are not comfortable, it does loop forever...

Fix it in v3.

>
>Otherwise, it looks good to me.
>
>Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>

Thanks for your review. :)

Regards,
Wanpeng Li 

>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 30b28dc..91b5bdb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -854,6 +854,8 @@  void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      struct kvm_memory_slot *memslot);
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+					struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cee7592..df3f2e3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4465,6 +4465,77 @@  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_flush_remote_tlbs(kvm);
 }
 
+static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
+		unsigned long *rmapp)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	int need_tlb_flush = 0;
+	pfn_t pfn;
+	struct kvm_mmu_page *sp;
+
+	while ((sptep = rmap_get_first(*rmapp, &iter))) {
+		BUG_ON(!(*sptep & PT_PRESENT_MASK));
+
+		sp = page_header(__pa(sptep));
+		pfn = spte_to_pfn(*sptep);
+
+		/*
+		 * Let support EPT only now, an efficient way need to be figure
+		 * out to let these code be aware what mapping level used in
+		 * guest.
+		 */
+		if (sp->role.direct &&
+			!kvm_is_reserved_pfn(pfn) &&
+			PageTransCompound(pfn_to_page(pfn))) {
+			drop_spte(kvm, sptep);
+			need_tlb_flush = 1;
+		}
+	}
+
+	return need_tlb_flush;
+}
+
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+			struct kvm_memory_slot *memslot)
+{
+	bool flush = false;
+	unsigned long *rmapp;
+	unsigned long last_index, index;
+	gfn_t gfn_start, gfn_end;
+
+	spin_lock(&kvm->mmu_lock);
+
+	gfn_start = memslot->base_gfn;
+	gfn_end = memslot->base_gfn + memslot->npages - 1;
+
+	if (gfn_start >= gfn_end)
+		goto out;
+
+	rmapp = memslot->arch.rmap[0];
+	last_index = gfn_to_index(gfn_end, memslot->base_gfn,
+					PT_PAGE_TABLE_LEVEL);
+
+	for (index = 0; index <= last_index; ++index, ++rmapp) {
+		if (*rmapp)
+			flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
+
+		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+			if (flush) {
+				kvm_flush_remote_tlbs(kvm);
+				flush = false;
+			}
+			cond_resched_lock(&kvm->mmu_lock);
+		}
+	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+out:
+	spin_unlock(&kvm->mmu_lock);
+}
+
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   struct kvm_memory_slot *memslot)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 50861dd..650a552 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7647,6 +7647,25 @@  void kvm_arch_commit_memory_region(struct kvm *kvm,
 	new = id_to_memslot(kvm->memslots, mem->slot);
 
 	/*
+	 * Dirty logging tracks sptes in 4k granularity, so large sptes are
+	 * split, the large sptes will be reallocated in the destination
+	 * machine and the guest in the source machine will be destroyed
+	 * when live migration successfully. However, the guest in the source
+	 * machine will continue to run if live migration fail due to some
+	 * reasons, the sptes still keep small which lead to bad performance.
+	 *
+	 * Lazy collapse small sptes into large sptes is intended to handle
+	 * this, the memory region will be scanned on the ioctl context when
+	 * dirty log is stopped, the ones which can be collapsed into large
+	 * pages will be dropped during the scan, it depends the on later #PF
+	 * to reallocate all large sptes.
+	 */
+	if ((change != KVM_MR_DELETE) &&
+		(old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
+		!(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
+		kvm_mmu_zap_collapsible_sptes(kvm, new);
+
+	/*
 	 * Set up write protection and/or dirty logging for the new slot.
 	 *
 	 * For KVM_MR_DELETE and KVM_MR_MOVE, the shadow pages of old slot have

[v2] kvm: mmu: lazy collapse small sptes into large sptes

Commit Message

Comments

Patch