diff mbox

[v2,03/15] KVM: MMU: lazily drop large spte

Message ID 1378376958-27252-4-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Xiao Guangrong Sept. 5, 2013, 10:29 a.m. UTC
Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter.  This removes the need for the return value.

This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 36 ++++++++++++++++++++----------------
 arch/x86/kvm/x86.c |  8 ++++++--
 2 files changed, 26 insertions(+), 18 deletions(-)

Comments

Marcelo Tosatti Sept. 30, 2013, 10:39 p.m. UTC | #1
On Thu, Sep 05, 2013 at 06:29:06PM +0800, Xiao Guangrong wrote:
> Currently, kvm zaps the large spte if write-protected is needed, the later
> read can fault on that spte. Actually, we can make the large spte readonly
> instead of making them un-present, the page fault caused by read access can
> be avoided
> 
> The idea is from Avi:
> | As I mentioned before, write-protecting a large spte is a good idea,
> | since it moves some work from protect-time to fault-time, so it reduces
> | jitter.  This removes the need for the return value.
> 
> This version has fixed the issue reported in 6b73a9606, the reason of that
> issue is that fast_page_fault() directly sets the readonly large spte to
> writable but only dirty the first page into the dirty-bitmap that means
> other pages are missed. Fixed it by only the normal sptes (on the
> PT_PAGE_TABLE_LEVEL level) can be fast fixed
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c | 36 ++++++++++++++++++++----------------
>  arch/x86/kvm/x86.c |  8 ++++++--
>  2 files changed, 26 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 869f1db..88107ee 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1177,8 +1177,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>  
>  /*
>   * Write-protect on the specified @sptep, @pt_protect indicates whether
> - * spte writ-protection is caused by protecting shadow page table.
> - * @flush indicates whether tlb need be flushed.
> + * spte write-protection is caused by protecting shadow page table.
>   *
>   * Note: write protection is difference between drity logging and spte
>   * protection:
> @@ -1187,10 +1186,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>   * - for spte protection, the spte can be writable only after unsync-ing
>   *   shadow page.
>   *
> - * Return true if the spte is dropped.
> + * Return true if tlb need be flushed.
>   */
> -static bool
> -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
>  {
>  	u64 spte = *sptep;
>  
> @@ -1200,17 +1198,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
>  
>  	rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
>  
> -	if (__drop_large_spte(kvm, sptep)) {
> -		*flush |= true;
> -		return true;
> -	}
> -
>  	if (pt_protect)
>  		spte &= ~SPTE_MMU_WRITEABLE;
>  	spte = spte & ~PT_WRITABLE_MASK;
>  
> -	*flush |= mmu_spte_update(sptep, spte);
> -	return false;
> +	return mmu_spte_update(sptep, spte);
>  }

Is it necessary for kvm_mmu_unprotect_page to search for an entire range large 
page range now, instead of a 4k page?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Oct. 3, 2013, 6:29 a.m. UTC | #2
On Oct 1, 2013, at 6:39 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Thu, Sep 05, 2013 at 06:29:06PM +0800, Xiao Guangrong wrote:
>> Currently, kvm zaps the large spte if write-protected is needed, the later
>> read can fault on that spte. Actually, we can make the large spte readonly
>> instead of making them un-present, the page fault caused by read access can
>> be avoided
>> 
>> The idea is from Avi:
>> | As I mentioned before, write-protecting a large spte is a good idea,
>> | since it moves some work from protect-time to fault-time, so it reduces
>> | jitter.  This removes the need for the return value.
>> 
>> This version has fixed the issue reported in 6b73a9606, the reason of that
>> issue is that fast_page_fault() directly sets the readonly large spte to
>> writable but only dirty the first page into the dirty-bitmap that means
>> other pages are missed. Fixed it by only the normal sptes (on the
>> PT_PAGE_TABLE_LEVEL level) can be fast fixed
>> 
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>> ---
>> arch/x86/kvm/mmu.c | 36 ++++++++++++++++++++----------------
>> arch/x86/kvm/x86.c |  8 ++++++--
>> 2 files changed, 26 insertions(+), 18 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 869f1db..88107ee 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1177,8 +1177,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>> 
>> /*
>>  * Write-protect on the specified @sptep, @pt_protect indicates whether
>> - * spte writ-protection is caused by protecting shadow page table.
>> - * @flush indicates whether tlb need be flushed.
>> + * spte write-protection is caused by protecting shadow page table.
>>  *
>>  * Note: write protection is difference between drity logging and spte
>>  * protection:
>> @@ -1187,10 +1186,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>>  * - for spte protection, the spte can be writable only after unsync-ing
>>  *   shadow page.
>>  *
>> - * Return true if the spte is dropped.
>> + * Return true if tlb need be flushed.
>>  */
>> -static bool
>> -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
>> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
>> {
>> 	u64 spte = *sptep;
>> 
>> @@ -1200,17 +1198,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
>> 
>> 	rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
>> 
>> -	if (__drop_large_spte(kvm, sptep)) {
>> -		*flush |= true;
>> -		return true;
>> -	}
>> -
>> 	if (pt_protect)
>> 		spte &= ~SPTE_MMU_WRITEABLE;
>> 	spte = spte & ~PT_WRITABLE_MASK;
>> 
>> -	*flush |= mmu_spte_update(sptep, spte);
>> -	return false;
>> +	return mmu_spte_update(sptep, spte);
>> }
> 
> Is it necessary for kvm_mmu_unprotect_page to search for an entire range large 
> page range now, instead of a 4k page?

It is unnecessary. kvm_mmu_unprotect_page is used to delete the gfn's shadow pages
then vcpu will try to re-fault. If any gfn in the large range has shadow page, it will stop using large
mapping, so that the mapping will be split to small mappings when vcpu re-fault again.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 869f1db..88107ee 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1177,8 +1177,7 @@  static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
 
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
- * @flush indicates whether tlb need be flushed.
+ * spte write-protection is caused by protecting shadow page table.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1187,10 +1186,9 @@  static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
  * - for spte protection, the spte can be writable only after unsync-ing
  *   shadow page.
  *
- * Return true if the spte is dropped.
+ * Return true if tlb need be flushed.
  */
-static bool
-spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
 {
 	u64 spte = *sptep;
 
@@ -1200,17 +1198,11 @@  spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
 
 	rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-	if (__drop_large_spte(kvm, sptep)) {
-		*flush |= true;
-		return true;
-	}
-
 	if (pt_protect)
 		spte &= ~SPTE_MMU_WRITEABLE;
 	spte = spte & ~PT_WRITABLE_MASK;
 
-	*flush |= mmu_spte_update(sptep, spte);
-	return false;
+	return mmu_spte_update(sptep, spte);
 }
 
 static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1222,11 +1214,8 @@  static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
 
 	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
 		BUG_ON(!(*sptep & PT_PRESENT_MASK));
-		if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
-			sptep = rmap_get_first(*rmapp, &iter);
-			continue;
-		}
 
+		flush |= spte_write_protect(kvm, sptep, pt_protect);
 		sptep = rmap_get_next(&iter);
 	}
 
@@ -2675,6 +2664,8 @@  static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 			break;
 		}
 
+		drop_large_spte(vcpu, iterator.sptep);
+
 		if (!is_shadow_present_pte(*iterator.sptep)) {
 			u64 base_addr = iterator.addr;
 
@@ -2876,6 +2867,19 @@  static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		goto exit;
 
 	/*
+	 * Do not fix write-permission on the large spte since we only dirty
+	 * the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
+	 * that means other pages are missed if its slot is dirty-logged.
+	 *
+	 * Instead, we let the slow page fault path create a normal spte to
+	 * fix the access.
+	 *
+	 * See the comments in kvm_arch_commit_memory_region().
+	 */
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+		goto exit;
+
+	/*
 	 * Currently, fast page fault only works for direct mapping since
 	 * the gfn is not stable for indirect shadow page.
 	 * See Documentation/virtual/kvm/locking.txt to get more detail.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5ca72a..6ad0c07 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7208,8 +7208,12 @@  void kvm_arch_commit_memory_region(struct kvm *kvm,
 		kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
 	/*
 	 * Write protect all pages for dirty logging.
-	 * Existing largepage mappings are destroyed here and new ones will
-	 * not be created until the end of the logging.
+	 *
+	 * All the sptes including the large sptes which point to this
+	 * slot are set to readonly. We can not create any new large
+	 * spte on this slot until the end of the logging.
+	 *
+	 * See the comments in fast_page_fault().
 	 */
 	if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
 		kvm_mmu_slot_remove_write_access(kvm, mem->slot);