[03/12] KVM: MMU: lazily drop large spte

Message ID	1375189330-24066-4-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Gateway: Authorized Use Only! Violators will be prosecuted for <kvm@vger.kernel.org> from <xiaoguangrong@linux.vnet.ibm.com>; Tue, 30 Jul 2013 22:52:33 +1000 Gateway: Authorized Use Only! Violators will be prosecuted; Tue, 30 Jul 2013 22:52:30 +1000 From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> To: gleb@redhat.com Cc: avi.kivity@gmail.com, mtosatti@redhat.com, pbonzini@redhat.com, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Subject: [PATCH 03/12] KVM: MMU: lazily drop large spte Date: Tue, 30 Jul 2013 21:02:01 +0800 Message-Id: <1375189330-24066-4-git-send-email-xiaoguangrong@linux.vnet.ibm.com> In-Reply-To: <1375189330-24066-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com> References: <1375189330-24066-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Message ID

1375189330-24066-4-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive)

State

New, archived

Headers

From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
To: gleb@redhat.com
Cc: avi.kivity@gmail.com, mtosatti@redhat.com, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Subject: [PATCH 03/12] KVM: MMU: lazily drop large spte
Date: Tue, 30 Jul 2013 21:02:01 +0800
Message-Id: <1375189330-24066-4-git-send-email-xiaoguangrong@linux.vnet.ibm.com>
In-Reply-To: <1375189330-24066-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com>
References: <1375189330-24066-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com>
Sender: kvm-owner@vger.kernel.org
Precedence: bulk

Commit Message

Xiao Guangrong July 30, 2013, 1:02 p.m. UTC

Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter.  This removes the need for the return value.

[
  It has fixed the issue reported in 6b73a9606 by stopping fast page fault
  marking the large spte to writable
]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 36 +++++++++++++++++-------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

Comments

Marcelo Tosatti Aug. 2, 2013, 2:55 p.m. UTC | #1

On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
> Currently, kvm zaps the large spte if write-protected is needed, the later
> read can fault on that spte. Actually, we can make the large spte readonly
> instead of making them un-present, the page fault caused by read access can
> be avoided
> 
> The idea is from Avi:
> | As I mentioned before, write-protecting a large spte is a good idea,
> | since it moves some work from protect-time to fault-time, so it reduces
> | jitter.  This removes the need for the return value.
> 
> [
>   It has fixed the issue reported in 6b73a9606 by stopping fast page fault
>   marking the large spte to writable
> ]

Xiao,

Can you please write a comment explaining why are the problems 
with shadow vs large read-only sptes (can't recall anymore),
and then why it is now safe to do it.

Comments below.

> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c | 36 +++++++++++++++++-------------------
>  1 file changed, 17 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index cf163ca..35d4b50 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1181,8 +1181,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>  
>  /*
>   * Write-protect on the specified @sptep, @pt_protect indicates whether
> - * spte writ-protection is caused by protecting shadow page table.
> - * @flush indicates whether tlb need be flushed.
> + * spte write-protection is caused by protecting shadow page table.
>   *
>   * Note: write protection is difference between drity logging and spte
>   * protection:
> @@ -1191,10 +1190,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>   * - for spte protection, the spte can be writable only after unsync-ing
>   *   shadow page.
>   *
> - * Return true if the spte is dropped.
> + * Return true if tlb need be flushed.
>   */
> -static bool
> -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
>  {
>  	u64 spte = *sptep;
>  
> @@ -1204,17 +1202,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
>  
>  	rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
>  
> -	if (__drop_large_spte(kvm, sptep)) {
> -		*flush |= true;
> -		return true;
> -	}
> -
>  	if (pt_protect)
>  		spte &= ~SPTE_MMU_WRITEABLE;
>  	spte = spte & ~PT_WRITABLE_MASK;
>  
> -	*flush |= mmu_spte_update(sptep, spte);
> -	return false;
> +	return mmu_spte_update(sptep, spte);
>  }
>  
>  static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
> @@ -1226,11 +1218,8 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
>  
>  	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>  		BUG_ON(!(*sptep & PT_PRESENT_MASK));
> -		if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
> -			sptep = rmap_get_first(*rmapp, &iter);
> -			continue;
> -		}
>  
> +		flush |= spte_write_protect(kvm, sptep, pt_protect);
>  		sptep = rmap_get_next(&iter);
>  	}
>  
> @@ -2701,6 +2690,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
>  			break;
>  		}
>  
> +		drop_large_spte(vcpu, iterator.sptep);
> +
>  		if (!is_shadow_present_pte(*iterator.sptep)) {
>  			u64 base_addr = iterator.addr;
>  
> @@ -2855,7 +2846,7 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>   * - false: let the real page fault path to fix it.
>   */
>  static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> -			    u32 error_code)
> +			    u32 error_code, bool force_pt_level)
>  {
>  	struct kvm_shadow_walk_iterator iterator;
>  	struct kvm_mmu_page *sp;
> @@ -2884,6 +2875,13 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  		goto exit;
>  
>  	/*
> +	 * Can not map the large spte to writable if the page is dirty
> +	 * logged.
> +	 */
> +	if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
> +		goto exit;
> +

It is not safe to derive slot->dirty_bitmap like this: 
since dirty log is enabled via RCU update, "is dirty bitmap enabled"
info could be stale by the time you check it here via the parameter,
so you can instantiate a large spte (because force_pt_level == false),
while you should not.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong Aug. 2, 2013, 3:42 p.m. UTC | #2

On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
>> Currently, kvm zaps the large spte if write-protected is needed, the later
>> read can fault on that spte. Actually, we can make the large spte readonly
>> instead of making them un-present, the page fault caused by read access can
>> be avoided
>> 
>> The idea is from Avi:
>> | As I mentioned before, write-protecting a large spte is a good idea,
>> | since it moves some work from protect-time to fault-time, so it reduces
>> | jitter.  This removes the need for the return value.
>> 
>> [
>>  It has fixed the issue reported in 6b73a9606 by stopping fast page fault
>>  marking the large spte to writable
>> ]
> 
> Xiao,
> 
> Can you please write a comment explaining why are the problems 
> with shadow vs large read-only sptes (can't recall anymore),
> and then why it is now safe to do it.

Hi Marcelo,

Thanks for your review.  Yes. The bug reported in  6b73a9606 is, in this patch,
we mark the large spte as readonly when the pages are dirt logged and the
readonly spte can be set to writable by fast page fault, but on that path, it failed
to check dirty logging, so it will set the large spte to writable but only set the first
page to the dirty bitmap.

For example:

1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
     is writable.

2): KVM dirty log 0 ~ 2M,  then set SPTE to readonly

3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.

Then 4K ~ 2M memory is not dirty logged.

In this version, we let fast page fault do not mark large spte to writable if
its page are dirty logged.  But it is still not safe as you pointed out.

>> 
>> 
>> 	/*
>> +	 * Can not map the large spte to writable if the page is dirty
>> +	 * logged.
>> +	 */
>> +	if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
>> +		goto exit;
>> +
> 
> It is not safe to derive slot->dirty_bitmap like this: 
> since dirty log is enabled via RCU update, "is dirty bitmap enabled"
> info could be stale by the time you check it here via the parameter,
> so you can instantiate a large spte (because force_pt_level == false),
> while you should not.

Good catch! This is true even if we enable dirty log under the protection
of mmu lock.

How about let the fault page fault only fix the small spte, that is changing
the code to:
	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
		goto exit;
?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Aug. 2, 2013, 8:27 p.m. UTC | #3

On Fri, Aug 02, 2013 at 11:42:19PM +0800, Xiao Guangrong wrote:
> 
> On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
> >> Currently, kvm zaps the large spte if write-protected is needed, the later
> >> read can fault on that spte. Actually, we can make the large spte readonly
> >> instead of making them un-present, the page fault caused by read access can
> >> be avoided
> >> 
> >> The idea is from Avi:
> >> | As I mentioned before, write-protecting a large spte is a good idea,
> >> | since it moves some work from protect-time to fault-time, so it reduces
> >> | jitter.  This removes the need for the return value.
> >> 
> >> [
> >>  It has fixed the issue reported in 6b73a9606 by stopping fast page fault
> >>  marking the large spte to writable
> >> ]
> > 
> > Xiao,
> > 
> > Can you please write a comment explaining why are the problems 
> > with shadow vs large read-only sptes (can't recall anymore),
> > and then why it is now safe to do it.
> 
> Hi Marcelo,
> 
> Thanks for your review.  Yes. The bug reported in  6b73a9606 is, in this patch,
> we mark the large spte as readonly when the pages are dirt logged and the
> readonly spte can be set to writable by fast page fault, but on that path, it failed
> to check dirty logging, so it will set the large spte to writable but only set the first
> page to the dirty bitmap.
> 
> For example:
> 
> 1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
>      is writable.
> 
> 2): KVM dirty log 0 ~ 2M,  then set SPTE to readonly
> 
> 3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.
> 
> Then 4K ~ 2M memory is not dirty logged.

Ok can you write a self contained summary of read-only large sptes (when
they are created, when destroyed, from which point they can't be created,
etc), and the interaction with shadow write protection and creation of
writeable sptes?
Its easy to get lost.

> In this version, we let fast page fault do not mark large spte to writable if
> its page are dirty logged.  But it is still not safe as you pointed out.
> 
> >> 
> >> 
> >> 	/*
> >> +	 * Can not map the large spte to writable if the page is dirty
> >> +	 * logged.
> >> +	 */
> >> +	if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
> >> +		goto exit;
> >> +
> > 
> > It is not safe to derive slot->dirty_bitmap like this: 
> > since dirty log is enabled via RCU update, "is dirty bitmap enabled"
> > info could be stale by the time you check it here via the parameter,
> > so you can instantiate a large spte (because force_pt_level == false),
> > while you should not.
> 
> Good catch! This is true even if we enable dirty log under the protection
> of mmu lock.
> 
> How about let the fault page fault only fix the small spte, that is changing
> the code to:
> 	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> 		goto exit;
> ?

Sure.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong Aug. 2, 2013, 10:56 p.m. UTC | #4

On Aug 3, 2013, at 4:27 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Fri, Aug 02, 2013 at 11:42:19PM +0800, Xiao Guangrong wrote:
>> 
>> On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> 
>>> On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
>>>> Currently, kvm zaps the large spte if write-protected is needed, the later
>>>> read can fault on that spte. Actually, we can make the large spte readonly
>>>> instead of making them un-present, the page fault caused by read access can
>>>> be avoided
>>>> 
>>>> The idea is from Avi:
>>>> | As I mentioned before, write-protecting a large spte is a good idea,
>>>> | since it moves some work from protect-time to fault-time, so it reduces
>>>> | jitter.  This removes the need for the return value.
>>>> 
>>>> [
>>>> It has fixed the issue reported in 6b73a9606 by stopping fast page fault
>>>> marking the large spte to writable
>>>> ]
>>> 
>>> Xiao,
>>> 
>>> Can you please write a comment explaining why are the problems 
>>> with shadow vs large read-only sptes (can't recall anymore),
>>> and then why it is now safe to do it.
>> 
>> Hi Marcelo,
>> 
>> Thanks for your review.  Yes. The bug reported in  6b73a9606 is, in this patch,
>> we mark the large spte as readonly when the pages are dirt logged and the
>> readonly spte can be set to writable by fast page fault, but on that path, it failed
>> to check dirty logging, so it will set the large spte to writable but only set the first
>> page to the dirty bitmap.
>> 
>> For example:
>> 
>> 1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
>>     is writable.
>> 
>> 2): KVM dirty log 0 ~ 2M,  then set SPTE to readonly
>> 
>> 3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.
>> 
>> Then 4K ~ 2M memory is not dirty logged.
> 
> Ok can you write a self contained summary of read-only large sptes (when
> they are created, when destroyed, from which point they can't be created,
> etc), and the interaction with shadow write protection and creation of
> writeable sptes?
> Its easy to get lost.

Okay, will do.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cf163ca..35d4b50 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1181,8 +1181,7 @@  static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
 
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
- * @flush indicates whether tlb need be flushed.
+ * spte write-protection is caused by protecting shadow page table.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1191,10 +1190,9 @@  static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
  * - for spte protection, the spte can be writable only after unsync-ing
  *   shadow page.
  *
- * Return true if the spte is dropped.
+ * Return true if tlb need be flushed.
  */
-static bool
-spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
 {
 	u64 spte = *sptep;
 
@@ -1204,17 +1202,11 @@  spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
 
 	rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-	if (__drop_large_spte(kvm, sptep)) {
-		*flush |= true;
-		return true;
-	}
-
 	if (pt_protect)
 		spte &= ~SPTE_MMU_WRITEABLE;
 	spte = spte & ~PT_WRITABLE_MASK;
 
-	*flush |= mmu_spte_update(sptep, spte);
-	return false;
+	return mmu_spte_update(sptep, spte);
 }
 
 static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1226,11 +1218,8 @@  static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
 
 	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
 		BUG_ON(!(*sptep & PT_PRESENT_MASK));
-		if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
-			sptep = rmap_get_first(*rmapp, &iter);
-			continue;
-		}
 
+		flush |= spte_write_protect(kvm, sptep, pt_protect);
 		sptep = rmap_get_next(&iter);
 	}
 
@@ -2701,6 +2690,8 @@  static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 			break;
 		}
 
+		drop_large_spte(vcpu, iterator.sptep);
+
 		if (!is_shadow_present_pte(*iterator.sptep)) {
 			u64 base_addr = iterator.addr;
 
@@ -2855,7 +2846,7 @@  fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
  * - false: let the real page fault path to fix it.
  */
 static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
-			    u32 error_code)
+			    u32 error_code, bool force_pt_level)
 {
 	struct kvm_shadow_walk_iterator iterator;
 	struct kvm_mmu_page *sp;
@@ -2884,6 +2875,13 @@  static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		goto exit;
 
 	/*
+	 * Can not map the large spte to writable if the page is dirty
+	 * logged.
+	 */
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
+		goto exit;
+
+	/*
 	 * Check if it is a spurious fault caused by TLB lazily flushed.
 	 *
 	 * Need not check the access of upper level table entries since
@@ -2944,7 +2942,7 @@  static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
 	} else
 		level = PT_PAGE_TABLE_LEVEL;
 
-	if (fast_page_fault(vcpu, v, level, error_code))
+	if (fast_page_fault(vcpu, v, level, error_code, force_pt_level))
 		return 0;
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
@@ -3422,7 +3420,7 @@  static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	} else
 		level = PT_PAGE_TABLE_LEVEL;
 
-	if (fast_page_fault(vcpu, gpa, level, error_code))
+	if (fast_page_fault(vcpu, gpa, level, error_code, force_pt_level))
 		return 0;
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;

[03/12] KVM: MMU: lazily drop large spte

Commit Message

Comments

Patch