KVM: mmu: spte_write_protect optimization

Message ID	20220525191229.2037431-1-venkateshs@chromium.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Venkatesh Srinivas <venkateshs@chromium.org> To: kvm@vger.kernel.org Cc: seanjc@google.com, venkateshs@chromium.org, Junaid Shahid <junaids@google.com> Subject: [PATCH] KVM: mmu: spte_write_protect optimization Date: Wed, 25 May 2022 19:12:29 +0000 Message-Id: <20220525191229.2037431-1-venkateshs@chromium.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: mmu: spte_write_protect optimization \| expand KVM: mmu: spte_write_protect optimization

Message ID

20220525191229.2037431-1-venkateshs@chromium.org (mailing list archive)

State

New, archived

Headers

From: Venkatesh Srinivas <venkateshs@chromium.org>
To: kvm@vger.kernel.org
Cc: seanjc@google.com, venkateshs@chromium.org,
        Junaid Shahid <junaids@google.com>
Subject: [PATCH] KVM: mmu: spte_write_protect optimization
Date: Wed, 25 May 2022 19:12:29 +0000
Message-Id: <20220525191229.2037431-1-venkateshs@chromium.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

KVM: mmu: spte_write_protect optimization | expand

Commit Message

Venkatesh Srinivas May 25, 2022, 7:12 p.m. UTC

From: Junaid Shahid <junaids@google.com>

This change uses a lighter-weight function instead of mmu_spte_update()
in the common case in spte_write_protect(). This helps speed up the
get_dirty_log IOCTL.

Performance: dirty_log_perf_test with 32 GB VM size
             Avg IOCTL time over 10 passes
             Haswell: ~0.23s vs ~0.4s
             IvyBridge: ~0.8s vs 1s

Signed-off-by: Venkatesh Srinivas <venkateshs@chromium.org>
Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

Comments

Sean Christopherson May 25, 2022, 8:07 p.m. UTC | #1

+Ben and David

On Wed, May 25, 2022, Venkatesh Srinivas wrote:
> From: Junaid Shahid <junaids@google.com>
> 
> This change uses a lighter-weight function instead of mmu_spte_update()
> in the common case in spte_write_protect(). This helps speed up the
> get_dirty_log IOCTL.

When upstreaming our internal patches, please rewrite the changelogs to meet
upstream standards.  If that requires a lot of effort, then by all means take
credit with a Co-developed-by or even making yourself the sole author with a
Suggested-by for the original author.

There is subtly a _lot_ going on in this patch, and practically none of it is
explained.

> Performance: dirty_log_perf_test with 32 GB VM size
>              Avg IOCTL time over 10 passes
>              Haswell: ~0.23s vs ~0.4s
>              IvyBridge: ~0.8s vs 1s
> 
> Signed-off-by: Venkatesh Srinivas <venkateshs@chromium.org>
> Signed-off-by: Junaid Shahid <junaids@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index efe5a3dca1e0..a6db9dfaf7c3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1151,6 +1151,22 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>  	}
>  }
>  
> +static bool spte_test_and_clear_writable(u64 *sptep)
> +{
> +	u64 spte = *sptep;
> +
> +	if (spte & PT_WRITABLE_MASK) {

This is redundant, the caller has already verified that the spte is writable.

> +		clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);

Is an unconditional atomic op faster than checking spte_has_volatile_bits()?  If
so, that should be documented with a comment.

This also needs a comment explaining the mmu_lock is held for write and so
WRITABLE can't be cleared, i.e. using clear_bit() instead of test_and_clear_bit()
is acceptable.

> +		if (!spte_ad_enabled(spte))

This absolutely needs a comment explaining what's going on.  Specifically that if
A/D bits are disabled, the WRITABLE bit doubles as the dirty flag, whereas hardware's
DIRTY bit is untouched when A/D bits are in use.

> +			kvm_set_pfn_dirty(spte_to_pfn(spte));
> +
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
>  /*
>   * Write-protect on the specified @sptep, @pt_protect indicates whether
>   * spte write-protection is caused by protecting shadow page table.
> @@ -1174,11 +1190,11 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
>  
>  	rmap_printk("spte %p %llx\n", sptep, *sptep);
>  
> -	if (pt_protect)
> -		spte &= ~shadow_mmu_writable_mask;
> -	spte = spte & ~PT_WRITABLE_MASK;
> -
> -	return mmu_spte_update(sptep, spte);
> +	if (pt_protect) {
> +		spte &= ~(shadow_mmu_writable_mask | PT_WRITABLE_MASK);
> +		return mmu_spte_update(sptep, spte);
> +	}
> +	return spte_test_and_clear_writable(sptep);

I think rather open code the logic instead of adding a helper.  Without even more
comments, it's not at all obvious when it's safe to use spte_test_and_clear_writable().
And although this patch begs the question of whether or not we should do a similar
thing for the TDP MMU's clear_dirty_gfn_range(), the TDP MMU and legacy MMU can't
really share a helper at this time because the TDP MMU only holds mmu_lock for read.

So minus all the comments that need to be added, I think this can just be:

---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index efe5a3dca1e0..caf5db7f1dce 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1174,11 +1174,16 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)

 	rmap_printk("spte %p %llx\n", sptep, *sptep);

-	if (pt_protect)
-		spte &= ~shadow_mmu_writable_mask;
-	spte = spte & ~PT_WRITABLE_MASK;
+	if (pt_protect) {
+		spte &= ~(shadow_mmu_writable_mask | PT_WRITABLE_MASK);
+		return mmu_spte_update(sptep, spte);
+	}

-	return mmu_spte_update(sptep, spte);
+	clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);
+	if (!spte_ad_enabled(spte))
+		kvm_set_pfn_dirty(spte_to_pfn(spte));
+
+	return true;
 }

 static bool rmap_write_protect(struct kvm_rmap_head *rmap_head,

base-commit: 9fd329c0c3309a87821817ca553b449936e318a9
--


And then on the TDP MMU side, I _think_ we can do the following (again, minus
comments).  The logic being that if something else races and clears the bit or
zaps the SPTEs, there's no point in "retrying" because either the !PRESENT check
and/or the WRITABLE/DIRTY bit check will fail.

The only hiccup would be clearing a REMOVED_SPTE bit, but by sheer dumb luck, past
me chose a semi-arbitrary value for REMOVED_SPTE that doesn't use either dirty bit.

This would need to be done over two patches, e.g. to make the EPT definitions
available to the mmu.

Ben, David, any holes in my idea?

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 841feaa48be5..c28a0b9e2af1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1600,6 +1600,8 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
        }
 }

+#define VMX_EPT_DIRTY_BIT BIT_ULL(9)
+
 /*
  * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
  * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
@@ -1610,35 +1612,32 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
 static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
                           gfn_t start, gfn_t end)
 {
-       struct tdp_iter iter;
-       u64 new_spte;
+       const int dirty_shift = ilog2(shadow_dirty_mask);
        bool spte_set = false;
+       struct tdp_iter iter;
+       int bit_nr;
+
+       BUILD_BUG_ON(REMOVED_SPTE & PT_DIRTY_MASK);
+       BUILD_BUG_ON(REMOVED_SPTE & VMX_EPT_DIRTY_BIT);

        rcu_read_lock();

        tdp_root_for_each_leaf_pte(iter, root, start, end) {
-retry:
                if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
                        continue;

                if (!is_shadow_present_pte(iter.old_spte))
                        continue;

-               if (spte_ad_need_write_protect(iter.old_spte)) {
-                       if (is_writable_pte(iter.old_spte))
-                               new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
-                       else
-                               continue;
-               } else {
-                       if (iter.old_spte & shadow_dirty_mask)
-                               new_spte = iter.old_spte & ~shadow_dirty_mask;
-                       else
-                               continue;
-               }
+               if (spte_ad_need_write_protect(iter.old_spte))
+                       bit_nr = PT_WRITABLE_SHIFT;
+               else
+                       bit_nr = dirty_shift;

-               if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
-                       goto retry;
+               if (!test_and_clear_bit(bit_nr, (unsigned long *)iter.sptep))
+                       continue;

+               kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
                spte_set = true;
        }

David Matlack May 26, 2022, 5:18 p.m. UTC | #2

On Wed, May 25, 2022 at 08:07:05PM +0000, Sean Christopherson wrote:
> +Ben and David
> 
> On Wed, May 25, 2022, Venkatesh Srinivas wrote:
> > From: Junaid Shahid <junaids@google.com>
> > 
> > This change uses a lighter-weight function instead of mmu_spte_update()
> > in the common case in spte_write_protect(). This helps speed up the
> > get_dirty_log IOCTL.
> 
> When upstreaming our internal patches, please rewrite the changelogs to meet
> upstream standards.  If that requires a lot of effort, then by all means take
> credit with a Co-developed-by or even making yourself the sole author with a
> Suggested-by for the original author.
> 
> There is subtly a _lot_ going on in this patch, and practically none of it is
> explained.
> 
> > Performance: dirty_log_perf_test with 32 GB VM size
> >              Avg IOCTL time over 10 passes
> >              Haswell: ~0.23s vs ~0.4s
> >              IvyBridge: ~0.8s vs 1s
> > 
> > Signed-off-by: Venkatesh Srinivas <venkateshs@chromium.org>
> > Signed-off-by: Junaid Shahid <junaids@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++++-----
> >  1 file changed, 21 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index efe5a3dca1e0..a6db9dfaf7c3 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1151,6 +1151,22 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> >  	}
> >  }
> >  
> > +static bool spte_test_and_clear_writable(u64 *sptep)
> > +{
> > +	u64 spte = *sptep;
> > +
> > +	if (spte & PT_WRITABLE_MASK) {
> 
> This is redundant, the caller has already verified that the spte is writable.
> 
> > +		clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);
> 
> Is an unconditional atomic op faster than checking spte_has_volatile_bits()?  If
> so, that should be documented with a comment.
> 
> This also needs a comment explaining the mmu_lock is held for write and so
> WRITABLE can't be cleared, i.e. using clear_bit() instead of test_and_clear_bit()
> is acceptable.
> 
> > +		if (!spte_ad_enabled(spte))
> 
> This absolutely needs a comment explaining what's going on.  Specifically that if
> A/D bits are disabled, the WRITABLE bit doubles as the dirty flag, whereas hardware's
> DIRTY bit is untouched when A/D bits are in use.
> 
> > +			kvm_set_pfn_dirty(spte_to_pfn(spte));
> > +
> > +		return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> >  /*
> >   * Write-protect on the specified @sptep, @pt_protect indicates whether
> >   * spte write-protection is caused by protecting shadow page table.
> > @@ -1174,11 +1190,11 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
> >  
> >  	rmap_printk("spte %p %llx\n", sptep, *sptep);
> >  
> > -	if (pt_protect)
> > -		spte &= ~shadow_mmu_writable_mask;
> > -	spte = spte & ~PT_WRITABLE_MASK;
> > -
> > -	return mmu_spte_update(sptep, spte);
> > +	if (pt_protect) {
> > +		spte &= ~(shadow_mmu_writable_mask | PT_WRITABLE_MASK);
> > +		return mmu_spte_update(sptep, spte);
> > +	}
> > +	return spte_test_and_clear_writable(sptep);
> 
> I think rather open code the logic instead of adding a helper.  Without even more
> comments, it's not at all obvious when it's safe to use spte_test_and_clear_writable().
> And although this patch begs the question of whether or not we should do a similar
> thing for the TDP MMU's clear_dirty_gfn_range(), the TDP MMU and legacy MMU can't
> really share a helper at this time because the TDP MMU only holds mmu_lock for read.
> 
> So minus all the comments that need to be added, I think this can just be:
> 
> ---
>  arch/x86/kvm/mmu/mmu.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index efe5a3dca1e0..caf5db7f1dce 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1174,11 +1174,16 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
> 
>  	rmap_printk("spte %p %llx\n", sptep, *sptep);
> 
> -	if (pt_protect)
> -		spte &= ~shadow_mmu_writable_mask;
> -	spte = spte & ~PT_WRITABLE_MASK;
> +	if (pt_protect) {
> +		spte &= ~(shadow_mmu_writable_mask | PT_WRITABLE_MASK);
> +		return mmu_spte_update(sptep, spte);
> +	}
> 
> -	return mmu_spte_update(sptep, spte);
> +	clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);
> +	if (!spte_ad_enabled(spte))
> +		kvm_set_pfn_dirty(spte_to_pfn(spte));
> +
> +	return true;
>  }
> 
>  static bool rmap_write_protect(struct kvm_rmap_head *rmap_head,
> 
> base-commit: 9fd329c0c3309a87821817ca553b449936e318a9
> --
> 
> 
> And then on the TDP MMU side, I _think_ we can do the following (again, minus
> comments).  The logic being that if something else races and clears the bit or
> zaps the SPTEs, there's no point in "retrying" because either the !PRESENT check
> and/or the WRITABLE/DIRTY bit check will fail.
> 
> The only hiccup would be clearing a REMOVED_SPTE bit, but by sheer dumb luck, past
> me chose a semi-arbitrary value for REMOVED_SPTE that doesn't use either dirty bit.
> 
> This would need to be done over two patches, e.g. to make the EPT definitions
> available to the mmu.
> 
> Ben, David, any holes in my idea?

I think the idea will work. But I also wonder if we should just change
clear_dirty_gfn_range() to run under the write lock [1]. Then we could
share the code with the shadow MMU and avoid the REMOVED_SPTE
complexity.

[1] As a general rule, it seems like large (e.g. slot-wide)
single-threaded batch operations should run under the write lock.
They will be more efficient by avoiding atomic instructions, and can
still avoid lock contention via tdp_mmu_iter_cond_resched(). e.g. Ben
and I have been talking about changing eager page splitting to always
run under the write lock.

> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 841feaa48be5..c28a0b9e2af1 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1600,6 +1600,8 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
>         }
>  }
> 
> +#define VMX_EPT_DIRTY_BIT BIT_ULL(9)
> +
>  /*
>   * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
>   * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
> @@ -1610,35 +1612,32 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
>  static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>                            gfn_t start, gfn_t end)
>  {
> -       struct tdp_iter iter;
> -       u64 new_spte;
> +       const int dirty_shift = ilog2(shadow_dirty_mask);
>         bool spte_set = false;
> +       struct tdp_iter iter;
> +       int bit_nr;
> +
> +       BUILD_BUG_ON(REMOVED_SPTE & PT_DIRTY_MASK);
> +       BUILD_BUG_ON(REMOVED_SPTE & VMX_EPT_DIRTY_BIT);
> 
>         rcu_read_lock();
> 
>         tdp_root_for_each_leaf_pte(iter, root, start, end) {
> -retry:
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>                         continue;
> 
>                 if (!is_shadow_present_pte(iter.old_spte))
>                         continue;
> 
> -               if (spte_ad_need_write_protect(iter.old_spte)) {
> -                       if (is_writable_pte(iter.old_spte))
> -                               new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
> -                       else
> -                               continue;
> -               } else {
> -                       if (iter.old_spte & shadow_dirty_mask)
> -                               new_spte = iter.old_spte & ~shadow_dirty_mask;
> -                       else
> -                               continue;
> -               }
> +               if (spte_ad_need_write_protect(iter.old_spte))
> +                       bit_nr = PT_WRITABLE_SHIFT;
> +               else
> +                       bit_nr = dirty_shift;
> 
> -               if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
> -                       goto retry;
> +               if (!test_and_clear_bit(bit_nr, (unsigned long *)iter.sptep))
> +                       continue;
> 
> +               kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
>                 spte_set = true;
>         }
>

Ben Gardon May 26, 2022, 5:33 p.m. UTC | #3

On Thu, May 26, 2022 at 10:18 AM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, May 25, 2022 at 08:07:05PM +0000, Sean Christopherson wrote:
> > +Ben and David
> >
> > On Wed, May 25, 2022, Venkatesh Srinivas wrote:
> > > From: Junaid Shahid <junaids@google.com>
> > >
> > > This change uses a lighter-weight function instead of mmu_spte_update()
> > > in the common case in spte_write_protect(). This helps speed up the
> > > get_dirty_log IOCTL.
> >
> > When upstreaming our internal patches, please rewrite the changelogs to meet
> > upstream standards.  If that requires a lot of effort, then by all means take
> > credit with a Co-developed-by or even making yourself the sole author with a
> > Suggested-by for the original author.
> >
> > There is subtly a _lot_ going on in this patch, and practically none of it is
> > explained.
> >
> > > Performance: dirty_log_perf_test with 32 GB VM size
> > >              Avg IOCTL time over 10 passes
> > >              Haswell: ~0.23s vs ~0.4s
> > >              IvyBridge: ~0.8s vs 1s
> > >
> > > Signed-off-by: Venkatesh Srinivas <venkateshs@chromium.org>
> > > Signed-off-by: Junaid Shahid <junaids@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++++-----
> > >  1 file changed, 21 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index efe5a3dca1e0..a6db9dfaf7c3 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1151,6 +1151,22 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> > >     }
> > >  }
> > >
> > > +static bool spte_test_and_clear_writable(u64 *sptep)
> > > +{
> > > +   u64 spte = *sptep;
> > > +
> > > +   if (spte & PT_WRITABLE_MASK) {
> >
> > This is redundant, the caller has already verified that the spte is writable.
> >
> > > +           clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);
> >
> > Is an unconditional atomic op faster than checking spte_has_volatile_bits()?  If
> > so, that should be documented with a comment.
> >
> > This also needs a comment explaining the mmu_lock is held for write and so
> > WRITABLE can't be cleared, i.e. using clear_bit() instead of test_and_clear_bit()
> > is acceptable.
> >
> > > +           if (!spte_ad_enabled(spte))
> >
> > This absolutely needs a comment explaining what's going on.  Specifically that if
> > A/D bits are disabled, the WRITABLE bit doubles as the dirty flag, whereas hardware's
> > DIRTY bit is untouched when A/D bits are in use.
> >
> > > +                   kvm_set_pfn_dirty(spte_to_pfn(spte));
> > > +
> > > +           return true;
> > > +   }
> > > +
> > > +   return false;
> > > +}
> > > +
> > >  /*
> > >   * Write-protect on the specified @sptep, @pt_protect indicates whether
> > >   * spte write-protection is caused by protecting shadow page table.
> > > @@ -1174,11 +1190,11 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
> > >
> > >     rmap_printk("spte %p %llx\n", sptep, *sptep);
> > >
> > > -   if (pt_protect)
> > > -           spte &= ~shadow_mmu_writable_mask;
> > > -   spte = spte & ~PT_WRITABLE_MASK;
> > > -
> > > -   return mmu_spte_update(sptep, spte);
> > > +   if (pt_protect) {
> > > +           spte &= ~(shadow_mmu_writable_mask | PT_WRITABLE_MASK);
> > > +           return mmu_spte_update(sptep, spte);
> > > +   }
> > > +   return spte_test_and_clear_writable(sptep);
> >
> > I think rather open code the logic instead of adding a helper.  Without even more
> > comments, it's not at all obvious when it's safe to use spte_test_and_clear_writable().
> > And although this patch begs the question of whether or not we should do a similar
> > thing for the TDP MMU's clear_dirty_gfn_range(), the TDP MMU and legacy MMU can't
> > really share a helper at this time because the TDP MMU only holds mmu_lock for read.
> >
> > So minus all the comments that need to be added, I think this can just be:
> >
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 13 +++++++++----
> >  1 file changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index efe5a3dca1e0..caf5db7f1dce 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1174,11 +1174,16 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
> >
> >       rmap_printk("spte %p %llx\n", sptep, *sptep);
> >
> > -     if (pt_protect)
> > -             spte &= ~shadow_mmu_writable_mask;
> > -     spte = spte & ~PT_WRITABLE_MASK;
> > +     if (pt_protect) {
> > +             spte &= ~(shadow_mmu_writable_mask | PT_WRITABLE_MASK);
> > +             return mmu_spte_update(sptep, spte);
> > +     }
> >
> > -     return mmu_spte_update(sptep, spte);
> > +     clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);
> > +     if (!spte_ad_enabled(spte))
> > +             kvm_set_pfn_dirty(spte_to_pfn(spte));
> > +
> > +     return true;
> >  }
> >
> >  static bool rmap_write_protect(struct kvm_rmap_head *rmap_head,
> >
> > base-commit: 9fd329c0c3309a87821817ca553b449936e318a9
> > --
> >
> >
> > And then on the TDP MMU side, I _think_ we can do the following (again, minus
> > comments).  The logic being that if something else races and clears the bit or
> > zaps the SPTEs, there's no point in "retrying" because either the !PRESENT check
> > and/or the WRITABLE/DIRTY bit check will fail.
> >
> > The only hiccup would be clearing a REMOVED_SPTE bit, but by sheer dumb luck, past
> > me chose a semi-arbitrary value for REMOVED_SPTE that doesn't use either dirty bit.
> >
> > This would need to be done over two patches, e.g. to make the EPT definitions
> > available to the mmu.
> >
> > Ben, David, any holes in my idea?
>
> I think the idea will work. But I also wonder if we should just change
> clear_dirty_gfn_range() to run under the write lock [1]. Then we could
> share the code with the shadow MMU and avoid the REMOVED_SPTE
> complexity.
>
> [1] As a general rule, it seems like large (e.g. slot-wide)
> single-threaded batch operations should run under the write lock.
> They will be more efficient by avoiding atomic instructions, and can
> still avoid lock contention via tdp_mmu_iter_cond_resched(). e.g. Ben
> and I have been talking about changing eager page splitting to always
> run under the write lock.

I agree we probably want slot-wide operations to run under the write
lock for speed, (if we get speed for that operation. Some ops might
not benefit) but even better would be to not do slot-wide operations.
Since the only user clears the entire slot, it makes sense to do under
the write lock.
With the separated get / clear dirty logging ioctls, it would be nice
to leave the clear operation under the read lock so userspace can
execute many clears in parallel.

>
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 841feaa48be5..c28a0b9e2af1 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1600,6 +1600,8 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
> >         }
> >  }
> >
> > +#define VMX_EPT_DIRTY_BIT BIT_ULL(9)
> > +
> >  /*
> >   * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
> >   * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
> > @@ -1610,35 +1612,32 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
> >  static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >                            gfn_t start, gfn_t end)
> >  {
> > -       struct tdp_iter iter;
> > -       u64 new_spte;
> > +       const int dirty_shift = ilog2(shadow_dirty_mask);
> >         bool spte_set = false;
> > +       struct tdp_iter iter;
> > +       int bit_nr;
> > +
> > +       BUILD_BUG_ON(REMOVED_SPTE & PT_DIRTY_MASK);
> > +       BUILD_BUG_ON(REMOVED_SPTE & VMX_EPT_DIRTY_BIT);
> >
> >         rcu_read_lock();
> >
> >         tdp_root_for_each_leaf_pte(iter, root, start, end) {
> > -retry:
> >                 if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> >                         continue;
> >
> >                 if (!is_shadow_present_pte(iter.old_spte))
> >                         continue;
> >
> > -               if (spte_ad_need_write_protect(iter.old_spte)) {
> > -                       if (is_writable_pte(iter.old_spte))
> > -                               new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
> > -                       else
> > -                               continue;
> > -               } else {
> > -                       if (iter.old_spte & shadow_dirty_mask)
> > -                               new_spte = iter.old_spte & ~shadow_dirty_mask;
> > -                       else
> > -                               continue;
> > -               }
> > +               if (spte_ad_need_write_protect(iter.old_spte))
> > +                       bit_nr = PT_WRITABLE_SHIFT;
> > +               else
> > +                       bit_nr = dirty_shift;
> >
> > -               if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
> > -                       goto retry;
> > +               if (!test_and_clear_bit(bit_nr, (unsigned long *)iter.sptep))
> > +                       continue;
> >
> > +               kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
> >                 spte_set = true;
> >         }
> >

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index efe5a3dca1e0..a6db9dfaf7c3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1151,6 +1151,22 @@  static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
 	}
 }
 
+static bool spte_test_and_clear_writable(u64 *sptep)
+{
+	u64 spte = *sptep;
+
+	if (spte & PT_WRITABLE_MASK) {
+		clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep);
+
+		if (!spte_ad_enabled(spte))
+			kvm_set_pfn_dirty(spte_to_pfn(spte));
+
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
  * spte write-protection is caused by protecting shadow page table.
@@ -1174,11 +1190,11 @@  static bool spte_write_protect(u64 *sptep, bool pt_protect)
 
 	rmap_printk("spte %p %llx\n", sptep, *sptep);
 
-	if (pt_protect)
-		spte &= ~shadow_mmu_writable_mask;
-	spte = spte & ~PT_WRITABLE_MASK;
-
-	return mmu_spte_update(sptep, spte);
+	if (pt_protect) {
+		spte &= ~(shadow_mmu_writable_mask | PT_WRITABLE_MASK);
+		return mmu_spte_update(sptep, spte);
+	}
+	return spte_test_and_clear_writable(sptep);
 }
 
 static bool rmap_write_protect(struct kvm_rmap_head *rmap_head,

KVM: mmu: spte_write_protect optimization

Commit Message

Comments

Patch