diff mbox series

[v4,6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

Message ID 20240529180510.2295118-7-jthoughton@google.com (mailing list archive)
State Handled Elsewhere
Headers show
Series mm: multi-gen LRU: Walk secondary MMU page tables while aging | expand

Checks

Context Check Description
conchuod/vmtest-for-next-PR fail PR summary
conchuod/patch-6-test-1 success .github/scripts/patches/tests/build_rv32_defconfig.sh
conchuod/patch-6-test-2 success .github/scripts/patches/tests/build_rv64_clang_allmodconfig.sh
conchuod/patch-6-test-3 success .github/scripts/patches/tests/build_rv64_gcc_allmodconfig.sh
conchuod/patch-6-test-4 success .github/scripts/patches/tests/build_rv64_nommu_k210_defconfig.sh
conchuod/patch-6-test-5 success .github/scripts/patches/tests/build_rv64_nommu_virt_defconfig.sh
conchuod/patch-6-test-6 success .github/scripts/patches/tests/checkpatch.sh
conchuod/patch-6-test-7 success .github/scripts/patches/tests/dtb_warn_rv64.sh
conchuod/patch-6-test-8 success .github/scripts/patches/tests/header_inline.sh
conchuod/patch-6-test-9 success .github/scripts/patches/tests/kdoc.sh
conchuod/patch-6-test-10 success .github/scripts/patches/tests/module_param.sh
conchuod/patch-6-test-11 success .github/scripts/patches/tests/verify_fixes.sh
conchuod/patch-6-test-12 success .github/scripts/patches/tests/verify_signedoff.sh

Commit Message

James Houghton May 29, 2024, 6:05 p.m. UTC
Replace the MMU write locks for read locks.

Grabbing the read lock instead of the write lock is safe because the
only requirement we have is that the stage-2 page tables do not get
deallocated while we are walking them. The stage2_age_walker() callback
is safe to race with itself; update the comment to reflect the
synchronization change.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 9 ++++-----
 arch/arm64/kvm/mmu.c         | 8 ++++----
 2 files changed, 8 insertions(+), 9 deletions(-)

Comments

Oliver Upton May 31, 2024, 7:11 p.m. UTC | #1
On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:

[...]

> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 9e2bbee77491..eabb07c66a07 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
>  	data->young = true;
>  
>  	/*
> -	 * stage2_age_walker() is always called while holding the MMU lock for
> -	 * write, so this will always succeed. Nonetheless, this deliberately
> -	 * follows the race detection pattern of the other stage-2 walkers in
> -	 * case the locking mechanics of the MMU notifiers is ever changed.
> +	 * This walk may not be exclusive; the PTE is permitted to change
> +	 * from under us.
>  	 */
>  	if (data->mkold && !stage2_try_set_pte(ctx, new))
>  		return -EAGAIN;

It is probably worth mentioning that if there was a race to update the
PTE then the GFN is most likely young, so failing to clear AF probably
isn't even consequential.
Oliver Upton May 31, 2024, 7:18 p.m. UTC | #2
On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:
> 
> [...]
> 
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 9e2bbee77491..eabb07c66a07 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
> >  	data->young = true;
> >  
> >  	/*
> > -	 * stage2_age_walker() is always called while holding the MMU lock for
> > -	 * write, so this will always succeed. Nonetheless, this deliberately
> > -	 * follows the race detection pattern of the other stage-2 walkers in
> > -	 * case the locking mechanics of the MMU notifiers is ever changed.
> > +	 * This walk may not be exclusive; the PTE is permitted to change
> > +	 * from under us.
> >  	 */
> >  	if (data->mkold && !stage2_try_set_pte(ctx, new))
> >  		return -EAGAIN;
> 
> It is probably worth mentioning that if there was a race to update the
> PTE then the GFN is most likely young, so failing to clear AF probably
> isn't even consequential.

Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
now. Maybe demote it to:

  r = kvm_pgtable_walk(...);
  WARN_ON_ONCE(r && r != -EAGAIN);
James Houghton June 4, 2024, 10:20 p.m. UTC | #3
On Fri, May 31, 2024 at 12:18 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> > On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:
> >
> > [...]
> >
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 9e2bbee77491..eabb07c66a07 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > >     data->young = true;
> > >
> > >     /*
> > > -    * stage2_age_walker() is always called while holding the MMU lock for
> > > -    * write, so this will always succeed. Nonetheless, this deliberately
> > > -    * follows the race detection pattern of the other stage-2 walkers in
> > > -    * case the locking mechanics of the MMU notifiers is ever changed.
> > > +    * This walk may not be exclusive; the PTE is permitted to change
> > > +    * from under us.
> > >      */
> > >     if (data->mkold && !stage2_try_set_pte(ctx, new))
> > >             return -EAGAIN;
> >
> > It is probably worth mentioning that if there was a race to update the
> > PTE then the GFN is most likely young, so failing to clear AF probably
> > isn't even consequential.

Thanks Oliver.

>
> Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
> now. Maybe demote it to:
>
>   r = kvm_pgtable_walk(...);
>   WARN_ON_ONCE(r && r != -EAGAIN);

Oh, indeed, thank you. Just to make sure -- does it make sense to
retry the cmpxchg if it fails? For example, the way I have it now for
x86[1], we retry the cmpxchg if the spte is still a leaf, otherwise we
move on to the next one having done nothing. Does something like that
make sense for arm64?

[1]: https://lore.kernel.org/linux-mm/20240529180510.2295118-6-jthoughton@google.com/
Oliver Upton June 4, 2024, 11 p.m. UTC | #4
On Tue, Jun 04, 2024 at 03:20:20PM -0700, James Houghton wrote:
> On Fri, May 31, 2024 at 12:18 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> > > On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:
> > >
> > > [...]
> > >
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index 9e2bbee77491..eabb07c66a07 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > >     data->young = true;
> > > >
> > > >     /*
> > > > -    * stage2_age_walker() is always called while holding the MMU lock for
> > > > -    * write, so this will always succeed. Nonetheless, this deliberately
> > > > -    * follows the race detection pattern of the other stage-2 walkers in
> > > > -    * case the locking mechanics of the MMU notifiers is ever changed.
> > > > +    * This walk may not be exclusive; the PTE is permitted to change
> > > > +    * from under us.
> > > >      */
> > > >     if (data->mkold && !stage2_try_set_pte(ctx, new))
> > > >             return -EAGAIN;
> > >
> > > It is probably worth mentioning that if there was a race to update the
> > > PTE then the GFN is most likely young, so failing to clear AF probably
> > > isn't even consequential.
> 
> Thanks Oliver.
> 
> >
> > Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
> > now. Maybe demote it to:
> >
> >   r = kvm_pgtable_walk(...);
> >   WARN_ON_ONCE(r && r != -EAGAIN);
> 
> Oh, indeed, thank you. Just to make sure -- does it make sense to
> retry the cmpxchg if it fails? For example, the way I have it now for
> x86[1], we retry the cmpxchg if the spte is still a leaf, otherwise we
> move on to the next one having done nothing. Does something like that
> make sense for arm64?

At least for arm64 I do not see a need for retry. The only possible
races are:

 - A stage-2 fault handler establishing / adjusting the mapping for the
   GFN. If the guest is directly accessing the GFN in question, what's
   the point of wiping out AF?

   Even when returning -EAGAIN we've already primed stage2_age_data::young,
   so we report the correct state back to the primary MMU.

 - Another kvm_age_gfn() trying to age the same GFN. I haven't even
   looked to see if this is possible from the primary MMU POV, but in
   theory one of the calls will win the race and clear AF.

Given Yu's concerns about making pending writers wait, we should take
every opportunity to bail on the walk.
Sean Christopherson June 4, 2024, 11:36 p.m. UTC | #5
On Tue, Jun 04, 2024, Oliver Upton wrote:
> On Tue, Jun 04, 2024 at 03:20:20PM -0700, James Houghton wrote:
> > On Fri, May 31, 2024 at 12:18 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > > On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> > > > On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:
> > > Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
> > > now. Maybe demote it to:
> > >
> > >   r = kvm_pgtable_walk(...);
> > >   WARN_ON_ONCE(r && r != -EAGAIN);
> > 
> > Oh, indeed, thank you. Just to make sure -- does it make sense to
> > retry the cmpxchg if it fails? For example, the way I have it now for
> > x86[1], we retry the cmpxchg if the spte is still a leaf, otherwise we
> > move on to the next one having done nothing. Does something like that
> > make sense for arm64?
> 
> At least for arm64 I do not see a need for retry. The only possible
> races are:
> 
>  - A stage-2 fault handler establishing / adjusting the mapping for the
>    GFN. If the guest is directly accessing the GFN in question, what's
>    the point of wiping out AF?
> 
>    Even when returning -EAGAIN we've already primed stage2_age_data::young,
>    so we report the correct state back to the primary MMU.
> 
>  - Another kvm_age_gfn() trying to age the same GFN. I haven't even
>    looked to see if this is possible from the primary MMU POV, but in
>    theory one of the calls will win the race and clear AF.
> 
> Given Yu's concerns about making pending writers wait, we should take
> every opportunity to bail on the walk.

+1.  The x86 path that retries is, for all intents and purposes, limited to Intel
CPUs that don't support EPT A/D bits, i.e. to pre-HSW CPUs.  I wouldn't make any
decisions based on that code.
diff mbox series

Patch

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 9e2bbee77491..eabb07c66a07 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1319,10 +1319,8 @@  static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
 	data->young = true;
 
 	/*
-	 * stage2_age_walker() is always called while holding the MMU lock for
-	 * write, so this will always succeed. Nonetheless, this deliberately
-	 * follows the race detection pattern of the other stage-2 walkers in
-	 * case the locking mechanics of the MMU notifiers is ever changed.
+	 * This walk may not be exclusive; the PTE is permitted to change
+	 * from under us.
 	 */
 	if (data->mkold && !stage2_try_set_pte(ctx, new))
 		return -EAGAIN;
@@ -1345,7 +1343,8 @@  bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
 	struct kvm_pgtable_walker walker = {
 		.cb		= stage2_age_walker,
 		.arg		= &data,
-		.flags		= KVM_PGTABLE_WALK_LEAF,
+		.flags		= KVM_PGTABLE_WALK_LEAF |
+				  KVM_PGTABLE_WALK_SHARED,
 	};
 
 	WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker));
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8337009dde77..40e7427462a7 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1775,7 +1775,7 @@  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
 	bool young = false;
 
-	write_lock(&kvm->mmu_lock);
+	read_lock(&kvm->mmu_lock);
 
 	if (!kvm->arch.mmu.pgt)
 		goto out;
@@ -1785,7 +1785,7 @@  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 						    size, true);
 
 out:
-	write_unlock(&kvm->mmu_lock);
+	read_unlock(&kvm->mmu_lock);
 	return young;
 }
 
@@ -1794,7 +1794,7 @@  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
 	bool young = false;
 
-	write_lock(&kvm->mmu_lock);
+	read_lock(&kvm->mmu_lock);
 
 	if (!kvm->arch.mmu.pgt)
 		goto out;
@@ -1804,7 +1804,7 @@  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 						    size, false);
 
 out:
-	write_unlock(&kvm->mmu_lock);
+	read_unlock(&kvm->mmu_lock);
 	return young;
 }