[5/6] KVM: x86/mmu: Protect kvm->memslots with a mutex

Message ID	20210427223635.2711774-6-bgardon@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Date: Tue, 27 Apr 2021 15:36:34 -0700 In-Reply-To: <20210427223635.2711774-1-bgardon@google.com> Message-Id: <20210427223635.2711774-6-bgardon@google.com> Mime-Version: 1.0 References: <20210427223635.2711774-1-bgardon@google.com> Subject: [PATCH 5/6] KVM: x86/mmu: Protect kvm->memslots with a mutex From: Ben Gardon <bgardon@google.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Peter Xu <peterx@redhat.com>, Sean Christopherson <seanjc@google.com>, Peter Shier <pshier@google.com>, Junaid Shahid <junaids@google.com>, Jim Mattson <jmattson@google.com>, Yulei Zhang <yulei.kernel@gmail.com>, Wanpeng Li <kernellwp@gmail.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Xiao Guangrong <xiaoguangrong.eric@gmail.com>, Ben Gardon <bgardon@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Lazily allocate memslot rmaps \| expand [0/6] Lazily allocate memslot rmaps [1/6] KVM: x86/mmu: Track if shadow MMU active [2/6] KVM: x86/mmu: Skip rmap operations if shadow MMU inactive [3/6] KVM: x86/mmu: Deduplicate rmap freeing in allocate_memslot_rmap [4/6] KVM: x86/mmu: Factor out allocating memslot rmap [5/6] KVM: x86/mmu: Protect kvm->memslots with a mutex [6/6] KVM: x86/mmu: Lazily allocate memslot rmaps

Ben Gardon April 27, 2021, 10:36 p.m. UTC

Adds a lock around memslots changes. Currently this lock does not have
any effect on the syncronization model, but it will be used in a future
commit to facilitate lazy rmap allocation.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +++++
 arch/x86/kvm/x86.c              | 11 +++++++++++
 include/linux/kvm_host.h        |  2 ++
 virt/kvm/kvm_main.c             |  9 ++++++++-
 4 files changed, 26 insertions(+), 1 deletion(-)

Paolo Bonzini April 28, 2021, 6:25 a.m. UTC | #1

On 28/04/21 00:36, Ben Gardon wrote:
> +void kvm_arch_assign_memslots(struct kvm *kvm, int as_id,
> +			     struct kvm_memslots *slots)
> +{
> +	mutex_lock(&kvm->arch.memslot_assignment_lock);
> +	rcu_assign_pointer(kvm->memslots[as_id], slots);
> +	mutex_unlock(&kvm->arch.memslot_assignment_lock);
> +}

Does the assignment also needs the lock, or only the rmap allocation?  I 
would prefer the hook to be something like kvm_arch_setup_new_memslots.

(Also it is useful to have a comment somewhere explaining why the 
slots_lock does not work.  IIUC there would be a deadlock because you'd 
be taking the slots_lock inside an SRCU critical region, while usually 
the slots_lock critical section is the one that includes a 
synchronize_srcu; I should dig that up and document that ordering in 
Documentation/virt/kvm too).

Paolo

Ben Gardon April 28, 2021, 4:40 p.m. UTC | #2

On Tue, Apr 27, 2021 at 11:25 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 28/04/21 00:36, Ben Gardon wrote:
> > +void kvm_arch_assign_memslots(struct kvm *kvm, int as_id,
> > +                          struct kvm_memslots *slots)
> > +{
> > +     mutex_lock(&kvm->arch.memslot_assignment_lock);
> > +     rcu_assign_pointer(kvm->memslots[as_id], slots);
> > +     mutex_unlock(&kvm->arch.memslot_assignment_lock);
> > +}
>
> Does the assignment also needs the lock, or only the rmap allocation?  I
> would prefer the hook to be something like kvm_arch_setup_new_memslots.

The assignment does need to be under the lock to prevent the following race:
1. Thread 1 (installing a new memslot): Acquires memslot assignment
lock (or perhaps in this case rmap_allocation_lock would be more apt.)
2. Thread 1: Check alloc_memslot_rmaps (it is false)
3. Thread 1: doesn't allocate memslot rmaps for new slot
4. Thread 1: Releases memslot assignment lock
5. Thread 2 (allocating a shadow root): Acquires memslot assignment lock
6. Thread 2: Sets alloc_memslot_rmaps = true
7. Thread 2: Allocates rmaps for all existing slots
8. Thread 2: Releases memslot assignment lock
9. Thread 2: Sets shadow_mmu_active = true
10. Thread 1: Installs the new memslots
11. Thread 3: Null pointer dereference when trying to access rmaps on
the new slot.

Putting the assignment under the lock prevents 5-8 from happening
between 2 and 10.

I'm open to other ideas as far as how to prevent this race though. I
admit this solution is not the most elegant looking.

>
> (Also it is useful to have a comment somewhere explaining why the
> slots_lock does not work.  IIUC there would be a deadlock because you'd
> be taking the slots_lock inside an SRCU critical region, while usually
> the slots_lock critical section is the one that includes a
> synchronize_srcu; I should dig that up and document that ordering in
> Documentation/virt/kvm too).

Yeah, sorry about that. I should have added a comment to that effect.
As you suspected, it's because of the slots lock / SRCU deadlock.
Using the slots lock was my original implementation, until the
deadlock issue came up.

I can add comments about that in a v2.

>
> Paolo
>

Paolo Bonzini April 28, 2021, 5:46 p.m. UTC | #3

On 28/04/21 18:40, Ben Gardon wrote:
> On Tue, Apr 27, 2021 at 11:25 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 28/04/21 00:36, Ben Gardon wrote:
>>> +void kvm_arch_assign_memslots(struct kvm *kvm, int as_id,
>>> +                          struct kvm_memslots *slots)
>>> +{
>>> +     mutex_lock(&kvm->arch.memslot_assignment_lock);
>>> +     rcu_assign_pointer(kvm->memslots[as_id], slots);
>>> +     mutex_unlock(&kvm->arch.memslot_assignment_lock);
>>> +}
>>
>> Does the assignment also needs the lock, or only the rmap allocation?  I
>> would prefer the hook to be something like kvm_arch_setup_new_memslots.
> 
> The assignment does need to be under the lock to prevent the following race:
> 1. Thread 1 (installing a new memslot): Acquires memslot assignment
> lock (or perhaps in this case rmap_allocation_lock would be more apt.)
> 2. Thread 1: Check alloc_memslot_rmaps (it is false)
> 3. Thread 1: doesn't allocate memslot rmaps for new slot
> 4. Thread 1: Releases memslot assignment lock
> 5. Thread 2 (allocating a shadow root): Acquires memslot assignment lock
> 6. Thread 2: Sets alloc_memslot_rmaps = true
> 7. Thread 2: Allocates rmaps for all existing slots
> 8. Thread 2: Releases memslot assignment lock
> 9. Thread 2: Sets shadow_mmu_active = true
> 10. Thread 1: Installs the new memslots
> 11. Thread 3: Null pointer dereference when trying to access rmaps on
> the new slot.

... because thread 3 would be under mmu_lock and therefore cannot 
allocate the rmap itself (you have to do it in mmu_alloc_shadow_roots, 
as in patch 6).

Related to this, your solution does not have to protect kvm_dup_memslots 
with the new lock, because the first update of the memslots will not go 
through kvm_arch_prepare_memory_region but it _will_ go through 
install_new_memslots and therefore through the new hook.  But overall I 
think I'd prefer to have a kvm->slots_arch_lock mutex in generic code, 
and place the call to kvm_dup_memslots and 
kvm_arch_prepare_memory_region inside that mutex.

That makes the new lock decently intuitive, and easily documented as 
"Architecture code can use slots_arch_lock if the contents of struct 
kvm_arch_memory_slot needs to be written outside 
kvm_arch_prepare_memory_region.  Unlike slots_lock, slots_arch_lock can 
be taken inside a ``kvm->srcu`` read-side critical section".

I admit I haven't thought about it very thoroughly, but if something 
like this is enough, it is relatively pretty:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9b8e30dd5b9b..6e5106365597 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1333,6 +1333,7 @@ static struct kvm_memslots 
*install_new_memslots(struct kvm *kvm,

  	rcu_assign_pointer(kvm->memslots[as_id], slots);

+	mutex_unlock(&kvm->slots_arch_lock);
  	synchronize_srcu_expedited(&kvm->srcu);

  	/*
@@ -1399,6 +1398,7 @@ static int kvm_set_memslot(struct kvm *kvm,
  	struct kvm_memslots *slots;
  	int r;

+	mutex_lock(&kvm->slots_arch_lock);
  	slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), change);
  	if (!slots)
  		return -ENOMEM;
@@ -1427,6 +1427,7 @@ static int kvm_set_memslot(struct kvm *kvm,
  		 *	- kvm_is_visible_gfn (mmu_check_root)
  		 */
  		kvm_arch_flush_shadow_memslot(kvm, slot);
+		mutex_lock(&kvm->slots_arch_lock);
  	}

  	r = kvm_arch_prepare_memory_region(kvm, new, mem, change);

It does make the critical section a bit larger, so that the 
initialization of the shadow page (which is in KVM_RUN context) contends 
with slightly more code than necessary.  However it's all but a 
performance critical situation, as it will only happen just once per VM.

WDYT?

Paolo

> Putting the assignment under the lock prevents 5-8 from happening
> between 2 and 10.
> 
> I'm open to other ideas as far as how to prevent this race though. I
> admit this solution is not the most elegant looking.

Ben Gardon April 28, 2021, 8:40 p.m. UTC | #4

On Wed, Apr 28, 2021 at 10:46 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 28/04/21 18:40, Ben Gardon wrote:
> > On Tue, Apr 27, 2021 at 11:25 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>
> >> On 28/04/21 00:36, Ben Gardon wrote:
> >>> +void kvm_arch_assign_memslots(struct kvm *kvm, int as_id,
> >>> +                          struct kvm_memslots *slots)
> >>> +{
> >>> +     mutex_lock(&kvm->arch.memslot_assignment_lock);
> >>> +     rcu_assign_pointer(kvm->memslots[as_id], slots);
> >>> +     mutex_unlock(&kvm->arch.memslot_assignment_lock);
> >>> +}
> >>
> >> Does the assignment also needs the lock, or only the rmap allocation?  I
> >> would prefer the hook to be something like kvm_arch_setup_new_memslots.
> >
> > The assignment does need to be under the lock to prevent the following race:
> > 1. Thread 1 (installing a new memslot): Acquires memslot assignment
> > lock (or perhaps in this case rmap_allocation_lock would be more apt.)
> > 2. Thread 1: Check alloc_memslot_rmaps (it is false)
> > 3. Thread 1: doesn't allocate memslot rmaps for new slot
> > 4. Thread 1: Releases memslot assignment lock
> > 5. Thread 2 (allocating a shadow root): Acquires memslot assignment lock
> > 6. Thread 2: Sets alloc_memslot_rmaps = true
> > 7. Thread 2: Allocates rmaps for all existing slots
> > 8. Thread 2: Releases memslot assignment lock
> > 9. Thread 2: Sets shadow_mmu_active = true
> > 10. Thread 1: Installs the new memslots
> > 11. Thread 3: Null pointer dereference when trying to access rmaps on
> > the new slot.
>
> ... because thread 3 would be under mmu_lock and therefore cannot
> allocate the rmap itself (you have to do it in mmu_alloc_shadow_roots,
> as in patch 6).
>
> Related to this, your solution does not have to protect kvm_dup_memslots
> with the new lock, because the first update of the memslots will not go
> through kvm_arch_prepare_memory_region but it _will_ go through
> install_new_memslots and therefore through the new hook.  But overall I
> think I'd prefer to have a kvm->slots_arch_lock mutex in generic code,
> and place the call to kvm_dup_memslots and
> kvm_arch_prepare_memory_region inside that mutex.

That makes sense, and I think it also avoids a bug in this series.
Currently, if the rmaps are allocated between kvm_dup_memslots and and
install_new_memslots, we run into a problem where the copied slots
will have new rmaps allocated for them before installation.
Potentially creating all sorts of problems. I had fixed that issue in
the past by allocating the array of per-level rmaps at memslot
creation, seperate from the memslot. That meant that the copy would
get the newly allocated rmaps as well, but I thought that was obsolete
after some memslot refactors went in.

I hoped we could get away without that change, but I was probably
wrong. However, with this enlarged critical section, that should not
be an issue for creating memslots since kvm_dup_memslots will either
copy memslots with the rmaps already allocated, or the whole thing
will happen before the rmaps are allocated.

... However with the locking you propose below, we might still run
into issues on a move or delete, which would mean we'd still need the
separate memory allocation for the rmaps array. Or we do some
shenanigans where we try to copy the rmap pointers from the other set
of memslots.

I can put together a v2 with the seperate rmap memory and more generic
locking and see how that looks.

>
> That makes the new lock decently intuitive, and easily documented as
> "Architecture code can use slots_arch_lock if the contents of struct
> kvm_arch_memory_slot needs to be written outside
> kvm_arch_prepare_memory_region.  Unlike slots_lock, slots_arch_lock can
> be taken inside a ``kvm->srcu`` read-side critical section".
>
> I admit I haven't thought about it very thoroughly, but if something
> like this is enough, it is relatively pretty:
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9b8e30dd5b9b..6e5106365597 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1333,6 +1333,7 @@ static struct kvm_memslots
> *install_new_memslots(struct kvm *kvm,
>
>         rcu_assign_pointer(kvm->memslots[as_id], slots);
>
> +       mutex_unlock(&kvm->slots_arch_lock);
>         synchronize_srcu_expedited(&kvm->srcu);
>
>         /*
> @@ -1399,6 +1398,7 @@ static int kvm_set_memslot(struct kvm *kvm,
>         struct kvm_memslots *slots;
>         int r;
>
> +       mutex_lock(&kvm->slots_arch_lock);
>         slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), change);
>         if (!slots)
>                 return -ENOMEM;
> @@ -1427,6 +1427,7 @@ static int kvm_set_memslot(struct kvm *kvm,
>                  *      - kvm_is_visible_gfn (mmu_check_root)
>                  */
>                 kvm_arch_flush_shadow_memslot(kvm, slot);
> +               mutex_lock(&kvm->slots_arch_lock);
>         }
>
>         r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
>
> It does make the critical section a bit larger, so that the
> initialization of the shadow page (which is in KVM_RUN context) contends
> with slightly more code than necessary.  However it's all but a
> performance critical situation, as it will only happen just once per VM.

I agree performance is not a huge concern here. Excluding
kvm_arch_flush_shadow_memslot from the critical section also helps a
lot because that's where most of the work could be if we're deleting /
moving a slot.
My only worry is the latency this could add to a nested VM launch, but
it seems pretty unlikely that that would be frequently coinciding with
a memslot change in practice.

>
> WDYT?
>
> Paolo
>
> > Putting the assignment under the lock prevents 5-8 from happening
> > between 2 and 10.
> >
> > I'm open to other ideas as far as how to prevent this race though. I
> > admit this solution is not the most elegant looking.
>
>

Paolo Bonzini April 28, 2021, 9:41 p.m. UTC | #5

On 28/04/21 22:40, Ben Gardon wrote:
> ... However with the locking you propose below, we might still run
> into issues on a move or delete, which would mean we'd still need the
> separate memory allocation for the rmaps array. Or we do some
> shenanigans where we try to copy the rmap pointers from the other set
> of memslots.

If that's (almost) as easy as passing old to 
kvm_arch_prepare_memory_region, that would be totally okay.

> My only worry is the latency this could add to a nested VM launch, but
> it seems pretty unlikely that that would be frequently coinciding with
> a memslot change in practice.

Right, memslot changes in practice occur only at boot and on hotplug. 
If that was a problem we could always make the allocation state 
off/in-progress/on, allowing to check the allocation state out of the 
lock.  This would only potentially slow down the first nested VM launch.

Paolo

Ben Gardon April 28, 2021, 9:46 p.m. UTC | #6

On Wed, Apr 28, 2021 at 2:41 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 28/04/21 22:40, Ben Gardon wrote:
> > ... However with the locking you propose below, we might still run
> > into issues on a move or delete, which would mean we'd still need the
> > separate memory allocation for the rmaps array. Or we do some
> > shenanigans where we try to copy the rmap pointers from the other set
> > of memslots.
>
> If that's (almost) as easy as passing old to
> kvm_arch_prepare_memory_region, that would be totally okay.

Unfortunately it's not quite that easy because it's all the slots
_besides_ the one being modified where we'd need to copy the rmaps.

>
> > My only worry is the latency this could add to a nested VM launch, but
> > it seems pretty unlikely that that would be frequently coinciding with
> > a memslot change in practice.
>
> Right, memslot changes in practice occur only at boot and on hotplug.
> If that was a problem we could always make the allocation state
> off/in-progress/on, allowing to check the allocation state out of the
> lock.  This would only potentially slow down the first nested VM launch.
>
> Paolo
>

Paolo Bonzini April 28, 2021, 11:42 p.m. UTC | #7

On 28/04/21 23:46, Ben Gardon wrote:
> On Wed, Apr 28, 2021 at 2:41 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 28/04/21 22:40, Ben Gardon wrote:
>>> ... However with the locking you propose below, we might still run
>>> into issues on a move or delete, which would mean we'd still need the
>>> separate memory allocation for the rmaps array. Or we do some
>>> shenanigans where we try to copy the rmap pointers from the other set
>>> of memslots.
>>
>> If that's (almost) as easy as passing old to
>> kvm_arch_prepare_memory_region, that would be totally okay.
> 
> Unfortunately it's not quite that easy because it's all the slots
> _besides_ the one being modified where we'd need to copy the rmaps.

Ah, now I understand the whole race.  And it seems to me that if one
kvm_dup_memslots within the new lock fixed a bug, two kvm_dup_memslots
within the new lock are going to fix two bugs. :)

Seriously: unless I'm missing another case (it's late here...), it's
not ugly and it's still relatively easy to explain.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2799c6660cce..48929dd5fb29 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1270,7 +1270,7 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
  	return 0;
  }
  
-static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
+static void install_new_memslots(struct kvm *kvm,
  		int as_id, struct kvm_memslots *slots)
  {
  	struct kvm_memslots *old_memslots = __kvm_memslots(kvm, as_id);
@@ -1280,7 +1280,9 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
  	slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
  
  	rcu_assign_pointer(kvm->memslots[as_id], slots);
+	mutex_unlock(&kvm->slots_arch_lock);
  	synchronize_srcu_expedited(&kvm->srcu);
+	kvfree(old_memslots);
  
  	/*
  	 * Increment the new memslot generation a second time, dropping the
@@ -1302,8 +1304,6 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
  	kvm_arch_memslots_updated(kvm, gen);
  
  	slots->generation = gen;
-
-	return old_memslots;
  }
  
  /*
@@ -1342,6 +1342,7 @@ static int kvm_set_memslot(struct kvm *kvm,
  	struct kvm_memslots *slots;
  	int r;
  
+	mutex_lock(&kvm->slots_arch_lock);
  	slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), change);
  	if (!slots)
  		return -ENOMEM;
@@ -1353,14 +1354,7 @@ static int kvm_set_memslot(struct kvm *kvm,
  		 */
  		slot = id_to_memslot(slots, old->id);
  		slot->flags |= KVM_MEMSLOT_INVALID;
-
-		/*
-		 * We can re-use the old memslots, the only difference from the
-		 * newly installed memslots is the invalid flag, which will get
-		 * dropped by update_memslots anyway.  We'll also revert to the
-		 * old memslots if preparing the new memory region fails.
-		 */
-		slots = install_new_memslots(kvm, as_id, slots);
+		install_new_memslots(kvm, as_id, slots);
  
  		/* From this point no new shadow pages pointing to a deleted,
  		 * or moved, memslot will be created.
@@ -1370,6 +1364,9 @@ static int kvm_set_memslot(struct kvm *kvm,
  		 *	- kvm_is_visible_gfn (mmu_check_root)
  		 */
  		kvm_arch_flush_shadow_memslot(kvm, slot);
+
+		mutex_lock(&kvm->slots_arch_lock);
+		slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), change);
  	}
  
  	r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
@@ -1377,16 +1374,17 @@ static int kvm_set_memslot(struct kvm *kvm,
  		goto out_slots;
  
  	update_memslots(slots, new, change);
-	slots = install_new_memslots(kvm, as_id, slots);
+	install_new_memslots(kvm, as_id, slots);
  
  	kvm_arch_commit_memory_region(kvm, mem, old, new, change);
-
-	kvfree(slots);
  	return 0;
  
  out_slots:
-	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
+	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
+		slot = id_to_memslot(slots, old->id);
+		slot->flags &= ~KVM_MEMSLOT_INVALID;
  		slots = install_new_memslots(kvm, as_id, slots);
+	}
  	kvfree(slots);
  	return r;
  }

One could optimize things a bit by reusing the allocation and only
doing a memcpy from the new memslots array to the old one under the
slots_arch_lock.  (Plus the above still lacks a mutex_init and
should be split in two patches, with the mutex going in the second;
but you get the idea and code sometimes is easier than words).

Paolo

Sean Christopherson April 29, 2021, 12:40 a.m. UTC | #8

On Thu, Apr 29, 2021, Paolo Bonzini wrote:
> it's not ugly and it's still relatively easy to explain.

LOL, that's debatable.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2799c6660cce..48929dd5fb29 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1377,16 +1374,17 @@ static int kvm_set_memslot(struct kvm *kvm,
>  		goto out_slots;
>  	update_memslots(slots, new, change);
> -	slots = install_new_memslots(kvm, as_id, slots);
> +	install_new_memslots(kvm, as_id, slots);
>  	kvm_arch_commit_memory_region(kvm, mem, old, new, change);
> -
> -	kvfree(slots);
>  	return 0;
>  out_slots:
> -	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
> +	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
> +		slot = id_to_memslot(slots, old->id);
> +		slot->flags &= ~KVM_MEMSLOT_INVALID;

Modifying flags on an SRCU-protect field outside of said protection is sketchy.
It's probably ok to do this prior to the generation update, emphasis on
"probably".  Of course, the VM is also likely about to be killed in this case...

>  		slots = install_new_memslots(kvm, as_id, slots);

This will explode if memory allocation for KVM_MR_MOVE fails.  In that case,
the rmaps for "slots" will have been cleared by kvm_alloc_memslot_metadata().

> +	}
>  	kvfree(slots);
>  	return r;
>  }

The SRCU index is already tracked in vcpu->srcu_idx, why not temporarily drop
the SRCU lock if activate_shadow_mmu() needs to do work so that it can take
slots_lock?  That seems simpler and I think would avoid modifying the common
memslot code.

kvm_arch_async_page_ready() is the only path for reaching kvm_mmu_reload() that
looks scary, but that should be impossible to reach with the correct MMU context.
We could always and an explicit sanity check on the rmaps being avaiable.

Sean Christopherson April 29, 2021, 1:42 a.m. UTC | #9

On Thu, Apr 29, 2021, Sean Christopherson wrote:
> On Thu, Apr 29, 2021, Paolo Bonzini wrote:
> > it's not ugly and it's still relatively easy to explain.
> 
> LOL, that's debatable.
> 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 2799c6660cce..48929dd5fb29 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1377,16 +1374,17 @@ static int kvm_set_memslot(struct kvm *kvm,
> >  		goto out_slots;
> >  	update_memslots(slots, new, change);
> > -	slots = install_new_memslots(kvm, as_id, slots);
> > +	install_new_memslots(kvm, as_id, slots);
> >  	kvm_arch_commit_memory_region(kvm, mem, old, new, change);
> > -
> > -	kvfree(slots);
> >  	return 0;
> >  out_slots:
> > -	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
> > +	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
> > +		slot = id_to_memslot(slots, old->id);
> > +		slot->flags &= ~KVM_MEMSLOT_INVALID;
> 
> Modifying flags on an SRCU-protect field outside of said protection is sketchy.
> It's probably ok to do this prior to the generation update, emphasis on
> "probably".  Of course, the VM is also likely about to be killed in this case...
> 
> >  		slots = install_new_memslots(kvm, as_id, slots);
> 
> This will explode if memory allocation for KVM_MR_MOVE fails.  In that case,
> the rmaps for "slots" will have been cleared by kvm_alloc_memslot_metadata().

Gah, that's all wrong, slots are the second duplicate and the clear happens on
the new slot, not the old slot with the same id.

Though I still think temporarily dropping the SRCU lock would be simpler.  If
performance is a concern, it could be mitigated by adding a capability to
preallocate the rmaps.

> > +	}
> >  	kvfree(slots);
> >  	return r;
> >  }
> 
> The SRCU index is already tracked in vcpu->srcu_idx, why not temporarily drop
> the SRCU lock if activate_shadow_mmu() needs to do work so that it can take
> slots_lock?  That seems simpler and I think would avoid modifying the common
> memslot code.
> 
> kvm_arch_async_page_ready() is the only path for reaching kvm_mmu_reload() that
> looks scary, but that should be impossible to reach with the correct MMU context.
> We could always and an explicit sanity check on the rmaps being avaiable.

Paolo Bonzini April 29, 2021, 7:02 a.m. UTC | #10

On 29/04/21 02:40, Sean Christopherson wrote:
> On Thu, Apr 29, 2021, Paolo Bonzini wrote:
>> it's not ugly and it's still relatively easy to explain.
> 
> LOL, that's debatable.

 From your remark below it looks like we have different priorities on 
what to avoid modifying.

I like the locks to be either very coarse or fine-grained enough for 
them to be leaves, as I find that to be the easiest way to avoid 
deadlocks and complex hierarchies.  For this reason, I treat unlocking 
in the middle of a large critical section as "scary by default"; you 
have to worry about which invariants might be required (in the case of 
RCU, which pointers might be stored somewhere and would be invalidated), 
and which locks are taken at that point so that the subsequent relocking 
would change the lock order from AB to BA.

This applies to every path leading to the unlock/relock.  So instead 
what matters IMO is shielding architecture code from the races that Ben 
had to point out to me, _and the possibility to apply easily explained 
rules_ outside more complex core code.

So, well, "relatively easy" because it's indeed subtle.  But if you 
consider what the locking rules are, "you can choose to protect 
slots->arch data with this mutex and it will have no problematic 
interactions with the memslot copy/update code" is as simple as it can get.

>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 2799c6660cce..48929dd5fb29 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1377,16 +1374,17 @@ static int kvm_set_memslot(struct kvm *kvm,
>>   		goto out_slots;
>>   	update_memslots(slots, new, change);
>> -	slots = install_new_memslots(kvm, as_id, slots);
>> +	install_new_memslots(kvm, as_id, slots);
>>   	kvm_arch_commit_memory_region(kvm, mem, old, new, change);
>> -
>> -	kvfree(slots);
>>   	return 0;
>>   out_slots:
>> -	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
>> +	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
>> +		slot = id_to_memslot(slots, old->id);
>> +		slot->flags &= ~KVM_MEMSLOT_INVALID;
> 
> Modifying flags on an SRCU-protect field outside of said protection is sketchy.
> It's probably ok to do this prior to the generation update, emphasis on
> "probably".  Of course, the VM is also likely about to be killed in this case...
> 
>>   		slots = install_new_memslots(kvm, as_id, slots);
> 
> This will explode if memory allocation for KVM_MR_MOVE fails.  In that case,
> the rmaps for "slots" will have been cleared by kvm_alloc_memslot_metadata().

I take your subsequent reply as a sort-of-review that the above approach 
works, though we may disagree on its elegance and complexity.

Paolo

> The SRCU index is already tracked in vcpu->srcu_idx, why not temporarily drop
> the SRCU lock if activate_shadow_mmu() needs to do work so that it can take
> slots_lock?  That seems simpler and I think would avoid modifying the common
> memslot code.
>
> kvm_arch_async_page_ready() is the only path for reaching kvm_mmu_reload() that
> looks scary, but that should be impossible to reach with the correct MMU context.
> We could always and an explicit sanity check on the rmaps being avaiable.

Ben Gardon April 29, 2021, 5:45 p.m. UTC | #11

On Thu, Apr 29, 2021 at 12:02 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 29/04/21 02:40, Sean Christopherson wrote:
> > On Thu, Apr 29, 2021, Paolo Bonzini wrote:
> >> it's not ugly and it's still relatively easy to explain.
> >
> > LOL, that's debatable.
>
>  From your remark below it looks like we have different priorities on
> what to avoid modifying.
>
> I like the locks to be either very coarse or fine-grained enough for
> them to be leaves, as I find that to be the easiest way to avoid
> deadlocks and complex hierarchies.  For this reason, I treat unlocking
> in the middle of a large critical section as "scary by default"; you
> have to worry about which invariants might be required (in the case of
> RCU, which pointers might be stored somewhere and would be invalidated),
> and which locks are taken at that point so that the subsequent relocking
> would change the lock order from AB to BA.

The idea of dropping the srcu read lock to allocate the rmaps scares
me too. I have no idea what it protects, in addition to the memslots,
and where we might be making assumptions about things staying valid
because of it.
Is there anywhere that we enumerate the things protected by that SRCU?
I wonder if we have complete enough annotations for the SRCU / lockdep
checker to find a problem if we were dropping the SRCU read lock in a
bad place.

>
> This applies to every path leading to the unlock/relock.  So instead
> what matters IMO is shielding architecture code from the races that Ben
> had to point out to me, _and the possibility to apply easily explained
> rules_ outside more complex core code.
>
> So, well, "relatively easy" because it's indeed subtle.  But if you
> consider what the locking rules are, "you can choose to protect
> slots->arch data with this mutex and it will have no problematic
> interactions with the memslot copy/update code" is as simple as it can get.
>
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 2799c6660cce..48929dd5fb29 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -1377,16 +1374,17 @@ static int kvm_set_memslot(struct kvm *kvm,
> >>              goto out_slots;
> >>      update_memslots(slots, new, change);
> >> -    slots = install_new_memslots(kvm, as_id, slots);
> >> +    install_new_memslots(kvm, as_id, slots);
> >>      kvm_arch_commit_memory_region(kvm, mem, old, new, change);
> >> -
> >> -    kvfree(slots);
> >>      return 0;
> >>   out_slots:
> >> -    if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
> >> +    if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
> >> +            slot = id_to_memslot(slots, old->id);
> >> +            slot->flags &= ~KVM_MEMSLOT_INVALID;
> >
> > Modifying flags on an SRCU-protect field outside of said protection is sketchy.
> > It's probably ok to do this prior to the generation update, emphasis on
> > "probably".  Of course, the VM is also likely about to be killed in this case...
> >
> >>              slots = install_new_memslots(kvm, as_id, slots);
> >
> > This will explode if memory allocation for KVM_MR_MOVE fails.  In that case,
> > the rmaps for "slots" will have been cleared by kvm_alloc_memslot_metadata().
>
> I take your subsequent reply as a sort-of-review that the above approach
> works, though we may disagree on its elegance and complexity.

I'll try Paolo's suggestion about using a second dup_slots to avoid
backing the higher level rmap arrays with dynamic memory and send out
a V2.

>
> Paolo
>
> > The SRCU index is already tracked in vcpu->srcu_idx, why not temporarily drop
> > the SRCU lock if activate_shadow_mmu() needs to do work so that it can take
> > slots_lock?  That seems simpler and I think would avoid modifying the common
> > memslot code.
> >
> > kvm_arch_async_page_ready() is the only path for reaching kvm_mmu_reload() that
> > looks scary, but that should be impossible to reach with the correct MMU context.
> > We could always and an explicit sanity check on the rmaps being avaiable.
>

[5/6] KVM: x86/mmu: Protect kvm->memslots with a mutex

Commit Message

Comments

Patch