diff mbox series

[4/6] KVM: pfncache: simplify locking and make more self-contained

Message ID 20240217114017.11551-5-dwmw2@infradead.org (mailing list archive)
State New, archived
Headers show
Series KVM: x86/xen updates | expand

Commit Message

David Woodhouse Feb. 17, 2024, 11:27 a.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

The locking on the gfn_to_pfn_cache is... interesting. And awful.

There is a rwlock in ->lock which readers take to ensure protection
against concurrent changes. But __kvm_gpc_refresh() makes assumptions
that certain fields will not change even while it drops the write lock
and performs MM operations to revalidate the target PFN and kernel
mapping.

Commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via
mutex") partly addressed that — not by fixing it, but by adding a new
mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh()
calls on a given gfn_to_pfn_cache, but is still only a partial solution.

There is still a theoretical race where __kvm_gpc_refresh() runs in
parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has
dropped the write lock, kvm_gpc_deactivate() clears the ->active flag
and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous
->pfn and ->khva are still valid, and reinstalls those values into the
structure. This leaves the gfn_to_pfn_cache with the ->valid bit set,
but ->active clear. And a ->khva which looks like a reasonable kernel
address but is actually unmapped.

All it takes is a subsequent reactivation to cause that ->khva to be
dereferenced. This would theoretically cause an oops which would look
something like this:

[1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0
[1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0

I say "theoretically" because theoretically, that oops that was seen in
production cannot happen. The code which uses the gfn_to_pfn_cache is
supposed to have its *own* locking, to further paper over the fact that
the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own
rwlock abuse is not sufficient.

For the Xen vcpu_info that external lock is the vcpu->mutex, and for the
shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect
the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all
but the cases where the vcpu or kvm object is being *destroyed*, in
which case the subsequent reactivation should never happen.

Theoretically.

Nevertheless, this locking abuse is awful and should be fixed, even if
no clear explanation can be found for how the oops happened. So expand
the use of the ->refresh_lock mutex to ensure serialization of
activate/deactivate vs. refresh and make the pfncache locking entirely
self-sufficient.

This means that a future commit can simplify the locking in the callers,
such as the Xen emulation code which has an outstanding problem with
recursive locking of kvm->arch.xen.xen_lock, which will no longer be
necessary.

The rwlock abuse described above is still not best practice, although
it's harmless now that the ->refresh_lock is held for the entire duration
while the offending code drops the write lock, does some other stuff,
then takes the write lock again and assumes nothing changed. That can
also be fixed^W cleaned up in a subsequent commit, but this commit is
a simpler basis for the Xen deadlock fix mentioned above.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 virt/kvm/pfncache.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

Comments

Paul Durrant Feb. 19, 2024, 9 a.m. UTC | #1
On 17/02/2024 11:27, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The locking on the gfn_to_pfn_cache is... interesting. And awful.
> 
> There is a rwlock in ->lock which readers take to ensure protection
> against concurrent changes. But __kvm_gpc_refresh() makes assumptions
> that certain fields will not change even while it drops the write lock
> and performs MM operations to revalidate the target PFN and kernel
> mapping.
> 
> Commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via
> mutex") partly addressed that — not by fixing it, but by adding a new
> mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh()
> calls on a given gfn_to_pfn_cache, but is still only a partial solution.
> 
> There is still a theoretical race where __kvm_gpc_refresh() runs in
> parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has
> dropped the write lock, kvm_gpc_deactivate() clears the ->active flag
> and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous
> ->pfn and ->khva are still valid, and reinstalls those values into the
> structure. This leaves the gfn_to_pfn_cache with the ->valid bit set,
> but ->active clear. And a ->khva which looks like a reasonable kernel
> address but is actually unmapped.
> 
> All it takes is a subsequent reactivation to cause that ->khva to be
> dereferenced. This would theoretically cause an oops which would look
> something like this:
> 
> [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0
> [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0
> 
> I say "theoretically" because theoretically, that oops that was seen in
> production cannot happen. The code which uses the gfn_to_pfn_cache is
> supposed to have its *own* locking, to further paper over the fact that
> the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own
> rwlock abuse is not sufficient.
> 
> For the Xen vcpu_info that external lock is the vcpu->mutex, and for the
> shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect
> the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all
> but the cases where the vcpu or kvm object is being *destroyed*, in
> which case the subsequent reactivation should never happen.
> 
> Theoretically.
> 
> Nevertheless, this locking abuse is awful and should be fixed, even if
> no clear explanation can be found for how the oops happened. So expand
> the use of the ->refresh_lock mutex to ensure serialization of
> activate/deactivate vs. refresh and make the pfncache locking entirely
> self-sufficient.
> 
> This means that a future commit can simplify the locking in the callers,
> such as the Xen emulation code which has an outstanding problem with
> recursive locking of kvm->arch.xen.xen_lock, which will no longer be
> necessary.
> 
> The rwlock abuse described above is still not best practice, although
> it's harmless now that the ->refresh_lock is held for the entire duration
> while the offending code drops the write lock, does some other stuff,
> then takes the write lock again and assumes nothing changed. That can
> also be fixed^W cleaned up in a subsequent commit, but this commit is
> a simpler basis for the Xen deadlock fix mentioned above.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   virt/kvm/pfncache.c | 32 +++++++++++++++++++++-----------
>   1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index a60d8f906896..79a3ef7c6d04 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -255,12 +255,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
>   	if (page_offset + len > PAGE_SIZE)
>   		return -EINVAL;
>   
> -	/*
> -	 * If another task is refreshing the cache, wait for it to complete.
> -	 * There is no guarantee that concurrent refreshes will see the same
> -	 * gpa, memslots generation, etc..., so they must be fully serialized.
> -	 */
> -	mutex_lock(&gpc->refresh_lock);
> +	lockdep_assert_held(&gpc->refresh_lock);
>   
>   	write_lock_irq(&gpc->lock);
>   
> @@ -346,8 +341,6 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
>   out_unlock:
>   	write_unlock_irq(&gpc->lock);
>   
> -	mutex_unlock(&gpc->refresh_lock);
> -
>   	if (unmap_old)
>   		gpc_unmap(old_pfn, old_khva);
>   
> @@ -356,16 +349,24 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
>   
>   int kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, unsigned long len)
>   {
> -	unsigned long uhva = gpc->uhva;
> +	unsigned long uhva;
> +	int ret;
> +
> +	mutex_lock(&gpc->refresh_lock);
>   
>   	/*
>   	 * If the GPA is valid then invalidate the HVA, otherwise
>   	 * __kvm_gpc_refresh() will fail its strict either/or address check.
>   	 */
> +	uhva = gpc->uhva;
>   	if (!kvm_is_error_gpa(gpc->gpa))
>   		uhva = KVM_HVA_ERR_BAD;
>   
> -	return __kvm_gpc_refresh(gpc, gpc->gpa, uhva, len);
> +	ret = __kvm_gpc_refresh(gpc, gpc->gpa, uhva, len);
> +
> +	mutex_unlock(&gpc->refresh_lock);
> +
> +	return ret;
>   }
>   
>   void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm)
> @@ -377,12 +378,16 @@ void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm)
>   	gpc->pfn = KVM_PFN_ERR_FAULT;
>   	gpc->gpa = INVALID_GPA;
>   	gpc->uhva = KVM_HVA_ERR_BAD;
> +	gpc->active = gpc->valid = false;

Not strictly necessary if the comment above the function prototype in 
kvm_host.h is to be believed, but no harm.

Reviewed-by: Paul Durrant <paul@xen.org>

>   }
>
diff mbox series

Patch

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index a60d8f906896..79a3ef7c6d04 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -255,12 +255,7 @@  static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	if (page_offset + len > PAGE_SIZE)
 		return -EINVAL;
 
-	/*
-	 * If another task is refreshing the cache, wait for it to complete.
-	 * There is no guarantee that concurrent refreshes will see the same
-	 * gpa, memslots generation, etc..., so they must be fully serialized.
-	 */
-	mutex_lock(&gpc->refresh_lock);
+	lockdep_assert_held(&gpc->refresh_lock);
 
 	write_lock_irq(&gpc->lock);
 
@@ -346,8 +341,6 @@  static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 out_unlock:
 	write_unlock_irq(&gpc->lock);
 
-	mutex_unlock(&gpc->refresh_lock);
-
 	if (unmap_old)
 		gpc_unmap(old_pfn, old_khva);
 
@@ -356,16 +349,24 @@  static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 
 int kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, unsigned long len)
 {
-	unsigned long uhva = gpc->uhva;
+	unsigned long uhva;
+	int ret;
+
+	mutex_lock(&gpc->refresh_lock);
 
 	/*
 	 * If the GPA is valid then invalidate the HVA, otherwise
 	 * __kvm_gpc_refresh() will fail its strict either/or address check.
 	 */
+	uhva = gpc->uhva;
 	if (!kvm_is_error_gpa(gpc->gpa))
 		uhva = KVM_HVA_ERR_BAD;
 
-	return __kvm_gpc_refresh(gpc, gpc->gpa, uhva, len);
+	ret = __kvm_gpc_refresh(gpc, gpc->gpa, uhva, len);
+
+	mutex_unlock(&gpc->refresh_lock);
+
+	return ret;
 }
 
 void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm)
@@ -377,12 +378,16 @@  void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm)
 	gpc->pfn = KVM_PFN_ERR_FAULT;
 	gpc->gpa = INVALID_GPA;
 	gpc->uhva = KVM_HVA_ERR_BAD;
+	gpc->active = gpc->valid = false;
 }
 
 static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva,
 			      unsigned long len)
 {
 	struct kvm *kvm = gpc->kvm;
+	int ret;
+
+	mutex_lock(&gpc->refresh_lock);
 
 	if (!gpc->active) {
 		if (KVM_BUG_ON(gpc->valid, kvm))
@@ -401,7 +406,10 @@  static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned
 		gpc->active = true;
 		write_unlock_irq(&gpc->lock);
 	}
-	return __kvm_gpc_refresh(gpc, gpa, uhva, len);
+	ret = __kvm_gpc_refresh(gpc, gpa, uhva, len);
+	mutex_unlock(&gpc->refresh_lock);
+
+	return ret;
 }
 
 int kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long len)
@@ -420,6 +428,7 @@  void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 	kvm_pfn_t old_pfn;
 	void *old_khva;
 
+	mutex_lock(&gpc->refresh_lock);
 	if (gpc->active) {
 		/*
 		 * Deactivate the cache before removing it from the list, KVM
@@ -449,4 +458,5 @@  void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 
 		gpc_unmap(old_pfn, old_khva);
 	}
+	mutex_unlock(&gpc->refresh_lock);
 }