Message ID | 20230109201037.33051-2-mjrosato@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | kvm/vfio: fix potential deadlock on vfio group lock | expand |
On Mon, Jan 09, 2023 at 03:10:36PM -0500, Matthew Rosato wrote: > Currently it is possible that the final put of a KVM reference comes from > vfio during its device close operation. This occurs while the vfio group > lock is held; however, if the vfio device is still in the kvm device list, > then the following call chain could result in a deadlock: > > kvm_put_kvm > -> kvm_destroy_vm > -> kvm_destroy_devices > -> kvm_vfio_destroy > -> kvm_vfio_file_set_kvm > -> vfio_file_set_kvm > -> group->group_lock/group_rwsem > > Avoid this scenario by adding kvm_put_kvm_async which will perform the > kvm_destroy_vm asynchronously if the refcount reaches 0. > > Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") > Reported-by: Alex Williamson <alex.williamson@redhat.com> > Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> > --- > drivers/gpu/drm/i915/gvt/kvmgt.c | 6 +++++- > drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++- > include/linux/kvm_host.h | 3 +++ > virt/kvm/kvm_main.c | 22 ++++++++++++++++++++++ > 4 files changed, 36 insertions(+), 2 deletions(-) Why two patches? It looks OK to me Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Jason
On 1/9/23 3:13 PM, Jason Gunthorpe wrote: > On Mon, Jan 09, 2023 at 03:10:36PM -0500, Matthew Rosato wrote: >> Currently it is possible that the final put of a KVM reference comes from >> vfio during its device close operation. This occurs while the vfio group >> lock is held; however, if the vfio device is still in the kvm device list, >> then the following call chain could result in a deadlock: >> >> kvm_put_kvm >> -> kvm_destroy_vm >> -> kvm_destroy_devices >> -> kvm_vfio_destroy >> -> kvm_vfio_file_set_kvm >> -> vfio_file_set_kvm >> -> group->group_lock/group_rwsem >> >> Avoid this scenario by adding kvm_put_kvm_async which will perform the >> kvm_destroy_vm asynchronously if the refcount reaches 0. >> >> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") >> Reported-by: Alex Williamson <alex.williamson@redhat.com> >> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> >> --- >> drivers/gpu/drm/i915/gvt/kvmgt.c | 6 +++++- >> drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++- >> include/linux/kvm_host.h | 3 +++ >> virt/kvm/kvm_main.c | 22 ++++++++++++++++++++++ >> 4 files changed, 36 insertions(+), 2 deletions(-) > > Why two patches? > > It looks OK to me Mentioned in the cover, the fixes: tag is different on the 2nd patch as the s390 PCI passthrough kvm_puts were added later soemtime after 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM"). The blamed commit for those changes also landed in a different release (6.0 vs 5.19). But, now that you mention it, neither is an LTS so it probably doesn't matter all that much and could be squashed if preferred. > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Thanks! > > Jason
LGTM Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com> On 1/9/23 3:10 PM, Matthew Rosato wrote: > Currently it is possible that the final put of a KVM reference comes from > vfio during its device close operation. This occurs while the vfio group > lock is held; however, if the vfio device is still in the kvm device list, > then the following call chain could result in a deadlock: > > kvm_put_kvm > -> kvm_destroy_vm > -> kvm_destroy_devices > -> kvm_vfio_destroy > -> kvm_vfio_file_set_kvm > -> vfio_file_set_kvm > -> group->group_lock/group_rwsem > > Avoid this scenario by adding kvm_put_kvm_async which will perform the > kvm_destroy_vm asynchronously if the refcount reaches 0. > > Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") > Reported-by: Alex Williamson <alex.williamson@redhat.com> > Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> > --- > drivers/gpu/drm/i915/gvt/kvmgt.c | 6 +++++- > drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++- > include/linux/kvm_host.h | 3 +++ > virt/kvm/kvm_main.c | 22 ++++++++++++++++++++++ > 4 files changed, 36 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c > index 8ae7039b3683..24511c877572 100644 > --- a/drivers/gpu/drm/i915/gvt/kvmgt.c > +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c > @@ -703,7 +703,11 @@ static void intel_vgpu_close_device(struct vfio_device *vfio_dev) > > kvm_page_track_unregister_notifier(vgpu->vfio_device.kvm, > &vgpu->track_node); > - kvm_put_kvm(vgpu->vfio_device.kvm); > + /* > + * Avoid possible deadlock on any currently-held vfio lock by > + * ensuring the potential kvm_destroy_vm call is done asynchronously > + */ > + kvm_put_kvm_async(vgpu->vfio_device.kvm); > > kvmgt_protect_table_destroy(vgpu); > gvt_cache_destroy(vgpu); > diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c > index e93bb9c468ce..a37b2baefb36 100644 > --- a/drivers/s390/crypto/vfio_ap_ops.c > +++ b/drivers/s390/crypto/vfio_ap_ops.c > @@ -1574,7 +1574,12 @@ static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev) > > kvm_arch_crypto_clear_masks(kvm); > vfio_ap_mdev_reset_queues(&matrix_mdev->qtable); > - kvm_put_kvm(kvm); > + /* > + * Avoid possible deadlock on any currently-held vfio lock by > + * ensuring the potential kvm_destroy_vm call is done > + * asynchronously > + */ > + kvm_put_kvm_async(kvm); > matrix_mdev->kvm = NULL; > > release_update_locks_for_kvm(kvm); > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 4f26b244f6d0..2ef6a5102265 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -34,6 +34,7 @@ > #include <linux/instrumentation.h> > #include <linux/interval_tree.h> > #include <linux/rbtree.h> > +#include <linux/workqueue.h> > #include <linux/xarray.h> > #include <asm/signal.h> > > @@ -793,6 +794,7 @@ struct kvm { > struct kvm_stat_data **debugfs_stat_data; > struct srcu_struct srcu; > struct srcu_struct irq_srcu; > + struct work_struct async_work; > pid_t userspace_pid; > bool override_halt_poll_ns; > unsigned int max_halt_poll_ns; > @@ -963,6 +965,7 @@ void kvm_exit(void); > void kvm_get_kvm(struct kvm *kvm); > bool kvm_get_kvm_safe(struct kvm *kvm); > void kvm_put_kvm(struct kvm *kvm); > +void kvm_put_kvm_async(struct kvm *kvm); > bool file_is_kvm(struct file *file); > void kvm_put_kvm_no_destroy(struct kvm *kvm); > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 13e88297f999..fbe8d127028b 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1353,6 +1353,28 @@ void kvm_put_kvm(struct kvm *kvm) > } > EXPORT_SYMBOL_GPL(kvm_put_kvm); > > +static void kvm_put_async_fn(struct work_struct *work) > +{ > + struct kvm *kvm = container_of(work, struct kvm, > + async_work); > + > + kvm_destroy_vm(kvm); > +} > + > +/* > + * Put a reference but only destroy the vm asynchronously. Can be used in > + * cases where the caller holds a mutex that could cause deadlock if > + * kvm_destroy_vm is triggered > + */ > +void kvm_put_kvm_async(struct kvm *kvm) > +{ > + if (refcount_dec_and_test(&kvm->users_count)) { > + INIT_WORK(&kvm->async_work, kvm_put_async_fn); > + schedule_work(&kvm->async_work); > + } > +} > +EXPORT_SYMBOL_GPL(kvm_put_kvm_async); > + > /* > * Used to put a reference that was taken on behalf of an object associated > * with a user-visible file descriptor, e.g. a vcpu or device, if installation
On Mon, Jan 09, 2023, Matthew Rosato wrote: > Currently it is possible that the final put of a KVM reference comes from > vfio during its device close operation. This occurs while the vfio group > lock is held; however, if the vfio device is still in the kvm device list, > then the following call chain could result in a deadlock: > > kvm_put_kvm > -> kvm_destroy_vm > -> kvm_destroy_devices > -> kvm_vfio_destroy > -> kvm_vfio_file_set_kvm > -> vfio_file_set_kvm > -> group->group_lock/group_rwsem > > Avoid this scenario by adding kvm_put_kvm_async which will perform the > kvm_destroy_vm asynchronously if the refcount reaches 0. Something feels off. If KVM's refcount is 0, then accessing device->group->kvm in vfio_device_open() can't happen unless there's a refcounting bug somewhere. E.g. if this snippet holds group_lock mutex_lock(&device->group->group_lock); device->kvm = device->group->kvm; if (device->ops->open_device) { ret = device->ops->open_device(device); if (ret) goto err_undo_count; } vfio_device_container_register(device); mutex_unlock(&device->group->group_lock); and kvm_vfio_destroy() has already been invoked and is waiting on group_lock in vfio_file_set_kvm(), then device->ops->open_device() is being called with a non-NULL device->kvm that has kvm->users_count==0. And intel_vgpu_open_device() at least uses kvm_get_kvm(), i.e. assumes kvm->users_count > 0. If there's no refcounting bug then kvm_vfio_destroy() doesn't need to call kvm_vfio_file_set_kvm() since the only way there isn't a refcounting bug is if group->kvm is unreachable. This seems unlikely. Assuming there is indeed a refcounting issue, one solution would be to harden all ->open_device() implementations to use kvm_get_kvm_safe(). I'd prefer not to have to do that since it will complicate those implementations and also requires KVM to support an async put. Rather than force devices to get KVM references, why not handle that in common VFIO code and drop KVM refcountin from devices? Worst case scenario KVM is pinned by a device that doesn't need KVM but is in a group associated with KVM. If that's a concern, it seems easy enough to add a flag to vfio_device_ops to enumerate whether or not the device depends on KVM. --- drivers/vfio/vfio_main.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 6e8804fe0095..fb43212d77a0 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -772,6 +772,13 @@ static struct file *vfio_device_open(struct vfio_device *device) * reference and release it during close_device. */ mutex_lock(&device->group->group_lock); + + if (device->group->kvm && + !kvm_get_kvm_safe(device->group->kvm->kvm)) { + ret = -ENODEV; + goto err_undo_count; + } + device->kvm = device->group->kvm; if (device->ops->open_device) { @@ -823,8 +830,10 @@ static struct file *vfio_device_open(struct vfio_device *device) err_undo_count: mutex_unlock(&device->group->group_lock); device->open_count--; - if (device->open_count == 0 && device->kvm) + if (device->open_count == 0 && device->kvm) { + kvm_put_kvm(device->kvm); device->kvm = NULL; + } mutex_unlock(&device->dev_set->lock); module_put(device->dev->driver->owner); err_unassign_container: @@ -1039,8 +1048,10 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) } mutex_unlock(&device->group->group_lock); device->open_count--; - if (device->open_count == 0) + if (device->open_count == 0 && device->kvm) { + kvm_put_kvm(device->kvm); device->kvm = NULL; + } mutex_unlock(&device->dev_set->lock); module_put(device->dev->driver->owner); base-commit: d52444c7a90fc551b4c3b0bda7d3f0b2ca9fc84d --
On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote: > Something feels off. If KVM's refcount is 0, then accessing device->group->kvm > in vfio_device_open() can't happen unless there's a refcounting bug somewhere. The problem is in close, not open. Specifically it would be very hard to avoid holding the group_lock during close which is when the put is done. > Rather than force devices to get KVM references, why not handle that in common > VFIO code and drop KVM refcountin from devices? Worst case scenario KVM is pinned > by a device that doesn't need KVM but is in a group associated with KVM. If that's > a concern, it seems easy enough to add a flag to vfio_device_ops to enumerate > whether or not the device depends on KVM. We can't make cross-dependencies between kvm and core VFIO - it is why so much of this is soo ugly. The few device drivers that unavoidably have KVM involvment already have a KVM module dependency, so they can safely do the get/put Jason
On Wed, Jan 11, 2023, Jason Gunthorpe wrote: > On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote: > > > Something feels off. If KVM's refcount is 0, then accessing device->group->kvm > > in vfio_device_open() can't happen unless there's a refcounting bug somewhere. > > The problem is in close, not open. The deadlock problem is, yes. My point is that if group_lock needs to be taken when nullifying group->kvm during kvm_vfio_destroy(), then there is also a refcounting prolem with respect to open(). If there is no refcounting problem, then nullifying group->kvm during kvm_vfio_destroy() is unnecessary (but again, I doubt this is the case). The two things aren't directly related, but it seems possible to solve both while making this all slightly less ugly. Well, at least from KVM's perspective, whether or not it'd be an improvement on the VFIO side is definitely debatable. > Specifically it would be very hard to avoid holding the group_lock > during close which is when the put is done. > > > Rather than force devices to get KVM references, why not handle that in common > > VFIO code and drop KVM refcountin from devices? Worst case scenario KVM is pinned > > by a device that doesn't need KVM but is in a group associated with KVM. If that's > > a concern, it seems easy enough to add a flag to vfio_device_ops to enumerate > > whether or not the device depends on KVM. > > We can't make cross-dependencies between kvm and core VFIO - it is why > so much of this is soo ugly. Ugh, right, modules for everyone. > The few device drivers that unavoidably have KVM involvment already > have a KVM module dependency, so they can safely do the get/put Rather than store a "struct kvm *" in vfio_device, what about adding a new set of optional ops to get/put KVM references? Having dedicated KVM ops is gross, but IMO it's less gross than backdooring the KVM pointer into open_device() by stashing KVM into the device, e.g. it formalizes the VFIO API for devices that depend on KVM instead of making devices pinky-swear to grab a reference during open_device(). To further harden things, KVM could export only kvm_get_safe_kvm() if there are no vendor modules. I.e. make kvm_get_kvm() an internal-only helper when possible and effectively force VFIO devices to use the safe variant. That would work even x86, as kvm_get_kvm() wouldn't be exported if neither KVM_AMD nor KVM_INTEL is built as a module. --- drivers/vfio/vfio_main.c | 20 +++++++++++++------- include/linux/vfio.h | 9 +++++++-- 2 files changed, 20 insertions(+), 9 deletions(-) diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 6e8804fe0095..b3a84d65baa6 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -772,7 +772,12 @@ static struct file *vfio_device_open(struct vfio_device *device) * reference and release it during close_device. */ mutex_lock(&device->group->group_lock); - device->kvm = device->group->kvm; + + if (device->kvm_ops && device->group->kvm) { + ret = device->kvm_ops->get_kvm(device->group->kvm); + if (ret) + goto err_undo_count; + } if (device->ops->open_device) { ret = device->ops->open_device(device); @@ -823,8 +828,9 @@ static struct file *vfio_device_open(struct vfio_device *device) err_undo_count: mutex_unlock(&device->group->group_lock); device->open_count--; - if (device->open_count == 0 && device->kvm) - device->kvm = NULL; + if (device->open_count == 0 && device->kvm_ops) + device->kvm_ops->put_kvm(); + mutex_unlock(&device->dev_set->lock); module_put(device->dev->driver->owner); err_unassign_container: @@ -1039,8 +1045,8 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) } mutex_unlock(&device->group->group_lock); device->open_count--; - if (device->open_count == 0) - device->kvm = NULL; + if (device->open_count == 0 && device->kvm_ops) + device->kvm_ops->put_kvm(); mutex_unlock(&device->dev_set->lock); module_put(device->dev->driver->owner); @@ -1656,8 +1662,8 @@ EXPORT_SYMBOL_GPL(vfio_file_enforced_coherent); * @file: VFIO group file * @kvm: KVM to link * - * When a VFIO device is first opened the KVM will be available in - * device->kvm if one was associated with the group. + * When a VFIO device is first opened, the device's kvm_ops->get_kvm() will be + * invoked with the KVM instance associated with the group (if applicable). */ void vfio_file_set_kvm(struct file *file, struct kvm *kvm) { diff --git a/include/linux/vfio.h b/include/linux/vfio.h index fdd393f70b19..d6dcbe0546bf 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -18,6 +18,11 @@ struct kvm; +struct vfio_device_kvm_ops { + int (*get_kvm)(struct kvm *kvm); + void (*put_kvm)(void); +}; + /* * VFIO devices can be placed in a set, this allows all devices to share this * structure and the VFIO core will provide a lock that is held around @@ -43,8 +48,8 @@ struct vfio_device { struct vfio_device_set *dev_set; struct list_head dev_set_list; unsigned int migration_flags; - /* Driver must reference the kvm during open_device or never touch it */ - struct kvm *kvm; + + const struct vfio_device_kvm_ops *kvm_ops; /* Members below here are private, not for driver use */ unsigned int index; base-commit: d52444c7a90fc551b4c3b0bda7d3f0b2ca9fc84d --
On Wed, Jan 11, 2023 at 08:53:34PM +0000, Sean Christopherson wrote: > On Wed, Jan 11, 2023, Jason Gunthorpe wrote: > > On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote: > > > > > Something feels off. If KVM's refcount is 0, then accessing device->group->kvm > > > in vfio_device_open() can't happen unless there's a refcounting bug somewhere. > > > > The problem is in close, not open. > > The deadlock problem is, yes. My point is that if group_lock needs to be taken > when nullifying group->kvm during kvm_vfio_destroy(), then there is also a refcounting > prolem with respect to open(). If there is no refcounting problem, then nullifying > group->kvm during kvm_vfio_destroy() is unnecessary (but again, I doubt this is > the case). IIRC the drivers are supposed to use one of the refcount not zero incrs to counteract this, but I never checked they do.. Yi is working on a patch to change things so vfio drops the kvm pointer when the kvm file closes, not when the reference goes to 0 to avoid a refcount cycle problem which should also solve that. > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c > index 6e8804fe0095..b3a84d65baa6 100644 > --- a/drivers/vfio/vfio_main.c > +++ b/drivers/vfio/vfio_main.c > @@ -772,7 +772,12 @@ static struct file *vfio_device_open(struct vfio_device *device) > * reference and release it during close_device. > */ > mutex_lock(&device->group->group_lock); > - device->kvm = device->group->kvm; > + > + if (device->kvm_ops && device->group->kvm) { > + ret = device->kvm_ops->get_kvm(device->group->kvm); At this point I'd rather just use the symbol get stuff like kvm does and call the proper functions. Jason
On 1/12/23 7:45 AM, Jason Gunthorpe wrote: > On Wed, Jan 11, 2023 at 08:53:34PM +0000, Sean Christopherson wrote: >> On Wed, Jan 11, 2023, Jason Gunthorpe wrote: >>> On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote: >>> >>>> Something feels off. If KVM's refcount is 0, then accessing device->group->kvm >>>> in vfio_device_open() can't happen unless there's a refcounting bug somewhere. >>> >>> The problem is in close, not open. >> >> The deadlock problem is, yes. My point is that if group_lock needs to be taken >> when nullifying group->kvm during kvm_vfio_destroy(), then there is also a refcounting >> prolem with respect to open(). If there is no refcounting problem, then nullifying >> group->kvm during kvm_vfio_destroy() is unnecessary (but again, I doubt this is >> the case). > > IIRC the drivers are supposed to use one of the refcount not zero > incrs to counteract this, but I never checked they do.. > > Yi is working on a patch to change things so vfio drops the kvm > pointer when the kvm file closes, not when the reference goes to 0 > to avoid a refcount cycle problem which should also solve that. > >> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c >> index 6e8804fe0095..b3a84d65baa6 100644 >> --- a/drivers/vfio/vfio_main.c >> +++ b/drivers/vfio/vfio_main.c >> @@ -772,7 +772,12 @@ static struct file *vfio_device_open(struct vfio_device *device) >> * reference and release it during close_device. >> */ >> mutex_lock(&device->group->group_lock); >> - device->kvm = device->group->kvm; >> + >> + if (device->kvm_ops && device->group->kvm) { >> + ret = device->kvm_ops->get_kvm(device->group->kvm); > > At this point I'd rather just use the symbol get stuff like kvm does > and call the proper functions. > So should I work up a v2 that does symbol gets for kvm_get_kvm_safe and kvm_put_kvm from vfio_main and drop kvm_put_kvm_async? Or is the patch Yi is working on changing things such that will also address the deadlock issue? If so, something like the following (where vfio_kvm_get symbol gets kvm_get_kvm_safe and vfio_kvm_put symbol gets kvm_put_kvm): diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 5177bb061b17..a49bf1080f0a 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -361,16 +361,22 @@ static int vfio_device_first_open(struct vfio_device *device, if (ret) goto err_module_put; + if (kvm && !vfio_kvm_get(kvm)) { + ret = -ENOENT; + goto err_unuse_iommu; + } device->kvm = kvm; if (device->ops->open_device) { ret = device->ops->open_device(device); if (ret) - goto err_unuse_iommu; + goto err_put_kvm; } return 0; -err_unuse_iommu: +err_put_kvm: + vfio_put_kvm(kvm); device->kvm = NULL; +err_unuse_iommu: if (iommufd) vfio_iommufd_unbind(device); else @@ -465,6 +471,9 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) vfio_device_group_close(device); + if (device->open_count == 0 && device->group->kvm) + vfio_kvm_put(device->group->kvm); + vfio_device_put_registration(device); return 0;
On Thu, Jan 12, 2023 at 12:21:17PM -0500, Matthew Rosato wrote: > So should I work up a v2 that does symbol gets for kvm_get_kvm_safe > and kvm_put_kvm from vfio_main and drop kvm_put_kvm_async? Or is > the patch Yi is working on changing things such that will also > address the deadlock issue? I don't think Yi's part will help > +361,22 @@ static int vfio_device_first_open(struct vfio_device > *device, if (ret) goto err_module_put; > > + if (kvm && !vfio_kvm_get(kvm)) { Do call it kvm_get_safe though > + ret = -ENOENT; > + goto err_unuse_iommu; > + } > device->kvm = kvm; > if (device->ops->open_device) { > ret = device->ops->open_device(device); > if (ret) > - goto err_unuse_iommu; > + goto err_put_kvm; > } > return 0; > > -err_unuse_iommu: > +err_put_kvm: > + vfio_put_kvm(kvm); > device->kvm = NULL; > +err_unuse_iommu: > if (iommufd) > vfio_iommufd_unbind(device); > else > @@ -465,6 +471,9 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) > > vfio_device_group_close(device); > > + if (device->open_count == 0 && device->group->kvm) > + vfio_kvm_put(device->group->kvm); > + No, you can't touch group->kvm without holding the group lock, that is the whole point of the problem.. This has to be device->kvm Jason
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c index 8ae7039b3683..24511c877572 100644 --- a/drivers/gpu/drm/i915/gvt/kvmgt.c +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c @@ -703,7 +703,11 @@ static void intel_vgpu_close_device(struct vfio_device *vfio_dev) kvm_page_track_unregister_notifier(vgpu->vfio_device.kvm, &vgpu->track_node); - kvm_put_kvm(vgpu->vfio_device.kvm); + /* + * Avoid possible deadlock on any currently-held vfio lock by + * ensuring the potential kvm_destroy_vm call is done asynchronously + */ + kvm_put_kvm_async(vgpu->vfio_device.kvm); kvmgt_protect_table_destroy(vgpu); gvt_cache_destroy(vgpu); diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c index e93bb9c468ce..a37b2baefb36 100644 --- a/drivers/s390/crypto/vfio_ap_ops.c +++ b/drivers/s390/crypto/vfio_ap_ops.c @@ -1574,7 +1574,12 @@ static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev) kvm_arch_crypto_clear_masks(kvm); vfio_ap_mdev_reset_queues(&matrix_mdev->qtable); - kvm_put_kvm(kvm); + /* + * Avoid possible deadlock on any currently-held vfio lock by + * ensuring the potential kvm_destroy_vm call is done + * asynchronously + */ + kvm_put_kvm_async(kvm); matrix_mdev->kvm = NULL; release_update_locks_for_kvm(kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4f26b244f6d0..2ef6a5102265 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -34,6 +34,7 @@ #include <linux/instrumentation.h> #include <linux/interval_tree.h> #include <linux/rbtree.h> +#include <linux/workqueue.h> #include <linux/xarray.h> #include <asm/signal.h> @@ -793,6 +794,7 @@ struct kvm { struct kvm_stat_data **debugfs_stat_data; struct srcu_struct srcu; struct srcu_struct irq_srcu; + struct work_struct async_work; pid_t userspace_pid; bool override_halt_poll_ns; unsigned int max_halt_poll_ns; @@ -963,6 +965,7 @@ void kvm_exit(void); void kvm_get_kvm(struct kvm *kvm); bool kvm_get_kvm_safe(struct kvm *kvm); void kvm_put_kvm(struct kvm *kvm); +void kvm_put_kvm_async(struct kvm *kvm); bool file_is_kvm(struct file *file); void kvm_put_kvm_no_destroy(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 13e88297f999..fbe8d127028b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1353,6 +1353,28 @@ void kvm_put_kvm(struct kvm *kvm) } EXPORT_SYMBOL_GPL(kvm_put_kvm); +static void kvm_put_async_fn(struct work_struct *work) +{ + struct kvm *kvm = container_of(work, struct kvm, + async_work); + + kvm_destroy_vm(kvm); +} + +/* + * Put a reference but only destroy the vm asynchronously. Can be used in + * cases where the caller holds a mutex that could cause deadlock if + * kvm_destroy_vm is triggered + */ +void kvm_put_kvm_async(struct kvm *kvm) +{ + if (refcount_dec_and_test(&kvm->users_count)) { + INIT_WORK(&kvm->async_work, kvm_put_async_fn); + schedule_work(&kvm->async_work); + } +} +EXPORT_SYMBOL_GPL(kvm_put_kvm_async); + /* * Used to put a reference that was taken on behalf of an object associated * with a user-visible file descriptor, e.g. a vcpu or device, if installation
Currently it is possible that the final put of a KVM reference comes from vfio during its device close operation. This occurs while the vfio group lock is held; however, if the vfio device is still in the kvm device list, then the following call chain could result in a deadlock: kvm_put_kvm -> kvm_destroy_vm -> kvm_destroy_devices -> kvm_vfio_destroy -> kvm_vfio_file_set_kvm -> vfio_file_set_kvm -> group->group_lock/group_rwsem Avoid this scenario by adding kvm_put_kvm_async which will perform the kvm_destroy_vm asynchronously if the refcount reaches 0. Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") Reported-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> --- drivers/gpu/drm/i915/gvt/kvmgt.c | 6 +++++- drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++- include/linux/kvm_host.h | 3 +++ virt/kvm/kvm_main.c | 22 ++++++++++++++++++++++ 4 files changed, 36 insertions(+), 2 deletions(-)