Message ID | 20221025193820.4412-1-ajderossi@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | vfio: Decrement open_count before close_device() | expand |
On Tue, Oct 25, 2022 at 12:38:20PM -0700, Anthony DeRossi wrote: > The implementation of close_device() for vfio-pci inspects the > open_count of every device in the device set to determine whether a > reset is needed. Unless open_count is decremented before invoking > close_device(), the device set will always contain a device with > open_count > 0, effectively disabling the reset logic. This seems to miss the reason why this was done: Eliminate the calls to vfio_group_add_container_user() and add vfio_assert_device_open() to detect driver mis-use. This causes the close_device() op to check device->open_count so always leave it elevated while calling the op. If we let it be zero then vfio_assert_device_open() will trigger on other drivers. I think the best approach is to change vfio_pci to understand that open_count == 1 means it is the last close. Thanks, Jason
On Wed, Oct 26, 2022 at 11:24 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > I think the best approach is to change vfio_pci to understand that > open_count == 1 means it is the last close. Thanks for the feedback. I sent an updated patch. v2: https://lore.kernel.org/kvm/20221026194245.1769-1-ajderossi@gmail.com/ Anthony
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 2d168793d4e1..7c3f1734fb35 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -763,8 +763,10 @@ static struct file *vfio_device_open(struct vfio_device *device) if (device->ops->open_device) { ret = device->ops->open_device(device); - if (ret) - goto err_undo_count; + if (ret) { + device->open_count--; + goto err_unlock; + } } vfio_device_container_register(device); mutex_unlock(&device->group->group_lock); @@ -801,14 +803,13 @@ static struct file *vfio_device_open(struct vfio_device *device) err_close_device: mutex_lock(&device->dev_set->lock); mutex_lock(&device->group->group_lock); - if (device->open_count == 1 && device->ops->close_device) { + if (!--device->open_count && device->ops->close_device) { device->ops->close_device(device); vfio_device_container_unregister(device); } -err_undo_count: +err_unlock: mutex_unlock(&device->group->group_lock); - device->open_count--; if (device->open_count == 0 && device->kvm) device->kvm = NULL; mutex_unlock(&device->dev_set->lock); @@ -1017,12 +1018,11 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) mutex_lock(&device->dev_set->lock); vfio_assert_device_open(device); mutex_lock(&device->group->group_lock); - if (device->open_count == 1 && device->ops->close_device) + if (!--device->open_count && device->ops->close_device) device->ops->close_device(device); vfio_device_container_unregister(device); mutex_unlock(&device->group->group_lock); - device->open_count--; if (device->open_count == 0) device->kvm = NULL; mutex_unlock(&device->dev_set->lock);
The implementation of close_device() for vfio-pci inspects the open_count of every device in the device set to determine whether a reset is needed. Unless open_count is decremented before invoking close_device(), the device set will always contain a device with open_count > 0, effectively disabling the reset logic. After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure"), failure to create a new file for a device would cause the reset to be skipped when closing the device in the error path. After commit eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()"), releasing a device would always skip the reset. Failing to reset the device leaves it in an unknown state, potentially causing errors when it is bound to a different driver. This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3) assigned to a Windows guest. After shutting down the guest, unbinding the device from vfio-pci, and binding the device to amdgpu: [ 548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed! [ 548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed [ 548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22 [ 548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed [ 548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure") Fixes: eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()") Signed-off-by: Anthony DeRossi <ajderossi@gmail.com> --- drivers/vfio/vfio_main.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)