Message ID | 20221105224458.8180-4-ajderossi@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | vfio/pci: Check the device set open count on reset | expand |
On Sat, Nov 05, 2022 at 03:44:58PM -0700, Anthony DeRossi wrote: > vfio_pci_dev_set_needs_reset() inspects the open_count of every device > in the set to determine whether a reset is allowed. The current device > always has open_count == 1 within vfio_pci_core_disable(), effectively > disabling the reset logic. This field is also documented as private in > vfio_device, so it should not be used to determine whether other devices > in the set are open. > > Checking for vfio_device_set_open_count() > 1 on the device set fixes > both issues. > > After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set > infrastructure"), failure to create a new file for a device would cause > the reset to be skipped due to open_count being decremented after > calling close_device() in the error path. > > After commit eadd86f835c6 ("vfio: Remove calls to > vfio_group_add_container_user()"), releasing a device would always skip > the reset due to an ordering change in vfio_device_fops_release(). > > Failing to reset the device leaves it in an unknown state, potentially > causing errors when it is accessed later or bound to a different driver. > > This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3) > assigned to a Windows guest. After shutting down the guest, unbinding > the device from vfio-pci, and binding the device to amdgpu: > > [ 548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed! > [ 548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed > [ 548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22 > [ 548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed > [ 548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init > > Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure") > Fixes: eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()") > Signed-off-by: Anthony DeRossi <ajderossi@gmail.com> > --- > drivers/vfio/pci/vfio_pci_core.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Jason
> From: Anthony DeRossi <ajderossi@gmail.com> > Sent: Sunday, November 6, 2022 6:45 AM > > vfio_pci_dev_set_needs_reset() inspects the open_count of every device > in the set to determine whether a reset is allowed. The current device > always has open_count == 1 within vfio_pci_core_disable(), effectively > disabling the reset logic. This field is also documented as private in > vfio_device, so it should not be used to determine whether other devices > in the set are open. > > Checking for vfio_device_set_open_count() > 1 on the device set fixes > both issues. > > After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set > infrastructure"), failure to create a new file for a device would cause > the reset to be skipped due to open_count being decremented after > calling close_device() in the error path. > > After commit eadd86f835c6 ("vfio: Remove calls to > vfio_group_add_container_user()"), releasing a device would always skip > the reset due to an ordering change in vfio_device_fops_release(). > > Failing to reset the device leaves it in an unknown state, potentially > causing errors when it is accessed later or bound to a different driver. > > This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3) > assigned to a Windows guest. After shutting down the guest, unbinding > the device from vfio-pci, and binding the device to amdgpu: > > [ 548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed! > [ 548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading > failed > [ 548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init > of IP block <psp> failed -22 > [ 548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed > [ 548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init > > Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure") > Fixes: eadd86f835c6 ("vfio: Remove calls to > vfio_group_add_container_user()") > Signed-off-by: Anthony DeRossi <ajderossi@gmail.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index badc9d828cac..e030c2120183 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -2488,12 +2488,12 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set) struct vfio_pci_core_device *cur; bool needs_reset = false; - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { - /* No VFIO device in the set can have an open device FD */ - if (cur->vdev.open_count) - return false; + /* No other VFIO device in the set can be open. */ + if (vfio_device_set_open_count(dev_set) > 1) + return false; + + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) needs_reset |= cur->needs_reset; - } return needs_reset; }
vfio_pci_dev_set_needs_reset() inspects the open_count of every device in the set to determine whether a reset is allowed. The current device always has open_count == 1 within vfio_pci_core_disable(), effectively disabling the reset logic. This field is also documented as private in vfio_device, so it should not be used to determine whether other devices in the set are open. Checking for vfio_device_set_open_count() > 1 on the device set fixes both issues. After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure"), failure to create a new file for a device would cause the reset to be skipped due to open_count being decremented after calling close_device() in the error path. After commit eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()"), releasing a device would always skip the reset due to an ordering change in vfio_device_fops_release(). Failing to reset the device leaves it in an unknown state, potentially causing errors when it is accessed later or bound to a different driver. This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3) assigned to a Windows guest. After shutting down the guest, unbinding the device from vfio-pci, and binding the device to amdgpu: [ 548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed! [ 548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed [ 548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22 [ 548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed [ 548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure") Fixes: eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()") Signed-off-by: Anthony DeRossi <ajderossi@gmail.com> --- drivers/vfio/pci/vfio_pci_core.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)