diff mbox series

[v6,12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET

Message ID 20230308132903.465159-13-yi.l.liu@intel.com (mailing list archive)
State New, archived
Headers show
Series cover-letter: Add vfio_device cdev for iommufd support | expand

Commit Message

Yi Liu March 8, 2023, 1:28 p.m. UTC
This is another method to issue PCI hot reset for the users that bounds
device to a positive iommufd value. In such case, iommufd is a proof of
device ownership. By passing a zero-length fd array, user indicates kernel
to do ownership check with the bound iommufd. All the opened devices within
the affected dev_set should have been bound to the same iommufd. This is
simpler and faster as user does not need to pass a set of fds and kernel
no need to search the device within the given fds.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/device.c   |  6 +++
 drivers/vfio/iommufd.c           |  9 ++++
 drivers/vfio/pci/vfio_pci_core.c | 92 ++++++++++++++++++++++----------
 include/linux/iommufd.h          |  1 +
 include/linux/vfio.h             |  3 ++
 include/uapi/linux/vfio.h        |  5 ++
 6 files changed, 89 insertions(+), 27 deletions(-)

Comments

Tian, Kevin March 10, 2023, 5:31 a.m. UTC | #1
> From: Yi Liu
> Sent: Wednesday, March 8, 2023 9:29 PM
> 
> This is another method to issue PCI hot reset for the users that bounds
> device to a positive iommufd value. In such case, iommufd is a proof of
> device ownership. By passing a zero-length fd array, user indicates kernel
> to do ownership check with the bound iommufd. All the opened devices
> within
> the affected dev_set should have been bound to the same iommufd. This is
> simpler and faster as user does not need to pass a set of fds and kernel
> no need to search the device within the given fds.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>

I think you also need a s-o-b from Jason since he wrote most of the
code here.

> +struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device *vdev)
> +{
> +	/* Only serve for physical device */
> +	if (!vdev->iommufd_device)
> +		return NULL;

pointless comment

> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info {
>   * The ownership can be proved by:
>   *   - An array of group fds
>   *   - An array of device fds
> + *   - A zero-length array
> + *
> + * In the last case all affected devices which are opened by this user
> + * must have been bound to a same iommufd_ctx.  This approach is only
> + * available for devices bound to positive iommufd.

As we chatted before I still think the last sentence is pointless. If a device
is bound to negative iommufd value (i.e. noiommu) it doesn't have a
valid iommufd_ctx so won't meet "bound to a same iommufd_ctx".
Yi Liu March 10, 2023, 6:04 a.m. UTC | #2
> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Friday, March 10, 2023 1:31 PM
> 
> > From: Yi Liu
> > Sent: Wednesday, March 8, 2023 9:29 PM
> >
> > This is another method to issue PCI hot reset for the users that bounds
> > device to a positive iommufd value. In such case, iommufd is a proof of
> > device ownership. By passing a zero-length fd array, user indicates kernel
> > to do ownership check with the bound iommufd. All the opened devices
> > within
> > the affected dev_set should have been bound to the same iommufd. This
> is
> > simpler and faster as user does not need to pass a set of fds and kernel
> > no need to search the device within the given fds.
> >
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> 
> I think you also need a s-o-b from Jason since he wrote most of the
> code here.

Yes, it is. I'll add it if no objection from Jason.

> > +struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device
> *vdev)
> > +{
> > +	/* Only serve for physical device */
> > +	if (!vdev->iommufd_device)
> > +		return NULL;
> 
> pointless comment

Will remove it.

> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info {
> >   * The ownership can be proved by:
> >   *   - An array of group fds
> >   *   - An array of device fds
> > + *   - A zero-length array
> > + *
> > + * In the last case all affected devices which are opened by this user
> > + * must have been bound to a same iommufd_ctx.  This approach is only
> > + * available for devices bound to positive iommufd.
> 
> As we chatted before I still think the last sentence is pointless. If a device
> is bound to negative iommufd value (i.e. noiommu) it doesn't have a
> valid iommufd_ctx so won't meet "bound to a same iommufd_ctx".

Yes, it is. But iommufd_ctx is more a kernel thing, userspace may just
know whether it has bound a positive iommufd or a negative iommufd
to the device. So positive iommufd may be more straightforward to
userspace programmers. 
Tian, Kevin March 10, 2023, 9:08 a.m. UTC | #3
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, March 10, 2023 2:04 PM
> > > + *
> > > + * In the last case all affected devices which are opened by this user
> > > + * must have been bound to a same iommufd_ctx.  This approach is only
> > > + * available for devices bound to positive iommufd.
> >
> > As we chatted before I still think the last sentence is pointless. If a device
> > is bound to negative iommufd value (i.e. noiommu) it doesn't have a
> > valid iommufd_ctx so won't meet "bound to a same iommufd_ctx".
> 
> Yes, it is. But iommufd_ctx is more a kernel thing, userspace may just
> know whether it has bound a positive iommufd or a negative iommufd
> to the device. So positive iommufd may be more straightforward to
> userspace programmers. 
Jason Gunthorpe March 10, 2023, 5:42 p.m. UTC | #4
On Fri, Mar 10, 2023 at 06:04:02AM +0000, Liu, Yi L wrote:
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Friday, March 10, 2023 1:31 PM
> > 
> > > From: Yi Liu
> > > Sent: Wednesday, March 8, 2023 9:29 PM
> > >
> > > This is another method to issue PCI hot reset for the users that bounds
> > > device to a positive iommufd value. In such case, iommufd is a proof of
> > > device ownership. By passing a zero-length fd array, user indicates kernel
> > > to do ownership check with the bound iommufd. All the opened devices
> > > within
> > > the affected dev_set should have been bound to the same iommufd. This
> > is
> > > simpler and faster as user does not need to pass a set of fds and kernel
> > > no need to search the device within the given fds.
> > >
> > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > 
> > I think you also need a s-o-b from Jason since he wrote most of the
> > code here.
> 
> Yes, it is. I'll add it if no objection from Jason.

Go ahead

Jason
Alex Williamson March 15, 2023, 10:53 p.m. UTC | #5
On Wed,  8 Mar 2023 05:28:51 -0800
Yi Liu <yi.l.liu@intel.com> wrote:

> This is another method to issue PCI hot reset for the users that bounds
> device to a positive iommufd value. In such case, iommufd is a proof of
> device ownership. By passing a zero-length fd array, user indicates kernel
> to do ownership check with the bound iommufd. All the opened devices within
> the affected dev_set should have been bound to the same iommufd. This is
> simpler and faster as user does not need to pass a set of fds and kernel
> no need to search the device within the given fds.

Couldn't this same idea apply to containers?

I'm afraid this proposal reduces or eliminates the handshake we have
with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore the
_INFO ioctl altogether, resulting in drivers that don't understand the
scope of the reset.  Is it worth it?  What do we really gain?

> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index d80141969cd1..382d95455f89 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info {
>   * The ownership can be proved by:
>   *   - An array of group fds
>   *   - An array of device fds
> + *   - A zero-length array
> + *
> + * In the last case all affected devices which are opened by this user
> + * must have been bound to a same iommufd_ctx.  This approach is only
> + * available for devices bound to positive iommufd.
>   *
>   * Return: 0 on success, -errno on failure.
>   */

There's no introspection that this feature is supported, is that why
containers are not considered?  ie. if the host supports vfio cdevs, it
necessarily must support vfio-pci hot reset w/ a zero-length array?
Thanks,

Alex
Tian, Kevin March 15, 2023, 11:31 p.m. UTC | #6
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, March 16, 2023 6:53 AM
> 
> On Wed,  8 Mar 2023 05:28:51 -0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > This is another method to issue PCI hot reset for the users that bounds
> > device to a positive iommufd value. In such case, iommufd is a proof of
> > device ownership. By passing a zero-length fd array, user indicates kernel
> > to do ownership check with the bound iommufd. All the opened devices
> within
> > the affected dev_set should have been bound to the same iommufd. This is
> > simpler and faster as user does not need to pass a set of fds and kernel
> > no need to search the device within the given fds.
> 
> Couldn't this same idea apply to containers?

User is allowed to create multiple containers. Looks we don't have a way
to check whether multiple containers belong to the same user today.

> 
> I'm afraid this proposal reduces or eliminates the handshake we have
> with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
> VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore the
> _INFO ioctl altogether, resulting in drivers that don't understand the
> scope of the reset.  Is it worth it?  What do we really gain?

Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
useful today.

It's an interface on opened device. So the tiny difference is whether the
user knows the device is resettable when calling GET_INFO or later when
actually calling PCI_HOT_RESET.

and with this series we also allow reset on affected devices which are not
opened. Such dynamic cannot be reflected in static GET_INFO. More
suitable a try-and-fail style.


> 
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index d80141969cd1..382d95455f89 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info {
> >   * The ownership can be proved by:
> >   *   - An array of group fds
> >   *   - An array of device fds
> > + *   - A zero-length array
> > + *
> > + * In the last case all affected devices which are opened by this user
> > + * must have been bound to a same iommufd_ctx.  This approach is only
> > + * available for devices bound to positive iommufd.
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> 
> There's no introspection that this feature is supported, is that why
> containers are not considered?  ie. if the host supports vfio cdevs, it
> necessarily must support vfio-pci hot reset w/ a zero-length array?
> Thanks,
> 

yes. It's more for users who knows that iommufd is used.
Yi Liu March 16, 2023, 3:54 a.m. UTC | #7
> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, March 16, 2023 7:31 AM
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, March 16, 2023 6:53 AM
> >
> > On Wed,  8 Mar 2023 05:28:51 -0800
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >
> > > This is another method to issue PCI hot reset for the users that bounds
> > > device to a positive iommufd value. In such case, iommufd is a proof of
> > > device ownership. By passing a zero-length fd array, user indicates
> kernel
> > > to do ownership check with the bound iommufd. All the opened devices
> > within
> > > the affected dev_set should have been bound to the same iommufd.
> This is
> > > simpler and faster as user does not need to pass a set of fds and kernel
> > > no need to search the device within the given fds.
> >
> > Couldn't this same idea apply to containers?
> 
> User is allowed to create multiple containers. Looks we don't have a way
> to check whether multiple containers belong to the same user today.

Hi Kevin,

This reminds me. In the compat mode, container fd is actually iommufd.
If the compat mode passes a zeror-length array to do reset, it is possible
that the opened devices in this affected dev_set may be set to different
containers (a.k.a. iommufd_ctx). This would break what we defined in
uapi. So a better description is users that use cdev can use this zero-length
approach. And also, in kernel, we need to check if this approach is abused
by the compat mode.

> >
> > I'm afraid this proposal reduces or eliminates the handshake we have
> > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
> > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore
> the
> > _INFO ioctl altogether, resulting in drivers that don't understand the
> > scope of the reset.  Is it worth it?  What do we really gain?
> 
> Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> useful today.
> 
> It's an interface on opened device. So the tiny difference is whether the
> user knows the device is resettable when calling GET_INFO or later when
> actually calling PCI_HOT_RESET.
> 
> and with this series we also allow reset on affected devices which are not
> opened. Such dynamic cannot be reflected in static GET_INFO. More
> suitable a try-and-fail style.

Got the usage of zero-length, 
> 
> >
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index d80141969cd1..382d95455f89 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info {
> > >   * The ownership can be proved by:
> > >   *   - An array of group fds
> > >   *   - An array of device fds
> > > + *   - A zero-length array
> > > + *
> > > + * In the last case all affected devices which are opened by this user
> > > + * must have been bound to a same iommufd_ctx.  This approach is
> only
> > > + * available for devices bound to positive iommufd.
> > >   *
> > >   * Return: 0 on success, -errno on failure.
> > >   */
> >
> > There's no introspection that this feature is supported, is that why
> > containers are not considered?  ie. if the host supports vfio cdevs, it
> > necessarily must support vfio-pci hot reset w/ a zero-length array?
> > Thanks,
> >
> 
> yes. It's more for users who knows that iommufd is used.

Needs to be more accurate. Only users that uses cdev.
Tian, Kevin March 16, 2023, 6:09 a.m. UTC | #8
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, March 16, 2023 11:55 AM
> 
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Thursday, March 16, 2023 7:31 AM
> >
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, March 16, 2023 6:53 AM
> > >
> > > On Wed,  8 Mar 2023 05:28:51 -0800
> > > Yi Liu <yi.l.liu@intel.com> wrote:
> > >
> > > > This is another method to issue PCI hot reset for the users that bounds
> > > > device to a positive iommufd value. In such case, iommufd is a proof of
> > > > device ownership. By passing a zero-length fd array, user indicates
> > kernel
> > > > to do ownership check with the bound iommufd. All the opened devices
> > > within
> > > > the affected dev_set should have been bound to the same iommufd.
> > This is
> > > > simpler and faster as user does not need to pass a set of fds and kernel
> > > > no need to search the device within the given fds.
> > >
> > > Couldn't this same idea apply to containers?
> >
> > User is allowed to create multiple containers. Looks we don't have a way
> > to check whether multiple containers belong to the same user today.
> 
> Hi Kevin,
> 
> This reminds me. In the compat mode, container fd is actually iommufd.
> If the compat mode passes a zeror-length array to do reset, it is possible
> that the opened devices in this affected dev_set may be set to different
> containers (a.k.a. iommufd_ctx). This would break what we defined in
> uapi. So a better description is users that use cdev can use this zero-length
> approach. And also, in kernel, we need to check if this approach is abused
> by the compat mode.
> 

In normal case legacy application uses group fd array and new application
with cdev uses zero-length approach.

In rare case an application which opens /dev/iommu but opts to use it
as a container in compat mode can also use zero-length array to reset
if all devices are attached to a single container (internally to a same
iommufd_ctx). It's still kind of matching uAPI description.

I'm not sure whether we want to add explicit check to prevent it.

Of course if affected devices span multiple compat iommufd's then
it will fail.

The open Alex raised is whether we want to further extend it to
legacy container if all affected devices are in one container. But
I hesitate to do so since iommufd is the future and if an application
can be rewritten to utilize zero-length reset then it probably
should explicitly embrace iommufd instead.

Anyway let's not wait here. Send your v7 and we can have more
focused discussion in your split series about hot reset.

Thanks
Kevin
Yi Liu March 16, 2023, 6:28 a.m. UTC | #9
> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, March 16, 2023 2:10 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Thursday, March 16, 2023 11:55 AM
> >
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Thursday, March 16, 2023 7:31 AM
> > >
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Thursday, March 16, 2023 6:53 AM
> > > >
> > > > On Wed,  8 Mar 2023 05:28:51 -0800
> > > > Yi Liu <yi.l.liu@intel.com> wrote:
> > > >
> > > > > This is another method to issue PCI hot reset for the users that
> bounds
> > > > > device to a positive iommufd value. In such case, iommufd is a proof
> of
> > > > > device ownership. By passing a zero-length fd array, user indicates
> > > kernel
> > > > > to do ownership check with the bound iommufd. All the opened
> devices
> > > > within
> > > > > the affected dev_set should have been bound to the same iommufd.
> > > This is
> > > > > simpler and faster as user does not need to pass a set of fds and
> kernel
> > > > > no need to search the device within the given fds.
> > > >
> > > > Couldn't this same idea apply to containers?
> > >
> > > User is allowed to create multiple containers. Looks we don't have a way
> > > to check whether multiple containers belong to the same user today.
> >
> > Hi Kevin,
> >
> > This reminds me. In the compat mode, container fd is actually iommufd.
> > If the compat mode passes a zeror-length array to do reset, it is possible
> > that the opened devices in this affected dev_set may be set to different
> > containers (a.k.a. iommufd_ctx). This would break what we defined in
> > uapi. So a better description is users that use cdev can use this zero-length
> > approach. And also, in kernel, we need to check if this approach is abused
> > by the compat mode.
> >
> 
> In normal case legacy application uses group fd array and new application
> with cdev uses zero-length approach.
> 
> In rare case an application which opens /dev/iommu but opts to use it
> as a container in compat mode can also use zero-length array to reset
> if all devices are attached to a single container (internally to a same
> iommufd_ctx). It's still kind of matching uAPI description.
>
> I'm not sure whether we want to add explicit check to prevent it.
>
> Of course if affected devices span multiple compat iommufd's then
> it will fail.

Yes. this failure matches the uapi description. And it is rare case,
mostly likely applications should be explicitly updated to use cdev
and then use this zero-length approach. Before that, the legacy
applications do not know it at all. Even it uses this approach
by mistake, it will fail in the multiple compat iommufd case.

So maybe no need to limit it.

> The open Alex raised is whether we want to further extend it to
> legacy container if all affected devices are in one container. But
> I hesitate to do so since iommufd is the future and if an application
> can be rewritten to utilize zero-length reset then it probably
> should explicitly embrace iommufd instead.

For this, I agree with you.

> 
> Anyway let's not wait here. Send your v7 and we can have more
> focused discussion in your split series about hot reset.

Sure. Once Nicolin's patch is updated, I can send v7 with the hot
reset series as well.

Regards,
Yi Liu
Nicolin Chen March 16, 2023, 6:49 a.m. UTC | #10
On Thu, Mar 16, 2023 at 06:28:28AM +0000, Liu, Yi L wrote:

> > Anyway let's not wait here. Send your v7 and we can have more
> > focused discussion in your split series about hot reset.
> 
> Sure. Once Nicolin's patch is updated, I can send v7 with the hot
> reset series as well.

I've updated three commits and pushed here:
https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-03152023

Please pull the following commit to the emulated series:
  "iommufd: Create access in vfio_iommufd_emulated_bind()"
  https://github.com/nicolinc/iommufd/commit/6467e332584de62d1c4d5daab404a8c8d5a90a2d

Please pull the following commit to the cdev series or a place
that you feel it'd be better -- it's required by the change of
adding vfio_iommufd_emulated_detach_ioas():
  "iommufd/device: Add iommufd_access_detach() API"
  https://github.com/nicolinc/iommufd/commit/86346b5d06100640037cbb4a14bd249476072dec

The other one adding replace() will go with the replace series.

And regarding the new baseline for the replace series and the
nesting series, it'd be nicer to have another one git-merging
your cdev v7 branch on top of Jason's iommufd_hwpt branch. We
could wait for him updating to 6.3-rc2, if that's necessary.

Thanks
Nic
Yi Liu March 16, 2023, 1:22 p.m. UTC | #11
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, March 16, 2023 2:49 PM
> 
> On Thu, Mar 16, 2023 at 06:28:28AM +0000, Liu, Yi L wrote:
> 
> > > Anyway let's not wait here. Send your v7 and we can have more
> > > focused discussion in your split series about hot reset.
> >
> > Sure. Once Nicolin's patch is updated, I can send v7 with the hot
> > reset series as well.
> 
> I've updated three commits and pushed here:
> https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-
> 03152023
> 
> Please pull the following commit to the emulated series:
>   "iommufd: Create access in vfio_iommufd_emulated_bind()"
> 
> https://github.com/nicolinc/iommufd/commit/6467e332584de62d1c4d5daa
> b404a8c8d5a90a2d
> 
> Please pull the following commit to the cdev series or a place
> that you feel it'd be better -- it's required by the change of
> adding vfio_iommufd_emulated_detach_ioas():
>   "iommufd/device: Add iommufd_access_detach() API"
> 
> https://github.com/nicolinc/iommufd/commit/86346b5d06100640037cbb4a
> 14bd249476072dec

Thanks, I've taken them. v7 was sent out.

> The other one adding replace() will go with the replace series.
> 
> And regarding the new baseline for the replace series and the
> nesting series, it'd be nicer to have another one git-merging
> your cdev v7 branch on top of Jason's iommufd_hwpt branch. We
> could wait for him updating to 6.3-rc2, if that's necessary.

Yes. I cherry-pick his iommufd_hwpt to 6.3-rc2 and then try a
merge and then cherry-pick the replace and nesting series from
your above branch. Though the order between cdev and
iommufd_hwpt not perfect, we may use it as a wip baseline
when we try to address the comments w.r.t. nesting and
replace series.

https://github.com/yiliu1765/iommufd/tree/wip/iommufd_nesting-03162023

Regards,
Yi Liu
Alex Williamson March 16, 2023, 6:45 p.m. UTC | #12
On Wed, 15 Mar 2023 23:31:23 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, March 16, 2023 6:53 AM
> > 
> > On Wed,  8 Mar 2023 05:28:51 -0800
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> > > This is another method to issue PCI hot reset for the users that bounds
> > > device to a positive iommufd value. In such case, iommufd is a proof of
> > > device ownership. By passing a zero-length fd array, user indicates kernel
> > > to do ownership check with the bound iommufd. All the opened devices  
> > within  
> > > the affected dev_set should have been bound to the same iommufd. This is
> > > simpler and faster as user does not need to pass a set of fds and kernel
> > > no need to search the device within the given fds.  
> > 
> > Couldn't this same idea apply to containers?  
> 
> User is allowed to create multiple containers. Looks we don't have a way
> to check whether multiple containers belong to the same user today.

No, but a common configuration is that all devices are in the same
container, ie. no-vIOMMU, and it's also rather common that when we have
multi-function devices, all functions are within the same IOMMU group
and therefore necessarily require the same address space and therefore
container.

> > I'm afraid this proposal reduces or eliminates the handshake we have
> > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
> > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore the
> > _INFO ioctl altogether, resulting in drivers that don't understand the
> > scope of the reset.  Is it worth it?  What do we really gain?  
> 
> Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> useful today.
> 
> It's an interface on opened device. So the tiny difference is whether the
> user knows the device is resettable when calling GET_INFO or later when
> actually calling PCI_HOT_RESET.

No, GET_PCI_HOT_RESET_INFO conveys not only whether a PCI_HOT_RESET can
be performed, but equally important the scope of the reset, ie. which
devices are affected by the reset.  If we de-emphasize the INFO
portion, then this easily gets confused as just a variant of
VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset.  In
fact, I'd say the interface is not only trying to validate that the
user has sufficient privileges for the reset, but that they explicitly
acknowledge the scope of the reset.

> and with this series we also allow reset on affected devices which are not
> opened. Such dynamic cannot be reflected in static GET_INFO. More
> suitable a try-and-fail style.

Resets have side-effects, obviously, so this isn't the sort of thing we
can simply ask the user to probe for.  I agree that dynamics can
change, the GET_PCI_HOT_RESET_INFO is a point in time, isolated
functions on the same bus can change ownership.  However, in practice,
and in its primary use case with GPUs without isolation, it's
sufficiently static.  So I think this is a mischaracterization.  Thanks,

Alex
Nicolin Chen March 16, 2023, 9:27 p.m. UTC | #13
On Thu, Mar 16, 2023 at 01:22:58PM +0000, Liu, Yi L wrote:

> > And regarding the new baseline for the replace series and the
> > nesting series, it'd be nicer to have another one git-merging
> > your cdev v7 branch on top of Jason's iommufd_hwpt branch. We
> > could wait for him updating to 6.3-rc2, if that's necessary.
> 
> Yes. I cherry-pick his iommufd_hwpt to 6.3-rc2 and then try a
> merge and then cherry-pick the replace and nesting series from
> your above branch. Though the order between cdev and
> iommufd_hwpt not perfect, we may use it as a wip baseline
> when we try to address the comments w.r.t. nesting and
> replace series.
> 
> https://github.com/yiliu1765/iommufd/tree/wip/iommufd_nesting-03162023

Nice. It looks like you integrated everything from my tree so
it saves me some effort :)

Regarding the order between cdev and iommufd_hwpt, I think it
would be Jason's decision whether to merge his changes prior
to the PR from the VFIO tree, or the other way around.

One way or another, I think the replace v5 and the nesting v2
will be less impacted, unless Jason makes some huge changes
to his branch. Let's use this tree this week to rework both
series, and rebase after he comes back and updates his tree.

Lemme know if you need a help for the nesting series or so.

Thanks
Nic
Tian, Kevin March 16, 2023, 11:29 p.m. UTC | #14
> From: Alex Williamson
> Sent: Friday, March 17, 2023 2:46 AM
> 
> On Wed, 15 Mar 2023 23:31:23 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, March 16, 2023 6:53 AM
> > > I'm afraid this proposal reduces or eliminates the handshake we have
> > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
> > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore
> the
> > > _INFO ioctl altogether, resulting in drivers that don't understand the
> > > scope of the reset.  Is it worth it?  What do we really gain?
> >
> > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> > useful today.
> >
> > It's an interface on opened device. So the tiny difference is whether the
> > user knows the device is resettable when calling GET_INFO or later when
> > actually calling PCI_HOT_RESET.
> 
> No, GET_PCI_HOT_RESET_INFO conveys not only whether a PCI_HOT_RESET
> can
> be performed, but equally important the scope of the reset, ie. which
> devices are affected by the reset.  If we de-emphasize the INFO
> portion, then this easily gets confused as just a variant of
> VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset.  In
> fact, I'd say the interface is not only trying to validate that the
> user has sufficient privileges for the reset, but that they explicitly
> acknowledge the scope of the reset.
> 

IMHO the usefulness of scope is if it's discoverable by the management
stack which then can try to assign devices with affected reset to a same
user.

but this info is only available after the device is opened. Then the mgmt.
stack just assigns devices w/o awareness of reset scope and nothing
can be changed by the user no matter it knows the scope or not.

from this angle I don't see a value of probe-and-reset vs. direct reset
when reset itself also takes the scope into consideration. Probably the
slight difference is that with probe a more informative error message
can be printed out?
Alex Williamson March 17, 2023, 12:22 a.m. UTC | #15
On Thu, 16 Mar 2023 23:29:21 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Friday, March 17, 2023 2:46 AM
> > 
> > On Wed, 15 Mar 2023 23:31:23 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Thursday, March 16, 2023 6:53 AM
> > > > I'm afraid this proposal reduces or eliminates the handshake we have
> > > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
> > > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore  
> > the  
> > > > _INFO ioctl altogether, resulting in drivers that don't understand the
> > > > scope of the reset.  Is it worth it?  What do we really gain?  
> > >
> > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> > > useful today.
> > >
> > > It's an interface on opened device. So the tiny difference is whether the
> > > user knows the device is resettable when calling GET_INFO or later when
> > > actually calling PCI_HOT_RESET.  
> > 
> > No, GET_PCI_HOT_RESET_INFO conveys not only whether a PCI_HOT_RESET
> > can
> > be performed, but equally important the scope of the reset, ie. which
> > devices are affected by the reset.  If we de-emphasize the INFO
> > portion, then this easily gets confused as just a variant of
> > VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset.  In
> > fact, I'd say the interface is not only trying to validate that the
> > user has sufficient privileges for the reset, but that they explicitly
> > acknowledge the scope of the reset.
> >   
> 
> IMHO the usefulness of scope is if it's discoverable by the management
> stack which then can try to assign devices with affected reset to a same
> user.

Disagree, the user needs to know the scope of reset.  Take for instance
two function of a device configured onto separate buses within a VM.
The VMM needs to know that a hot-reset of one will reset the other.
That's not obvious to the VMM without some understanding of PCI/e
topology and analysis of the host system.  The info ioctl simplifies
that discovery for the VMM and the handshake of passing the affected
groups makes sure that the info ioctl remains relevant.

OTOH, I really haven't seen any evidence that the null-array concept
provides significant simplification for userspace, especially without
compromising the user's understanding of the scope of the provided
reset.  Thanks,

Alex
Tian, Kevin March 17, 2023, 12:57 a.m. UTC | #16
> From: Alex Williamson
> Sent: Friday, March 17, 2023 8:23 AM
> 
> On Thu, 16 Mar 2023 23:29:21 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson
> > > Sent: Friday, March 17, 2023 2:46 AM
> > >
> > > On Wed, 15 Mar 2023 23:31:23 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Thursday, March 16, 2023 6:53 AM
> > > > > I'm afraid this proposal reduces or eliminates the handshake we have
> > > > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> and
> > > > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to
> ignore
> > > the
> > > > > _INFO ioctl altogether, resulting in drivers that don't understand the
> > > > > scope of the reset.  Is it worth it?  What do we really gain?
> > > >
> > > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> > > > useful today.
> > > >
> > > > It's an interface on opened device. So the tiny difference is whether the
> > > > user knows the device is resettable when calling GET_INFO or later
> when
> > > > actually calling PCI_HOT_RESET.
> > >
> > > No, GET_PCI_HOT_RESET_INFO conveys not only whether a
> PCI_HOT_RESET
> > > can
> > > be performed, but equally important the scope of the reset, ie. which
> > > devices are affected by the reset.  If we de-emphasize the INFO
> > > portion, then this easily gets confused as just a variant of
> > > VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset.  In
> > > fact, I'd say the interface is not only trying to validate that the
> > > user has sufficient privileges for the reset, but that they explicitly
> > > acknowledge the scope of the reset.
> > >
> >
> > IMHO the usefulness of scope is if it's discoverable by the management
> > stack which then can try to assign devices with affected reset to a same
> > user.
> 
> Disagree, the user needs to know the scope of reset.  Take for instance
> two function of a device configured onto separate buses within a VM.
> The VMM needs to know that a hot-reset of one will reset the other.
> That's not obvious to the VMM without some understanding of PCI/e
> topology and analysis of the host system.  The info ioctl simplifies
> that discovery for the VMM and the handshake of passing the affected
> groups makes sure that the info ioctl remains relevant.

If that is the intended usage then I don't see why this proposal will
promote userspace to ignore the _INFO ioctl. It should be always
queried no matter how the reset ioctl itself is designed. The motivation
of calling _INFO is not from the reset ioctl asking for an array of fds.

> 
> OTOH, I really haven't seen any evidence that the null-array concept
> provides significant simplification for userspace, especially without
> compromising the user's understanding of the scope of the provided
> reset.  Thanks,
> 

I'll let Jason to further comment after he is back.

The bottom line, if this cannot be converged in short time, is to move
it out of the preparatory series for cdev. There is no reason to block
cdev just for this open. Anyway we'll allow using device fd array
for cdev so there is no function gap. 
Alex Williamson March 17, 2023, 3:15 p.m. UTC | #17
On Fri, 17 Mar 2023 00:57:23 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Friday, March 17, 2023 8:23 AM
> > 
> > On Thu, 16 Mar 2023 23:29:21 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson
> > > > Sent: Friday, March 17, 2023 2:46 AM
> > > >
> > > > On Wed, 15 Mar 2023 23:31:23 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Thursday, March 16, 2023 6:53 AM
> > > > > > I'm afraid this proposal reduces or eliminates the handshake we have
> > > > > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO  
> > and  
> > > > > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to  
> > ignore  
> > > > the  
> > > > > > _INFO ioctl altogether, resulting in drivers that don't understand the
> > > > > > scope of the reset.  Is it worth it?  What do we really gain?  
> > > > >
> > > > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> > > > > useful today.
> > > > >
> > > > > It's an interface on opened device. So the tiny difference is whether the
> > > > > user knows the device is resettable when calling GET_INFO or later  
> > when  
> > > > > actually calling PCI_HOT_RESET.  
> > > >
> > > > No, GET_PCI_HOT_RESET_INFO conveys not only whether a  
> > PCI_HOT_RESET  
> > > > can
> > > > be performed, but equally important the scope of the reset, ie. which
> > > > devices are affected by the reset.  If we de-emphasize the INFO
> > > > portion, then this easily gets confused as just a variant of
> > > > VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset.  In
> > > > fact, I'd say the interface is not only trying to validate that the
> > > > user has sufficient privileges for the reset, but that they explicitly
> > > > acknowledge the scope of the reset.
> > > >  
> > >
> > > IMHO the usefulness of scope is if it's discoverable by the management
> > > stack which then can try to assign devices with affected reset to a same
> > > user.  
> > 
> > Disagree, the user needs to know the scope of reset.  Take for instance
> > two function of a device configured onto separate buses within a VM.
> > The VMM needs to know that a hot-reset of one will reset the other.
> > That's not obvious to the VMM without some understanding of PCI/e
> > topology and analysis of the host system.  The info ioctl simplifies
> > that discovery for the VMM and the handshake of passing the affected
> > groups makes sure that the info ioctl remains relevant.  
> 
> If that is the intended usage then I don't see why this proposal will
> promote userspace to ignore the _INFO ioctl. It should be always
> queried no matter how the reset ioctl itself is designed. The motivation
> of calling _INFO is not from the reset ioctl asking for an array of fds.

The VFIO_DEVICE_PCI_HOT_RESET ioctl requires a set of group (or cdev)
fds that encompass the set of affected devices reported by the
VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl, so I don't agree with the
last sentence above.

This proposal seems to be based on some confusion about the difference
between VFIO_DEVICE_RESET and VFIO_DEVICE_PCI_HOT_RESET, and therefore
IMO, proliferates that confusion by making the scope argument optional,
which I see as a key difference.  This therefore makes the behavior of
the ioctl less intuitive, easier to get wrong, and I expect we'll see
users unitentionally resetting devices beyond the desired scope if the
group/device fd array is allowed to be empty.  Thanks,

Alex
Jason Gunthorpe March 20, 2023, 5:14 p.m. UTC | #18
On Fri, Mar 17, 2023 at 09:15:57AM -0600, Alex Williamson wrote:
> > If that is the intended usage then I don't see why this proposal will
> > promote userspace to ignore the _INFO ioctl. It should be always
> > queried no matter how the reset ioctl itself is designed. The motivation
> > of calling _INFO is not from the reset ioctl asking for an array of fds.
> 
> The VFIO_DEVICE_PCI_HOT_RESET ioctl requires a set of group (or cdev)
> fds that encompass the set of affected devices reported by the
> VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl, so I don't agree with the
> last sentence above.

There are two things going on - VFIO_DEVICE_PCI_HOT_RESET requires to
prove security that the userspace is not attempting to reset something
that it does not have ownership over. Eg a reset group that spans
multiple iommu groups.

The second is for userspace to discover the reset group so it can
understand what is happening.

IMHO it is perfectly fine for each API to be only concerned with its
own purpose.

VFIO_DEVICE_PCI_HOT_RESET needs to check security, which the
iommufd_ctx check does just fine

VFIO_DEVICE_GET_PCI_HOT_RESET_INFO needs to convey the reset group
span so userspace can do something with this.

I think confusing security and scope and "acknowledgment" is not a
good idea.

The APIs are well defined and userspace can always use them wrong. It
doesn't need to call RESET_INFO even today, it can just trivially pass
every group FD it owns to meet the security check.

It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security
check without code marshalling fds, which is why we went this
direction.

Jason
Alex Williamson March 20, 2023, 10:52 p.m. UTC | #19
On Mon, 20 Mar 2023 14:14:48 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Mar 17, 2023 at 09:15:57AM -0600, Alex Williamson wrote:
> > > If that is the intended usage then I don't see why this proposal will
> > > promote userspace to ignore the _INFO ioctl. It should be always
> > > queried no matter how the reset ioctl itself is designed. The motivation
> > > of calling _INFO is not from the reset ioctl asking for an array of fds.  
> > 
> > The VFIO_DEVICE_PCI_HOT_RESET ioctl requires a set of group (or cdev)
> > fds that encompass the set of affected devices reported by the
> > VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl, so I don't agree with the
> > last sentence above.  
> 
> There are two things going on - VFIO_DEVICE_PCI_HOT_RESET requires to
> prove security that the userspace is not attempting to reset something
> that it does not have ownership over. Eg a reset group that spans
> multiple iommu groups.
> 
> The second is for userspace to discover the reset group so it can
> understand what is happening.
> 
> IMHO it is perfectly fine for each API to be only concerned with its
> own purpose.
> 
> VFIO_DEVICE_PCI_HOT_RESET needs to check security, which the
> iommufd_ctx check does just fine
> 
> VFIO_DEVICE_GET_PCI_HOT_RESET_INFO needs to convey the reset group
> span so userspace can do something with this.
> 
> I think confusing security and scope and "acknowledgment" is not a
> good idea.
> 
> The APIs are well defined and userspace can always use them wrong. It
> doesn't need to call RESET_INFO even today, it can just trivially pass
> every group FD it owns to meet the security check.

That's not actually true, in order to avoid arbitrarily large buffers
from the user, the ioctl won't accept an array greater than the number
of devices affected by the reset.

> It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security
> check without code marshalling fds, which is why we went this
> direction.

I agree that nullifying the arg makes the ioctl easier to use, but my
hesitation is whether it makes it more difficult to use correctly,
which includes resetting devices unexpectedly.

We're talking about something that's a relatively rare event, so I
don't see that time overhead is a factor, nor has the complexity
overhead in the QEMU implementation ever been raised as an issue
previously.

We can always blame the developer for using an interface incorrectly,
but if we make it easier to use incorrectly in order to optimize
something that doesn't need to be optimized, does that make it a good
choice for the uAPI?  Thanks,

Alex
Jason Gunthorpe March 20, 2023, 11:39 p.m. UTC | #20
On Mon, Mar 20, 2023 at 04:52:17PM -0600, Alex Williamson wrote:

> > The APIs are well defined and userspace can always use them wrong. It
> > doesn't need to call RESET_INFO even today, it can just trivially pass
> > every group FD it owns to meet the security check.
> 
> That's not actually true, in order to avoid arbitrarily large buffers
> from the user, the ioctl won't accept an array greater than the number
> of devices affected by the reset.

Oh yuk!

> > It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security
> > check without code marshalling fds, which is why we went this
> > direction.
> 
> I agree that nullifying the arg makes the ioctl easier to use, but my
> hesitation is whether it makes it more difficult to use correctly,
> which includes resetting devices unexpectedly.

I don't think it makes it harder to use correctly. It maybe makes it
easier to misuse, but IMHO not too much.

If the desire was to have an API that explicitly acknowledged the
reset scope then it should have taken in a list of device FDs and
optimally reset all of them or fail EPERM.

What is going to make this hard to use is the _INFO IOCTL, it returns
basically the BDF string, but I think we effectively get rid of this
in the new model. libvirt will know the BDF and open the cdev, then fd
pass the cdev to qemu. Qemu shouldn't also have to know the sysfs
path..

So we really want a new _INFO ioctl to make this easier to use..

> We can always blame the developer for using an interface incorrectly,
> but if we make it easier to use incorrectly in order to optimize
> something that doesn't need to be optimized, does that make it a good
> choice for the uAPI?

IMHO the API is designed around a security proof. Present some groups
and a subset of devices in those groups will be reset. You can't know
the subset unless you do the _INFO thing.

If we wanted it to be clearly linked to scope it should have taken in
a list of device FDs, and reset those devices FDs optimally or
returned -EPERM. Then the reset scope is very clearly connected to the
API.

Jason
Alex Williamson March 21, 2023, 8:31 p.m. UTC | #21
On Mon, 20 Mar 2023 20:39:07 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Mar 20, 2023 at 04:52:17PM -0600, Alex Williamson wrote:
> 
> > > The APIs are well defined and userspace can always use them wrong. It
> > > doesn't need to call RESET_INFO even today, it can just trivially pass
> > > every group FD it owns to meet the security check.  
> > 
> > That's not actually true, in order to avoid arbitrarily large buffers
> > from the user, the ioctl won't accept an array greater than the number
> > of devices affected by the reset.  
> 
> Oh yuk!
> 
> > > It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security
> > > check without code marshalling fds, which is why we went this
> > > direction.  
> > 
> > I agree that nullifying the arg makes the ioctl easier to use, but my
> > hesitation is whether it makes it more difficult to use correctly,
> > which includes resetting devices unexpectedly.  
> 
> I don't think it makes it harder to use correctly. It maybe makes it
> easier to misuse, but IMHO not too much.
> 
> If the desire was to have an API that explicitly acknowledged the
> reset scope then it should have taken in a list of device FDs and
> optimally reset all of them or fail EPERM.
> 
> What is going to make this hard to use is the _INFO IOCTL, it returns
> basically the BDF string, but I think we effectively get rid of this
> in the new model. libvirt will know the BDF and open the cdev, then fd
> pass the cdev to qemu. Qemu shouldn't also have to know the sysfs
> path..
> 
> So we really want a new _INFO ioctl to make this easier to use..

I think this makes it even worse.  If userspace cannot match BDFs from
the _INFO ioctl to devices files, for proof of ownership or scope, then
the answer is clearly not "get rid of the device files".

> > We can always blame the developer for using an interface incorrectly,
> > but if we make it easier to use incorrectly in order to optimize
> > something that doesn't need to be optimized, does that make it a good
> > choice for the uAPI?  
> 
> IMHO the API is designed around a security proof. Present some groups
> and a subset of devices in those groups will be reset. You can't know
> the subset unless you do the _INFO thing.
> 
> If we wanted it to be clearly linked to scope it should have taken in
> a list of device FDs, and reset those devices FDs optimally or
> returned -EPERM. Then the reset scope is very clearly connected to the
> API.

This just seems like nit-picking that the API could have accomplished
this more concisely.  Probably that's true, but I think you've
identified a gap above that amplifies the issue.  If the user cannot
map BDFs to cdevs because the cdevs are passed as open fds to the user
driver, the _INFO results become meaningless and by removing the fds
array, that becomes the obvious choice that a user presented with this
dilemma would take.  We're skipping past easier to misuse, difficult to
use correctly, and circling around no obvious way to use correctly.

Unfortunately the _INFO ioctl does presume that userspace knows the BDF
to device mappings today, so if we are attempting to pre-enable a case
with cdev support where that is not the case, then there must be
something done with the _INFO ioctl to provide scope.  Thanks,

Alex
Jason Gunthorpe March 21, 2023, 8:50 p.m. UTC | #22
On Tue, Mar 21, 2023 at 02:31:22PM -0600, Alex Williamson wrote:

> This just seems like nit-picking that the API could have accomplished
> this more concisely.  Probably that's true, but I think you've
> identified a gap above that amplifies the issue.  If the user cannot
> map BDFs to cdevs because the cdevs are passed as open fds to the user
> driver, the _INFO results become meaningless and by removing the fds
> array, that becomes the obvious choice that a user presented with this
> dilemma would take.  We're skipping past easier to misuse, difficult to
> use correctly, and circling around no obvious way to use correctly.

No - this just isn't finished yet is all it means :(

I just noticed it just now, presumably Eric would have discovered this
when he tried to implement the FD pass and we would have made a new
_INFO at that point (or more ugly, have libvirt pass the BDF along
with the FD).

> Unfortunately the _INFO ioctl does presume that userspace knows the BDF
> to device mappings today, so if we are attempting to pre-enable a case
> with cdev support where that is not the case, then there must be
> something done with the _INFO ioctl to provide scope.

Yes, something is required with _INFO before libvirt can use a FD
pass. I'm thinking of a new _INFO query that returns the iommufd
dev_ids for the reset group. Then qemu can match the dev_ids back to
cdev FDs and thus vPCI devices and do what it needs to do.

But for the current qemu setup it will open cdev directly and it will
know the BDF so it can still use the current _INFO.

Though it would be nice if qemu didn't need two implementations so Yi
I'd rather see a new info in this series as well and qemu can just
consistently use dev_id and never bdf in iommufd mode.

Anyhow, I don't see the two topics as really related, the intention is
not to discourage people from calling _INFO, it just to make the
security proof simpler and more logical.

Jason
Alex Williamson March 21, 2023, 9:01 p.m. UTC | #23
On Tue, 21 Mar 2023 17:50:08 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 21, 2023 at 02:31:22PM -0600, Alex Williamson wrote:
> 
> > This just seems like nit-picking that the API could have accomplished
> > this more concisely.  Probably that's true, but I think you've
> > identified a gap above that amplifies the issue.  If the user cannot
> > map BDFs to cdevs because the cdevs are passed as open fds to the user
> > driver, the _INFO results become meaningless and by removing the fds
> > array, that becomes the obvious choice that a user presented with this
> > dilemma would take.  We're skipping past easier to misuse, difficult to
> > use correctly, and circling around no obvious way to use correctly.  
> 
> No - this just isn't finished yet is all it means :(
> 
> I just noticed it just now, presumably Eric would have discovered this
> when he tried to implement the FD pass and we would have made a new
> _INFO at that point (or more ugly, have libvirt pass the BDF along
> with the FD).
> 
> > Unfortunately the _INFO ioctl does presume that userspace knows the BDF
> > to device mappings today, so if we are attempting to pre-enable a case
> > with cdev support where that is not the case, then there must be
> > something done with the _INFO ioctl to provide scope.  
> 
> Yes, something is required with _INFO before libvirt can use a FD
> pass. I'm thinking of a new _INFO query that returns the iommufd
> dev_ids for the reset group. Then qemu can match the dev_ids back to
> cdev FDs and thus vPCI devices and do what it needs to do.
> 
> But for the current qemu setup it will open cdev directly and it will
> know the BDF so it can still use the current _INFO.
> 
> Though it would be nice if qemu didn't need two implementations so Yi
> I'd rather see a new info in this series as well and qemu can just
> consistently use dev_id and never bdf in iommufd mode.

We also need to consider how libvirt determines if QEMU has the kernel
support it needs to pass file descriptors.  It'd be a lot cleaner if
this aligned with the introduction of vfio cdevs.
 
> Anyhow, I don't see the two topics as really related, the intention is
> not to discourage people from calling _INFO, it just to make the
> security proof simpler and more logical.

At a minimum, we need a new _INFO ioctl to get back to the point where
it's only a discussion of whether we're checking the user on scope.  We
can't remove the array while doing so opens up an obviously incorrect
solution to an impossible to use API.  Thanks,

Alex
Jason Gunthorpe March 21, 2023, 10:20 p.m. UTC | #24
On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:

> > Though it would be nice if qemu didn't need two implementations so Yi
> > I'd rather see a new info in this series as well and qemu can just
> > consistently use dev_id and never bdf in iommufd mode.
> 
> We also need to consider how libvirt determines if QEMU has the kernel
> support it needs to pass file descriptors.  It'd be a lot cleaner if
> this aligned with the introduction of vfio cdevs.

Yes, that would be much better if it was one package.

But this is starting to grow and we have so many threads that need to
progress blocked on this cdev enablement :(

Could we go forward with the cdev main patches and kconfig it to
experimental or something while the rest of the parts are completed
and tested through qemu? ie move the vfio-pci reset enablment to after
the cdev patches?

Jason
Alex Williamson March 21, 2023, 10:47 p.m. UTC | #25
On Tue, 21 Mar 2023 19:20:37 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> 
> > > Though it would be nice if qemu didn't need two implementations so Yi
> > > I'd rather see a new info in this series as well and qemu can just
> > > consistently use dev_id and never bdf in iommufd mode.  
> > 
> > We also need to consider how libvirt determines if QEMU has the kernel
> > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > this aligned with the introduction of vfio cdevs.  
> 
> Yes, that would be much better if it was one package.
> 
> But this is starting to grow and we have so many threads that need to
> progress blocked on this cdev enablement :(
> 
> Could we go forward with the cdev main patches and kconfig it to
> experimental or something while the rest of the parts are completed
> and tested through qemu? ie move the vfio-pci reset enablment to after
> the cdev patches?

We need to be able to guarantee that there cannot be any significant
builds of the kernel with vfio cdev support if our intention is to stage
it for libvirt.  We don't have a global EXPERIMENTAL config option any
more.  Adding new code under BROKEN seems wrong.  Fedora ships with
STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
the proposed mechanism to make this not look like a big uncompiled code
dump?  Thanks,

Alex
Yi Liu March 22, 2023, 4:42 a.m. UTC | #26
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, March 22, 2023 6:48 AM
> 
> On Tue, 21 Mar 2023 19:20:37 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> >
> > > > Though it would be nice if qemu didn't need two implementations so Yi
> > > > I'd rather see a new info in this series as well and qemu can just
> > > > consistently use dev_id and never bdf in iommufd mode.
> > >
> > > We also need to consider how libvirt determines if QEMU has the kernel
> > > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > > this aligned with the introduction of vfio cdevs.
> >
> > Yes, that would be much better if it was one package.
> >
> > But this is starting to grow and we have so many threads that need to
> > progress blocked on this cdev enablement :(
> >
> > Could we go forward with the cdev main patches and kconfig it to
> > experimental or something while the rest of the parts are completed
> > and tested through qemu? ie move the vfio-pci reset enablment to after
> > the cdev patches?
> 
> We need to be able to guarantee that there cannot be any significant
> builds of the kernel with vfio cdev support if our intention is to stage
> it for libvirt.  We don't have a global EXPERIMENTAL config option any
> more.  Adding new code under BROKEN seems wrong.  Fedora ships with
> STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
> the proposed mechanism to make this not look like a big uncompiled code
> dump?  Thanks,

Just out of curious, is the BDF mapping gap only for cdev or it also
exists in the traditional group path? IMHO, if it is only a gap for cdev, maybe
we can use CONFIG_VFIO_DEVICE_CDEV to stage it. This kconfig is N by
default. I think it won't change until one day the whole ecosystem is
updated.

Anyhow, I'll also see the complexity of adding a new _INFO ioctl. It should
return a set of dev_id to user rather than the bdf info in the existing _INFO
ioctl. Is it?

Regards,
Yi Liu
Yi Liu March 22, 2023, 8:17 a.m. UTC | #27
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 22, 2023 4:50 AM
> 
> On Tue, Mar 21, 2023 at 02:31:22PM -0600, Alex Williamson wrote:
> 
> > This just seems like nit-picking that the API could have accomplished
> > this more concisely.  Probably that's true, but I think you've
> > identified a gap above that amplifies the issue.  If the user cannot
> > map BDFs to cdevs because the cdevs are passed as open fds to the user
> > driver, the _INFO results become meaningless and by removing the fds
> > array, that becomes the obvious choice that a user presented with this
> > dilemma would take.  We're skipping past easier to misuse, difficult to
> > use correctly, and circling around no obvious way to use correctly.
> 
> No - this just isn't finished yet is all it means :(
> 
> I just noticed it just now, presumably Eric would have discovered this
> when he tried to implement the FD pass and we would have made a new
> _INFO at that point (or more ugly, have libvirt pass the BDF along
> with the FD).
> 
> > Unfortunately the _INFO ioctl does presume that userspace knows the BDF
> > to device mappings today, so if we are attempting to pre-enable a case
> > with cdev support where that is not the case, then there must be
> > something done with the _INFO ioctl to provide scope.
> 
> Yes, something is required with _INFO before libvirt can use a FD
> pass. I'm thinking of a new _INFO query that returns the iommufd
> dev_ids for the reset group. Then qemu can match the dev_ids back to
> cdev FDs and thus vPCI devices and do what it needs to do.

Could you elaborate what is required with _INFO before libvirt can
use a FD pass?

> But for the current qemu setup it will open cdev directly and it will
> know the BDF so it can still use the current _INFO.
> 
> Though it would be nice if qemu didn't need two implementations so Yi
> I'd rather see a new info in this series as well and qemu can just
> consistently use dev_id and never bdf in iommufd mode.

I have one concern here. iommufd dev_id is not a static info as much as
bdf. It is generated when bound to iommufd. So if there are devices that
are affected but not bound to iommufd yet at the time of invoking _INFO,
then the _INFO ioctl just gets a subset of the affected devices. Is it enough?

Regards,
Yi Liu
Jason Gunthorpe March 22, 2023, 12:17 p.m. UTC | #28
On Wed, Mar 22, 2023 at 08:17:54AM +0000, Liu, Yi L wrote:

> Could you elaborate what is required with _INFO before libvirt can
> use a FD pass?

Make a new _INFO that returns an array of dev_ids within the cdev's
iommufd_ctx that are part of the reset group, eg the devset.

qemu will call this for each dev_id after it opens the cdev to
generate the groupings.

> > But for the current qemu setup it will open cdev directly and it will
> > know the BDF so it can still use the current _INFO.
> > 
> > Though it would be nice if qemu didn't need two implementations so Yi
> > I'd rather see a new info in this series as well and qemu can just
> > consistently use dev_id and never bdf in iommufd mode.
> 
> I have one concern here. iommufd dev_id is not a static info as much as
> bdf. It is generated when bound to iommufd. So if there are devices that
> are affected but not bound to iommufd yet at the time of invoking _INFO,
> then the _INFO ioctl just gets a subset of the affected devices. Is it enough?

I'd probably use similar logic as the reset path, if one of reset
group devices is open and on a different iommufd_ctx then fail the
IOCTL with EPERM.

Jason
Alex Williamson March 22, 2023, 12:23 p.m. UTC | #29
On Wed, 22 Mar 2023 04:42:16 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, March 22, 2023 6:48 AM
> > 
> > On Tue, 21 Mar 2023 19:20:37 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> > >  
> > > > > Though it would be nice if qemu didn't need two implementations so Yi
> > > > > I'd rather see a new info in this series as well and qemu can just
> > > > > consistently use dev_id and never bdf in iommufd mode.  
> > > >
> > > > We also need to consider how libvirt determines if QEMU has the kernel
> > > > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > > > this aligned with the introduction of vfio cdevs.  
> > >
> > > Yes, that would be much better if it was one package.
> > >
> > > But this is starting to grow and we have so many threads that need to
> > > progress blocked on this cdev enablement :(
> > >
> > > Could we go forward with the cdev main patches and kconfig it to
> > > experimental or something while the rest of the parts are completed
> > > and tested through qemu? ie move the vfio-pci reset enablment to after
> > > the cdev patches?  
> > 
> > We need to be able to guarantee that there cannot be any significant
> > builds of the kernel with vfio cdev support if our intention is to stage
> > it for libvirt.  We don't have a global EXPERIMENTAL config option any
> > more.  Adding new code under BROKEN seems wrong.  Fedora ships with
> > STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
> > the proposed mechanism to make this not look like a big uncompiled code
> > dump?  Thanks,  
> 
> Just out of curious, is the BDF mapping gap only for cdev or it also
> exists in the traditional group path?

The group path doesn't support passing file descriptors, getting access
to the device files requires a full container configuration, which
implies significant policy decisions in libvirt.  Even if groups were
passed, QEMU would need to know the device name, ie. BDF in string
format, to get the device from the group.

> IMHO, if it is only a gap for cdev, maybe
> we can use CONFIG_VFIO_DEVICE_CDEV to stage it. This kconfig is N by
> default. I think it won't change until one day the whole ecosystem is
> updated.

See the "toothless" comment above, disabling vfio cdev support by
default because we don't have feature parity in reset support does not
provide any guarantees to libvirt that it can effectively take
advantage of passing cdev fds to QEMU.  Thanks,

Alex
Jason Gunthorpe March 22, 2023, 12:27 p.m. UTC | #30
On Tue, Mar 21, 2023 at 04:47:37PM -0600, Alex Williamson wrote:
> On Tue, 21 Mar 2023 19:20:37 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> > 
> > > > Though it would be nice if qemu didn't need two implementations so Yi
> > > > I'd rather see a new info in this series as well and qemu can just
> > > > consistently use dev_id and never bdf in iommufd mode.  
> > > 
> > > We also need to consider how libvirt determines if QEMU has the kernel
> > > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > > this aligned with the introduction of vfio cdevs.  
> > 
> > Yes, that would be much better if it was one package.
> > 
> > But this is starting to grow and we have so many threads that need to
> > progress blocked on this cdev enablement :(
> > 
> > Could we go forward with the cdev main patches and kconfig it to
> > experimental or something while the rest of the parts are completed
> > and tested through qemu? ie move the vfio-pci reset enablment to after
> > the cdev patches?
> 
> We need to be able to guarantee that there cannot be any significant
> builds of the kernel with vfio cdev support if our intention is to stage
> it for libvirt.  We don't have a global EXPERIMENTAL config option any
> more.  Adding new code under BROKEN seems wrong.  Fedora ships with
> STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
> the proposed mechanism to make this not look like a big uncompiled code
> dump?  Thanks,

I would suggest a sternly worded kconfig and STAGING.

This isn't such a big issue, we are trying to say that a future
released qemu is not required to work on older kernels with a STAGING
kconfig mark.

IOW we are saying that qemu release X.0 with production iommufd
requires kernel version > x.y and just lightly reflecting this into
the kconfig.

qemu should simply not support iommufd if it finds itself on a old
kernel.

Jason
Alex Williamson March 22, 2023, 12:36 p.m. UTC | #31
On Wed, 22 Mar 2023 09:27:16 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 21, 2023 at 04:47:37PM -0600, Alex Williamson wrote:
> > On Tue, 21 Mar 2023 19:20:37 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> > >   
> > > > > Though it would be nice if qemu didn't need two implementations so Yi
> > > > > I'd rather see a new info in this series as well and qemu can just
> > > > > consistently use dev_id and never bdf in iommufd mode.    
> > > > 
> > > > We also need to consider how libvirt determines if QEMU has the kernel
> > > > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > > > this aligned with the introduction of vfio cdevs.    
> > > 
> > > Yes, that would be much better if it was one package.
> > > 
> > > But this is starting to grow and we have so many threads that need to
> > > progress blocked on this cdev enablement :(
> > > 
> > > Could we go forward with the cdev main patches and kconfig it to
> > > experimental or something while the rest of the parts are completed
> > > and tested through qemu? ie move the vfio-pci reset enablment to after
> > > the cdev patches?  
> > 
> > We need to be able to guarantee that there cannot be any significant
> > builds of the kernel with vfio cdev support if our intention is to stage
> > it for libvirt.  We don't have a global EXPERIMENTAL config option any
> > more.  Adding new code under BROKEN seems wrong.  Fedora ships with
> > STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
> > the proposed mechanism to make this not look like a big uncompiled code
> > dump?  Thanks,  
> 
> I would suggest a sternly worded kconfig and STAGING.
> 
> This isn't such a big issue, we are trying to say that a future
> released qemu is not required to work on older kernels with a STAGING
> kconfig mark.
> 
> IOW we are saying that qemu release X.0 with production iommufd
> requires kernel version > x.y and just lightly reflecting this into
> the kconfig.
> 
> qemu should simply not support iommufd if it finds itself on a old
> kernel.

Inferring features based on kernel versions doesn't work in a world
where downstreams backport features to older kernels.  Thanks,

Alex
Jason Gunthorpe March 22, 2023, 12:47 p.m. UTC | #32
On Wed, Mar 22, 2023 at 06:36:14AM -0600, Alex Williamson wrote:
> On Wed, 22 Mar 2023 09:27:16 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Mar 21, 2023 at 04:47:37PM -0600, Alex Williamson wrote:
> > > On Tue, 21 Mar 2023 19:20:37 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> > > >   
> > > > > > Though it would be nice if qemu didn't need two implementations so Yi
> > > > > > I'd rather see a new info in this series as well and qemu can just
> > > > > > consistently use dev_id and never bdf in iommufd mode.    
> > > > > 
> > > > > We also need to consider how libvirt determines if QEMU has the kernel
> > > > > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > > > > this aligned with the introduction of vfio cdevs.    
> > > > 
> > > > Yes, that would be much better if it was one package.
> > > > 
> > > > But this is starting to grow and we have so many threads that need to
> > > > progress blocked on this cdev enablement :(
> > > > 
> > > > Could we go forward with the cdev main patches and kconfig it to
> > > > experimental or something while the rest of the parts are completed
> > > > and tested through qemu? ie move the vfio-pci reset enablment to after
> > > > the cdev patches?  
> > > 
> > > We need to be able to guarantee that there cannot be any significant
> > > builds of the kernel with vfio cdev support if our intention is to stage
> > > it for libvirt.  We don't have a global EXPERIMENTAL config option any
> > > more.  Adding new code under BROKEN seems wrong.  Fedora ships with
> > > STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
> > > the proposed mechanism to make this not look like a big uncompiled code
> > > dump?  Thanks,  
> > 
> > I would suggest a sternly worded kconfig and STAGING.
> > 
> > This isn't such a big issue, we are trying to say that a future
> > released qemu is not required to work on older kernels with a STAGING
> > kconfig mark.
> > 
> > IOW we are saying that qemu release X.0 with production iommufd
> > requires kernel version > x.y and just lightly reflecting this into
> > the kconfig.
> > 
> > qemu should simply not support iommufd if it finds itself on a old
> > kernel.
> 
> Inferring features based on kernel versions doesn't work in a world
> where downstreams backport features to older kernels. 

I don't mean actual kernel versions as a compatability test. I mean it
as documention and an expected "support" window.

ie we are disclaiming support for STAGING kernel as a matter of
documentation, not code.

Jason
Yi Liu March 22, 2023, 1:33 p.m. UTC | #33
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 22, 2023 8:18 PM
>
> On Wed, Mar 22, 2023 at 08:17:54AM +0000, Liu, Yi L wrote:
> 
> > Could you elaborate what is required with _INFO before libvirt can
> > use a FD pass?
> 
> Make a new _INFO that returns an array of dev_ids within the cdev's
> iommufd_ctx that are part of the reset group, eg the devset.
>
> qemu will call this for each dev_id after it opens the cdev to
> generate the groupings.

Thanks. So this new _INFO only reports a limited scope instead of
the full list of affected devices. Also, it is not static scope since device
may be opened just after the _INFO returns.

> > > But for the current qemu setup it will open cdev directly and it will
> > > know the BDF so it can still use the current _INFO.
> > >
> > > Though it would be nice if qemu didn't need two implementations so Yi
> > > I'd rather see a new info in this series as well and qemu can just
> > > consistently use dev_id and never bdf in iommufd mode.
> >
> > I have one concern here. iommufd dev_id is not a static info as much as
> > bdf. It is generated when bound to iommufd. So if there are devices that
> > are affected but not bound to iommufd yet at the time of invoking _INFO,
> > then the _INFO ioctl just gets a subset of the affected devices. Is it enough?
> 
> I'd probably use similar logic as the reset path, if one of reset
> group devices is open and on a different iommufd_ctx then fail the
> IOCTL with EPERM.

Say there are three devices in the dev_set. When the first device is
opened by the current qemu, this new _INFO would return one dev_id
to user. When the second device is opened, this new _INFO will return
two dev_ids to user. If the third device is opened by another qemu, then
the new _INFO would fail since the former two devices were opened and
have different iommufd_ctx with the third device.

Regards,
Yi Liu
Jason Gunthorpe March 22, 2023, 1:43 p.m. UTC | #34
On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote:

> Thanks. So this new _INFO only reports a limited scope instead of
> the full list of affected devices. Also, it is not static scope since device
> may be opened just after the _INFO returns.

Yes, it would be simplest for qemu to do the query after it gains a
new dev_id and then it can add the new dev_id with the correct reset
group.

> > I'd probably use similar logic as the reset path, if one of reset
> > group devices is open and on a different iommufd_ctx then fail the
> > IOCTL with EPERM.
> 
> Say there are three devices in the dev_set. When the first device is
> opened by the current qemu, this new _INFO would return one dev_id
> to user. When the second device is opened, this new _INFO will return
> two dev_ids to user.

Yes

> If the third device is opened by another qemu, then
> the new _INFO would fail since the former two devices were opened and
> have different iommufd_ctx with the third device.

Yes

qemu should refuse to use the device at this moment.

Jason
Yi Liu March 23, 2023, 3:15 a.m. UTC | #35
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 22, 2023 9:43 PM
> 
> On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote:
> 
> > Thanks. So this new _INFO only reports a limited scope instead of
> > the full list of affected devices. Also, it is not static scope since device
> > may be opened just after the _INFO returns.
> 
> Yes, it would be simplest for qemu to do the query after it gains a
> new dev_id and then it can add the new dev_id with the correct reset
> group.

I see. QEMU can decide. For now, it seems like QEMU doesn't store
such the info return by the existing _INFO ioctl. It is used in the hot
reset helper and freed before it returns. Though, I'm not sure whether
QEMU will store the dev_id info returned by the new _INFO. Perhaps
Alex can give some guidance.

> > > I'd probably use similar logic as the reset path, if one of reset
> > > group devices is open and on a different iommufd_ctx then fail the
> > > IOCTL with EPERM.
> >
> > Say there are three devices in the dev_set. When the first device is
> > opened by the current qemu, this new _INFO would return one dev_id
> > to user. When the second device is opened, this new _INFO will return
> > two dev_ids to user.
> 
> Yes
> 
> > If the third device is opened by another qemu, then
> > the new _INFO would fail since the former two devices were opened and
> > have different iommufd_ctx with the third device.
> 
> Yes
> 
> qemu should refuse to use the device at this moment.

Yes. it is.

btw.  Another question about this new _INFO ioctl. If there are affected
devices that have not been bound to vfio driver yet, should this new _INFO
ioctl fail all the same with EPERM? Or it still just returns the dev_ids
of the devices that are already bound and opened. This seems to make sense
with two reasons:
 - for one, the new _INFO is not designed to give userspace an exact
   affected device list;
 - for two, reset shall fail when checking vfio_pci_dev_set_resettable();
Please feel free to correct me if this is wrong.

Regards,
Yi Liu
Jason Gunthorpe March 23, 2023, 12:02 p.m. UTC | #36
On Thu, Mar 23, 2023 at 03:15:20AM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, March 22, 2023 9:43 PM
> > 
> > On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote:
> > 
> > > Thanks. So this new _INFO only reports a limited scope instead of
> > > the full list of affected devices. Also, it is not static scope since device
> > > may be opened just after the _INFO returns.
> > 
> > Yes, it would be simplest for qemu to do the query after it gains a
> > new dev_id and then it can add the new dev_id with the correct reset
> > group.
> 
> I see. QEMU can decide. For now, it seems like QEMU doesn't store
> such the info return by the existing _INFO ioctl. It is used in the hot
> reset helper and freed before it returns. Though, I'm not sure whether
> QEMU will store the dev_id info returned by the new _INFO. Perhaps
> Alex can give some guidance.

That seems a bit confusing, qemu should take the reset group
information and encode it so that the guest can know it as well.

If all it does is blindly invoke the hot_reset with the right
parameters then what was the point of all this discussion?
 
> btw.  Another question about this new _INFO ioctl. If there are affected
> devices that have not been bound to vfio driver yet, should this new _INFO
> ioctl fail all the same with EPERM? 

Yeah, it should EPERM the same as the normal hot reset if it can't
marshal the device list.

Jason
Tian, Kevin March 24, 2023, 9:09 a.m. UTC | #37
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, March 22, 2023 5:01 AM
> 
> On Tue, 21 Mar 2023 17:50:08 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> >
> > Though it would be nice if qemu didn't need two implementations so Yi
> > I'd rather see a new info in this series as well and qemu can just
> > consistently use dev_id and never bdf in iommufd mode.
> 
> We also need to consider how libvirt determines if QEMU has the kernel
> support it needs to pass file descriptors.  It'd be a lot cleaner if
> this aligned with the introduction of vfio cdevs.
> 

Libvirt can check whether the kernel creates cdev for a given device
via sysfs.

but I'm not sure how Libvirt determines whether QEMU supports a
feature that it wants to use. But sounds this is a general handshake
problem as Libvirt needs to support many versions of QEMU then
there must be a way for such negotiation?
Yi Liu March 24, 2023, 9:25 a.m. UTC | #38
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 23, 2023 8:02 PM
> 
> On Thu, Mar 23, 2023 at 03:15:20AM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, March 22, 2023 9:43 PM
> > >
> > > On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote:
> > >
> > > > Thanks. So this new _INFO only reports a limited scope instead of
> > > > the full list of affected devices. Also, it is not static scope since device
> > > > may be opened just after the _INFO returns.
> > >
> > > Yes, it would be simplest for qemu to do the query after it gains a
> > > new dev_id and then it can add the new dev_id with the correct reset
> > > group.
> >
> > I see. QEMU can decide. For now, it seems like QEMU doesn't store
> > such the info return by the existing _INFO ioctl. It is used in the hot
> > reset helper and freed before it returns. Though, I'm not sure whether
> > QEMU will store the dev_id info returned by the new _INFO. Perhaps
> > Alex can give some guidance.
> 
> That seems a bit confusing, qemu should take the reset group
> information and encode it so that the guest can know it as well.
> 
> If all it does is blindly invoke the hot_reset with the right
> parameters then what was the point of all this discussion?
> 
> > btw.  Another question about this new _INFO ioctl. If there are affected
> > devices that have not been bound to vfio driver yet, should this new _INFO
> > ioctl fail all the same with EPERM?
> 
> Yeah, it should EPERM the same as the normal hot reset if it can't
> marshal the device list.

Hi Jason, Alex,

I got a draft patch to add the new _INFO? It checks if all the affected devices
are in the dev_set, and then gets the dev_id of all the opened devices within
the dev_set. There is still one thing not quite clear. It is the noiommu mode.
In this mode, there is no iommufd bind, so no dev_id. For now, I just fail this
new _INFO ioctl if there is no iommufd_device. Hence, this new _INFO is not
available for users that operating in noiommu mode. Is this acceptable?

From e763474e255ff9805b1fb76d6b6b9ccedb61058f Mon Sep 17 00:00:00 2001
From: Yi Liu <yi.l.liu@intel.com>
Date: Fri, 24 Mar 2023 00:54:08 -0700
Subject: [PATCH 06/10] vfio/pci: Add VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO

to report the affected devices for a given device's hot reset. It is a
list of iommufd dev_id that is opened by the user. If there is device
that is not bound to vfio driver or opened by another user, this shall
fail with -EPERM. For the noiommu mode in the vfio device cdev path,
this shall fail as no dev_id would be generated.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 98 ++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h        | 28 +++++++++
 2 files changed, 126 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index b68fcba67a4b..5789933a64ae 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+static struct pci_dev *
+vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set);
+
+static int vfio_pci_ioctl_get_pci_hot_reset_group_info(
+	struct vfio_pci_core_device *vdev,
+	struct vfio_pci_hot_reset_group_info __user *arg)
+{
+	unsigned long minsz =
+		offsetofend(struct vfio_pci_hot_reset_group_info, count);
+	struct vfio_pci_hot_reset_group_info hdr;
+	struct iommufd_ctx *iommufd, *cur_iommufd;
+	u32 count = 0, index = 0, *devices = NULL;
+	struct vfio_pci_core_device *cur;
+	bool slot = false;
+	int ret = 0;
+
+	if (copy_from_user(&hdr, arg, minsz))
+		return -EFAULT;
+
+	if (hdr.argsz < minsz)
+		return -EINVAL;
+
+	hdr.flags = 0;
+
+	/* Can we do a slot or bus reset or neither? */
+	if (!pci_probe_reset_slot(vdev->pdev->slot))
+		slot = true;
+	else if (pci_probe_reset_bus(vdev->pdev->bus))
+		return -ENODEV;
+
+	mutex_lock(&vdev->vdev.dev_set->lock);
+	if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
+	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
+	if (!iommufd) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
+	/* How many devices are affected? */
+	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
+					    &count, slot);
+	if (ret)
+		goto out_unlock;
+
+	WARN_ON(!count); /* Should always be at least one */
+
+	/*
+	 * If there's enough space, fill it now, otherwise return -ENOSPC and
+	 * the number of devices affected.
+	 */
+	if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) {
+		ret = -ENOSPC;
+		hdr.count = count;
+		goto reset_info_exit;
+	}
+
+	devices = kcalloc(count, sizeof(*devices), GFP_KERNEL);
+	if (!devices) {
+		ret = -ENOMEM;
+		goto reset_info_exit;
+	}
+
+	list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, vdev.dev_set_list) {
+		cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev);
+		if (cur->vdev.open_count) {
+			if (cur_iommufd != iommufd) {
+				ret = -EPERM;
+				break;
+			}
+			ret = vfio_iommufd_physical_devid(&cur->vdev, &devices[index]);
+			if (ret)
+				break;
+			index++;
+		}
+	}
+
+reset_info_exit:
+	if (copy_to_user(arg, &hdr, minsz))
+		ret = -EFAULT;
+
+	if (!ret) {
+		if (copy_to_user(&arg->devices, devices,
+				 hdr.count * sizeof(*devices)))
+			ret = -EFAULT;
+	}
+
+	kfree(devices);
+out_unlock:
+	mutex_unlock(&vdev->vdev.dev_set->lock);
+	return ret;
+}
+
 static int vfio_pci_ioctl_get_pci_hot_reset_info(
 	struct vfio_pci_core_device *vdev,
 	struct vfio_pci_hot_reset_info __user *arg)
@@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		return vfio_pci_ioctl_get_irq_info(vdev, uarg);
 	case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO:
 		return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg);
+	case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO:
+		return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev, uarg);
 	case VFIO_DEVICE_GET_REGION_INFO:
 		return vfio_pci_ioctl_get_region_info(vdev, uarg);
 	case VFIO_DEVICE_IOEVENTFD:
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 17aa5d09db41..572497cda3ca 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -669,6 +669,43 @@ struct vfio_pci_hot_reset_info {
 
 #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
 
+/**
+ * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
+ *						    struct vfio_pci_hot_reset_group_info)
+ *
+ * This is used in the vfio device cdev mode.  It returns the list of
+ * affected devices (represented by iommufd dev_id) when hot reset is
+ * issued on the current device with which this ioctl is invoked.  It
+ * only includes the devices that are opened by the current user in the
+ * time of this command is invoked.  So user should re-invoke it when
+ * it has opened new devices.  This information gives user a scope of
+ * affected devices that is opened by it.  This is helpful for user to
+ * decide whether VFIO_DEVICE_PCI_HOT_RESET can be issued.  However,
+ * even user can do hot reset per the info got via this ioctl, hot reset
+ * can fail since the not-opened affected device can be opened by other
+ * users in the window between the two ioctls.  Detail can check the
+ * description of VFIO_DEVICE_PCI_HOT_RESET.
+ *
+ * Return: 0 on success, -errno on failure:
+ *	-enospc = insufficient buffer;
+ *	-enodev = unsupported for device;
+ *	-eperm = no permission for device, this error comes:
+ *		 - when there are affected devices that are opened but
+ *		   not bound to the same iommufd with the current device
+ *		   with which this ioctl is invoked,
+ *		 - there are affected devices that are not bound to vfio
+ *		   driver yet.
+ *		 - no valid iommufd is bound (e.g. noiommu mode)
+ */
+struct vfio_pci_hot_reset_group_info {
+	__u32	argsz;
+	__u32	flags;
+	__u32	count;
+	__u32	devices[];
+};
+
+#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
 /**
  * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
  *				    struct vfio_pci_hot_reset)
Jason Gunthorpe March 24, 2023, 1:14 p.m. UTC | #39
On Fri, Mar 24, 2023 at 09:09:59AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, March 22, 2023 5:01 AM
> > 
> > On Tue, 21 Mar 2023 17:50:08 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > >
> > > Though it would be nice if qemu didn't need two implementations so Yi
> > > I'd rather see a new info in this series as well and qemu can just
> > > consistently use dev_id and never bdf in iommufd mode.
> > 
> > We also need to consider how libvirt determines if QEMU has the kernel
> > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > this aligned with the introduction of vfio cdevs.
> > 
> 
> Libvirt can check whether the kernel creates cdev for a given device
> via sysfs.
> 
> but I'm not sure how Libvirt determines whether QEMU supports a
> feature that it wants to use. But sounds this is a general handshake
> problem as Libvirt needs to support many versions of QEMU then
> there must be a way for such negotiation?

Ideally libvirt would be able to learn what ioctls are supported on
the cdev fd after it opens it, but before binding. I don't think we
have something here yet, but I could imagine having something.

Clearly it is easier if cdev alone is enough proof to know that it
should go ahead in cdev mode.

Jason
Yi Liu March 27, 2023, 11:57 a.m. UTC | #40
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, March 24, 2023 5:25 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, March 23, 2023 8:02 PM
> >
> > On Thu, Mar 23, 2023 at 03:15:20AM +0000, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, March 22, 2023 9:43 PM
> > > >
> > > > On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote:
> > > >
> > > > > Thanks. So this new _INFO only reports a limited scope instead of
> > > > > the full list of affected devices. Also, it is not static scope since device
> > > > > may be opened just after the _INFO returns.
> > > >
> > > > Yes, it would be simplest for qemu to do the query after it gains a
> > > > new dev_id and then it can add the new dev_id with the correct reset
> > > > group.
> > >
> > > I see. QEMU can decide. For now, it seems like QEMU doesn't store
> > > such the info return by the existing _INFO ioctl. It is used in the hot
> > > reset helper and freed before it returns. Though, I'm not sure whether
> > > QEMU will store the dev_id info returned by the new _INFO. Perhaps
> > > Alex can give some guidance.
> >
> > That seems a bit confusing, qemu should take the reset group
> > information and encode it so that the guest can know it as well.
> >
> > If all it does is blindly invoke the hot_reset with the right
> > parameters then what was the point of all this discussion?
> >
> > > btw.  Another question about this new _INFO ioctl. If there are affected
> > > devices that have not been bound to vfio driver yet, should this new
> _INFO
> > > ioctl fail all the same with EPERM?
> >
> > Yeah, it should EPERM the same as the normal hot reset if it can't
> > marshal the device list.
> 
> Hi Jason, Alex,
> 
> I got a draft patch to add the new _INFO? It checks if all the affected devices
> are in the dev_set, and then gets the dev_id of all the opened devices within
> the dev_set. There is still one thing not quite clear. It is the noiommu mode.
> In this mode, there is no iommufd bind, so no dev_id. For now, I just fail this
> new _INFO ioctl if there is no iommufd_device. Hence, this new _INFO is not
> available for users that operating in noiommu mode. Is this acceptable?

The latest _INFO ioctl is in below link.

https://lore.kernel.org/kvm/20230327093458.44939-11-yi.l.liu@intel.com/
diff mbox series

Patch

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 9087cd8ed3ea..dbcee0d38a48 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -131,6 +131,12 @@  void iommufd_device_unbind(struct iommufd_device *idev)
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_unbind, IOMMUFD);
 
+struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev)
+{
+	return idev->ictx;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_to_ictx, IOMMUFD);
+
 static int iommufd_device_setup_msi(struct iommufd_device *idev,
 				    struct iommufd_hw_pagetable *hwpt,
 				    phys_addr_t sw_msi_start)
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 768d353cb6fa..30c0da2e11f9 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -69,6 +69,15 @@  void vfio_iommufd_unbind(struct vfio_device *vdev)
 		vdev->ops->unbind_iommufd(vdev);
 }
 
+struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device *vdev)
+{
+	/* Only serve for physical device */
+	if (!vdev->iommufd_device)
+		return NULL;
+	return iommufd_device_to_ictx(vdev->iommufd_device);
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_ctx);
+
 /*
  * The physical standard ops mean that the iommufd_device is bound to the
  * physical device vdev->dev that was provided to vfio_init_group_dev(). Drivers
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 123b468ead73..b039fbd5c656 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -180,7 +180,8 @@  static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 struct vfio_pci_user_file_info;
 static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_user_file_info *user_info);
+				      struct vfio_pci_user_file_info *user_info,
+				      struct iommufd_ctx *iommufd_ctx);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -1255,29 +1256,17 @@  static int vfio_pci_ioctl_get_pci_hot_reset_info(
 	return ret;
 }
 
-static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
+static int
+vfio_pci_ioctl_pci_hot_reset_user_files(struct vfio_pci_core_device *vdev,
+					struct vfio_pci_hot_reset *hdr,
+					bool slot,
 					struct vfio_pci_hot_reset __user *arg)
 {
-	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
-	struct vfio_pci_hot_reset hdr;
 	int32_t *user_fds;
 	struct file **files;
 	struct vfio_pci_user_file_info info;
-	bool slot = false;
 	int file_idx, count = 0, ret = 0;
 
-	if (copy_from_user(&hdr, arg, minsz))
-		return -EFAULT;
-
-	if (hdr.argsz < minsz || hdr.flags)
-		return -EINVAL;
-
-	/* Can we do a slot or bus reset or neither? */
-	if (!pci_probe_reset_slot(vdev->pdev->slot))
-		slot = true;
-	else if (pci_probe_reset_bus(vdev->pdev->bus))
-		return -ENODEV;
-
 	/*
 	 * We can't let userspace give us an arbitrarily large buffer to copy,
 	 * so verify how many we think there could be.  Note groups can have
@@ -1289,11 +1278,11 @@  static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		return ret;
 
 	/* Somewhere between 1 and count is OK */
-	if (!hdr.count || hdr.count > count)
+	if (hdr->count > count)
 		return -EINVAL;
 
-	user_fds = kcalloc(hdr.count, sizeof(*user_fds), GFP_KERNEL);
-	files = kcalloc(hdr.count, sizeof(*files), GFP_KERNEL);
+	user_fds = kcalloc(hdr->count, sizeof(*user_fds), GFP_KERNEL);
+	files = kcalloc(hdr->count, sizeof(*files), GFP_KERNEL);
 	if (!user_fds || !files) {
 		kfree(user_fds);
 		kfree(files);
@@ -1301,7 +1290,7 @@  static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	}
 
 	if (copy_from_user(user_fds, arg->fds,
-			   hdr.count * sizeof(*user_fds))) {
+			   hdr->count * sizeof(*user_fds))) {
 		kfree(user_fds);
 		kfree(files);
 		return -EFAULT;
@@ -1311,7 +1300,7 @@  static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	 * Get the file for each fd to ensure the group/device file
 	 * is held across the reset
 	 */
-	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
+	for (file_idx = 0; file_idx < hdr->count; file_idx++) {
 		struct file *file = fget(user_fds[file_idx]);
 
 		if (!file) {
@@ -1341,10 +1330,10 @@  static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	if (ret)
 		goto hot_reset_release;
 
-	info.count = hdr.count;
+	info.count = hdr->count;
 	info.files = files;
 
-	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
+	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
 
 hot_reset_release:
 	for (file_idx--; file_idx >= 0; file_idx--)
@@ -1354,6 +1343,36 @@  static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
+					struct vfio_pci_hot_reset __user *arg)
+{
+	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
+	struct vfio_pci_hot_reset hdr;
+	struct iommufd_ctx *iommufd;
+	bool slot = false;
+
+	if (copy_from_user(&hdr, arg, minsz))
+		return -EFAULT;
+
+	if (hdr.argsz < minsz || hdr.flags)
+		return -EINVAL;
+
+	/* Can we do a slot or bus reset or neither? */
+	if (!pci_probe_reset_slot(vdev->pdev->slot))
+		slot = true;
+	else if (pci_probe_reset_bus(vdev->pdev->bus))
+		return -ENODEV;
+
+	if (hdr.count)
+		return vfio_pci_ioctl_pci_hot_reset_user_files(vdev, &hdr, slot, arg);
+
+	iommufd = vfio_iommufd_physical_ctx(&vdev->vdev);
+	if (!iommufd)
+		return -EINVAL;
+
+	return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);
+}
+
 static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
 				    struct vfio_device_ioeventfd __user *arg)
 {
@@ -2323,6 +2342,9 @@  static bool vfio_dev_in_user_fds(struct vfio_pci_core_device *vdev,
 {
 	unsigned int i;
 
+	if (!user_info)
+		return false;
+
 	for (i = 0; i < user_info->count; i++)
 		if (vfio_file_has_dev(user_info->files[i], &vdev->vdev))
 			return true;
@@ -2398,13 +2420,25 @@  static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
 	return ret;
 }
 
+static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
+				    struct iommufd_ctx *iommufd_ctx)
+{
+	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ctx(&vdev->vdev);
+
+	if (!iommufd)
+		return false;
+
+	return iommufd == iommufd_ctx;
+}
+
 /*
  * We need to get memory_lock for each device, but devices can share mmap_lock,
  * therefore we need to zap and hold the vma_lock for each device, and only then
  * get each memory_lock.
  */
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_user_file_info *user_info)
+				      struct vfio_pci_user_file_info *user_info,
+				      struct iommufd_ctx *iommufd_ctx)
 {
 	struct vfio_pci_core_device *cur_mem;
 	struct vfio_pci_core_device *cur_vma;
@@ -2448,10 +2482,14 @@  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		 * For the devices that have been opened, needs to check the
 		 * ownership.  If the user provides a set of group/device
 		 * fds, test whether all the opened devices are contained
-		 * by the set of groups/devices provided by the user.
+		 * by the set of groups/devices provided by the user.  If
+		 * user provides a zero-length array, the ownerhsip check
+		 * is done by checking if all the opened devices are bound
+		 * to the same iommufd_ctx.
 		 */
 		if (cur_vma->vdev.open_count &&
-		    !vfio_dev_in_user_fds(cur_vma, user_info)) {
+		    !vfio_dev_in_user_fds(cur_vma, user_info) &&
+		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx)) {
 			ret = -EINVAL;
 			goto err_undo;
 		}
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 365f11e8e615..7a0d7f2c4237 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -20,6 +20,7 @@  struct file;
 struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
 					   struct device *dev, u32 *id);
 void iommufd_device_unbind(struct iommufd_device *idev);
+struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev);
 
 int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id);
 void iommufd_device_detach(struct iommufd_device *idev);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 1c69be2d687e..fc14f8430a10 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -116,6 +116,7 @@  struct vfio_device_ops {
 int vfio_iommufd_physical_bind(struct vfio_device *vdev,
 			       struct iommufd_ctx *ictx, u32 *out_device_id);
 void vfio_iommufd_physical_unbind(struct vfio_device *vdev);
+struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device *vdev);
 int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
 int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
 			       struct iommufd_ctx *ictx, u32 *out_device_id);
@@ -127,6 +128,8 @@  int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
 		  u32 *out_device_id)) NULL)
 #define vfio_iommufd_physical_unbind \
 	((void (*)(struct vfio_device *vdev)) NULL)
+#define vfio_iommufd_physical_ctx \
+	((struct iommufd_ctx * (*)(struct vfio_device *vdev)) NULL)
 #define vfio_iommufd_physical_attach_ioas \
 	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
 #define vfio_iommufd_emulated_bind                                      \
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index d80141969cd1..382d95455f89 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -682,6 +682,11 @@  struct vfio_pci_hot_reset_info {
  * The ownership can be proved by:
  *   - An array of group fds
  *   - An array of device fds
+ *   - A zero-length array
+ *
+ * In the last case all affected devices which are opened by this user
+ * must have been bound to a same iommufd_ctx.  This approach is only
+ * available for devices bound to positive iommufd.
  *
  * Return: 0 on success, -errno on failure.
  */