diff mbox series

[v3,12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

Message ID 20230401144429.88673-13-yi.l.liu@intel.com (mailing list archive)
State New, archived
Headers show
Series Introduce new methods for verifying ownership in vfio PCI hot reset | expand

Commit Message

Yi Liu April 1, 2023, 2:44 p.m. UTC
for the users that accept device fds passed from management stacks to be
able to figure out the host reset affected devices among the devices
opened by the user. This is needed as such users do not have BDF (bus,
devfn) knowledge about the devices it has opened, hence unable to use
the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
to figure out the affected devices.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
 include/uapi/linux/vfio.h        | 24 ++++++++++++-
 2 files changed, 74 insertions(+), 8 deletions(-)

Comments

Yi Liu April 3, 2023, 9:25 a.m. UTC | #1
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Saturday, April 1, 2023 10:44 PM

> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  	if (!iommu_group)
>  		return -EPERM; /* Cannot reset non-isolated devices */

Hi Alex,

Is disabling iommu a sane way to test vfio noiommu mode? If no, just skip
the below contents. 
Alex Williamson April 3, 2023, 3:01 p.m. UTC | #2
On Mon, 3 Apr 2023 09:25:06 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Saturday, April 1, 2023 10:44 PM  
> 
> > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
> >  	if (!iommu_group)
> >  		return -EPERM; /* Cannot reset non-isolated devices */  
> 
> Hi Alex,
> 
> Is disabling iommu a sane way to test vfio noiommu mode?

Yes

> I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> iommufd==-1 can succeed, but failed to get hot reset info due to the above
> group check. Reason is that this happens to have some affected devices, and
> these devices have no valid iommu_group (because they are not bound to vfio-pci
> hence nobody allocates noiommu group for them). So when hot reset info loops
> such devices, it failed with -EPERM. Is this expected?

Hmm, I didn't recall that we put in such a limitation, but given the
minimally intrusive approach to no-iommu and the fact that we never
defined an invalid group ID to return to the user, it makes sense that
we just blocked the ioctl for no-iommu use.  I guess we can do the same
for no-iommu cdev.

BTW, what does this series apply on?  I'm assuming[1], but I don't see
a branch from Jason yet.  Thanks,

Alex

[1]https://lore.kernel.org/all/20230327093351.44505-1-yi.l.liu@intel.com/
Yi Liu April 3, 2023, 3:22 p.m. UTC | #3
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, April 3, 2023 11:02 PM
> 
> On Mon, 3 Apr 2023 09:25:06 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Saturday, April 1, 2023 10:44 PM
> >
> > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void
> *data)
> > >  	if (!iommu_group)
> > >  		return -EPERM; /* Cannot reset non-isolated devices */
> >
> > Hi Alex,
> >
> > Is disabling iommu a sane way to test vfio noiommu mode?
> 
> Yes
> 
> > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > group check. Reason is that this happens to have some affected devices, and
> > these devices have no valid iommu_group (because they are not bound to vfio-pci
> > hence nobody allocates noiommu group for them). So when hot reset info loops
> > such devices, it failed with -EPERM. Is this expected?
> 
> Hmm, I didn't recall that we put in such a limitation, but given the
> minimally intrusive approach to no-iommu and the fact that we never
> defined an invalid group ID to return to the user, it makes sense that
> we just blocked the ioctl for no-iommu use.  I guess we can do the same
> for no-iommu cdev.

sure.

> 
> BTW, what does this series apply on?  I'm assuming[1], but I don't see
> a branch from Jason yet.  Thanks,

yes, this series is applied on [1]. I put the [1], this series and cdev series
in https://github.com/yiliu1765/iommufd/commits/vfio_device_cdev_v9.

Jason has taken [1] in the below branch. It is based on rc1. So I hesitated
to apply this series and cdev series on top of it. Maybe I should have done
it to make life easier. 
Alex Williamson April 3, 2023, 3:32 p.m. UTC | #4
On Mon, 3 Apr 2023 15:22:03 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, April 3, 2023 11:02 PM
> > 
> > On Mon, 3 Apr 2023 09:25:06 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Saturday, April 1, 2023 10:44 PM  
> > >  
> > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void  
> > *data)  
> > > >  	if (!iommu_group)
> > > >  		return -EPERM; /* Cannot reset non-isolated devices */  
> > >
> > > Hi Alex,
> > >
> > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > 
> > Yes
> >   
> > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > group check. Reason is that this happens to have some affected devices, and
> > > these devices have no valid iommu_group (because they are not bound to vfio-pci
> > > hence nobody allocates noiommu group for them). So when hot reset info loops
> > > such devices, it failed with -EPERM. Is this expected?  
> > 
> > Hmm, I didn't recall that we put in such a limitation, but given the
> > minimally intrusive approach to no-iommu and the fact that we never
> > defined an invalid group ID to return to the user, it makes sense that
> > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > for no-iommu cdev.  
> 
> sure.
> 
> > 
> > BTW, what does this series apply on?  I'm assuming[1], but I don't see
> > a branch from Jason yet.  Thanks,  
> 
> yes, this series is applied on [1]. I put the [1], this series and cdev series
> in https://github.com/yiliu1765/iommufd/commits/vfio_device_cdev_v9.
> 
> Jason has taken [1] in the below branch. It is based on rc1. So I hesitated
> to apply this series and cdev series on top of it. Maybe I should have done
> it to make life easier. 
Jason Gunthorpe April 3, 2023, 4:12 p.m. UTC | #5
On Mon, Apr 03, 2023 at 09:32:18AM -0600, Alex Williamson wrote:
> > yes, this series is applied on [1]. I put the [1], this series and cdev series
> > in https://github.com/yiliu1765/iommufd/commits/vfio_device_cdev_v9.
> > 
> > Jason has taken [1] in the below branch. It is based on rc1. So I hesitated
> > to apply this series and cdev series on top of it. Maybe I should have done
> > it to make life easier. 
Alex Williamson April 4, 2023, 10:20 p.m. UTC | #6
On Sat,  1 Apr 2023 07:44:29 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> for the users that accept device fds passed from management stacks to be
> able to figure out the host reset affected devices among the devices
> opened by the user. This is needed as such users do not have BDF (bus,
> devfn) knowledge about the devices it has opened, hence unable to use
> the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> to figure out the affected devices.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
>  include/uapi/linux/vfio.h        | 24 ++++++++++++-
>  2 files changed, 74 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 19f5b075d70a..a5a7e148dce1 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -30,6 +30,7 @@
>  #if IS_ENABLED(CONFIG_EEH)
>  #include <asm/eeh.h>
>  #endif
> +#include <uapi/linux/iommufd.h>
>  
>  #include "vfio_pci_priv.h"
>  
> @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
>  	return 0;
>  }
>  
> +static struct vfio_device *
> +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
> +			       struct pci_dev *pdev)
> +{
> +	struct vfio_device *cur;
> +
> +	lockdep_assert_held(&dev_set->lock);
> +
> +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> +		if (cur->dev == &pdev->dev)
> +			return cur;
> +	return NULL;
> +}
> +
>  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
>  {
>  	(*(int *)data)++;
> @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
>  struct vfio_pci_fill_info {
>  	int max;
>  	int cur;
> +	bool require_devid;
> +	struct iommufd_ctx *iommufd;
> +	struct vfio_device_set *dev_set;
>  	struct vfio_pci_dependent_device *devices;

Poor structure packing, move the bool to the end.

Nit, maybe just name it @devid.

>  };
>  
>  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  {
>  	struct vfio_pci_fill_info *fill = data;
> +	struct vfio_device_set *dev_set = fill->dev_set;
>  	struct iommu_group *iommu_group;
> +	struct vfio_device *vdev;
> +
> +	lockdep_assert_held(&dev_set->lock);
>  
>  	if (fill->cur == fill->max)
>  		return -EAGAIN; /* Something changed, try again */
> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  	if (!iommu_group)
>  		return -EPERM; /* Cannot reset non-isolated devices */
>  
> -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	if (fill->require_devid) {

Nit, @vdev could be scoped here.

> +		/*
> +		 * Report dev_id of the devices that are opened as cdev
> +		 * and have the same iommufd with the fill->iommufd.
> +		 * Otherwise, just fill IOMMUFD_INVALID_ID.
> +		 */
> +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);

I wish I had a better solution to this, but I don't.

> +		if (vdev && vfio_device_cdev_opened(vdev) &&
> +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
> +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id);

Long line, maybe a pointer to &fill->devices[fill->cur] would help.

> +		else
> +			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
> +	} else {
> +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	}
>  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
>  	fill->devices[fill->cur].bus = pdev->bus->number;
>  	fill->devices[fill->cur].devfn = pdev->devfn;
> @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
>  		return -ENOMEM;
>  
>  	fill.devices = devices;
> +	fill.dev_set = vdev->vdev.dev_set;
>  
> +	mutex_lock(&vdev->vdev.dev_set->lock);
> +	if (vfio_device_cdev_opened(&vdev->vdev)) {
> +		fill.require_devid = true;
> +		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> +	}

We can do this unconditionally:

	fill.devid = vfio_device_cdev_opened(&vdev->vdev);
	fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);

Thanks,
Alex

>  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
>  					    &fill, slot);
> +	mutex_unlock(&vdev->vdev.dev_set->lock);
>  
>  	/*
>  	 * If a device was removed between counting and filling, we may come up
>  	 * short of fill.max.  If a device was added, we'll have a return of
>  	 * -EAGAIN above.
>  	 */
> -	if (!ret)
> +	if (!ret) {
>  		hdr.count = fill.cur;
> +		if (fill.require_devid)
> +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
> +	}
>  
>  reset_info_exit:
>  	if (copy_to_user(arg, &hdr, minsz))
> @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev,
>  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
>  {
>  	struct vfio_device_set *dev_set = data;
> -	struct vfio_device *cur;
>  
> -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> -		if (cur->dev == &pdev->dev)
> -			return 0;
> -	return -EBUSY;
> +	lockdep_assert_held(&dev_set->lock);
> +
> +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
>  }
>  
>  /*
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 25432ef213ee..5a34364e3b94 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -650,11 +650,32 @@ enum {
>   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
>   *					      struct vfio_pci_hot_reset_info)
>   *
> + * This command is used to query the affected devices in the hot reset for
> + * a given device.  User could use the information reported by this command
> + * to figure out the affected devices among the devices it has opened.
> + * This command always reports the segment, bus and devfn information for
> + * each affected device, and selectively report the group_id or the dev_id
> + * per the way how the device being queried is opened.
> + *	- If the device is opened via the traditional group/container manner,
> + *	  this command reports the group_id for each affected device.
> + *
> + *	- If the device is opened as a cdev, this command needs to report
> + *	  dev_id for each affected device and set the
> + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
> + *	  devices that are not opened as cdev or bound to different iommufds
> + *	  with the device that is queried, report an invalid dev_id to avoid
> + *	  potential dev_id conflict as dev_id is local to iommufd.  For such
> + *	  affected devices, user shall fall back to use the segment, bus and
> + *	  devfn info to map it to opened device.
> + *
>   * Return: 0 on success, -errno on failure:
>   *	-enospc = insufficient buffer, -enodev = unsupported for device.
>   */
>  struct vfio_pci_dependent_device {
> -	__u32	group_id;
> +	union {
> +		__u32   group_id;
> +		__u32	dev_id;
> +	};
>  	__u16	segment;
>  	__u8	bus;
>  	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> @@ -663,6 +684,7 @@ struct vfio_pci_dependent_device {
>  struct vfio_pci_hot_reset_info {
>  	__u32	argsz;
>  	__u32	flags;
> +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
>  	__u32	count;
>  	struct vfio_pci_dependent_device	devices[];
>  };
Eric Auger April 5, 2023, 12:19 p.m. UTC | #7
Hi Yi,
On 4/1/23 16:44, Yi Liu wrote:
> for the users that accept device fds passed from management stacks to be
> able to figure out the host reset affected devices among the devices
> opened by the user. This is needed as such users do not have BDF (bus,
> devfn) knowledge about the devices it has opened, hence unable to use
> the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> to figure out the affected devices.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
>  include/uapi/linux/vfio.h        | 24 ++++++++++++-
>  2 files changed, 74 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 19f5b075d70a..a5a7e148dce1 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -30,6 +30,7 @@
>  #if IS_ENABLED(CONFIG_EEH)
>  #include <asm/eeh.h>
>  #endif
> +#include <uapi/linux/iommufd.h>
>  
>  #include "vfio_pci_priv.h"
>  
> @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
>  	return 0;
>  }
>  
> +static struct vfio_device *
> +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
> +			       struct pci_dev *pdev)
> +{
> +	struct vfio_device *cur;
> +
> +	lockdep_assert_held(&dev_set->lock);
> +
> +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> +		if (cur->dev == &pdev->dev)
> +			return cur;
> +	return NULL;
> +}
> +
>  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
>  {
>  	(*(int *)data)++;
> @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
>  struct vfio_pci_fill_info {
>  	int max;
>  	int cur;
> +	bool require_devid;
> +	struct iommufd_ctx *iommufd;
> +	struct vfio_device_set *dev_set;
>  	struct vfio_pci_dependent_device *devices;
>  };
>  
>  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  {
>  	struct vfio_pci_fill_info *fill = data;
> +	struct vfio_device_set *dev_set = fill->dev_set;
>  	struct iommu_group *iommu_group;
> +	struct vfio_device *vdev;
> +
> +	lockdep_assert_held(&dev_set->lock);
>  
>  	if (fill->cur == fill->max)
>  		return -EAGAIN; /* Something changed, try again */
> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  	if (!iommu_group)
>  		return -EPERM; /* Cannot reset non-isolated devices */
>  
> -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	if (fill->require_devid) {
> +		/*
> +		 * Report dev_id of the devices that are opened as cdev
> +		 * and have the same iommufd with the fill->iommufd.
> +		 * Otherwise, just fill IOMMUFD_INVALID_ID.
> +		 */
> +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> +		if (vdev && vfio_device_cdev_opened(vdev) &&
> +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
> +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id);
> +		else
> +			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
> +	} else {
> +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	}
>  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
>  	fill->devices[fill->cur].bus = pdev->bus->number;
>  	fill->devices[fill->cur].devfn = pdev->devfn;
> @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
>  		return -ENOMEM;
>  
>  	fill.devices = devices;
> +	fill.dev_set = vdev->vdev.dev_set;
>  
> +	mutex_lock(&vdev->vdev.dev_set->lock);
> +	if (vfio_device_cdev_opened(&vdev->vdev)) {
> +		fill.require_devid = true;
> +		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> +	}
>  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
>  					    &fill, slot);
> +	mutex_unlock(&vdev->vdev.dev_set->lock);
>  
>  	/*
>  	 * If a device was removed between counting and filling, we may come up
>  	 * short of fill.max.  If a device was added, we'll have a return of
>  	 * -EAGAIN above.
>  	 */
> -	if (!ret)
> +	if (!ret) {
>  		hdr.count = fill.cur;
> +		if (fill.require_devid)
> +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
> +	}
>  
>  reset_info_exit:
>  	if (copy_to_user(arg, &hdr, minsz))
> @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev,
>  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
>  {
>  	struct vfio_device_set *dev_set = data;
> -	struct vfio_device *cur;
>  
> -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> -		if (cur->dev == &pdev->dev)
> -			return 0;
> -	return -EBUSY;
> +	lockdep_assert_held(&dev_set->lock);
> +
> +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
>  }
>  
>  /*
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 25432ef213ee..5a34364e3b94 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -650,11 +650,32 @@ enum {
>   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
>   *					      struct vfio_pci_hot_reset_info)
>   *
> + * This command is used to query the affected devices in the hot reset for
> + * a given device.  User could use the information reported by this command
> + * to figure out the affected devices among the devices it has opened.
> + * This command always reports the segment, bus and devfn information for
> + * each affected device, and selectively report the group_id or the dev_id
> + * per the way how the device being queried is opened.
> + *	- If the device is opened via the traditional group/container manner,
> + *	  this command reports the group_id for each affected device.
> + *
> + *	- If the device is opened as a cdev, this command needs to report
s/needs to report/reports
> + *	  dev_id for each affected device and set the
> + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
> + *	  devices that are not opened as cdev or bound to different iommufds
> + *	  with the device that is queried, report an invalid dev_id to avoid
s/bound to different iommufds with the device that is queried/bound to
iommufds different from the reset device one?
> + *	  potential dev_id conflict as dev_id is local to iommufd.  For such
> + *	  affected devices, user shall fall back to use the segment, bus and
> + *	  devfn info to map it to opened device.
> + *
>   * Return: 0 on success, -errno on failure:
>   *	-enospc = insufficient buffer, -enodev = unsupported for device.
>   */
>  struct vfio_pci_dependent_device {
> -	__u32	group_id;
> +	union {
> +		__u32   group_id;
> +		__u32	dev_id;
> +	};
>  	__u16	segment;
>  	__u8	bus;
>  	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> @@ -663,6 +684,7 @@ struct vfio_pci_dependent_device {
>  struct vfio_pci_hot_reset_info {
>  	__u32	argsz;
>  	__u32	flags;
> +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
>  	__u32	count;
>  	struct vfio_pci_dependent_device	devices[];
>  };
Eric
Yi Liu April 5, 2023, 2:04 p.m. UTC | #8
Hi Eric,

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Wednesday, April 5, 2023 8:20 PM
> 
> Hi Yi,
> On 4/1/23 16:44, Yi Liu wrote:
> > for the users that accept device fds passed from management stacks to be
> > able to figure out the host reset affected devices among the devices
> > opened by the user. This is needed as such users do not have BDF (bus,
> > devfn) knowledge about the devices it has opened, hence unable to use
> > the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > to figure out the affected devices.
> >
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
> >  include/uapi/linux/vfio.h        | 24 ++++++++++++-
> >  2 files changed, 74 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 19f5b075d70a..a5a7e148dce1 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -30,6 +30,7 @@
> >  #if IS_ENABLED(CONFIG_EEH)
> >  #include <asm/eeh.h>
> >  #endif
> > +#include <uapi/linux/iommufd.h>
> >
> >  #include "vfio_pci_priv.h"
> >
> > @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct
> vfio_pci_core_device *vdev, int irq_typ
> >  	return 0;
> >  }
> >
> > +static struct vfio_device *
> > +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
> > +			       struct pci_dev *pdev)
> > +{
> > +	struct vfio_device *cur;
> > +
> > +	lockdep_assert_held(&dev_set->lock);
> > +
> > +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> > +		if (cur->dev == &pdev->dev)
> > +			return cur;
> > +	return NULL;
> > +}
> > +
> >  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
> >  {
> >  	(*(int *)data)++;
> > @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void
> *data)
> >  struct vfio_pci_fill_info {
> >  	int max;
> >  	int cur;
> > +	bool require_devid;
> > +	struct iommufd_ctx *iommufd;
> > +	struct vfio_device_set *dev_set;
> >  	struct vfio_pci_dependent_device *devices;
> >  };
> >
> >  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
> >  {
> >  	struct vfio_pci_fill_info *fill = data;
> > +	struct vfio_device_set *dev_set = fill->dev_set;
> >  	struct iommu_group *iommu_group;
> > +	struct vfio_device *vdev;
> > +
> > +	lockdep_assert_held(&dev_set->lock);
> >
> >  	if (fill->cur == fill->max)
> >  		return -EAGAIN; /* Something changed, try again */
> > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void
> *data)
> >  	if (!iommu_group)
> >  		return -EPERM; /* Cannot reset non-isolated devices */
> >
> > -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> > +	if (fill->require_devid) {
> > +		/*
> > +		 * Report dev_id of the devices that are opened as cdev
> > +		 * and have the same iommufd with the fill->iommufd.
> > +		 * Otherwise, just fill IOMMUFD_INVALID_ID.
> > +		 */
> > +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> > +		if (vdev && vfio_device_cdev_opened(vdev) &&
> > +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
> > +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill-
> >cur].dev_id);
> > +		else
> > +			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
> > +	} else {
> > +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> > +	}
> >  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
> >  	fill->devices[fill->cur].bus = pdev->bus->number;
> >  	fill->devices[fill->cur].devfn = pdev->devfn;
> > @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
> >  		return -ENOMEM;
> >
> >  	fill.devices = devices;
> > +	fill.dev_set = vdev->vdev.dev_set;
> >
> > +	mutex_lock(&vdev->vdev.dev_set->lock);
> > +	if (vfio_device_cdev_opened(&vdev->vdev)) {
> > +		fill.require_devid = true;
> > +		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> > +	}
> >  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
> >  					    &fill, slot);
> > +	mutex_unlock(&vdev->vdev.dev_set->lock);
> >
> >  	/*
> >  	 * If a device was removed between counting and filling, we may come up
> >  	 * short of fill.max.  If a device was added, we'll have a return of
> >  	 * -EAGAIN above.
> >  	 */
> > -	if (!ret)
> > +	if (!ret) {
> >  		hdr.count = fill.cur;
> > +		if (fill.require_devid)
> > +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
> > +	}
> >
> >  reset_info_exit:
> >  	if (copy_to_user(arg, &hdr, minsz))
> > @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct
> vfio_pci_core_device *vdev,
> >  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
> >  {
> >  	struct vfio_device_set *dev_set = data;
> > -	struct vfio_device *cur;
> >
> > -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> > -		if (cur->dev == &pdev->dev)
> > -			return 0;
> > -	return -EBUSY;
> > +	lockdep_assert_held(&dev_set->lock);
> > +
> > +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
> >  }
> >
> >  /*
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 25432ef213ee..5a34364e3b94 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -650,11 +650,32 @@ enum {
> >   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
> >   *					      struct vfio_pci_hot_reset_info)
> >   *
> > + * This command is used to query the affected devices in the hot reset for
> > + * a given device.  User could use the information reported by this command
> > + * to figure out the affected devices among the devices it has opened.
> > + * This command always reports the segment, bus and devfn information for
> > + * each affected device, and selectively report the group_id or the dev_id
> > + * per the way how the device being queried is opened.
> > + *	- If the device is opened via the traditional group/container manner,
> > + *	  this command reports the group_id for each affected device.
> > + *
> > + *	- If the device is opened as a cdev, this command needs to report
> s/needs to report/reports

got it.

> > + *	  dev_id for each affected device and set the
> > + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
> > + *	  devices that are not opened as cdev or bound to different iommufds
> > + *	  with the device that is queried, report an invalid dev_id to avoid
> s/bound to different iommufds with the device that is queried/bound to
> iommufds different from the reset device one?

hmmm, I'm not a native speaker here. This _INFO is to query if want
hot reset a given device, what devices would be affected. So it appears
the queried device is better. But I'd admit "the queried device" is also
"the reset device". may Alex help pick one. 
Alex Williamson April 5, 2023, 4:25 p.m. UTC | #9
On Wed, 5 Apr 2023 14:04:51 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Eric,
> 
> > From: Eric Auger <eric.auger@redhat.com>
> > Sent: Wednesday, April 5, 2023 8:20 PM
> > 
> > Hi Yi,
> > On 4/1/23 16:44, Yi Liu wrote:  
> > > for the users that accept device fds passed from management stacks to be
> > > able to figure out the host reset affected devices among the devices
> > > opened by the user. This is needed as such users do not have BDF (bus,
> > > devfn) knowledge about the devices it has opened, hence unable to use
> > > the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > to figure out the affected devices.
> > >
> > > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > > ---
> > >  drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
> > >  include/uapi/linux/vfio.h        | 24 ++++++++++++-
> > >  2 files changed, 74 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > > index 19f5b075d70a..a5a7e148dce1 100644
> > > --- a/drivers/vfio/pci/vfio_pci_core.c
> > > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > > @@ -30,6 +30,7 @@
> > >  #if IS_ENABLED(CONFIG_EEH)
> > >  #include <asm/eeh.h>
> > >  #endif
> > > +#include <uapi/linux/iommufd.h>
> > >
> > >  #include "vfio_pci_priv.h"
> > >
> > > @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct  
> > vfio_pci_core_device *vdev, int irq_typ  
> > >  	return 0;
> > >  }
> > >
> > > +static struct vfio_device *
> > > +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
> > > +			       struct pci_dev *pdev)
> > > +{
> > > +	struct vfio_device *cur;
> > > +
> > > +	lockdep_assert_held(&dev_set->lock);
> > > +
> > > +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> > > +		if (cur->dev == &pdev->dev)
> > > +			return cur;
> > > +	return NULL;
> > > +}
> > > +
> > >  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
> > >  {
> > >  	(*(int *)data)++;
> > > @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void  
> > *data)  
> > >  struct vfio_pci_fill_info {
> > >  	int max;
> > >  	int cur;
> > > +	bool require_devid;
> > > +	struct iommufd_ctx *iommufd;
> > > +	struct vfio_device_set *dev_set;
> > >  	struct vfio_pci_dependent_device *devices;
> > >  };
> > >
> > >  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
> > >  {
> > >  	struct vfio_pci_fill_info *fill = data;
> > > +	struct vfio_device_set *dev_set = fill->dev_set;
> > >  	struct iommu_group *iommu_group;
> > > +	struct vfio_device *vdev;
> > > +
> > > +	lockdep_assert_held(&dev_set->lock);
> > >
> > >  	if (fill->cur == fill->max)
> > >  		return -EAGAIN; /* Something changed, try again */
> > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void  
> > *data)  
> > >  	if (!iommu_group)
> > >  		return -EPERM; /* Cannot reset non-isolated devices */
> > >
> > > -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> > > +	if (fill->require_devid) {
> > > +		/*
> > > +		 * Report dev_id of the devices that are opened as cdev
> > > +		 * and have the same iommufd with the fill->iommufd.
> > > +		 * Otherwise, just fill IOMMUFD_INVALID_ID.
> > > +		 */
> > > +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> > > +		if (vdev && vfio_device_cdev_opened(vdev) &&
> > > +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
> > > +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill-
> > >cur].dev_id);
> > > +		else
> > > +			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
> > > +	} else {
> > > +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> > > +	}
> > >  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
> > >  	fill->devices[fill->cur].bus = pdev->bus->number;
> > >  	fill->devices[fill->cur].devfn = pdev->devfn;
> > > @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
> > >  		return -ENOMEM;
> > >
> > >  	fill.devices = devices;
> > > +	fill.dev_set = vdev->vdev.dev_set;
> > >
> > > +	mutex_lock(&vdev->vdev.dev_set->lock);
> > > +	if (vfio_device_cdev_opened(&vdev->vdev)) {
> > > +		fill.require_devid = true;
> > > +		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> > > +	}
> > >  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
> > >  					    &fill, slot);
> > > +	mutex_unlock(&vdev->vdev.dev_set->lock);
> > >
> > >  	/*
> > >  	 * If a device was removed between counting and filling, we may come up
> > >  	 * short of fill.max.  If a device was added, we'll have a return of
> > >  	 * -EAGAIN above.
> > >  	 */
> > > -	if (!ret)
> > > +	if (!ret) {
> > >  		hdr.count = fill.cur;
> > > +		if (fill.require_devid)
> > > +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
> > > +	}
> > >
> > >  reset_info_exit:
> > >  	if (copy_to_user(arg, &hdr, minsz))
> > > @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct  
> > vfio_pci_core_device *vdev,  
> > >  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
> > >  {
> > >  	struct vfio_device_set *dev_set = data;
> > > -	struct vfio_device *cur;
> > >
> > > -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> > > -		if (cur->dev == &pdev->dev)
> > > -			return 0;
> > > -	return -EBUSY;
> > > +	lockdep_assert_held(&dev_set->lock);
> > > +
> > > +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
> > >  }
> > >
> > >  /*
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 25432ef213ee..5a34364e3b94 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -650,11 +650,32 @@ enum {
> > >   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
> > >   *					      struct vfio_pci_hot_reset_info)
> > >   *
> > > + * This command is used to query the affected devices in the hot reset for
> > > + * a given device.  User could use the information reported by this command
> > > + * to figure out the affected devices among the devices it has opened.
> > > + * This command always reports the segment, bus and devfn information for
> > > + * each affected device, and selectively report the group_id or the dev_id
> > > + * per the way how the device being queried is opened.
> > > + *	- If the device is opened via the traditional group/container manner,
> > > + *	  this command reports the group_id for each affected device.
> > > + *
> > > + *	- If the device is opened as a cdev, this command needs to report  
> > s/needs to report/reports  
> 
> got it.
> 
> > > + *	  dev_id for each affected device and set the
> > > + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
> > > + *	  devices that are not opened as cdev or bound to different iommufds
> > > + *	  with the device that is queried, report an invalid dev_id to avoid  
> > s/bound to different iommufds with the device that is queried/bound to
> > iommufds different from the reset device one?  
> 
> hmmm, I'm not a native speaker here. This _INFO is to query if want
> hot reset a given device, what devices would be affected. So it appears
> the queried device is better. But I'd admit "the queried device" is also
> "the reset device". may Alex help pick one. 
Jason Gunthorpe April 5, 2023, 4:37 p.m. UTC | #10
On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:

> But that kind of brings to light the question of what does the user do
> when they encounter this situation.

What does it do now when it encounters a group_id it doesn't
understand? Userspace already doesn't know if the foreign group is
open or not, right?

> reset can complete.  If the device is opened by a different user, the
> reset is blocked.  The only logical conclusion is that the user should
> try the reset regardless of the result of the info ioctl, which the

IMHO my suggested version is still the overall saner uAPI.

An info that basically returns success/fail if reset is security
authorized and information about the reset groupings.

Actual reset follows the returned groupings automatically.

Easy for qemu. Call the info at startup to confirm reset can be
emulated, use the returned information to propogate the reset groups
to the guest. Trigger the reset with no fuss when the guest asks for
it.

Less weird corner cases.

Jason
Alex Williamson April 5, 2023, 4:52 p.m. UTC | #11
On Wed, 5 Apr 2023 13:37:05 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:
> 
> > But that kind of brings to light the question of what does the user do
> > when they encounter this situation.  
> 
> What does it do now when it encounters a group_id it doesn't
> understand? Userspace already doesn't know if the foreign group is
> open or not, right?

It's simple, there is currently no screwiness around opened devices.
If the caller doesn't own all the groups mapping to the affected
devices, hot-reset is not available.

> > reset can complete.  If the device is opened by a different user, the
> > reset is blocked.  The only logical conclusion is that the user should
> > try the reset regardless of the result of the info ioctl, which the  
> 
> IMHO my suggested version is still the overall saner uAPI.
> 
> An info that basically returns success/fail if reset is security
> authorized and information about the reset groupings.
> 
> Actual reset follows the returned groupings automatically.
> 
> Easy for qemu. Call the info at startup to confirm reset can be
> emulated, use the returned information to propogate the reset groups
> to the guest. Trigger the reset with no fuss when the guest asks for
> it.
> 
> Less weird corner cases.

This leads to scenarios where the info ioctl indicates a hot-reset is
initially available, perhaps only because one of the affected devices
was not opened at the time, and now it fails when QEMU actually tries
to use it.  In the group model, QEMU can know the set of affected
devices and the required groups, confirm it owns those, and for all
practical purposes guarantee that a hot-reset is available (yes, there
might be some exceptionally rare topology changes).

This goofiness around unopened devices and null-arrays is killing this
API.  Thanks,

Alex
Jason Gunthorpe April 5, 2023, 5:23 p.m. UTC | #12
On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote:
> On Wed, 5 Apr 2023 13:37:05 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:
> > 
> > > But that kind of brings to light the question of what does the user do
> > > when they encounter this situation.  
> > 
> > What does it do now when it encounters a group_id it doesn't
> > understand? Userspace already doesn't know if the foreign group is
> > open or not, right?
> 
> It's simple, there is currently no screwiness around opened devices.
> If the caller doesn't own all the groups mapping to the affected
> devices, hot-reset is not available.

That still has nasty edge cases. If the reset group spans beyond a
single iommu group you end up with qemu being unable to operate reset
at all, and it is unfixable from an API perspective as we can't pass
in groups that VFIO isn't going to use.

I think you are right, the fact we'd have to return -1 dev_ids to this
modified API is pretty damaging, it doesn't seem like a good
direction.

> This leads to scenarios where the info ioctl indicates a hot-reset is
> initially available, perhaps only because one of the affected devices
> was not opened at the time, and now it fails when QEMU actually tries
> to use it.

I would like it if the APIs toward the kernel were only about the
kernel's security apparatus. It is makes it easier to reason about the
kernel side and gives nice simple well defined APIs.

This is a good point that qemu needs to make a policy decision if it
is happy about the VFIO configuration - but that is a policy decision
that should not become entangled with the kernel's security checks.

Today qemu can make this policy choice the same way it does right now
- call _INFO and check the group_ids. It gets the exact same outcome
as today. We already discussed that we need to expose the group ID
through an ioctl someplace.

If this is too awkward we could add a query to the kernel if the cdev
is "reset exclusive" - eg the iommufd covers all the groups that span
the reset set.

Jason
Eric Auger April 5, 2023, 5:58 p.m. UTC | #13
On 4/5/23 18:25, Alex Williamson wrote:
> On Wed, 5 Apr 2023 14:04:51 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
>
>> Hi Eric,
>>
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Sent: Wednesday, April 5, 2023 8:20 PM
>>>
>>> Hi Yi,
>>> On 4/1/23 16:44, Yi Liu wrote:  
>>>> for the users that accept device fds passed from management stacks to be
>>>> able to figure out the host reset affected devices among the devices
>>>> opened by the user. This is needed as such users do not have BDF (bus,
>>>> devfn) knowledge about the devices it has opened, hence unable to use
>>>> the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
>>>> to figure out the affected devices.
>>>>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> ---
>>>>  drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
>>>>  include/uapi/linux/vfio.h        | 24 ++++++++++++-
>>>>  2 files changed, 74 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>>> index 19f5b075d70a..a5a7e148dce1 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>>> @@ -30,6 +30,7 @@
>>>>  #if IS_ENABLED(CONFIG_EEH)
>>>>  #include <asm/eeh.h>
>>>>  #endif
>>>> +#include <uapi/linux/iommufd.h>
>>>>
>>>>  #include "vfio_pci_priv.h"
>>>>
>>>> @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct  
>>> vfio_pci_core_device *vdev, int irq_typ  
>>>>  	return 0;
>>>>  }
>>>>
>>>> +static struct vfio_device *
>>>> +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
>>>> +			       struct pci_dev *pdev)
>>>> +{
>>>> +	struct vfio_device *cur;
>>>> +
>>>> +	lockdep_assert_held(&dev_set->lock);
>>>> +
>>>> +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
>>>> +		if (cur->dev == &pdev->dev)
>>>> +			return cur;
>>>> +	return NULL;
>>>> +}
>>>> +
>>>>  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
>>>>  {
>>>>  	(*(int *)data)++;
>>>> @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void  
>>> *data)  
>>>>  struct vfio_pci_fill_info {
>>>>  	int max;
>>>>  	int cur;
>>>> +	bool require_devid;
>>>> +	struct iommufd_ctx *iommufd;
>>>> +	struct vfio_device_set *dev_set;
>>>>  	struct vfio_pci_dependent_device *devices;
>>>>  };
>>>>
>>>>  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>>>>  {
>>>>  	struct vfio_pci_fill_info *fill = data;
>>>> +	struct vfio_device_set *dev_set = fill->dev_set;
>>>>  	struct iommu_group *iommu_group;
>>>> +	struct vfio_device *vdev;
>>>> +
>>>> +	lockdep_assert_held(&dev_set->lock);
>>>>
>>>>  	if (fill->cur == fill->max)
>>>>  		return -EAGAIN; /* Something changed, try again */
>>>> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void  
>>> *data)  
>>>>  	if (!iommu_group)
>>>>  		return -EPERM; /* Cannot reset non-isolated devices */
>>>>
>>>> -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
>>>> +	if (fill->require_devid) {
>>>> +		/*
>>>> +		 * Report dev_id of the devices that are opened as cdev
>>>> +		 * and have the same iommufd with the fill->iommufd.
>>>> +		 * Otherwise, just fill IOMMUFD_INVALID_ID.
>>>> +		 */
>>>> +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
>>>> +		if (vdev && vfio_device_cdev_opened(vdev) &&
>>>> +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
>>>> +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill-
>>>> cur].dev_id);
>>>> +		else
>>>> +			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
>>>> +	} else {
>>>> +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
>>>> +	}
>>>>  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
>>>>  	fill->devices[fill->cur].bus = pdev->bus->number;
>>>>  	fill->devices[fill->cur].devfn = pdev->devfn;
>>>> @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
>>>>  		return -ENOMEM;
>>>>
>>>>  	fill.devices = devices;
>>>> +	fill.dev_set = vdev->vdev.dev_set;
>>>>
>>>> +	mutex_lock(&vdev->vdev.dev_set->lock);
>>>> +	if (vfio_device_cdev_opened(&vdev->vdev)) {
>>>> +		fill.require_devid = true;
>>>> +		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
>>>> +	}
>>>>  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
>>>>  					    &fill, slot);
>>>> +	mutex_unlock(&vdev->vdev.dev_set->lock);
>>>>
>>>>  	/*
>>>>  	 * If a device was removed between counting and filling, we may come up
>>>>  	 * short of fill.max.  If a device was added, we'll have a return of
>>>>  	 * -EAGAIN above.
>>>>  	 */
>>>> -	if (!ret)
>>>> +	if (!ret) {
>>>>  		hdr.count = fill.cur;
>>>> +		if (fill.require_devid)
>>>> +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
>>>> +	}
>>>>
>>>>  reset_info_exit:
>>>>  	if (copy_to_user(arg, &hdr, minsz))
>>>> @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct  
>>> vfio_pci_core_device *vdev,  
>>>>  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
>>>>  {
>>>>  	struct vfio_device_set *dev_set = data;
>>>> -	struct vfio_device *cur;
>>>>
>>>> -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
>>>> -		if (cur->dev == &pdev->dev)
>>>> -			return 0;
>>>> -	return -EBUSY;
>>>> +	lockdep_assert_held(&dev_set->lock);
>>>> +
>>>> +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
>>>>  }
>>>>
>>>>  /*
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 25432ef213ee..5a34364e3b94 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -650,11 +650,32 @@ enum {
>>>>   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
>>>>   *					      struct vfio_pci_hot_reset_info)
>>>>   *
>>>> + * This command is used to query the affected devices in the hot reset for
>>>> + * a given device.  User could use the information reported by this command
>>>> + * to figure out the affected devices among the devices it has opened.
the 'opened' terminology does not look sufficient here because it is not
only a matter of the device being opened using cdev but it also needs to
have been bound to an iommufd, dev_id being the output of the
dev-iommufd binding.

By the way I am now confused. What does happen if the reset impact some
devices which are not bound to an iommu ctx. Previously we returned the
iommu group which always pre-exists but now you will report invalid id?
>>>> + * This command always reports the segment, bus and devfn information for
>>>> + * each affected device, and selectively report the group_id or the dev_id
>>>> + * per the way how the device being queried is opened.
>>>> + *	- If the device is opened via the traditional group/container manner,
>>>> + *	  this command reports the group_id for each affected device.
>>>> + *
>>>> + *	- If the device is opened as a cdev, this command needs to report  
>>> s/needs to report/reports  
>> got it.
>>
>>>> + *	  dev_id for each affected device and set the
>>>> + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
>>>> + *	  devices that are not opened as cdev or bound to different iommufds
>>>> + *	  with the device that is queried, report an invalid dev_id to avoid  
or not bound at all
>>> s/bound to different iommufds with the device that is queried/bound to
>>> iommufds different from the reset device one?  
>> hmmm, I'm not a native speaker here. This _INFO is to query if want
>> hot reset a given device, what devices would be affected. So it appears
>> the queried device is better. But I'd admit "the queried device" is also
>> "the reset device". may Alex help pick one. 
Alex Williamson April 5, 2023, 6:56 p.m. UTC | #14
On Wed, 5 Apr 2023 14:23:43 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote:
> > On Wed, 5 Apr 2023 13:37:05 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:
> > >   
> > > > But that kind of brings to light the question of what does the user do
> > > > when they encounter this situation.    
> > > 
> > > What does it do now when it encounters a group_id it doesn't
> > > understand? Userspace already doesn't know if the foreign group is
> > > open or not, right?  
> > 
> > It's simple, there is currently no screwiness around opened devices.
> > If the caller doesn't own all the groups mapping to the affected
> > devices, hot-reset is not available.  
> 
> That still has nasty edge cases. If the reset group spans beyond a
> single iommu group you end up with qemu being unable to operate reset
> at all, and it is unfixable from an API perspective as we can't pass
> in groups that VFIO isn't going to use.

Hmm, s/nasty/niche/?  Yes, QEMU currently has no way to own a group
without assigning a device from the group, but technically that could
be fixed within QEMU.  If QEMU doesn't own that affected group, then it
can't very well count on that group to not be used in some other way
when it comes time to actually do a hot-reset.
 
> I think you are right, the fact we'd have to return -1 dev_ids to this
> modified API is pretty damaging, it doesn't seem like a good
> direction.
> 
> > This leads to scenarios where the info ioctl indicates a hot-reset is
> > initially available, perhaps only because one of the affected devices
> > was not opened at the time, and now it fails when QEMU actually tries
> > to use it.  
> 
> I would like it if the APIs toward the kernel were only about the
> kernel's security apparatus. It is makes it easier to reason about the
> kernel side and gives nice simple well defined APIs.

Usability needs to be a consideration as well.  An interface where the
result is effectively arbitrary from a user perspective because the
kernel is solely focused on whether the operation is allowed,
evaluating constraints that the user is unaware of and cannot control,
is unusable.

> This is a good point that qemu needs to make a policy decision if it
> is happy about the VFIO configuration - but that is a policy decision
> that should not become entangled with the kernel's security checks.
> 
> Today qemu can make this policy choice the same way it does right now
> - call _INFO and check the group_ids. It gets the exact same outcome
> as today. We already discussed that we need to expose the group ID
> through an ioctl someplace.

QEMU can make a policy decision today because the kernel provides a
sufficiently reliable interface, ie. based on the set of owned groups, a
hot-reset is all but guaranteed to work.  If we focus only on whether a
given reset is allowed from a kernel perspective and ignore that
userspace needs some predictability of the kernel behavior, then QEMU
cannot reasonable make that policy decision.

> If this is too awkward we could add a query to the kernel if the cdev
> is "reset exclusive" - eg the iommufd covers all the groups that span
> the reset set.

That's essentially what we have if there are valid dev-ids for each
affected device in the info ioctl.  I don't think it helps the user
experience to create loopholes where the hot-reset ioctl can still work
in spite of those missing devices.  The group interface uses the fact
that ownership of the group implies ownership of all devices within the
group such that the user only needs to prove group ownership.

But we still have underlying groups even with the cdev model, with the
same ownership principles, so don't we just need to prove group
ownership based on a device fd rather than a group fd?

For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
capability chains, we could add a capability that reports the group ID
for the device.  The hot-reset info ioctl remains as it is today,
reporting group-ids and bdfs.  The hot-reset ioctl itself is modified to
transparently support either group fds or device fds.  The user can now
map cdevs to group-ids and therefore follow the same rules as groups,
providing at least one representative device fd for each group.  We've
essentially already enabled this by allowing the limit of user provided
fds equal to the number of affected devices.

Does that work?  Thanks,

Alex
Alex Williamson April 5, 2023, 7:18 p.m. UTC | #15
On Wed, 5 Apr 2023 12:56:21 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 5 Apr 2023 14:23:43 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote:  
> > > On Wed, 5 Apr 2023 13:37:05 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >     
> > > > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:
> > > >     
> > > > > But that kind of brings to light the question of what does the user do
> > > > > when they encounter this situation.      
> > > > 
> > > > What does it do now when it encounters a group_id it doesn't
> > > > understand? Userspace already doesn't know if the foreign group is
> > > > open or not, right?    
> > > 
> > > It's simple, there is currently no screwiness around opened devices.
> > > If the caller doesn't own all the groups mapping to the affected
> > > devices, hot-reset is not available.    
> > 
> > That still has nasty edge cases. If the reset group spans beyond a
> > single iommu group you end up with qemu being unable to operate reset
> > at all, and it is unfixable from an API perspective as we can't pass
> > in groups that VFIO isn't going to use.  
> 
> Hmm, s/nasty/niche/?  Yes, QEMU currently has no way to own a group
> without assigning a device from the group, but technically that could
> be fixed within QEMU.  If QEMU doesn't own that affected group, then it
> can't very well count on that group to not be used in some other way
> when it comes time to actually do a hot-reset.
>  
> > I think you are right, the fact we'd have to return -1 dev_ids to this
> > modified API is pretty damaging, it doesn't seem like a good
> > direction.
> >   
> > > This leads to scenarios where the info ioctl indicates a hot-reset is
> > > initially available, perhaps only because one of the affected devices
> > > was not opened at the time, and now it fails when QEMU actually tries
> > > to use it.    
> > 
> > I would like it if the APIs toward the kernel were only about the
> > kernel's security apparatus. It is makes it easier to reason about the
> > kernel side and gives nice simple well defined APIs.  
> 
> Usability needs to be a consideration as well.  An interface where the
> result is effectively arbitrary from a user perspective because the
> kernel is solely focused on whether the operation is allowed,
> evaluating constraints that the user is unaware of and cannot control,
> is unusable.
> 
> > This is a good point that qemu needs to make a policy decision if it
> > is happy about the VFIO configuration - but that is a policy decision
> > that should not become entangled with the kernel's security checks.
> > 
> > Today qemu can make this policy choice the same way it does right now
> > - call _INFO and check the group_ids. It gets the exact same outcome
> > as today. We already discussed that we need to expose the group ID
> > through an ioctl someplace.  
> 
> QEMU can make a policy decision today because the kernel provides a
> sufficiently reliable interface, ie. based on the set of owned groups, a
> hot-reset is all but guaranteed to work.  If we focus only on whether a
> given reset is allowed from a kernel perspective and ignore that
> userspace needs some predictability of the kernel behavior, then QEMU
> cannot reasonable make that policy decision.
> 
> > If this is too awkward we could add a query to the kernel if the cdev
> > is "reset exclusive" - eg the iommufd covers all the groups that span
> > the reset set.  
> 
> That's essentially what we have if there are valid dev-ids for each
> affected device in the info ioctl.  I don't think it helps the user
> experience to create loopholes where the hot-reset ioctl can still work
> in spite of those missing devices.  The group interface uses the fact
> that ownership of the group implies ownership of all devices within the
> group such that the user only needs to prove group ownership.
> 
> But we still have underlying groups even with the cdev model, with the
> same ownership principles, so don't we just need to prove group
> ownership based on a device fd rather than a group fd?
> 
> For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> capability chains, we could add a capability that reports the group ID
> for the device.  The hot-reset info ioctl remains as it is today,
> reporting group-ids and bdfs.  The hot-reset ioctl itself is modified to
> transparently support either group fds or device fds.  The user can now
> map cdevs to group-ids and therefore follow the same rules as groups,
> providing at least one representative device fd for each group.  We've
> essentially already enabled this by allowing the limit of user provided
> fds equal to the number of affected devices.

If I'm not mistaken, I think this resolves cdev no-iommu to work
equivalently to groups as well.  Thanks,

Alex
Jason Gunthorpe April 5, 2023, 7:21 p.m. UTC | #16
On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote:
> Usability needs to be a consideration as well.  An interface where the
> result is effectively arbitrary from a user perspective because the
> kernel is solely focused on whether the operation is allowed,
> evaluating constraints that the user is unaware of and cannot control,
> is unusable.

Considering this API is only invoked by qemu we might be overdoing
this usability and 'no shoot in foot' view.

> > This is a good point that qemu needs to make a policy decision if it
> > is happy about the VFIO configuration - but that is a policy decision
> > that should not become entangled with the kernel's security checks.
> > 
> > Today qemu can make this policy choice the same way it does right now
> > - call _INFO and check the group_ids. It gets the exact same outcome
> > as today. We already discussed that we need to expose the group ID
> > through an ioctl someplace.
> 
> QEMU can make a policy decision today because the kernel provides a
> sufficiently reliable interface, ie. based on the set of owned groups, a
> hot-reset is all but guaranteed to work.  

And we don't change that with cdev. If qemu wants to make the policy
decision it keeps using the exact same _INFO interface to make that
decision same it has always made.

We weaken the actual reset action to only consider the security side.

Applications that want this exclusive reset group policy simply must
check it on their own. It is a reasonable API design.

> > If this is too awkward we could add a query to the kernel if the cdev
> > is "reset exclusive" - eg the iommufd covers all the groups that span
> > the reset set.
> 
> That's essentially what we have if there are valid dev-ids for each
> affected device in the info ioctl.

If you have dev-ids for everything, yes. If you don't, then you can't
make the same policy choice using a dev-id interface.

> I don't think it helps the user experience to create loopholes where
> the hot-reset ioctl can still work in spite of those missing
> devices.

I disagree. The easy straightforward design is that the reset ioctl
works if the process has security permissions. Mixing a policy check
into the kernel on this path is creating complexity we don't really
need.

I don't view it as a loophole, it is flexability to use the API in a
way that is different from what qemu wants - eg an app like dpdk may
be willing to tolerate a reset group that becomes unavailable after
startup. Who knows, why should we force this in the kernel?

> For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> capability chains, we could add a capability that reports the group ID
> for the device.  

I was going to put that in an iommufd ioctl so it works with VDPA too,
but sure, lets assume we can get the group ID from a cdev fd.

> The hot-reset info ioctl remains as it is today, reporting group-ids
> and bdfs.

Sure, but userspace still needs to know how to map the reset sets into
dev-ids. Remember the reason we started doing this is because we don't
have easy access to the BDF anymore.

I like leaving this ioctl alone, lets go back to a dedicated ioctl to
return the dev_ids.

> The hot-reset ioctl itself is modified to transparently
> support either group fds or device fds.  The user can now map cdevs
> to group-ids and therefore follow the same rules as groups,
> providing at least one representative device fd for each group.

This looks like a very complex uapi compared to the empty list option,
but it seems like it would work.

Jason
Alex Williamson April 5, 2023, 7:49 p.m. UTC | #17
On Wed, 5 Apr 2023 16:21:09 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote:
> > Usability needs to be a consideration as well.  An interface where the
> > result is effectively arbitrary from a user perspective because the
> > kernel is solely focused on whether the operation is allowed,
> > evaluating constraints that the user is unaware of and cannot control,
> > is unusable.  
> 
> Considering this API is only invoked by qemu we might be overdoing
> this usability and 'no shoot in foot' view.

Ok, I'm not sure why we're diminishing the de facto vfio userspace...

> > > This is a good point that qemu needs to make a policy decision if it
> > > is happy about the VFIO configuration - but that is a policy decision
> > > that should not become entangled with the kernel's security checks.
> > > 
> > > Today qemu can make this policy choice the same way it does right now
> > > - call _INFO and check the group_ids. It gets the exact same outcome
> > > as today. We already discussed that we need to expose the group ID
> > > through an ioctl someplace.  
> > 
> > QEMU can make a policy decision today because the kernel provides a
> > sufficiently reliable interface, ie. based on the set of owned groups, a
> > hot-reset is all but guaranteed to work.    
> 
> And we don't change that with cdev. If qemu wants to make the policy
> decision it keeps using the exact same _INFO interface to make that
> decision same it has always made.
> 
> We weaken the actual reset action to only consider the security side.
> 
> Applications that want this exclusive reset group policy simply must
> check it on their own. It is a reasonable API design.

I disagree, as I've argued before, the info ioctl becomes so weak and
effectively arbitrary from a user perspective at being able to predict
whether the hot-reset ioctl works that it becomes useless, diminishing
the entire hot-reset info/execute API.

> > > If this is too awkward we could add a query to the kernel if the cdev
> > > is "reset exclusive" - eg the iommufd covers all the groups that span
> > > the reset set.  
> > 
> > That's essentially what we have if there are valid dev-ids for each
> > affected device in the info ioctl.  
> 
> If you have dev-ids for everything, yes. If you don't, then you can't
> make the same policy choice using a dev-id interface.

Exactly, you can't make any policy choice because the success or
failure of the hot-reset ioctl can't be known.

> > I don't think it helps the user experience to create loopholes where
> > the hot-reset ioctl can still work in spite of those missing
> > devices.  
> 
> I disagree. The easy straightforward design is that the reset ioctl
> works if the process has security permissions. Mixing a policy check
> into the kernel on this path is creating complexity we don't really
> need.
> 
> I don't view it as a loophole, it is flexability to use the API in a
> way that is different from what qemu wants - eg an app like dpdk may
> be willing to tolerate a reset group that becomes unavailable after
> startup. Who knows, why should we force this in the kernel?

Because look at all the problems it's causing to try to introduce these
loopholes without also introducing subtle bugs.  There's an argument
that we're overly strict, which is better than the alternative, which
seems to be what we're dabbling with.  It is a straightforward
interface for the hot-reset ioctl to mirror the information provided
via the hot-reset info ioctl.

> > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> > capability chains, we could add a capability that reports the group ID
> > for the device.    
> 
> I was going to put that in an iommufd ioctl so it works with VDPA too,
> but sure, lets assume we can get the group ID from a cdev fd.
> 
> > The hot-reset info ioctl remains as it is today, reporting group-ids
> > and bdfs.  
> 
> Sure, but userspace still needs to know how to map the reset sets into
> dev-ids.

No, it doesn't. 

> Remember the reason we started doing this is because we don't
> have easy access to the BDF anymore.

We don't need it, the info ioctl provides the groups, the group
association can be learned from the DEVICE_GET_INFO ioctl, the
hot-reset ioctl only requires a single representative fd per affected
group.  dev-ids not required.

> I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> return the dev_ids.

I don't see any justification for this.  We could add another PCI
specific DEVICE_GET_INFO capability to report the bdf if we really need
it, but reporting the group seems sufficient for this use case.

> > The hot-reset ioctl itself is modified to transparently
> > support either group fds or device fds.  The user can now map cdevs
> > to group-ids and therefore follow the same rules as groups,
> > providing at least one representative device fd for each group.  
> 
> This looks like a very complex uapi compared to the empty list option,
> but it seems like it would work.

It's the same API that we have now.  What's complex is trying to figure
out all the subtle side-effects from the loopholes that are being
proposed in this series.  Thanks,

Alex
Jason Gunthorpe April 5, 2023, 11:22 p.m. UTC | #18
On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:

> > > QEMU can make a policy decision today because the kernel provides a
> > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > hot-reset is all but guaranteed to work.    
> > 
> > And we don't change that with cdev. If qemu wants to make the policy
> > decision it keeps using the exact same _INFO interface to make that
> > decision same it has always made.
> > 
> > We weaken the actual reset action to only consider the security side.
> > 
> > Applications that want this exclusive reset group policy simply must
> > check it on their own. It is a reasonable API design.
> 
> I disagree, as I've argued before, the info ioctl becomes so weak and
> effectively arbitrary from a user perspective at being able to predict
> whether the hot-reset ioctl works that it becomes useless, diminishing
> the entire hot-reset info/execute API.

reset should be strictly more permissive than INFO. If INFO predicts
reset is permitted then reset should succeed.

We don't change INFO so it cannot "becomes so weak"  ??

We don't care about the cases where INFO says it will not succeed but
reset does (temporarily) succeed.

I don't get what argument you are trying to make or what you think is
diminished..

Again, userspace calls INFO, if info says yes then reset *always
works*, exactly just like today.

Userspace will call reset with a 0 length FD list and it uses a
security only check that is strictly more permissive than what
get_info will return. So the new check is simple in the kernel and
always works in the cases we need it to work.

What is getting things into trouble is insisting that RESET have
additional restrictions beyond the minimum checks required for
security.

> > I don't view it as a loophole, it is flexability to use the API in a
> > way that is different from what qemu wants - eg an app like dpdk may
> > be willing to tolerate a reset group that becomes unavailable after
> > startup. Who knows, why should we force this in the kernel?
> 
> Because look at all the problems it's causing to try to introduce these
> loopholes without also introducing subtle bugs.

These problems are coming from tring to do this integrated version,
not from my approach!

AFAICT there was nothing wrong with my original plan of using the
empty fd list for reset. What Yi has here is some mashup of what you
and I both suggested.

> > Remember the reason we started doing this is because we don't
> > have easy access to the BDF anymore.
> 
> We don't need it, the info ioctl provides the groups, the group
> association can be learned from the DEVICE_GET_INFO ioctl, the
> hot-reset ioctl only requires a single representative fd per affected
> group.  dev-ids not required.

I'm not talking about triggering the ioctl.

I'm talking about whatever else qemu needs to do so that the VM is
aware of the reset groups device-by-device on it's side so nested VFIO
in the VM reflects the same data as the hypervisor. Maybe it doesn't
do this right now, but the kernel API should continue to provide the
data.

> > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > return the dev_ids.
> 
> I don't see any justification for this.  We could add another PCI
> specific DEVICE_GET_INFO capability to report the bdf if we really need
> it, but reporting the group seems sufficient for this use case.

What I imagine is a single new ioctl 'get reset group 2' or something.
It returns a list of dev_ids in the reset group. It has an output flag
if the reset is reliable. This is the only ioctl user space needs to
call.

The reliable test is done by simply calling the ioctl and throwing
away the dev ids. The mapping of the VM's reset groups is done by
processing the dev_ids to vRIDs and flowing that into the VM somehow.

We don't expose group_ids, and we don't expose BDF. It is much simpler
and cleaner to use.

A BDF DEVICE_GET_INFO and the existing reset INFO will encode the same
data too, it is just not as elegant and requires userspace to do a lot
more work to keep track of the 3 different identifiers.

> > This looks like a very complex uapi compared to the empty list option,
> > but it seems like it would work.
>
> It's the same API that we have now.  What's complex is trying to figure
> out all the subtle side-effects from the loopholes that are being
> proposed in this series.  Thanks,

I might agree with you if we weren't now going backwards - 
ideas didn't work out and Yi has to throw stuff away. :(

Jason
Yi Liu April 6, 2023, 5:31 a.m. UTC | #19
Hi Eric,

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Thursday, April 6, 2023 1:58 AM
[...]
> >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>>> index 25432ef213ee..5a34364e3b94 100644
> >>>> --- a/include/uapi/linux/vfio.h
> >>>> +++ b/include/uapi/linux/vfio.h
> >>>> @@ -650,11 +650,32 @@ enum {
> >>>>   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE +
> 12,
> >>>>   *					      struct vfio_pci_hot_reset_info)
> >>>>   *
> >>>> + * This command is used to query the affected devices in the hot reset for
> >>>> + * a given device.  User could use the information reported by this command
> >>>> + * to figure out the affected devices among the devices it has opened.
> the 'opened' terminology does not look sufficient here because it is not
> only a matter of the device being opened using cdev but it also needs to
> have been bound to an iommufd, dev_id being the output of the
> dev-iommufd binding.
> 
> By the way I am now confused. What does happen if the reset impact some
> devices which are not bound to an iommu ctx. Previously we returned the
> iommu group which always pre-exists but now you will report invalid id?

For such devices, user could use the bdf information to check if
affected device is opened by the user. If yes, do some necessary
preparation on the device before issuing hot reset.

Regards,
Yi Liu

> >>>> + * This command always reports the segment, bus and devfn information for
> >>>> + * each affected device, and selectively report the group_id or the dev_id
> >>>> + * per the way how the device being queried is opened.
> >>>> + *	- If the device is opened via the traditional group/container manner,
> >>>> + *	  this command reports the group_id for each affected device.
> >>>> + *
> >>>> + *	- If the device is opened as a cdev, this command needs to report
> >>> s/needs to report/reports
> >> got it.
> >>
> >>>> + *	  dev_id for each affected device and set the
> >>>> + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the
> affected
> >>>> + *	  devices that are not opened as cdev or bound to different iommufds
> >>>> + *	  with the device that is queried, report an invalid dev_id to avoid
> or not bound at all
> >>> s/bound to different iommufds with the device that is queried/bound to
> >>> iommufds different from the reset device one?
> >> hmmm, I'm not a native speaker here. This _INFO is to query if want
> >> hot reset a given device, what devices would be affected. So it appears
> >> the queried device is better. But I'd admit "the queried device" is also
> >> "the reset device". may Alex help pick one. 
Yi Liu April 6, 2023, 6:34 a.m. UTC | #20
Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, April 6, 2023 3:50 AM
> 
> On Wed, 5 Apr 2023 16:21:09 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote:
> > > Usability needs to be a consideration as well.  An interface where the
> > > result is effectively arbitrary from a user perspective because the
> > > kernel is solely focused on whether the operation is allowed,
> > > evaluating constraints that the user is unaware of and cannot control,
> > > is unusable.
> >
> > Considering this API is only invoked by qemu we might be overdoing
> > this usability and 'no shoot in foot' view.
> 
> Ok, I'm not sure why we're diminishing the de facto vfio userspace...
> 
> > > > This is a good point that qemu needs to make a policy decision if it
> > > > is happy about the VFIO configuration - but that is a policy decision
> > > > that should not become entangled with the kernel's security checks.
> > > >
> > > > Today qemu can make this policy choice the same way it does right now
> > > > - call _INFO and check the group_ids. It gets the exact same outcome
> > > > as today. We already discussed that we need to expose the group ID
> > > > through an ioctl someplace.
> > >
> > > QEMU can make a policy decision today because the kernel provides a
> > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > hot-reset is all but guaranteed to work.
> >
> > And we don't change that with cdev. If qemu wants to make the policy
> > decision it keeps using the exact same _INFO interface to make that
> > decision same it has always made.
> >
> > We weaken the actual reset action to only consider the security side.
> >
> > Applications that want this exclusive reset group policy simply must
> > check it on their own. It is a reasonable API design.
> 
> I disagree, as I've argued before, the info ioctl becomes so weak and
> effectively arbitrary from a user perspective at being able to predict
> whether the hot-reset ioctl works that it becomes useless, diminishing
> the entire hot-reset info/execute API.
> 
> > > > If this is too awkward we could add a query to the kernel if the cdev
> > > > is "reset exclusive" - eg the iommufd covers all the groups that span
> > > > the reset set.
> > >
> > > That's essentially what we have if there are valid dev-ids for each
> > > affected device in the info ioctl.
> >
> > If you have dev-ids for everything, yes. If you don't, then you can't
> > make the same policy choice using a dev-id interface.
> 
> Exactly, you can't make any policy choice because the success or
> failure of the hot-reset ioctl can't be known.

could you elaborate a bit about what the policy is here. As far as I know,
QEMU makes use of the information reported by _INFO to check:
- if all the affected groups are owned by the current QEMU[1]
- if the affected devices are opened by the current QEMU, if yes, QEMU
  needs to use vfio_pci_pre_reset() to do preparation before issuing
  hot rest[1]

[1] vfio_pci_hot_reset() in https://github.com/qemu/qemu/blob/master/hw/vfio/pci.c

> > > I don't think it helps the user experience to create loopholes where
> > > the hot-reset ioctl can still work in spite of those missing
> > > devices.
> >
> > I disagree. The easy straightforward design is that the reset ioctl
> > works if the process has security permissions. Mixing a policy check
> > into the kernel on this path is creating complexity we don't really
> > need.
> >
> > I don't view it as a loophole, it is flexability to use the API in a
> > way that is different from what qemu wants - eg an app like dpdk may
> > be willing to tolerate a reset group that becomes unavailable after
> > startup. Who knows, why should we force this in the kernel?
> 
> Because look at all the problems it's causing to try to introduce these
> loopholes without also introducing subtle bugs.  There's an argument
> that we're overly strict, which is better than the alternative, which
> seems to be what we're dabbling with.  It is a straightforward
> interface for the hot-reset ioctl to mirror the information provided
> via the hot-reset info ioctl.
> 
> > > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> > > capability chains, we could add a capability that reports the group ID
> > > for the device.
> >
> > I was going to put that in an iommufd ioctl so it works with VDPA too,
> > but sure, lets assume we can get the group ID from a cdev fd.
> >
> > > The hot-reset info ioctl remains as it is today, reporting group-ids
> > > and bdfs.
> >
> > Sure, but userspace still needs to know how to map the reset sets into
> > dev-ids.
> 
> No, it doesn't.
> 
> > Remember the reason we started doing this is because we don't
> > have easy access to the BDF anymore.
> 
> We don't need it, the info ioctl provides the groups, the group
> association can be learned from the DEVICE_GET_INFO ioctl, the
> hot-reset ioctl only requires a single representative fd per affected
> group.  dev-ids not required.
> 
> > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > return the dev_ids.
> 
> I don't see any justification for this.  We could add another PCI
> specific DEVICE_GET_INFO capability to report the bdf if we really need
> it, but reporting the group seems sufficient for this use case.

IMHO, the knowledge of group may be not enough. Take QEMU as an example.
QEMU not only needs to ensure the group is owned by it, it also needs to
do preparation on the devices that are already in use and affected by
the hot reset on a new opened device. If there is only group knowledge,
QEMU may blindly prepares all the devices that are already opened and
belong to the same iommu group. But as I got in the discussion iommu
group is not equal to hot reset scope (a.k.a. dev_set). is it? It is
possible that devices in an iommu_group may span into multiple hot
reset scope. For such case, get bdf info from cdev fd is necessary.

Regards,
Yi Liu
Yi Liu April 6, 2023, 10:02 a.m. UTC | #21
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 6, 2023 7:23 AM
> 
> On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:
> 
> > > > QEMU can make a policy decision today because the kernel provides a
> > > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > > hot-reset is all but guaranteed to work.
> > >
> > > And we don't change that with cdev. If qemu wants to make the policy
> > > decision it keeps using the exact same _INFO interface to make that
> > > decision same it has always made.
> > >
> > > We weaken the actual reset action to only consider the security side.
> > >
> > > Applications that want this exclusive reset group policy simply must
> > > check it on their own. It is a reasonable API design.
> >
> > I disagree, as I've argued before, the info ioctl becomes so weak and
> > effectively arbitrary from a user perspective at being able to predict
> > whether the hot-reset ioctl works that it becomes useless, diminishing
> > the entire hot-reset info/execute API.
> 
> reset should be strictly more permissive than INFO. If INFO predicts
> reset is permitted then reset should succeed.
> 
> We don't change INFO so it cannot "becomes so weak"  ??
> 
> We don't care about the cases where INFO says it will not succeed but
> reset does (temporarily) succeed.
> 
> I don't get what argument you are trying to make or what you think is
> diminished..
> 
> Again, userspace calls INFO, if info says yes then reset *always
> works*, exactly just like today.
>
> Userspace will call reset with a 0 length FD list and it uses a
> security only check that is strictly more permissive than what
> get_info will return. So the new check is simple in the kernel and
> always works in the cases we need it to work.
> 
> What is getting things into trouble is insisting that RESET have
> additional restrictions beyond the minimum checks required for
> security.
> 
> > > I don't view it as a loophole, it is flexability to use the API in a
> > > way that is different from what qemu wants - eg an app like dpdk may
> > > be willing to tolerate a reset group that becomes unavailable after
> > > startup. Who knows, why should we force this in the kernel?
> >
> > Because look at all the problems it's causing to try to introduce these
> > loopholes without also introducing subtle bugs.
> 
> These problems are coming from tring to do this integrated version,
> not from my approach!
> 
> AFAICT there was nothing wrong with my original plan of using the
> empty fd list for reset. What Yi has here is some mashup of what you
> and I both suggested.

Hi Alex, Jason,

could be this reason. So let me try to gather the changes of this series
does and the impact as far as I know.

1) only check the ownership of opened devices in the dev_set
     in HOT_RESET ioctl.
     - Impact: it changes the relationship between _INFO  and HOT_RESET.
       As " Each group must have IOMMU protection established for the
       ioctl to succeed." in [1], existing design actually means userspace
       should own all the affected groups before heading to do HOT_RESET.
       With the change here, the user does not need to ensure all affected
       groups are opened and it can do hot-reset successfully as long as the
       devices in the affected group are just un-opened and can be reset.
    
       [1] https://patchwork.kernel.org/project/linux-pci/patch/20130814200845.21923.64284.stgit@bling.home/

2) Allow passing zero-length fd array to do hot reset
    - Impact: this uses the iommufd as ownership check in the kernel side.
      It is only supposed to be used by the users that open cdev instead of
      users that open group. The drawback is that it cannot cover the noiommu
      devices as noiommu does not use iommufd at all. But it works well for
      most cases.

3) Allow hot reset be successful when the dev_set is singleton
     - Impact: this makes sense but it seems to mess up the boundary between
     the group path and cdev path w.r.t. the usage of zero-length fd approach.
     The group path can succeed to do hot reset even if it is passing an empty
     fd array if the dev_set happens to be singleton.

4) Allow passing device fd to do hot reset
    - Impact: this is a new way for hot reset. should have no impact.

5) Extend the _INFO to report devid
    - Impact: this changes the way user to decode the info reported back.
    devid and groupid are returned per the way the queried device is opened.
    Since it was suggested to support the scenario in which some devices
    are opened via cdev while some devices are opened via group. This makes
    us to return invalid_devid for the device that is opened via group if
    it is affected by the hot reset of a device that is opened via cdev.
    
    This was proposed to support the future device fd passing usage which is
    only available in cdev path.

To me the major confusion is from 1) and 3). 1) changes the meaning of
_INFO and HOT_RESET, while 3) messes up the boundary.

Here is my thought:

For 1), it was proposed due to below reason[2]. We'd like to make a scenario
that works in the group path be workable in cdev path as well. But IMHO, we
may just accept that cdev path cannot work for such scenario to avoid sublte
change to uapi. Otherwise, we need to have another HOT_RESET ioctl or a
hint in HOT_RESET ioctl to tell the kernel  whether relaxed ownership check
is expected. Maybe this is awkward. But if we want to keep it, we'd do it
with the awareness by user.

[2] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@nvidia.com/

For 3), it was proposed when discussing the hot reset for noiommu[3]. But
it does not make hot reset always workable for noiommu in cdev, just in
case dev_set is singleton. So it is more of a general optimization that can
make the kernel skip the ownership check. But to make use of it, we may
need to test it before sanitizing the group fds from user or the iommufd
check. Maybe the dev_set singleton test in this series is not well placed.
If so, I can further modify it.

[3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/

Regards,
Yi Liu

> 
> > > Remember the reason we started doing this is because we don't
> > > have easy access to the BDF anymore.
> >
> > We don't need it, the info ioctl provides the groups, the group
> > association can be learned from the DEVICE_GET_INFO ioctl, the
> > hot-reset ioctl only requires a single representative fd per affected
> > group.  dev-ids not required.
> 
> I'm not talking about triggering the ioctl.
> 
> I'm talking about whatever else qemu needs to do so that the VM is
> aware of the reset groups device-by-device on it's side so nested VFIO
> in the VM reflects the same data as the hypervisor. Maybe it doesn't
> do this right now, but the kernel API should continue to provide the
> data.
> 
> > > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > > return the dev_ids.
> >
> > I don't see any justification for this.  We could add another PCI
> > specific DEVICE_GET_INFO capability to report the bdf if we really need
> > it, but reporting the group seems sufficient for this use case.
> 
> What I imagine is a single new ioctl 'get reset group 2' or something.
> It returns a list of dev_ids in the reset group. It has an output flag
> if the reset is reliable. This is the only ioctl user space needs to
> call.
> 
> The reliable test is done by simply calling the ioctl and throwing
> away the dev ids. The mapping of the VM's reset groups is done by
> processing the dev_ids to vRIDs and flowing that into the VM somehow.
> 
> We don't expose group_ids, and we don't expose BDF. It is much simpler
> and cleaner to use.
> 
> A BDF DEVICE_GET_INFO and the existing reset INFO will encode the same
> data too, it is just not as elegant and requires userspace to do a lot
> more work to keep track of the 3 different identifiers.
> 
> > > This looks like a very complex uapi compared to the empty list option,
> > > but it seems like it would work.
> >
> > It's the same API that we have now.  What's complex is trying to figure
> > out all the subtle side-effects from the loopholes that are being
> > proposed in this series.  Thanks,
> 
> I might agree with you if we weren't now going backwards -
> ideas didn't work out and Yi has to throw stuff away. :(
> 
> Jason
Alex Williamson April 6, 2023, 5:07 p.m. UTC | #22
On Thu, 6 Apr 2023 06:34:08 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, April 6, 2023 3:50 AM
> > 
> > On Wed, 5 Apr 2023 16:21:09 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote:  
> > > > Usability needs to be a consideration as well.  An interface where the
> > > > result is effectively arbitrary from a user perspective because the
> > > > kernel is solely focused on whether the operation is allowed,
> > > > evaluating constraints that the user is unaware of and cannot control,
> > > > is unusable.  
> > >
> > > Considering this API is only invoked by qemu we might be overdoing
> > > this usability and 'no shoot in foot' view.  
> > 
> > Ok, I'm not sure why we're diminishing the de facto vfio userspace...
> >   
> > > > > This is a good point that qemu needs to make a policy decision if it
> > > > > is happy about the VFIO configuration - but that is a policy decision
> > > > > that should not become entangled with the kernel's security checks.
> > > > >
> > > > > Today qemu can make this policy choice the same way it does right now
> > > > > - call _INFO and check the group_ids. It gets the exact same outcome
> > > > > as today. We already discussed that we need to expose the group ID
> > > > > through an ioctl someplace.  
> > > >
> > > > QEMU can make a policy decision today because the kernel provides a
> > > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > > hot-reset is all but guaranteed to work.  
> > >
> > > And we don't change that with cdev. If qemu wants to make the policy
> > > decision it keeps using the exact same _INFO interface to make that
> > > decision same it has always made.
> > >
> > > We weaken the actual reset action to only consider the security side.
> > >
> > > Applications that want this exclusive reset group policy simply must
> > > check it on their own. It is a reasonable API design.  
> > 
> > I disagree, as I've argued before, the info ioctl becomes so weak and
> > effectively arbitrary from a user perspective at being able to predict
> > whether the hot-reset ioctl works that it becomes useless, diminishing
> > the entire hot-reset info/execute API.
> >   
> > > > > If this is too awkward we could add a query to the kernel if the cdev
> > > > > is "reset exclusive" - eg the iommufd covers all the groups that span
> > > > > the reset set.  
> > > >
> > > > That's essentially what we have if there are valid dev-ids for each
> > > > affected device in the info ioctl.  
> > >
> > > If you have dev-ids for everything, yes. If you don't, then you can't
> > > make the same policy choice using a dev-id interface.  
> > 
> > Exactly, you can't make any policy choice because the success or
> > failure of the hot-reset ioctl can't be known.  
> 
> could you elaborate a bit about what the policy is here. As far as I know,
> QEMU makes use of the information reported by _INFO to check:
> - if all the affected groups are owned by the current QEMU[1]
> - if the affected devices are opened by the current QEMU, if yes, QEMU
>   needs to use vfio_pci_pre_reset() to do preparation before issuing
>   hot rest[1]
> 
> [1] vfio_pci_hot_reset() in https://github.com/qemu/qemu/blob/master/hw/vfio/pci.c

Regarding the policy decisions, look for instance at the distinction
between vfio_pci_hot_reset_one() vs vfio_pci_hot_reset_multi(), or the
way QEMU will opt for a bus reset if it believes only a PM reset is
available.

In my proposal, I did miss that if _INFO reports the group and bdf that
allows QEMU to associate fd passed devices to a group affected by the
reset, but not specifically whether the device is affected by the
reset.  I think that would be justification for capabilities on the
DEVICE_GET_INFO ioctl to report both the group and PCI address as
separate capabilities.
 
> > > > I don't think it helps the user experience to create loopholes where
> > > > the hot-reset ioctl can still work in spite of those missing
> > > > devices.  
> > >
> > > I disagree. The easy straightforward design is that the reset ioctl
> > > works if the process has security permissions. Mixing a policy check
> > > into the kernel on this path is creating complexity we don't really
> > > need.
> > >
> > > I don't view it as a loophole, it is flexability to use the API in a
> > > way that is different from what qemu wants - eg an app like dpdk may
> > > be willing to tolerate a reset group that becomes unavailable after
> > > startup. Who knows, why should we force this in the kernel?  
> > 
> > Because look at all the problems it's causing to try to introduce these
> > loopholes without also introducing subtle bugs.  There's an argument
> > that we're overly strict, which is better than the alternative, which
> > seems to be what we're dabbling with.  It is a straightforward
> > interface for the hot-reset ioctl to mirror the information provided
> > via the hot-reset info ioctl.
> >   
> > > > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> > > > capability chains, we could add a capability that reports the group ID
> > > > for the device.  
> > >
> > > I was going to put that in an iommufd ioctl so it works with VDPA too,
> > > but sure, lets assume we can get the group ID from a cdev fd.
> > >  
> > > > The hot-reset info ioctl remains as it is today, reporting group-ids
> > > > and bdfs.  
> > >
> > > Sure, but userspace still needs to know how to map the reset sets into
> > > dev-ids.  
> > 
> > No, it doesn't.
> >   
> > > Remember the reason we started doing this is because we don't
> > > have easy access to the BDF anymore.  
> > 
> > We don't need it, the info ioctl provides the groups, the group
> > association can be learned from the DEVICE_GET_INFO ioctl, the
> > hot-reset ioctl only requires a single representative fd per affected
> > group.  dev-ids not required.
> >   
> > > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > > return the dev_ids.  
> > 
> > I don't see any justification for this.  We could add another PCI
> > specific DEVICE_GET_INFO capability to report the bdf if we really need
> > it, but reporting the group seems sufficient for this use case.  
> 
> IMHO, the knowledge of group may be not enough. Take QEMU as an example.
> QEMU not only needs to ensure the group is owned by it, it also needs to
> do preparation on the devices that are already in use and affected by
> the hot reset on a new opened device. If there is only group knowledge,
> QEMU may blindly prepares all the devices that are already opened and
> belong to the same iommu group. But as I got in the discussion iommu
> group is not equal to hot reset scope (a.k.a. dev_set). is it? It is
> possible that devices in an iommu_group may span into multiple hot
> reset scope. For such case, get bdf info from cdev fd is necessary.

Yes, you're correct, group and reset scope are not equivalent, so we'd
require a means to get both the group and the bdf for the device.
Knowing the bdf allows the user to know which opened devices are
directly affected by the reset, knowing the group allows the user to
know if ancillary affected devices are within the set of groups the
user owns and therefore effectively under their purview.  Thanks,

Alex
Alex Williamson April 6, 2023, 5:53 p.m. UTC | #23
On Thu, 6 Apr 2023 10:02:10 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, April 6, 2023 7:23 AM
> > 
> > On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:
> >   
> > > > > QEMU can make a policy decision today because the kernel provides a
> > > > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > > > hot-reset is all but guaranteed to work.  
> > > >
> > > > And we don't change that with cdev. If qemu wants to make the policy
> > > > decision it keeps using the exact same _INFO interface to make that
> > > > decision same it has always made.
> > > >
> > > > We weaken the actual reset action to only consider the security side.
> > > >
> > > > Applications that want this exclusive reset group policy simply must
> > > > check it on their own. It is a reasonable API design.  
> > >
> > > I disagree, as I've argued before, the info ioctl becomes so weak and
> > > effectively arbitrary from a user perspective at being able to predict
> > > whether the hot-reset ioctl works that it becomes useless, diminishing
> > > the entire hot-reset info/execute API.  
> > 
> > reset should be strictly more permissive than INFO. If INFO predicts
> > reset is permitted then reset should succeed.
> > 
> > We don't change INFO so it cannot "becomes so weak"  ??
> > 
> > We don't care about the cases where INFO says it will not succeed but
> > reset does (temporarily) succeed.
> > 
> > I don't get what argument you are trying to make or what you think is
> > diminished..
> > 
> > Again, userspace calls INFO, if info says yes then reset *always
> > works*, exactly just like today.
> >
> > Userspace will call reset with a 0 length FD list and it uses a
> > security only check that is strictly more permissive than what
> > get_info will return. So the new check is simple in the kernel and
> > always works in the cases we need it to work.
> > 
> > What is getting things into trouble is insisting that RESET have
> > additional restrictions beyond the minimum checks required for
> > security.
> >   
> > > > I don't view it as a loophole, it is flexability to use the API in a
> > > > way that is different from what qemu wants - eg an app like dpdk may
> > > > be willing to tolerate a reset group that becomes unavailable after
> > > > startup. Who knows, why should we force this in the kernel?  
> > >
> > > Because look at all the problems it's causing to try to introduce these
> > > loopholes without also introducing subtle bugs.  
> > 
> > These problems are coming from tring to do this integrated version,
> > not from my approach!
> > 
> > AFAICT there was nothing wrong with my original plan of using the
> > empty fd list for reset. What Yi has here is some mashup of what you
> > and I both suggested.  
> 
> Hi Alex, Jason,
> 
> could be this reason. So let me try to gather the changes of this series
> does and the impact as far as I know.
> 
> 1) only check the ownership of opened devices in the dev_set
>      in HOT_RESET ioctl.
>      - Impact: it changes the relationship between _INFO  and HOT_RESET.
>        As " Each group must have IOMMU protection established for the
>        ioctl to succeed." in [1], existing design actually means userspace
>        should own all the affected groups before heading to do HOT_RESET.
>        With the change here, the user does not need to ensure all affected
>        groups are opened and it can do hot-reset successfully as long as the
>        devices in the affected group are just un-opened and can be reset.
>     
>        [1] https://patchwork.kernel.org/project/linux-pci/patch/20130814200845.21923.64284.stgit@bling.home/

Where whether a device is opened is subject to change outside of the
user's control.  This essentially allows the user to perform hot-resets
of devices outside of their ownership so long as the device is not
used elsewhere, versus the current requirement that the user own all the
affected groups, which implies device ownership.  It's not been
justified why this feature needs to exist, imo.
 
> 2) Allow passing zero-length fd array to do hot reset
>     - Impact: this uses the iommufd as ownership check in the kernel side.
>       It is only supposed to be used by the users that open cdev instead of
>       users that open group. The drawback is that it cannot cover the noiommu
>       devices as noiommu does not use iommufd at all. But it works well for
>       most cases.

The "only supposed to be used" is problematic here, we're extending all
the interfaces to transparently accept group and device fds, but here
we need to make a distinction because the ioctl needs to perform one
way for groups and another way for devices, which it currently doesn't
do.  As above, I've not seen sufficient justification for this other
than references to reducing complexity, but the only userspace expected
to make use of this interface already has equivalent complexity.
 
> 3) Allow hot reset be successful when the dev_set is singleton
>      - Impact: this makes sense but it seems to mess up the boundary between
>      the group path and cdev path w.r.t. the usage of zero-length fd approach.
>      The group path can succeed to do hot reset even if it is passing an empty
>      fd array if the dev_set happens to be singleton.

Again, what is the justification for requiring this, it seems to be
only a hack towards no-iommu support with cdev, which we can achieve by
other means.  Why have we not needed this in the group model?  It
introduces subtle loopholes, so while maybe we could, I don't see why we
should, therefore I cannot agree with "this makes sense".

> 4) Allow passing device fd to do hot reset
>     - Impact: this is a new way for hot reset. should have no impact.
> 
> 5) Extend the _INFO to report devid
>     - Impact: this changes the way user to decode the info reported back.
>     devid and groupid are returned per the way the queried device is opened.
>     Since it was suggested to support the scenario in which some devices
>     are opened via cdev while some devices are opened via group. This makes
>     us to return invalid_devid for the device that is opened via group if
>     it is affected by the hot reset of a device that is opened via cdev.
>     
>     This was proposed to support the future device fd passing usage which is
>     only available in cdev path.

I think this is fundamentally flawed because of the scope of the
dev-id.  We can only provide dev-ids for devices which belong to the
same iommufd of the calling device, thus there are multiple instances
where no dev-id can be provided.  The group-id and bdf are static
properties of the devices, regardless of their ownership.  The bdf
provides the specific device level association while the group-id
indicates implied, static ownership.

> To me the major confusion is from 1) and 3). 1) changes the meaning of
> _INFO and HOT_RESET, while 3) messes up the boundary.

As above, I think 2) is also an issue.

> Here is my thought:
> 
> For 1), it was proposed due to below reason[2]. We'd like to make a scenario
> that works in the group path be workable in cdev path as well. But IMHO, we
> may just accept that cdev path cannot work for such scenario to avoid sublte
> change to uapi. Otherwise, we need to have another HOT_RESET ioctl or a
> hint in HOT_RESET ioctl to tell the kernel  whether relaxed ownership check
> is expected. Maybe this is awkward. But if we want to keep it, we'd do it
> with the awareness by user.
> 
> [2] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@nvidia.com/

The group association is that relaxed ownership test.  Yes, there are
corner cases where we have a dual function card with separate IOMMU
groups, where a user owning function 0 could do a bus reset because
function 1 is temporarily unused, but so what, what good is that, have
we ever had an issue raised because of this?  The user can't rely on
the unopened state of the other function.  It's an entirely
opportunistic optimization.

The much more typical scenario is that a multi-function device does not
provide isolation, all the functions are in the same group and because
of the association of the group the user has implied ownership of the
other devices for the purpose of a reset.

> For 3), it was proposed when discussing the hot reset for noiommu[3]. But
> it does not make hot reset always workable for noiommu in cdev, just in
> case dev_set is singleton. So it is more of a general optimization that can
> make the kernel skip the ownership check. But to make use of it, we may
> need to test it before sanitizing the group fds from user or the iommufd
> check. Maybe the dev_set singleton test in this series is not well placed.
> If so, I can further modify it.
> 
> [3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/

As above, this seems to be some optimization related to no-iommu for
cdev because we don't have an iommufd association for the device in
no-iommu mode.  Note however that the current group interface doesn't
care about the IOMMU context of the devices.  We only need proof that
the user owns the affected groups.  So why are we bringing iommufd
context anywhere into this interface, here or the null-array interface?

It seems like the minor difference with cdev is that a) we're passing
device fds rather than group fds, and b) those device fds need to be
validated as having device access to complete the proof of ownership
relative to the group.  Otherwise we add capabilities to
DEVICE_GET_INFO to support the device fd passing model where the user
doesn't know the device group or bdf and allow the reset ioctl itself
to accept device fds (extracting the group relationship for those which
the user has configured for access).  Thanks,

Alex
Yi Liu April 7, 2023, 10:09 a.m. UTC | #24
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 7, 2023 1:54 AM
> 
> On Thu, 6 Apr 2023 10:02:10 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 6, 2023 7:23 AM
> > >
> > > On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:
> > >
> > > > > > QEMU can make a policy decision today because the kernel provides a
> > > > > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > > > > hot-reset is all but guaranteed to work.
> > > > >
> > > > > And we don't change that with cdev. If qemu wants to make the policy
> > > > > decision it keeps using the exact same _INFO interface to make that
> > > > > decision same it has always made.
> > > > >
> > > > > We weaken the actual reset action to only consider the security side.
> > > > >
> > > > > Applications that want this exclusive reset group policy simply must
> > > > > check it on their own. It is a reasonable API design.
> > > >
> > > > I disagree, as I've argued before, the info ioctl becomes so weak and
> > > > effectively arbitrary from a user perspective at being able to predict
> > > > whether the hot-reset ioctl works that it becomes useless, diminishing
> > > > the entire hot-reset info/execute API.
> > >
> > > reset should be strictly more permissive than INFO. If INFO predicts
> > > reset is permitted then reset should succeed.
> > >
> > > We don't change INFO so it cannot "becomes so weak"  ??
> > >
> > > We don't care about the cases where INFO says it will not succeed but
> > > reset does (temporarily) succeed.
> > >
> > > I don't get what argument you are trying to make or what you think is
> > > diminished..
> > >
> > > Again, userspace calls INFO, if info says yes then reset *always
> > > works*, exactly just like today.
> > >
> > > Userspace will call reset with a 0 length FD list and it uses a
> > > security only check that is strictly more permissive than what
> > > get_info will return. So the new check is simple in the kernel and
> > > always works in the cases we need it to work.
> > >
> > > What is getting things into trouble is insisting that RESET have
> > > additional restrictions beyond the minimum checks required for
> > > security.
> > >
> > > > > I don't view it as a loophole, it is flexability to use the API in a
> > > > > way that is different from what qemu wants - eg an app like dpdk may
> > > > > be willing to tolerate a reset group that becomes unavailable after
> > > > > startup. Who knows, why should we force this in the kernel?
> > > >
> > > > Because look at all the problems it's causing to try to introduce these
> > > > loopholes without also introducing subtle bugs.
> > >
> > > These problems are coming from tring to do this integrated version,
> > > not from my approach!
> > >
> > > AFAICT there was nothing wrong with my original plan of using the
> > > empty fd list for reset. What Yi has here is some mashup of what you
> > > and I both suggested.
> >
> > Hi Alex, Jason,
> >
> > could be this reason. So let me try to gather the changes of this series
> > does and the impact as far as I know.
> >
> > 1) only check the ownership of opened devices in the dev_set
> >      in HOT_RESET ioctl.
> >      - Impact: it changes the relationship between _INFO  and HOT_RESET.
> >        As " Each group must have IOMMU protection established for the
> >        ioctl to succeed." in [1], existing design actually means userspace
> >        should own all the affected groups before heading to do HOT_RESET.
> >        With the change here, the user does not need to ensure all affected
> >        groups are opened and it can do hot-reset successfully as long as the
> >        devices in the affected group are just un-opened and can be reset.
> >
> >        [1] https://patchwork.kernel.org/project/linux-
> pci/patch/20130814200845.21923.64284.stgit@bling.home/
> 
> Where whether a device is opened is subject to change outside of the
> user's control.  This essentially allows the user to perform hot-resets
> of devices outside of their ownership so long as the device is not
> used elsewhere, versus the current requirement that the user own all the
> affected groups, which implies device ownership.  It's not been
> justified why this feature needs to exist, imo.
> 
> > 2) Allow passing zero-length fd array to do hot reset
> >     - Impact: this uses the iommufd as ownership check in the kernel side.
> >       It is only supposed to be used by the users that open cdev instead of
> >       users that open group. The drawback is that it cannot cover the noiommu
> >       devices as noiommu does not use iommufd at all. But it works well for
> >       most cases.
> 
> The "only supposed to be used" is problematic here, we're extending all
> the interfaces to transparently accept group and device fds, but here
> we need to make a distinction because the ioctl needs to perform one
> way for groups and another way for devices, which it currently doesn't
> do.  As above, I've not seen sufficient justification for this other
> than references to reducing complexity, but the only userspace expected
> to make use of this interface already has equivalent complexity.
> 
> > 3) Allow hot reset be successful when the dev_set is singleton
> >      - Impact: this makes sense but it seems to mess up the boundary between
> >      the group path and cdev path w.r.t. the usage of zero-length fd approach.
> >      The group path can succeed to do hot reset even if it is passing an empty
> >      fd array if the dev_set happens to be singleton.
> 
> Again, what is the justification for requiring this, it seems to be
> only a hack towards no-iommu support with cdev, which we can achieve by
> other means.  Why have we not needed this in the group model?  It
> introduces subtle loopholes, so while maybe we could, I don't see why we
> should, therefore I cannot agree with "this makes sense".
> 
> > 4) Allow passing device fd to do hot reset
> >     - Impact: this is a new way for hot reset. should have no impact.
> >
> > 5) Extend the _INFO to report devid
> >     - Impact: this changes the way user to decode the info reported back.
> >     devid and groupid are returned per the way the queried device is opened.
> >     Since it was suggested to support the scenario in which some devices
> >     are opened via cdev while some devices are opened via group. This makes
> >     us to return invalid_devid for the device that is opened via group if
> >     it is affected by the hot reset of a device that is opened via cdev.
> >
> >     This was proposed to support the future device fd passing usage which is
> >     only available in cdev path.
> 
> I think this is fundamentally flawed because of the scope of the
> dev-id.  We can only provide dev-ids for devices which belong to the
> same iommufd of the calling device, thus there are multiple instances
> where no dev-id can be provided.  The group-id and bdf are static
> properties of the devices, regardless of their ownership.  The bdf
> provides the specific device level association while the group-id
> indicates implied, static ownership.
> 
> > To me the major confusion is from 1) and 3). 1) changes the meaning of
> > _INFO and HOT_RESET, while 3) messes up the boundary.
> 
> As above, I think 2) is also an issue.
> 
> > Here is my thought:
> >
> > For 1), it was proposed due to below reason[2]. We'd like to make a scenario
> > that works in the group path be workable in cdev path as well. But IMHO, we
> > may just accept that cdev path cannot work for such scenario to avoid sublte
> > change to uapi. Otherwise, we need to have another HOT_RESET ioctl or a
> > hint in HOT_RESET ioctl to tell the kernel  whether relaxed ownership check
> > is expected. Maybe this is awkward. But if we want to keep it, we'd do it
> > with the awareness by user.
> >
> > [2] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@nvidia.com/
> 
> The group association is that relaxed ownership test.  Yes, there are
> corner cases where we have a dual function card with separate IOMMU
> groups, where a user owning function 0 could do a bus reset because
> function 1 is temporarily unused, but so what, what good is that, have
> we ever had an issue raised because of this?  The user can't rely on
> the unopened state of the other function.  It's an entirely
> opportunistic optimization.
> 
> The much more typical scenario is that a multi-function device does not
> provide isolation, all the functions are in the same group and because
> of the association of the group the user has implied ownership of the
> other devices for the purpose of a reset.
> 
> > For 3), it was proposed when discussing the hot reset for noiommu[3]. But
> > it does not make hot reset always workable for noiommu in cdev, just in
> > case dev_set is singleton. So it is more of a general optimization that can
> > make the kernel skip the ownership check. But to make use of it, we may
> > need to test it before sanitizing the group fds from user or the iommufd
> > check. Maybe the dev_set singleton test in this series is not well placed.
> > If so, I can further modify it.
> >
> > [3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/
> 
> As above, this seems to be some optimization related to no-iommu for
> cdev because we don't have an iommufd association for the device in
> no-iommu mode.  Note however that the current group interface doesn't
> care about the IOMMU context of the devices.  We only need proof that
> the user owns the affected groups.  So why are we bringing iommufd
> context anywhere into this interface, here or the null-array interface?
> 
> It seems like the minor difference with cdev is that a) we're passing
> device fds rather than group fds, and b) those device fds need to be
> validated as having device access to complete the proof of ownership
> relative to the group.  Otherwise we add capabilities to
> DEVICE_GET_INFO to support the device fd passing model where the user
> doesn't know the device group or bdf and allow the reset ioctl itself
> to accept device fds (extracting the group relationship for those which
> the user has configured for access).  Thanks,

so your suggestion is to drop 1) 2) 3) and 5), keep 4) and add new bdf/group
capability to DEVICE_GET_INFO to retrieve group_id and bdf. In this way, the
existing _INFO ioctl can be reused without any change. is it?

Regards,
Yi Liu
Yi Liu April 7, 2023, 10:09 a.m. UTC | #25
Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, April 3, 2023 11:02 PM
> 
> On Mon, 3 Apr 2023 09:25:06 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Saturday, April 1, 2023 10:44 PM
> >
> > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void
> *data)
> > >  	if (!iommu_group)
> > >  		return -EPERM; /* Cannot reset non-isolated devices */
> >
> > Hi Alex,
> >
> > Is disabling iommu a sane way to test vfio noiommu mode?
> 
> Yes
> 
> > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > group check. Reason is that this happens to have some affected devices, and
> > these devices have no valid iommu_group (because they are not bound to vfio-pci
> > hence nobody allocates noiommu group for them). So when hot reset info loops
> > such devices, it failed with -EPERM. Is this expected?
> 
> Hmm, I didn't recall that we put in such a limitation, but given the
> minimally intrusive approach to no-iommu and the fact that we never
> defined an invalid group ID to return to the user, it makes sense that
> we just blocked the ioctl for no-iommu use.  I guess we can do the same
> for no-iommu cdev.

I just realize a further issue related to this limitation. Remember that we
may finally compile out the vfio group infrastructure in the future. Say I
want to test noiommu, I may boot such a kernel with iommu disabled. I think
the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
not support hot reset for noiommu in future if vfio group infrastructure is
compiled out?

As another thread, we are going to add a new bdf/group capability to
DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new
bdf/group capability or add a flag in the capability to mark the group_id
is invalid?

Regards,
Yi Liu
Alex Williamson April 7, 2023, 12:03 p.m. UTC | #26
On Fri, 7 Apr 2023 10:09:58 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, April 3, 2023 11:02 PM
> > 
> > On Mon, 3 Apr 2023 09:25:06 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Saturday, April 1, 2023 10:44 PM  
> > >  
> > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void  
> > *data)  
> > > >  	if (!iommu_group)
> > > >  		return -EPERM; /* Cannot reset non-isolated devices */  
> > >
> > > Hi Alex,
> > >
> > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > 
> > Yes
> >   
> > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > group check. Reason is that this happens to have some affected devices, and
> > > these devices have no valid iommu_group (because they are not bound to vfio-pci
> > > hence nobody allocates noiommu group for them). So when hot reset info loops
> > > such devices, it failed with -EPERM. Is this expected?  
> > 
> > Hmm, I didn't recall that we put in such a limitation, but given the
> > minimally intrusive approach to no-iommu and the fact that we never
> > defined an invalid group ID to return to the user, it makes sense that
> > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > for no-iommu cdev.  
> 
> I just realize a further issue related to this limitation. Remember that we
> may finally compile out the vfio group infrastructure in the future. Say I
> want to test noiommu, I may boot such a kernel with iommu disabled. I think
> the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> not support hot reset for noiommu in future if vfio group infrastructure is
> compiled out?

We're talking about IOMMU groups, IOMMU groups are always present
regardless of whether we expose a vfio group interface to userspace.
Remember, we create IOMMU groups even in the no-iommu case.  Even with
pure cdev, there are underlying IOMMU groups that maintain the DMA
ownership.

> As another thread, we are going to add a new bdf/group capability to
> DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new
> bdf/group capability or add a flag in the capability to mark the group_id
> is invalid?

As above, there's always an IOMMU group, it's never invalid.  Thanks,

Alex
Yi Liu April 7, 2023, 1:24 p.m. UTC | #27
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 7, 2023 8:04 PM
> 
> > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void
> > > *data)
> > > > >  	if (!iommu_group)
> > > > >  		return -EPERM; /* Cannot reset non-isolated devices */

[1]

> > > >
> > > > Hi Alex,
> > > >
> > > > Is disabling iommu a sane way to test vfio noiommu mode?
> > >
> > > Yes
> > >
> > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > > group check. Reason is that this happens to have some affected devices, and
> > > > these devices have no valid iommu_group (because they are not bound to vfio-
> pci
> > > > hence nobody allocates noiommu group for them). So when hot reset info loops
> > > > such devices, it failed with -EPERM. Is this expected?
> > >
> > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > minimally intrusive approach to no-iommu and the fact that we never
> > > defined an invalid group ID to return to the user, it makes sense that
> > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > for no-iommu cdev.
> >
> > I just realize a further issue related to this limitation. Remember that we
> > may finally compile out the vfio group infrastructure in the future. Say I
> > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > not support hot reset for noiommu in future if vfio group infrastructure is
> > compiled out?
> 
> We're talking about IOMMU groups, IOMMU groups are always present
> regardless of whether we expose a vfio group interface to userspace.
> Remember, we create IOMMU groups even in the no-iommu case.  Even with
> pure cdev, there are underlying IOMMU groups that maintain the DMA
> ownership.

hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
given device unless it is registered to VFIO, which a fake group is created.
That's why I hit the limitation [1]. When vfio_group is compiled out, then
even fake group goes away.

>
> > As another thread, we are going to add a new bdf/group capability to
> > DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new
> > bdf/group capability or add a flag in the capability to mark the group_id
> > is invalid?
> 
> As above, there's always an IOMMU group, it's never invalid.  Thanks,

Regards,
Yi Liu
Alex Williamson April 7, 2023, 1:51 p.m. UTC | #28
On Fri, 7 Apr 2023 13:24:25 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 7, 2023 8:04 PM
> >   
> > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void  
> > > > *data)  
> > > > > >  	if (!iommu_group)
> > > > > >  		return -EPERM; /* Cannot reset non-isolated devices */  
> 
> [1]
> 
> > > > >
> > > > > Hi Alex,
> > > > >
> > > > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > > >
> > > > Yes
> > > >  
> > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > > > group check. Reason is that this happens to have some affected devices, and
> > > > > these devices have no valid iommu_group (because they are not bound to vfio-  
> > pci  
> > > > > hence nobody allocates noiommu group for them). So when hot reset info loops
> > > > > such devices, it failed with -EPERM. Is this expected?  
> > > >
> > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > defined an invalid group ID to return to the user, it makes sense that
> > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > for no-iommu cdev.  
> > >
> > > I just realize a further issue related to this limitation. Remember that we
> > > may finally compile out the vfio group infrastructure in the future. Say I
> > > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > compiled out?  
> > 
> > We're talking about IOMMU groups, IOMMU groups are always present
> > regardless of whether we expose a vfio group interface to userspace.
> > Remember, we create IOMMU groups even in the no-iommu case.  Even with
> > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > ownership.  
> 
> hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> given device unless it is registered to VFIO, which a fake group is created.
> That's why I hit the limitation [1]. When vfio_group is compiled out, then
> even fake group goes away.

In the vfio group case, [1] can be hit with no-iommu only when there
are affected devices which are not bound to vfio.  Why are we not
allocating an IOMMU group to no-iommu devices when vfio group is
disabled?  Thanks,

Alex
Yi Liu April 7, 2023, 2:04 p.m. UTC | #29
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 7, 2023 9:52 PM
> 
> On Fri, 7 Apr 2023 13:24:25 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 7, 2023 8:04 PM
> > >
> > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev,
> void
> > > > > *data)
> > > > > > >  	if (!iommu_group)
> > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices */
> >
> > [1]
> >
> > > > > >
> > > > > > Hi Alex,
> > > > > >
> > > > > > Is disabling iommu a sane way to test vfio noiommu mode?
> > > > >
> > > > > Yes
> > > > >
> > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0.
> Bind
> > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > > > > group check. Reason is that this happens to have some affected devices, and
> > > > > > these devices have no valid iommu_group (because they are not bound to
> vfio-
> > > pci
> > > > > > hence nobody allocates noiommu group for them). So when hot reset info
> loops
> > > > > > such devices, it failed with -EPERM. Is this expected?
> > > > >
> > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > for no-iommu cdev.
> > > >
> > > > I just realize a further issue related to this limitation. Remember that we
> > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > compiled out?
> > >
> > > We're talking about IOMMU groups, IOMMU groups are always present
> > > regardless of whether we expose a vfio group interface to userspace.
> > > Remember, we create IOMMU groups even in the no-iommu case.  Even with
> > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > ownership.
> >
> > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > given device unless it is registered to VFIO, which a fake group is created.
> > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > even fake group goes away.
> 
> In the vfio group case, [1] can be hit with no-iommu only when there
> are affected devices which are not bound to vfio.

yes. because vfio would allocate fake group when device is registered to
it.

> Why are we not
> allocating an IOMMU group to no-iommu devices when vfio group is
> disabled?  Thanks,

hmmm. when the vfio group code is configured out. The
vfio_device_set_group() just returns 0 after below patch is
applied and CONFIG_VFIO_GROUP=n. So when there is no
vfio group, the fake group also goes away.

https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/

Regards,
Yi Liu
Alex Williamson April 7, 2023, 3:14 p.m. UTC | #30
On Fri, 7 Apr 2023 14:04:02 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 7, 2023 9:52 PM
> > 
> > On Fri, 7 Apr 2023 13:24:25 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Friday, April 7, 2023 8:04 PM
> > > >  
> > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev,  
> > void  
> > > > > > *data)  
> > > > > > > >  	if (!iommu_group)
> > > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices */  
> > >
> > > [1]
> > >  
> > > > > > >
> > > > > > > Hi Alex,
> > > > > > >
> > > > > > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > > > > >
> > > > > > Yes
> > > > > >  
> > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0.  
> > Bind  
> > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > > > > > group check. Reason is that this happens to have some affected devices, and
> > > > > > > these devices have no valid iommu_group (because they are not bound to  
> > vfio-  
> > > > pci  
> > > > > > > hence nobody allocates noiommu group for them). So when hot reset info  
> > loops  
> > > > > > > such devices, it failed with -EPERM. Is this expected?  
> > > > > >
> > > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > > for no-iommu cdev.  
> > > > >
> > > > > I just realize a further issue related to this limitation. Remember that we
> > > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > > compiled out?  
> > > >
> > > > We're talking about IOMMU groups, IOMMU groups are always present
> > > > regardless of whether we expose a vfio group interface to userspace.
> > > > Remember, we create IOMMU groups even in the no-iommu case.  Even with
> > > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > > ownership.  
> > >
> > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > > given device unless it is registered to VFIO, which a fake group is created.
> > > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > > even fake group goes away.  
> > 
> > In the vfio group case, [1] can be hit with no-iommu only when there
> > are affected devices which are not bound to vfio.  
> 
> yes. because vfio would allocate fake group when device is registered to
> it.
> 
> > Why are we not
> > allocating an IOMMU group to no-iommu devices when vfio group is
> > disabled?  Thanks,  
> 
> hmmm. when the vfio group code is configured out. The
> vfio_device_set_group() just returns 0 after below patch is
> applied and CONFIG_VFIO_GROUP=n. So when there is no
> vfio group, the fake group also goes away.
> 
> https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/

Is this a fundamental issue or just a problem with the current
implementation proposal?  It seems like the latter.  FWIW, I also don't
see a taint happening in the cdev path for no-iommu use.  Thanks,

Alex
Yi Liu April 7, 2023, 3:47 p.m. UTC | #31
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 7, 2023 11:14 PM
> 
> On Fri, 7 Apr 2023 14:04:02 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 7, 2023 9:52 PM
> > >
> > > On Fri, 7 Apr 2023 13:24:25 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Friday, April 7, 2023 8:04 PM
> > > > >
> > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev
> *pdev,
> > > void
> > > > > > > *data)
> > > > > > > > >  	if (!iommu_group)
> > > > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices */
> > > >
> > > > [1]
> > > >
> > > > > > > >
> > > > > > > > Hi Alex,
> > > > > > > >
> > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode?
> > > > > > >
> > > > > > > Yes
> > > > > > >
> > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-
> pci.
> > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0.
> > > Bind
> > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the
> above
> > > > > > > > group check. Reason is that this happens to have some affected devices,
> and
> > > > > > > > these devices have no valid iommu_group (because they are not bound to
> > > vfio-
> > > > > pci
> > > > > > > > hence nobody allocates noiommu group for them). So when hot reset info
> > > loops
> > > > > > > > such devices, it failed with -EPERM. Is this expected?
> > > > > > >
> > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > > > for no-iommu cdev.
> > > > > >
> > > > > > I just realize a further issue related to this limitation. Remember that we
> > > > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > > > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > > > compiled out?
> > > > >
> > > > > We're talking about IOMMU groups, IOMMU groups are always present
> > > > > regardless of whether we expose a vfio group interface to userspace.
> > > > > Remember, we create IOMMU groups even in the no-iommu case.  Even with
> > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > > > ownership.
> > > >
> > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > > > given device unless it is registered to VFIO, which a fake group is created.
> > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > > > even fake group goes away.
> > >
> > > In the vfio group case, [1] can be hit with no-iommu only when there
> > > are affected devices which are not bound to vfio.
> >
> > yes. because vfio would allocate fake group when device is registered to
> > it.
> >
> > > Why are we not
> > > allocating an IOMMU group to no-iommu devices when vfio group is
> > > disabled?  Thanks,
> >
> > hmmm. when the vfio group code is configured out. The
> > vfio_device_set_group() just returns 0 after below patch is
> > applied and CONFIG_VFIO_GROUP=n. So when there is no
> > vfio group, the fake group also goes away.
> >
> > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> 
> Is this a fundamental issue or just a problem with the current
> implementation proposal?  It seems like the latter.  FWIW, I also don't
> see a taint happening in the cdev path for no-iommu use.  Thanks,

yes. the latter case. The reason I raised it here is to confirm the
policy on the new group/bdf capability in the DEVICE_GET_INFO. If
there is no iommu group, perhaps I only need to exclude the new
group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?

Regards,
Yi Liu
Alex Williamson April 7, 2023, 9:07 p.m. UTC | #32
On Fri, 7 Apr 2023 15:47:10 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 7, 2023 11:14 PM
> > 
> > On Fri, 7 Apr 2023 14:04:02 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Friday, April 7, 2023 9:52 PM
> > > >
> > > > On Fri, 7 Apr 2023 13:24:25 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Friday, April 7, 2023 8:04 PM
> > > > > >  
> > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev  
> > *pdev,  
> > > > void  
> > > > > > > > *data)  
> > > > > > > > > >  	if (!iommu_group)
> > > > > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices */  
> > > > >
> > > > > [1]
> > > > >  
> > > > > > > > >
> > > > > > > > > Hi Alex,
> > > > > > > > >
> > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > > > > > > >
> > > > > > > > Yes
> > > > > > > >  
> > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-  
> > pci.  
> > > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0.  
> > > > Bind  
> > > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the  
> > above  
> > > > > > > > > group check. Reason is that this happens to have some affected devices,  
> > and  
> > > > > > > > > these devices have no valid iommu_group (because they are not bound to  
> > > > vfio-  
> > > > > > pci  
> > > > > > > > > hence nobody allocates noiommu group for them). So when hot reset info  
> > > > loops  
> > > > > > > > > such devices, it failed with -EPERM. Is this expected?  
> > > > > > > >
> > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > > > > for no-iommu cdev.  
> > > > > > >
> > > > > > > I just realize a further issue related to this limitation. Remember that we
> > > > > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > > > > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > > > > compiled out?  
> > > > > >
> > > > > > We're talking about IOMMU groups, IOMMU groups are always present
> > > > > > regardless of whether we expose a vfio group interface to userspace.
> > > > > > Remember, we create IOMMU groups even in the no-iommu case.  Even with
> > > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > > > > ownership.  
> > > > >
> > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > > > > given device unless it is registered to VFIO, which a fake group is created.
> > > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > > > > even fake group goes away.  
> > > >
> > > > In the vfio group case, [1] can be hit with no-iommu only when there
> > > > are affected devices which are not bound to vfio.  
> > >
> > > yes. because vfio would allocate fake group when device is registered to
> > > it.
> > >  
> > > > Why are we not
> > > > allocating an IOMMU group to no-iommu devices when vfio group is
> > > > disabled?  Thanks,  
> > >
> > > hmmm. when the vfio group code is configured out. The
> > > vfio_device_set_group() just returns 0 after below patch is
> > > applied and CONFIG_VFIO_GROUP=n. So when there is no
> > > vfio group, the fake group also goes away.
> > >
> > > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/  
> > 
> > Is this a fundamental issue or just a problem with the current
> > implementation proposal?  It seems like the latter.  FWIW, I also don't
> > see a taint happening in the cdev path for no-iommu use.  Thanks,  
> 
> yes. the latter case. The reason I raised it here is to confirm the
> policy on the new group/bdf capability in the DEVICE_GET_INFO. If
> there is no iommu group, perhaps I only need to exclude the new
> group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?

I think we need to revisit the question of why allocating an IOMMU
group for a no-iommu device is exclusive to the vfio group support.
We've already been down the path of trying to report a field that only
exists for devices with certain properties with dev-id.  It doesn't
work well.  I think we've said all along that while the cdev interface
is device based, there are still going to be underlying IOMMU groups
for the user to be aware of, they're just not as much a fundamental
part of the interface.  There should not be a case where a device
doesn't have a group to report.  Thanks,

Alex
Yi Liu April 8, 2023, 5:07 a.m. UTC | #33
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 8, 2023 5:07 AM
> 
> On Fri, 7 Apr 2023 15:47:10 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 7, 2023 11:14 PM
> > >
> > > On Fri, 7 Apr 2023 14:04:02 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Friday, April 7, 2023 9:52 PM
> > > > >
> > > > > On Fri, 7 Apr 2023 13:24:25 +0000
> > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > Sent: Friday, April 7, 2023 8:04 PM
> > > > > > >
> > > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev
> > > *pdev,
> > > > > void
> > > > > > > > > *data)
> > > > > > > > > > >  	if (!iommu_group)
> > > > > > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices
> */
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Alex,
> > > > > > > > > >
> > > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode?
> > > > > > > > >
> > > > > > > > > Yes
> > > > > > > > >
> > > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to
> vfio-
> > > pci.
> > > > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-
> vfio0.
> > > > > Bind
> > > > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the
> > > above
> > > > > > > > > > group check. Reason is that this happens to have some affected
> devices,
> > > and
> > > > > > > > > > these devices have no valid iommu_group (because they are not
> bound to
> > > > > vfio-
> > > > > > > pci
> > > > > > > > > > hence nobody allocates noiommu group for them). So when hot reset
> info
> > > > > loops
> > > > > > > > > > such devices, it failed with -EPERM. Is this expected?
> > > > > > > > >
> > > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > > > > > for no-iommu cdev.
> > > > > > > >
> > > > > > > > I just realize a further issue related to this limitation. Remember that we
> > > > > > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I
> think
> > > > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we
> will
> > > > > > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > > > > > compiled out?
> > > > > > >
> > > > > > > We're talking about IOMMU groups, IOMMU groups are always present
> > > > > > > regardless of whether we expose a vfio group interface to userspace.
> > > > > > > Remember, we create IOMMU groups even in the no-iommu case.  Even
> with
> > > > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > > > > > ownership.
> > > > > >
> > > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > > > > > given device unless it is registered to VFIO, which a fake group is created.
> > > > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > > > > > even fake group goes away.
> > > > >
> > > > > In the vfio group case, [1] can be hit with no-iommu only when there
> > > > > are affected devices which are not bound to vfio.
> > > >
> > > > yes. because vfio would allocate fake group when device is registered to
> > > > it.
> > > >
> > > > > Why are we not
> > > > > allocating an IOMMU group to no-iommu devices when vfio group is
> > > > > disabled?  Thanks,
> > > >
> > > > hmmm. when the vfio group code is configured out. The
> > > > vfio_device_set_group() just returns 0 after below patch is
> > > > applied and CONFIG_VFIO_GROUP=n. So when there is no
> > > > vfio group, the fake group also goes away.
> > > >
> > > > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> > >
> > > Is this a fundamental issue or just a problem with the current
> > > implementation proposal?  It seems like the latter.  FWIW, I also don't
> > > see a taint happening in the cdev path for no-iommu use.  Thanks,
> >
> > yes. the latter case. The reason I raised it here is to confirm the
> > policy on the new group/bdf capability in the DEVICE_GET_INFO. If
> > there is no iommu group, perhaps I only need to exclude the new
> > group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?
> 
> I think we need to revisit the question of why allocating an IOMMU
> group for a no-iommu device is exclusive to the vfio group support.

For no-iommu device, the iommu group is a fake group allocated by vfio.
is it? And the fake group allocation is part of the vfio group code.
It is the vfio_device_set_group() in group.c. If vfio group code is not
compiled in, vfio does not allocate fake groups. Detail for this compiling
can be found in link [1].

> We've already been down the path of trying to report a field that only
> exists for devices with certain properties with dev-id.  It doesn't
> work well.  I think we've said all along that while the cdev interface
> is device based, there are still going to be underlying IOMMU groups
> for the user to be aware of, they're just not as much a fundamental
> part of the interface.  There should not be a case where a device
> doesn't have a group to report.  Thanks,

As the patch in link [1] makes vfio group optional, so if compile a kernel
with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no
group to report. Perhaps this is not a typical usage but still a sane usage
for noiommu mode as I confirmed with you in this thread. So when it comes,
needs to consider what to report for the group field.

Perhaps I messed up the discussion by referring to a patch that is part of
another series. But I think it should be considered when talking about the
group to be reported.

[1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/

Regards,
Yi Liu
Alex Williamson April 8, 2023, 2:20 p.m. UTC | #34
On Sat, 8 Apr 2023 05:07:16 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Saturday, April 8, 2023 5:07 AM
> > 
> > On Fri, 7 Apr 2023 15:47:10 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Friday, April 7, 2023 11:14 PM
> > > >
> > > > On Fri, 7 Apr 2023 14:04:02 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Friday, April 7, 2023 9:52 PM
> > > > > >
> > > > > > On Fri, 7 Apr 2023 13:24:25 +0000
> > > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > > >  
> > > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > Sent: Friday, April 7, 2023 8:04 PM
> > > > > > > >  
> > > > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev  
> > > > *pdev,  
> > > > > > void  
> > > > > > > > > > *data)  
> > > > > > > > > > > >  	if (!iommu_group)
> > > > > > > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices  
> > */  
> > > > > > >
> > > > > > > [1]
> > > > > > >  
> > > > > > > > > > >
> > > > > > > > > > > Hi Alex,
> > > > > > > > > > >
> > > > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > > > > > > > > >
> > > > > > > > > > Yes
> > > > > > > > > >  
> > > > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to  
> > vfio-  
> > > > pci.  
> > > > > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-  
> > vfio0.  
> > > > > > Bind  
> > > > > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the  
> > > > above  
> > > > > > > > > > > group check. Reason is that this happens to have some affected  
> > devices,  
> > > > and  
> > > > > > > > > > > these devices have no valid iommu_group (because they are not  
> > bound to  
> > > > > > vfio-  
> > > > > > > > pci  
> > > > > > > > > > > hence nobody allocates noiommu group for them). So when hot reset  
> > info  
> > > > > > loops  
> > > > > > > > > > > such devices, it failed with -EPERM. Is this expected?  
> > > > > > > > > >
> > > > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > > > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > > > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > > > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > > > > > > for no-iommu cdev.  
> > > > > > > > >
> > > > > > > > > I just realize a further issue related to this limitation. Remember that we
> > > > > > > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I  
> > think  
> > > > > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we  
> > will  
> > > > > > > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > > > > > > compiled out?  
> > > > > > > >
> > > > > > > > We're talking about IOMMU groups, IOMMU groups are always present
> > > > > > > > regardless of whether we expose a vfio group interface to userspace.
> > > > > > > > Remember, we create IOMMU groups even in the no-iommu case.  Even  
> > with  
> > > > > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > > > > > > ownership.  
> > > > > > >
> > > > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > > > > > > given device unless it is registered to VFIO, which a fake group is created.
> > > > > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > > > > > > even fake group goes away.  
> > > > > >
> > > > > > In the vfio group case, [1] can be hit with no-iommu only when there
> > > > > > are affected devices which are not bound to vfio.  
> > > > >
> > > > > yes. because vfio would allocate fake group when device is registered to
> > > > > it.
> > > > >  
> > > > > > Why are we not
> > > > > > allocating an IOMMU group to no-iommu devices when vfio group is
> > > > > > disabled?  Thanks,  
> > > > >
> > > > > hmmm. when the vfio group code is configured out. The
> > > > > vfio_device_set_group() just returns 0 after below patch is
> > > > > applied and CONFIG_VFIO_GROUP=n. So when there is no
> > > > > vfio group, the fake group also goes away.
> > > > >
> > > > > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/  
> > > >
> > > > Is this a fundamental issue or just a problem with the current
> > > > implementation proposal?  It seems like the latter.  FWIW, I also don't
> > > > see a taint happening in the cdev path for no-iommu use.  Thanks,  
> > >
> > > yes. the latter case. The reason I raised it here is to confirm the
> > > policy on the new group/bdf capability in the DEVICE_GET_INFO. If
> > > there is no iommu group, perhaps I only need to exclude the new
> > > group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?  
> > 
> > I think we need to revisit the question of why allocating an IOMMU
> > group for a no-iommu device is exclusive to the vfio group support.  
> 
> For no-iommu device, the iommu group is a fake group allocated by vfio.
> is it? And the fake group allocation is part of the vfio group code.
> It is the vfio_device_set_group() in group.c. If vfio group code is not
> compiled in, vfio does not allocate fake groups. Detail for this compiling
> can be found in link [1].
> 
> > We've already been down the path of trying to report a field that only
> > exists for devices with certain properties with dev-id.  It doesn't
> > work well.  I think we've said all along that while the cdev interface
> > is device based, there are still going to be underlying IOMMU groups
> > for the user to be aware of, they're just not as much a fundamental
> > part of the interface.  There should not be a case where a device
> > doesn't have a group to report.  Thanks,  
> 
> As the patch in link [1] makes vfio group optional, so if compile a kernel
> with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no
> group to report. Perhaps this is not a typical usage but still a sane usage
> for noiommu mode as I confirmed with you in this thread. So when it comes,
> needs to consider what to report for the group field.
> 
> Perhaps I messed up the discussion by referring to a patch that is part of
> another series. But I think it should be considered when talking about the
> group to be reported.

The question is whether the split that group.c code handles both the
vfio group AND creation of the IOMMU group in such cases is the correct
split.  I'm not arguing that the way the code is currently laid out has
the fake IOMMU group for no-iommu devices created in vfio group
specific code, but we have a common interface that makes use of IOMMU
group information for which we don't have an equivalent alternative
data field to report.

We've shown that dev-id doesn't work here because dev-ids only exist
for devices within the user's IOMMU context.  Also reporting an invalid
ID of any sort fails to indicate the potential implied ownership.
Therefore I recognize that if this interface is to report an IOMMU
group, then the creation of fake IOMMU groups existing only in vfio
group code would need to be refactored.  Thanks,

Alex
Yi Liu April 9, 2023, 11:58 a.m. UTC | #35
On 2023/4/8 22:20, Alex Williamson wrote:
> On Sat, 8 Apr 2023 05:07:16 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
>>> From: Alex Williamson <alex.williamson@redhat.com>
>>> Sent: Saturday, April 8, 2023 5:07 AM
>>>
>>> On Fri, 7 Apr 2023 15:47:10 +0000
>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
>>>    
>>>>> From: Alex Williamson <alex.williamson@redhat.com>
>>>>> Sent: Friday, April 7, 2023 11:14 PM
>>>>>
>>>>> On Fri, 7 Apr 2023 14:04:02 +0000
>>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
>>>>>   
>>>>>>> From: Alex Williamson <alex.williamson@redhat.com>
>>>>>>> Sent: Friday, April 7, 2023 9:52 PM
>>>>>>>
>>>>>>> On Fri, 7 Apr 2023 13:24:25 +0000
>>>>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
>>>>>>>   
>>>>>>>>> From: Alex Williamson <alex.williamson@redhat.com>
>>>>>>>>> Sent: Friday, April 7, 2023 8:04 PM
>>>>>>>>>   
>>>>>>>>>>>>> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev
>>>>> *pdev,
>>>>>>> void
>>>>>>>>>>> *data)
>>>>>>>>>>>>>   	if (!iommu_group)
>>>>>>>>>>>>>   		return -EPERM; /* Cannot reset non-isolated devices
>>> */
>>>>>>>>
>>>>>>>> [1]
>>>>>>>>   
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>
>>>>>>>>>>>> Is disabling iommu a sane way to test vfio noiommu mode?
>>>>>>>>>>>
>>>>>>>>>>> Yes
>>>>>>>>>>>   
>>>>>>>>>>>> I added intel_iommu=off to disable intel iommu and bind a device to
>>> vfio-
>>>>> pci.
>>>>>>>>>>>> I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-
>>> vfio0.
>>>>>>> Bind
>>>>>>>>>>>> iommufd==-1 can succeed, but failed to get hot reset info due to the
>>>>> above
>>>>>>>>>>>> group check. Reason is that this happens to have some affected
>>> devices,
>>>>> and
>>>>>>>>>>>> these devices have no valid iommu_group (because they are not
>>> bound to
>>>>>>> vfio-
>>>>>>>>> pci
>>>>>>>>>>>> hence nobody allocates noiommu group for them). So when hot reset
>>> info
>>>>>>> loops
>>>>>>>>>>>> such devices, it failed with -EPERM. Is this expected?
>>>>>>>>>>>
>>>>>>>>>>> Hmm, I didn't recall that we put in such a limitation, but given the
>>>>>>>>>>> minimally intrusive approach to no-iommu and the fact that we never
>>>>>>>>>>> defined an invalid group ID to return to the user, it makes sense that
>>>>>>>>>>> we just blocked the ioctl for no-iommu use.  I guess we can do the same
>>>>>>>>>>> for no-iommu cdev.
>>>>>>>>>>
>>>>>>>>>> I just realize a further issue related to this limitation. Remember that we
>>>>>>>>>> may finally compile out the vfio group infrastructure in the future. Say I
>>>>>>>>>> want to test noiommu, I may boot such a kernel with iommu disabled. I
>>> think
>>>>>>>>>> the _INFO ioctl would fail as there is no iommu_group. Does it mean we
>>> will
>>>>>>>>>> not support hot reset for noiommu in future if vfio group infrastructure is
>>>>>>>>>> compiled out?
>>>>>>>>>
>>>>>>>>> We're talking about IOMMU groups, IOMMU groups are always present
>>>>>>>>> regardless of whether we expose a vfio group interface to userspace.
>>>>>>>>> Remember, we create IOMMU groups even in the no-iommu case.  Even
>>> with
>>>>>>>>> pure cdev, there are underlying IOMMU groups that maintain the DMA
>>>>>>>>> ownership.
>>>>>>>>
>>>>>>>> hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
>>>>>>>> given device unless it is registered to VFIO, which a fake group is created.
>>>>>>>> That's why I hit the limitation [1]. When vfio_group is compiled out, then
>>>>>>>> even fake group goes away.
>>>>>>>
>>>>>>> In the vfio group case, [1] can be hit with no-iommu only when there
>>>>>>> are affected devices which are not bound to vfio.
>>>>>>
>>>>>> yes. because vfio would allocate fake group when device is registered to
>>>>>> it.
>>>>>>   
>>>>>>> Why are we not
>>>>>>> allocating an IOMMU group to no-iommu devices when vfio group is
>>>>>>> disabled?  Thanks,
>>>>>>
>>>>>> hmmm. when the vfio group code is configured out. The
>>>>>> vfio_device_set_group() just returns 0 after below patch is
>>>>>> applied and CONFIG_VFIO_GROUP=n. So when there is no
>>>>>> vfio group, the fake group also goes away.
>>>>>>
>>>>>> https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
>>>>>
>>>>> Is this a fundamental issue or just a problem with the current
>>>>> implementation proposal?  It seems like the latter.  FWIW, I also don't
>>>>> see a taint happening in the cdev path for no-iommu use.  Thanks,
>>>>
>>>> yes. the latter case. The reason I raised it here is to confirm the
>>>> policy on the new group/bdf capability in the DEVICE_GET_INFO. If
>>>> there is no iommu group, perhaps I only need to exclude the new
>>>> group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?
>>>
>>> I think we need to revisit the question of why allocating an IOMMU
>>> group for a no-iommu device is exclusive to the vfio group support.
>>
>> For no-iommu device, the iommu group is a fake group allocated by vfio.
>> is it? And the fake group allocation is part of the vfio group code.
>> It is the vfio_device_set_group() in group.c. If vfio group code is not
>> compiled in, vfio does not allocate fake groups. Detail for this compiling
>> can be found in link [1].
>>
>>> We've already been down the path of trying to report a field that only
>>> exists for devices with certain properties with dev-id.  It doesn't
>>> work well.  I think we've said all along that while the cdev interface
>>> is device based, there are still going to be underlying IOMMU groups
>>> for the user to be aware of, they're just not as much a fundamental
>>> part of the interface.  There should not be a case where a device
>>> doesn't have a group to report.  Thanks,
>>
>> As the patch in link [1] makes vfio group optional, so if compile a kernel
>> with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no
>> group to report. Perhaps this is not a typical usage but still a sane usage
>> for noiommu mode as I confirmed with you in this thread. So when it comes,
>> needs to consider what to report for the group field.
>>
>> Perhaps I messed up the discussion by referring to a patch that is part of
>> another series. But I think it should be considered when talking about the
>> group to be reported.
> 
> The question is whether the split that group.c code handles both the
> vfio group AND creation of the IOMMU group in such cases is the correct
> split.  I'm not arguing that the way the code is currently laid out has
> the fake IOMMU group for no-iommu devices created in vfio group
> specific code, but we have a common interface that makes use of IOMMU
> group information for which we don't have an equivalent alternative
> data field to report.

yes. It is needed to ensure _HOT_RESET_INFO workable for noiommu devices.

> We've shown that dev-id doesn't work here because dev-ids only exist
> for devices within the user's IOMMU context.  Also reporting an invalid
> ID of any sort fails to indicate the potential implied ownership.
> Therefore I recognize that if this interface is to report an IOMMU
> group, then the creation of fake IOMMU groups existing only in vfio
> group code would need to be refactored.  Thanks,

yeah, needs to move the iommu group creation back to vfio_main.c. This
would be a prerequisite for [1]

[1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/

I'll also try out your suggestion to add a capability like below and link
it in the vfio_device_info cap chain.

#define VFIO_DEVICE_INFO_CAP_PCI_BDF          5

struct vfio_device_info_cap_pci_bdf {
         struct vfio_info_cap_header header;
         __u32   group_id;
         __u16   segment;
         __u8    bus;
         __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
};
Alex Williamson April 9, 2023, 1:29 p.m. UTC | #36
On Sun, 9 Apr 2023 19:58:47 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2023/4/8 22:20, Alex Williamson wrote:
> > On Sat, 8 Apr 2023 05:07:16 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> >>> From: Alex Williamson <alex.williamson@redhat.com>
> >>> Sent: Saturday, April 8, 2023 5:07 AM
> >>>
> >>> On Fri, 7 Apr 2023 15:47:10 +0000
> >>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >>>      
> >>>>> From: Alex Williamson <alex.williamson@redhat.com>
> >>>>> Sent: Friday, April 7, 2023 11:14 PM
> >>>>>
> >>>>> On Fri, 7 Apr 2023 14:04:02 +0000
> >>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >>>>>     
> >>>>>>> From: Alex Williamson <alex.williamson@redhat.com>
> >>>>>>> Sent: Friday, April 7, 2023 9:52 PM
> >>>>>>>
> >>>>>>> On Fri, 7 Apr 2023 13:24:25 +0000
> >>>>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >>>>>>>     
> >>>>>>>>> From: Alex Williamson <alex.williamson@redhat.com>
> >>>>>>>>> Sent: Friday, April 7, 2023 8:04 PM
> >>>>>>>>>     
> >>>>>>>>>>>>> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev  
> >>>>> *pdev,  
> >>>>>>> void  
> >>>>>>>>>>> *data)  
> >>>>>>>>>>>>>   	if (!iommu_group)
> >>>>>>>>>>>>>   		return -EPERM; /* Cannot reset non-isolated devices  
> >>> */  
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>>     
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is disabling iommu a sane way to test vfio noiommu mode?  
> >>>>>>>>>>>
> >>>>>>>>>>> Yes
> >>>>>>>>>>>     
> >>>>>>>>>>>> I added intel_iommu=off to disable intel iommu and bind a device to  
> >>> vfio-  
> >>>>> pci.  
> >>>>>>>>>>>> I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-  
> >>> vfio0.  
> >>>>>>> Bind  
> >>>>>>>>>>>> iommufd==-1 can succeed, but failed to get hot reset info due to the  
> >>>>> above  
> >>>>>>>>>>>> group check. Reason is that this happens to have some affected  
> >>> devices,  
> >>>>> and  
> >>>>>>>>>>>> these devices have no valid iommu_group (because they are not  
> >>> bound to  
> >>>>>>> vfio-  
> >>>>>>>>> pci  
> >>>>>>>>>>>> hence nobody allocates noiommu group for them). So when hot reset  
> >>> info  
> >>>>>>> loops  
> >>>>>>>>>>>> such devices, it failed with -EPERM. Is this expected?  
> >>>>>>>>>>>
> >>>>>>>>>>> Hmm, I didn't recall that we put in such a limitation, but given the
> >>>>>>>>>>> minimally intrusive approach to no-iommu and the fact that we never
> >>>>>>>>>>> defined an invalid group ID to return to the user, it makes sense that
> >>>>>>>>>>> we just blocked the ioctl for no-iommu use.  I guess we can do the same
> >>>>>>>>>>> for no-iommu cdev.  
> >>>>>>>>>>
> >>>>>>>>>> I just realize a further issue related to this limitation. Remember that we
> >>>>>>>>>> may finally compile out the vfio group infrastructure in the future. Say I
> >>>>>>>>>> want to test noiommu, I may boot such a kernel with iommu disabled. I  
> >>> think  
> >>>>>>>>>> the _INFO ioctl would fail as there is no iommu_group. Does it mean we  
> >>> will  
> >>>>>>>>>> not support hot reset for noiommu in future if vfio group infrastructure is
> >>>>>>>>>> compiled out?  
> >>>>>>>>>
> >>>>>>>>> We're talking about IOMMU groups, IOMMU groups are always present
> >>>>>>>>> regardless of whether we expose a vfio group interface to userspace.
> >>>>>>>>> Remember, we create IOMMU groups even in the no-iommu case.  Even  
> >>> with  
> >>>>>>>>> pure cdev, there are underlying IOMMU groups that maintain the DMA
> >>>>>>>>> ownership.  
> >>>>>>>>
> >>>>>>>> hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> >>>>>>>> given device unless it is registered to VFIO, which a fake group is created.
> >>>>>>>> That's why I hit the limitation [1]. When vfio_group is compiled out, then
> >>>>>>>> even fake group goes away.  
> >>>>>>>
> >>>>>>> In the vfio group case, [1] can be hit with no-iommu only when there
> >>>>>>> are affected devices which are not bound to vfio.  
> >>>>>>
> >>>>>> yes. because vfio would allocate fake group when device is registered to
> >>>>>> it.
> >>>>>>     
> >>>>>>> Why are we not
> >>>>>>> allocating an IOMMU group to no-iommu devices when vfio group is
> >>>>>>> disabled?  Thanks,  
> >>>>>>
> >>>>>> hmmm. when the vfio group code is configured out. The
> >>>>>> vfio_device_set_group() just returns 0 after below patch is
> >>>>>> applied and CONFIG_VFIO_GROUP=n. So when there is no
> >>>>>> vfio group, the fake group also goes away.
> >>>>>>
> >>>>>> https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/  
> >>>>>
> >>>>> Is this a fundamental issue or just a problem with the current
> >>>>> implementation proposal?  It seems like the latter.  FWIW, I also don't
> >>>>> see a taint happening in the cdev path for no-iommu use.  Thanks,  
> >>>>
> >>>> yes. the latter case. The reason I raised it here is to confirm the
> >>>> policy on the new group/bdf capability in the DEVICE_GET_INFO. If
> >>>> there is no iommu group, perhaps I only need to exclude the new
> >>>> group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?  
> >>>
> >>> I think we need to revisit the question of why allocating an IOMMU
> >>> group for a no-iommu device is exclusive to the vfio group support.  
> >>
> >> For no-iommu device, the iommu group is a fake group allocated by vfio.
> >> is it? And the fake group allocation is part of the vfio group code.
> >> It is the vfio_device_set_group() in group.c. If vfio group code is not
> >> compiled in, vfio does not allocate fake groups. Detail for this compiling
> >> can be found in link [1].
> >>  
> >>> We've already been down the path of trying to report a field that only
> >>> exists for devices with certain properties with dev-id.  It doesn't
> >>> work well.  I think we've said all along that while the cdev interface
> >>> is device based, there are still going to be underlying IOMMU groups
> >>> for the user to be aware of, they're just not as much a fundamental
> >>> part of the interface.  There should not be a case where a device
> >>> doesn't have a group to report.  Thanks,  
> >>
> >> As the patch in link [1] makes vfio group optional, so if compile a kernel
> >> with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no
> >> group to report. Perhaps this is not a typical usage but still a sane usage
> >> for noiommu mode as I confirmed with you in this thread. So when it comes,
> >> needs to consider what to report for the group field.
> >>
> >> Perhaps I messed up the discussion by referring to a patch that is part of
> >> another series. But I think it should be considered when talking about the
> >> group to be reported.  
> > 
> > The question is whether the split that group.c code handles both the
> > vfio group AND creation of the IOMMU group in such cases is the correct
> > split.  I'm not arguing that the way the code is currently laid out has
> > the fake IOMMU group for no-iommu devices created in vfio group
> > specific code, but we have a common interface that makes use of IOMMU
> > group information for which we don't have an equivalent alternative
> > data field to report.  
> 
> yes. It is needed to ensure _HOT_RESET_INFO workable for noiommu devices.
> 
> > We've shown that dev-id doesn't work here because dev-ids only exist
> > for devices within the user's IOMMU context.  Also reporting an invalid
> > ID of any sort fails to indicate the potential implied ownership.
> > Therefore I recognize that if this interface is to report an IOMMU
> > group, then the creation of fake IOMMU groups existing only in vfio
> > group code would need to be refactored.  Thanks,  
> 
> yeah, needs to move the iommu group creation back to vfio_main.c. This
> would be a prerequisite for [1]
> 
> [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> 
> I'll also try out your suggestion to add a capability like below and link
> it in the vfio_device_info cap chain.
> 
> #define VFIO_DEVICE_INFO_CAP_PCI_BDF          5
> 
> struct vfio_device_info_cap_pci_bdf {
>          struct vfio_info_cap_header header;
>          __u32   group_id;
>          __u16   segment;
>          __u8    bus;
>          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> };
> 

Group-id and bdf should be separate capabilities, all device should
report a group-id capability and only PCI devices a bdf capability.
Thanks,

Alex
Yi Liu April 10, 2023, 8:48 a.m. UTC | #37
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Sunday, April 9, 2023 9:30 PM
[...]
> > yeah, needs to move the iommu group creation back to vfio_main.c. This
> > would be a prerequisite for [1]
> >
> > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> >
> > I'll also try out your suggestion to add a capability like below and link
> > it in the vfio_device_info cap chain.
> >
> > #define VFIO_DEVICE_INFO_CAP_PCI_BDF          5
> >
> > struct vfio_device_info_cap_pci_bdf {
> >          struct vfio_info_cap_header header;
> >          __u32   group_id;
> >          __u16   segment;
> >          __u8    bus;
> >          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> > };
> >
> 
> Group-id and bdf should be separate capabilities, all device should
> report a group-id capability and only PCI devices a bdf capability.

ok. Since this is to support the device fd passing usage, so we need to
let all the vfio device drivers report group-id capability. is it? So may
have a below helper in vfio_main.c. How about the sample drivers?
seems not necessary for them. right?

int vfio_pci_info_add_group_cap(struct device *dev,
                                struct vfio_info_cap *caps)
{
        struct vfio_pci_device_info_cap_group cap = {
                .header.id = VFIO_DEVICE_INFO_CAP_GROUP_ID,
                .header.version = 1,
        };
        struct iommu_group *iommu_group;

        iommu_group = iommu_group_get(&pdev->dev);
        if (!iommu_group) {
                kfree(caps->buf);
                return -EPERM;
        }

        cap.group_id = iommu_group_id(iommu_group);

        iommu_group_put(iommu_group);

        return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
}

Regards,
Yi Liu
Alex Williamson April 10, 2023, 2:41 p.m. UTC | #38
On Mon, 10 Apr 2023 08:48:54 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Sunday, April 9, 2023 9:30 PM  
> [...]
> > > yeah, needs to move the iommu group creation back to vfio_main.c. This
> > > would be a prerequisite for [1]
> > >
> > > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> > >
> > > I'll also try out your suggestion to add a capability like below and link
> > > it in the vfio_device_info cap chain.
> > >
> > > #define VFIO_DEVICE_INFO_CAP_PCI_BDF          5
> > >
> > > struct vfio_device_info_cap_pci_bdf {
> > >          struct vfio_info_cap_header header;
> > >          __u32   group_id;
> > >          __u16   segment;
> > >          __u8    bus;
> > >          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> > > };
> > >  
> > 
> > Group-id and bdf should be separate capabilities, all device should
> > report a group-id capability and only PCI devices a bdf capability.  
> 
> ok. Since this is to support the device fd passing usage, so we need to
> let all the vfio device drivers report group-id capability. is it? So may
> have a below helper in vfio_main.c. How about the sample drivers?
> seems not necessary for them. right?

The more common we can make it, the better, but if it ends up that the
individual drivers need to initialize the capability then it would
probably be limited to those driver with a need to expose the group.
Sample drivers for the purpose of illustrating the interface and of
course anything based on vfio-pci-core which exposes hot-reset.  Thanks

Alex
Yi Liu April 10, 2023, 3:18 p.m. UTC | #39
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, April 10, 2023 10:41 PM
> 
> On Mon, 10 Apr 2023 08:48:54 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Sunday, April 9, 2023 9:30 PM
> > [...]
> > > > yeah, needs to move the iommu group creation back to vfio_main.c. This
> > > > would be a prerequisite for [1]
> > > >
> > > > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> > > >
> > > > I'll also try out your suggestion to add a capability like below and link
> > > > it in the vfio_device_info cap chain.
> > > >
> > > > #define VFIO_DEVICE_INFO_CAP_PCI_BDF          5
> > > >
> > > > struct vfio_device_info_cap_pci_bdf {
> > > >          struct vfio_info_cap_header header;
> > > >          __u32   group_id;
> > > >          __u16   segment;
> > > >          __u8    bus;
> > > >          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> > > > };
> > > >
> > >
> > > Group-id and bdf should be separate capabilities, all device should
> > > report a group-id capability and only PCI devices a bdf capability.
> >
> > ok. Since this is to support the device fd passing usage, so we need to
> > let all the vfio device drivers report group-id capability. is it? So may
> > have a below helper in vfio_main.c. How about the sample drivers?
> > seems not necessary for them. right?
> 
> The more common we can make it, the better, but if it ends up that the
> individual drivers need to initialize the capability then it would
> probably be limited to those driver with a need to expose the group.

looks to be such a case. vfio_device_info is assembled by the individual
drivers. If want to report group_id capability as a common behavior, needs
to change all of them. Had a quick draft for it as below commit:

https://github.com/yiliu1765/iommufd/commit/ff4b8bee90761961041126305183a9a7e0f0542d

https://github.com/yiliu1765/iommufd/commits/report_group_id

> Sample drivers for the purpose of illustrating the interface and of
> course anything based on vfio-pci-core which exposes hot-reset.  Thanks

do you see any sample drivers need to report group_id cap? IMHO, seems
no.

Regards,
Yi Liu
Alex Williamson April 10, 2023, 3:23 p.m. UTC | #40
On Mon, 10 Apr 2023 15:18:27 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, April 10, 2023 10:41 PM
> > 
> > On Mon, 10 Apr 2023 08:48:54 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Sunday, April 9, 2023 9:30 PM  
> > > [...]  
> > > > > yeah, needs to move the iommu group creation back to vfio_main.c. This
> > > > > would be a prerequisite for [1]
> > > > >
> > > > > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> > > > >
> > > > > I'll also try out your suggestion to add a capability like below and link
> > > > > it in the vfio_device_info cap chain.
> > > > >
> > > > > #define VFIO_DEVICE_INFO_CAP_PCI_BDF          5
> > > > >
> > > > > struct vfio_device_info_cap_pci_bdf {
> > > > >          struct vfio_info_cap_header header;
> > > > >          __u32   group_id;
> > > > >          __u16   segment;
> > > > >          __u8    bus;
> > > > >          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> > > > > };
> > > > >  
> > > >
> > > > Group-id and bdf should be separate capabilities, all device should
> > > > report a group-id capability and only PCI devices a bdf capability.  
> > >
> > > ok. Since this is to support the device fd passing usage, so we need to
> > > let all the vfio device drivers report group-id capability. is it? So may
> > > have a below helper in vfio_main.c. How about the sample drivers?
> > > seems not necessary for them. right?  
> > 
> > The more common we can make it, the better, but if it ends up that the
> > individual drivers need to initialize the capability then it would
> > probably be limited to those driver with a need to expose the group.  
> 
> looks to be such a case. vfio_device_info is assembled by the individual
> drivers. If want to report group_id capability as a common behavior, needs
> to change all of them. Had a quick draft for it as below commit:
> 
> https://github.com/yiliu1765/iommufd/commit/ff4b8bee90761961041126305183a9a7e0f0542d
> 
> https://github.com/yiliu1765/iommufd/commits/report_group_id
> 
> > Sample drivers for the purpose of illustrating the interface and of
> > course anything based on vfio-pci-core which exposes hot-reset.  Thanks  
> 
> do you see any sample drivers need to report group_id cap? IMHO, seems
> no.

As in the quoted text, part of the purpose of the sample drivers is to
act both as a proof-of-concept and illustration of the API, therefore
gratuitous exposure of such capabilities should be encouraged.  They
would also provide a proof point of an mdev device, ie. emulated IOMMU
device, exposing the capability.  Thanks,

Alex
Yi Liu April 11, 2023, 6:16 a.m. UTC | #41
Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 7, 2023 8:04 PM
> 
> On Fri, 7 Apr 2023 10:09:58 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Monday, April 3, 2023 11:02 PM
> > >
> > > On Mon, 3 Apr 2023 09:25:06 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Saturday, April 1, 2023 10:44 PM
> > > >
> > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void
> > > *data)
> > > > >  	if (!iommu_group)
> > > > >  		return -EPERM; /* Cannot reset non-isolated devices */
> > > >
> > > > Hi Alex,
> > > >
> > > > Is disabling iommu a sane way to test vfio noiommu mode?
> > >
> > > Yes
> > >
> > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > > group check. Reason is that this happens to have some affected devices, and
> > > > these devices have no valid iommu_group (because they are not bound to vfio-
> pci
> > > > hence nobody allocates noiommu group for them). So when hot reset info loops
> > > > such devices, it failed with -EPERM. Is this expected?
> > >
> > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > minimally intrusive approach to no-iommu and the fact that we never
> > > defined an invalid group ID to return to the user, it makes sense that
> > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > for no-iommu cdev.
> >
> > I just realize a further issue related to this limitation. Remember that we
> > may finally compile out the vfio group infrastructure in the future. Say I
> > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > not support hot reset for noiommu in future if vfio group infrastructure is
> > compiled out?
> 
> We're talking about IOMMU groups, IOMMU groups are always present
> regardless of whether we expose a vfio group interface to userspace.
> Remember, we create IOMMU groups even in the no-iommu case.  Even with
> pure cdev, there are underlying IOMMU groups that maintain the DMA
> ownership.

I just realize that there is one case that does not have iommu group.
although not implemented yet. There was a discussion on SIOV support.
IIRC, it was agreed that no need to allocate iommu_group for SIOV case.
Kevin or Jason can keep me honest here. I failed to find out the link
of this discussion.

> > As another thread, we are going to add a new bdf/group capability to
> > DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new
> > bdf/group capability or add a flag in the capability to mark the group_id
> > is invalid?
> 
> As above, there's always an IOMMU group, it's never invalid.  Thanks,

Regards,
Yi Liu
Jason Gunthorpe April 11, 2023, 1:24 p.m. UTC | #42
On Thu, Apr 06, 2023 at 11:53:47AM -0600, Alex Williamson wrote:

> Where whether a device is opened is subject to change outside of the
> user's control.  This essentially allows the user to perform hot-resets
> of devices outside of their ownership so long as the device is not
> used elsewhere, versus the current requirement that the user own all the
> affected groups, which implies device ownership.  It's not been
> justified why this feature needs to exist, imo.

The cdev API doesn't have the notion that owning a group means you
"own" some collection of devices. It still happens as a side effect,
but it isn't obviously part of the API. I'm really loath to
re-introduce that group-based concept just for this. We are trying
reduce the group API surface.

How about a different direction.

We add a new uAPI for cdev mode that is "take ownership of the reset
group". Maybe it can be a flag in during bind.

When requested vfio will ensure that every device in the reset group
is only bound to this iommufd_ctx or left closed. Now and in the
future. Since no-iommu has no iommufd_ctx this means we can open only
one device in the reset group.

With this flag RESET is guaranteed to always work by definition.

We continue with the zero-length FD, but we can just replace the
security checks with a check if we are in reset group ownership mode.

_INFO is unchanged.

We decide if we add a new IOCTL to return the BDF so the existing
_INFO can get back to the dev_id or a new IOCTL that returns the
dev_id list of the reset group.

Userspace is required to figure out the extent of the reset, but we
don't require that userspace prove to the kernel it did this when
requesting the reset.

Jason
Jason Gunthorpe April 11, 2023, 1:33 p.m. UTC | #43
On Fri, Apr 07, 2023 at 03:07:21PM -0600, Alex Williamson wrote:

> I think we need to revisit the question of why allocating an IOMMU
> group for a no-iommu device is exclusive to the vfio group support.

One of the points of this effort is to remove the co-mingling of iommu
and VFIO so much. We should not create the fake iommu groups for
no-iommu.

The _INFO API reporting the group is not a good reason to wreck this
clean separation.

Jason
Jason Gunthorpe April 11, 2023, 1:34 p.m. UTC | #44
On Sun, Apr 09, 2023 at 07:29:51AM -0600, Alex Williamson wrote:

> > struct vfio_device_info_cap_pci_bdf {
> >          struct vfio_info_cap_header header;
> >          __u32   group_id;
> >          __u16   segment;
> >          __u8    bus;
> >          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> > };
> > 
> 
> Group-id and bdf should be separate capabilities, all device should
> report a group-id capability and only PCI devices a bdf capability.

Group should be reported by iommufd using a generic ioctl, and not be
part of VFIO.

This should report BDF only and only work for PCI.

Jason
Alex Williamson April 11, 2023, 3:54 p.m. UTC | #45
On Tue, 11 Apr 2023 10:24:58 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Apr 06, 2023 at 11:53:47AM -0600, Alex Williamson wrote:
> 
> > Where whether a device is opened is subject to change outside of the
> > user's control.  This essentially allows the user to perform hot-resets
> > of devices outside of their ownership so long as the device is not
> > used elsewhere, versus the current requirement that the user own all the
> > affected groups, which implies device ownership.  It's not been
> > justified why this feature needs to exist, imo.  
> 
> The cdev API doesn't have the notion that owning a group means you
> "own" some collection of devices. It still happens as a side effect,
> but it isn't obviously part of the API. I'm really loath to
> re-introduce that group-based concept just for this. We are trying
> reduce the group API surface.
> 
> How about a different direction.
> 
> We add a new uAPI for cdev mode that is "take ownership of the reset
> group". Maybe it can be a flag in during bind.
> 
> When requested vfio will ensure that every device in the reset group
> is only bound to this iommufd_ctx or left closed. Now and in the
> future. Since no-iommu has no iommufd_ctx this means we can open only
> one device in the reset group.
> 
> With this flag RESET is guaranteed to always work by definition.
> 
> We continue with the zero-length FD, but we can just replace the
> security checks with a check if we are in reset group ownership mode.
> 
> _INFO is unchanged.
> 
> We decide if we add a new IOCTL to return the BDF so the existing
> _INFO can get back to the dev_id or a new IOCTL that returns the
> dev_id list of the reset group.
> 
> Userspace is required to figure out the extent of the reset, but we
> don't require that userspace prove to the kernel it did this when
> requesting the reset.

Take for example a multi-function PCIe device with ACS isolation between
functions, are you going to allow a user who has only been granted
ownership of a subset of functions control of the entire dev_set?  It
seems this proposal essentially extends the ownership model to the
greater of the dev_set or iommu group, apparently neither of which are
explicitly exposed to the user in the cdev API.  How does a user
determine when devices cannot be used independently in the cdev API?
Thanks,

Alex
Alex Williamson April 11, 2023, 5:11 p.m. UTC | #46
[Appears the list got dropped, replying to my previous message to re-add]

On Tue, 11 Apr 2023 13:32:16 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 11, 2023 at 09:54:17AM -0600, Alex Williamson wrote:
> > On Tue, 11 Apr 2023 10:24:58 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Thu, Apr 06, 2023 at 11:53:47AM -0600, Alex Williamson wrote:
> > >   
> > > > Where whether a device is opened is subject to change outside of the
> > > > user's control.  This essentially allows the user to perform hot-resets
> > > > of devices outside of their ownership so long as the device is not
> > > > used elsewhere, versus the current requirement that the user own all the
> > > > affected groups, which implies device ownership.  It's not been
> > > > justified why this feature needs to exist, imo.    
> > > 
> > > The cdev API doesn't have the notion that owning a group means you
> > > "own" some collection of devices. It still happens as a side effect,
> > > but it isn't obviously part of the API. I'm really loath to
> > > re-introduce that group-based concept just for this. We are trying
> > > reduce the group API surface.
> > > 
> > > How about a different direction.
> > > 
> > > We add a new uAPI for cdev mode that is "take ownership of the reset
> > > group". Maybe it can be a flag in during bind.
> > > 
> > > When requested vfio will ensure that every device in the reset group
> > > is only bound to this iommufd_ctx or left closed. Now and in the
> > > future. Since no-iommu has no iommufd_ctx this means we can open only
> > > one device in the reset group.
> > > 
> > > With this flag RESET is guaranteed to always work by definition.
> > > 
> > > We continue with the zero-length FD, but we can just replace the
> > > security checks with a check if we are in reset group ownership mode.
> > > 
> > > _INFO is unchanged.
> > > 
> > > We decide if we add a new IOCTL to return the BDF so the existing
> > > _INFO can get back to the dev_id or a new IOCTL that returns the
> > > dev_id list of the reset group.
> > > 
> > > Userspace is required to figure out the extent of the reset, but we
> > > don't require that userspace prove to the kernel it did this when
> > > requesting the reset.  
> > 
> > Take for example a multi-function PCIe device with ACS isolation between
> > functions, are you going to allow a user who has only been granted
> > ownership of a subset of functions control of the entire dev_set?  
> 
> Our cdev model says that opening a cdev locks out other cdevs from
> independent use, eg because of the group sharing. Extending this to
> include the reset group as well seems consistent.

The DMA ownership model based on the IOMMU group is consistent with
legacy vfio, but now you're proposing a new ownership model that
optionally allows a user to extend their ownership, opportunistically
lock out other users, and wreaking havoc for management utilities that
also have no insight into dev_sets or userspace driver behavior.

> There is some security concern here, but that goes both ways, a 3rd
> party should not be able to break an application that needs to use
> this RESET and had sufficient privileges to assert an ownership.

There are clearly scenarios we have now that could break.  For example,
today if QEMU doesn't own all the IOMMU groups for a mult-function
device, it can't do a reset, the remaining functions are available for
other users.  As I understand the proposal, QEMU now gets to attempt to
claim ownership of the dev_set, so it opportunistically extends its
ownership and may block other users from the affected devices.
Ordering makes this effectively unpredictable, if a userspace like DPDK
that doesn't assert dev_set ownership is started first, QEMU can start
and be denied hot-reset support.  In the reverse ordering, the DPDK
application can be locked out by QEMU.

> I'd say anyone should be able to assert RESET ownership if, like
> today, the iommufd_ctx has all the groups of the dev_set inside
> it. Once asserted it becomes safe against all forms of hotplug, and
> continues to be safe even if some of the devices are closed. eg hot
> unplugging from the VM doesn't change the availability of RESET.
> 
> This comes from your ask that qemu know clearly if RESET works, and it
> doesn't change while qemu is running. This seems stronger and clearer
> than the current implicit scheme. It also doesn't require usespace to
> do any calculations with groups or BDFs to figure out of RESET is
> available, kernel confirms it directly.

As above, clarity and predictability seem lacking in this proposal.
With the current scheme, the ownership of the affected devices is
implied if they exist within an owned group, but the strength of that
ownership is clear.  Affected devices outside the set of owned groups
says that hot-reset is unavailable without any of this "but QEMU might
be able to request it" or "unless the affected device is currently
unopened" variables.

> > seems this proposal essentially extends the ownership model to the
> > greater of the dev_set or iommu group, apparently neither of which
> > are explicitly exposed to the user in the cdev API.  
> 
> IIRC the group id can be learned from sysfs before opening the cdev
> file. Something like /sys/class/vfio/XX/../../iommu_group

And in the passed cdev fd model... ?

> We should also have an iommufd ioctl to report the "same ioas"
> groupings of dev_ids to make it easy on userspace. I haven't checked
> to see what the current qemu patches are doing with this..

Seems we're ignoring that no-iommu doesn't have a valid iommufd.

> > How does a user determine when devices cannot be used independently
> > in the cdev API?   
> 
> We have this problem right now. The only way to learn the reset group
> is to call the _INFO ioctl. We could add a sysfs "pci_reset_group"
> under /sys/class/vfio/XX/ if something needs it earlier.

For all the complaints about complexity, now we're asking management
tools to not only take into account IOMMU groups, but also reset
groups, and some inferred knowledge about the application and devices
to speculate whether reset group ownership is taken by a given
userspace??  Thanks,

Alex
Jason Gunthorpe April 11, 2023, 6:40 p.m. UTC | #47
On Tue, Apr 11, 2023 at 11:11:17AM -0600, Alex Williamson wrote:
> [Appears the list got dropped, replying to my previous message to re-add]

Wowo this got mesed up alot, mutt drops the cc when replying for some
reason. I think it is fixed up now

> > Our cdev model says that opening a cdev locks out other cdevs from
> > independent use, eg because of the group sharing. Extending this to
> > include the reset group as well seems consistent.
> 
> The DMA ownership model based on the IOMMU group is consistent with
> legacy vfio, but now you're proposing a new ownership model that
> optionally allows a user to extend their ownership, opportunistically
> lock out other users, and wreaking havoc for management utilities that
> also have no insight into dev_sets or userspace driver behavior.

I suggested below that the owership require enough open devices - so
it doesn't "extend ownership opportunistically", and there is no
havoc.

Management tools already need to understand dev_set if they want to
offer reliable reset support to the VMs. Same as today.
 
> > There is some security concern here, but that goes both ways, a 3rd
> > party should not be able to break an application that needs to use
> > this RESET and had sufficient privileges to assert an ownership.
> 
> There are clearly scenarios we have now that could break.  For example,
> today if QEMU doesn't own all the IOMMU groups for a mult-function
> device, it can't do a reset, the remaining functions are available for
> other users. 

Sure, and we can keep that with this approach.

> As I understand the proposal, QEMU now gets to attempt to
> claim ownership of the dev_set, so it opportunistically extends its
> ownership and may block other users from the affected devices.

We can decide the policy for the kernel to accept a claim. I suggested
below "same as today" - it must hold all the groups within the
iommufd_ctx.

The main point is to make this claiming operation qemu needs to do
clearer and more explicit. I view this as better than trying to guess
if it successfully made the claim by inspecting the _INFO output.

> > I'd say anyone should be able to assert RESET ownership if, like
> > today, the iommufd_ctx has all the groups of the dev_set inside
> > it. Once asserted it becomes safe against all forms of hotplug, and
> > continues to be safe even if some of the devices are closed. eg hot
> > unplugging from the VM doesn't change the availability of RESET.
> > 
> > This comes from your ask that qemu know clearly if RESET works, and it
> > doesn't change while qemu is running. This seems stronger and clearer
> > than the current implicit scheme. It also doesn't require usespace to
> > do any calculations with groups or BDFs to figure out of RESET is
> > available, kernel confirms it directly.
> 
> As above, clarity and predictability seem lacking in this proposal.
> With the current scheme, the ownership of the affected devices is
> implied if they exist within an owned group, but the strength of that
> ownership is clear.  

Same logic holds here

Ownership is claimed same as today by having all groups representated
in the iommufd_ctx. This seems just as clear as today.

> > > seems this proposal essentially extends the ownership model to the
> > > greater of the dev_set or iommu group, apparently neither of which
> > > are explicitly exposed to the user in the cdev API.  
> > 
> > IIRC the group id can be learned from sysfs before opening the cdev
> > file. Something like /sys/class/vfio/XX/../../iommu_group
> 
> And in the passed cdev fd model... ?

IMHO we should try to avoid needing to expose group_id specifically to
userspace. We are missing a way to learn the "same ioas" restriction
in iommufd, and it should provide that directly based on dev_ids.

Otherwise if we really really need group_id then iommufd should
provide an ioctl to get it. Let's find a good reason first

> > We should also have an iommufd ioctl to report the "same ioas"
> > groupings of dev_ids to make it easy on userspace. I haven't checked
> > to see what the current qemu patches are doing with this..
> 
> Seems we're ignoring that no-iommu doesn't have a valid iommufd.

no-iommu doesn't and shouldn't have iommu_groups either. It also
doesn't have an IOAS so querying for same-IOAS is not necessary.

The simplest option for no-iommu is to require it to pass in every
device fd to the reset ioctl.

> > > How does a user determine when devices cannot be used independently
> > > in the cdev API?   
> > 
> > We have this problem right now. The only way to learn the reset group
> > is to call the _INFO ioctl. We could add a sysfs "pci_reset_group"
> > under /sys/class/vfio/XX/ if something needs it earlier.
> 
> For all the complaints about complexity, now we're asking management
> tools to not only take into account IOMMU groups, but also reset
> groups, and some inferred knowledge about the application and devices
> to speculate whether reset group ownership is taken by a given
> userspace??

No, we are trying to keep things pretty much the same as today without
resorting to exposing a lot of group related concepts.

The reset group is a clear concept that already exists and isn't
exposed. If we really need to know about it then it should be exposed
on its own, as a seperate discussion from this cdev stuff.

I want to re-focus on the basics of what cdev is supposed to be doing,
because several of the idea you suggested seem against this direction:

 - cdev does not have, and cannot rely on vfio_groups. We enforce this
   by compiling all the vfio_group infrastructure out. iommu_groups
   continue to exist.
   
   So converting a cdev to a vfio_group is not an allowed operation.

 - no-iommu should not have iommu_groups. We enforce this by compiling
   out all the no-iommu vfio_group infrastructure.

 - cdev APIs should ideally not require the user to know the group_id,
   we should try hard to design APIs to avoid this.

We have solved every other problem but reset like this, I would like
to get past reset without compromising the above.

Jason
Alex Williamson April 11, 2023, 9:58 p.m. UTC | #48
On Tue, 11 Apr 2023 15:40:07 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 11, 2023 at 11:11:17AM -0600, Alex Williamson wrote:
> > [Appears the list got dropped, replying to my previous message to re-add]  
> 
> Wowo this got mesed up alot, mutt drops the cc when replying for some
> reason. I think it is fixed up now
> 
> > > Our cdev model says that opening a cdev locks out other cdevs from
> > > independent use, eg because of the group sharing. Extending this to
> > > include the reset group as well seems consistent.  
> > 
> > The DMA ownership model based on the IOMMU group is consistent with
> > legacy vfio, but now you're proposing a new ownership model that
> > optionally allows a user to extend their ownership, opportunistically
> > lock out other users, and wreaking havoc for management utilities that
> > also have no insight into dev_sets or userspace driver behavior.  
> 
> I suggested below that the owership require enough open devices - so
> it doesn't "extend ownership opportunistically", and there is no
> havoc.
> 
> Management tools already need to understand dev_set if they want to
> offer reliable reset support to the VMs. Same as today.

I don't think that's true.  Our primary hot-reset use case is GPUs and
subordinate functions, where the isolation and reset scope are often
sufficiently similar to make hot-reset possible, regardless whether
all the functions are assigned to a VM.  I don't think you'll find any
management tools that takes reset scope into account otherwise.

> > > There is some security concern here, but that goes both ways, a 3rd
> > > party should not be able to break an application that needs to use
> > > this RESET and had sufficient privileges to assert an ownership.  
> > 
> > There are clearly scenarios we have now that could break.  For example,
> > today if QEMU doesn't own all the IOMMU groups for a mult-function
> > device, it can't do a reset, the remaining functions are available for
> > other users.   
> 
> Sure, and we can keep that with this approach.
> 
> > As I understand the proposal, QEMU now gets to attempt to
> > claim ownership of the dev_set, so it opportunistically extends its
> > ownership and may block other users from the affected devices.  
> 
> We can decide the policy for the kernel to accept a claim. I suggested
> below "same as today" - it must hold all the groups within the
> iommufd_ctx.

It must hold all the groups [that the user doesn't know about because
it's not a formal part of the cdev API] within the iommufd_ctx?
 
> The main point is to make this claiming operation qemu needs to do
> clearer and more explicit. I view this as better than trying to guess
> if it successfully made the claim by inspecting the _INFO output.

There is no guessing in the current API.  Guessing is what happens
when hot-reset magically works because one of the devices wasn't opened
at the time, or the iommufd_ctx happens to hold all the affected groups
that the user doesn't have an API to understand.  The current API has a
very concise requirement, the user must own all of the groups affected
by the hot-reset in order to effect a hot-reset.

> > > I'd say anyone should be able to assert RESET ownership if, like
> > > today, the iommufd_ctx has all the groups of the dev_set inside
> > > it. Once asserted it becomes safe against all forms of hotplug, and
> > > continues to be safe even if some of the devices are closed. eg hot
> > > unplugging from the VM doesn't change the availability of RESET.
> > > 
> > > This comes from your ask that qemu know clearly if RESET works, and it
> > > doesn't change while qemu is running. This seems stronger and clearer
> > > than the current implicit scheme. It also doesn't require usespace to
> > > do any calculations with groups or BDFs to figure out of RESET is
> > > available, kernel confirms it directly.  
> > 
> > As above, clarity and predictability seem lacking in this proposal.
> > With the current scheme, the ownership of the affected devices is
> > implied if they exist within an owned group, but the strength of that
> > ownership is clear.    
> 
> Same logic holds here
> 
> Ownership is claimed same as today by having all groups representated
> in the iommufd_ctx. This seems just as clear as today.

I don't know if anyone else is having this trouble, but I'm seeing
conflicting requirements.  The cdev API is not to expose groups unless
a requirement is found to need them, of which this is apparently not
one, but all the groups need to be represented in the iommufd_ctx in
order to make use of this interface.  How is that clear?

> > > > seems this proposal essentially extends the ownership model to the
> > > > greater of the dev_set or iommu group, apparently neither of which
> > > > are explicitly exposed to the user in the cdev API.    
> > > 
> > > IIRC the group id can be learned from sysfs before opening the cdev
> > > file. Something like /sys/class/vfio/XX/../../iommu_group  
> > 
> > And in the passed cdev fd model... ?  
> 
> IMHO we should try to avoid needing to expose group_id specifically to
> userspace. We are missing a way to learn the "same ioas" restriction
> in iommufd, and it should provide that directly based on dev_ids.

Is this yet another "we need to expose groups to understand the ioas
restriction but we're not going to because reasons" argument?

> Otherwise if we really really need group_id then iommufd should
> provide an ioctl to get it. Let's find a good reason first

If needing to have all of the groups represented in an iommufd_ctx in
order to effect a reset without allowing the user to know the set of
affected groups and device to group relationship isn't a reason... well
I'm just lost.

> > > We should also have an iommufd ioctl to report the "same ioas"
> > > groupings of dev_ids to make it easy on userspace. I haven't checked
> > > to see what the current qemu patches are doing with this..  
> > 
> > Seems we're ignoring that no-iommu doesn't have a valid iommufd.  
> 
> no-iommu doesn't and shouldn't have iommu_groups either. It also
> doesn't have an IOAS so querying for same-IOAS is not necessary.
> 
> The simplest option for no-iommu is to require it to pass in every
> device fd to the reset ioctl.

Which ironically is exactly how it ends up working today, each no-iommu
device has a fake IOMMU group, so every affected device (group) needs
to be provided.

> > > > How does a user determine when devices cannot be used independently
> > > > in the cdev API?     
> > > 
> > > We have this problem right now. The only way to learn the reset group
> > > is to call the _INFO ioctl. We could add a sysfs "pci_reset_group"
> > > under /sys/class/vfio/XX/ if something needs it earlier.  
> > 
> > For all the complaints about complexity, now we're asking management
> > tools to not only take into account IOMMU groups, but also reset
> > groups, and some inferred knowledge about the application and devices
> > to speculate whether reset group ownership is taken by a given
> > userspace??  
> 
> No, we are trying to keep things pretty much the same as today without
> resorting to exposing a lot of group related concepts.
> 
> The reset group is a clear concept that already exists and isn't
> exposed. If we really need to know about it then it should be exposed
> on its own, as a seperate discussion from this cdev stuff.

"[A]nd isn't exposed"... what exactly is the hot-reset INFO ioctl
exposing if not that?

> I want to re-focus on the basics of what cdev is supposed to be doing,
> because several of the idea you suggested seem against this direction:
> 
>  - cdev does not have, and cannot rely on vfio_groups. We enforce this
>    by compiling all the vfio_group infrastructure out. iommu_groups
>    continue to exist.
>    
>    So converting a cdev to a vfio_group is not an allowed operation.

My only statements in this respect were towards the notion that IOMMU
groups continue to exist.  I'm well aware of the desire to deprecate
and remove vfio groups.
 
>  - no-iommu should not have iommu_groups. We enforce this by compiling
>    out all the no-iommu vfio_group infrastructure.

This is not logically inferred from the above if IOMMU groups continue
to exist and continue to be a basis for describing DMA ownership as
well as "reset groups" 

>  - cdev APIs should ideally not require the user to know the group_id,
>    we should try hard to design APIs to avoid this.

This is a nuance, group_id vs group, where it's been previously
discussed that users will need to continue to know the boundaries of a
group for the purpose of DMA isolation and potentially IOAS
independence should cdev/iommufd choose to tackle those topics.
 
> We have solved every other problem but reset like this, I would like
> to get past reset without compromising the above.

"These aren't the droids we're looking for."

What is the actual proposal here?  You've said that hot-reset works if
the iommufd_ctx has representation from each affected group, the INFO
ioctl remains as it is, which suggests that it's reporting group ID and
BDF, yet only sysfs tells the user the relation between a vfio cdev and
a group and we're trying to enable a pass-by-fd model for cdev where
the user has no reference to a sysfs node for the device.  Show me how
these pieces fit together.

OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
device has an IOMMU group, and there's an API to learn the group ID, the
solution becomes much more clear and no-iommu devices require no
special cases or restrictions.  Not only does the INFO ioctl remain the
same, but the hot-reset ioctl itself remains effectively the same
accepting either vfio cdevs or groups.  Thanks,

Alex
Jason Gunthorpe April 12, 2023, 12:01 a.m. UTC | #49
On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote:

> > Management tools already need to understand dev_set if they want to
> > offer reliable reset support to the VMs. Same as today.
> 
> I don't think that's true. Our primary hot-reset use case is GPUs and
> subordinate functions, where the isolation and reset scope are often
> sufficiently similar to make hot-reset possible, regardless whether
> all the functions are assigned to a VM.  I don't think you'll find any
> management tools that takes reset scope into account otherwise.

When I think of "reliable reset support" I think of the management
tool offering a checkbox that says "ensure PCI function reset
availability" and if checked it will not launch the VM without a
working reset.

If the user configures a set of VFIO devices and then hopes they get
working reset, that is fine, but doesn't require any reporting of
reset groups, or iommu groups to the management layer to work.

> > > As I understand the proposal, QEMU now gets to attempt to
> > > claim ownership of the dev_set, so it opportunistically extends its
> > > ownership and may block other users from the affected devices.  
> > 
> > We can decide the policy for the kernel to accept a claim. I suggested
> > below "same as today" - it must hold all the groups within the
> > iommufd_ctx.
> 
> It must hold all the groups [that the user doesn't know about because
> it's not a formal part of the cdev API] within the iommufd_ctx?

You keep going back to this, but I maintain userspace doesn't
care. qemu is given a list of VFIO devices to use, all it wants to
know is if it is allowed to use reset or not. Why should it need to
know groups and group_ids to get that binary signal out of the kernel?

> > The simplest option for no-iommu is to require it to pass in every
> > device fd to the reset ioctl.
> 
> Which ironically is exactly how it ends up working today, each no-iommu
> device has a fake IOMMU group, so every affected device (group) needs
> to be provided.

Sure, that is probably the way forward for no-iommu. Not that anyone
uses it..

The kicker is we don't force the user to generate a de-duplicated list
of devices FDs, one per group, just because.

> > I want to re-focus on the basics of what cdev is supposed to be doing,
> > because several of the idea you suggested seem against this direction:
> > 
> >  - cdev does not have, and cannot rely on vfio_groups. We enforce this
> >    by compiling all the vfio_group infrastructure out. iommu_groups
> >    continue to exist.
> >    
> >    So converting a cdev to a vfio_group is not an allowed operation.
> 
> My only statements in this respect were towards the notion that IOMMU
> groups continue to exist.  I'm well aware of the desire to deprecate
> and remove vfio groups.

Yes

> >  - no-iommu should not have iommu_groups. We enforce this by compiling
> >    out all the no-iommu vfio_group infrastructure.
> 
> This is not logically inferred from the above if IOMMU groups continue
> to exist and continue to be a basis for describing DMA ownership as
> well as "reset groups"

It is not ment to flow out of the above, it is a seperate statement. I
want the iommu_group mechanism to stop being abused outside the iommu
core code. The only thing that should be creating groups is an
attached iommu driver operating under ops->device_group().

VFIO needed this to support mdev and no-iommu. We already have mdev
free of iommu_groups, I would like no-iommu to also be free of it too,
we are very close.

That would leave POWER as the only abuser of the
iommu_group_add_device() API, and it is only doing it because it
hasn't got a proper iommu driver implementation yet. It turns out
their abuse is mislocked and maybe racy to boot :(

> >  - cdev APIs should ideally not require the user to know the group_id,
> >    we should try hard to design APIs to avoid this.
> 
> This is a nuance, group_id vs group, where it's been previously
> discussed that users will need to continue to know the boundaries of a
> group for the purpose of DMA isolation and potentially IOAS
> independence should cdev/iommufd choose to tackle those topics.

Yes, group_id is a value we have no specific use for and would require
userspace to keep seperate track of. I'd prefer to rely on dev_id as
much as possible instead.

> What is the actual proposal here?

I don't know anymore, you don't seem to like this direction either...

> You've said that hot-reset works if the iommufd_ctx has
> representation from each affected group, the INFO ioctl remains as
> it is, which suggests that it's reporting group ID and BDF, yet only
> sysfs tells the user the relation between a vfio cdev and a group
> and we're trying to enable a pass-by-fd model for cdev where the
> user has no reference to a sysfs node for the device.  Show me how
> these pieces fit together.

I prefer the version where INFO2 returns the dev_id, but info can work
if we do the BDF cap like you suggested to Yi

> OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
> device has an IOMMU group

I don't desire every VFIO device to have an iommu_group. I want VFIO
devices with real IOMMU drivers to have an iommu_group. mdev and
no-iommu should not. I don't want to add them back into the design
just so INFO has a value to return.

I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an
iommu_group...

I see this problem as a few basic requirements from a qemu-like
application:

 1) Does the configuration I was given support reset right now?
 2) Will the configuration I was given support reset for the duration
    of my execution?
 3) What groups of the devices I already have open does the reset
    effect?
 4) For debugging, report to the user the full list of devices in the
    reset group, in a way that relates back to sysfs.
 5) Away to trigger a reset on a group of devices

#1/#2 is the API I suggested here. Ask the kernel if the current
configuration works, and ask it to keep it working.

#3 is either INFO and a CAP for BDF or INFO2 reporting dev_id

#4 is either INFO and print the BDFs or INFO2 reporting the struct
vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).

#5 is adjusting the FD list in existing RESET ioctl. Remove the need
for userspace to specify a minimal exact list of FDs means userspace
doesn't need the information to figure out what that list actually
is. Pass a 0 length list and use iommufdctx.

None of these requirements suggests to me that qemu needs to know the
group_id, or that it needs to have enough information to know how to
fix an unavailable reset.

Did I miss a requirement here?

Regards,
Jason
Tian, Kevin April 12, 2023, 7:14 a.m. UTC | #50
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 12, 2023 5:58 AM
> 
> On Tue, 11 Apr 2023 15:40:07 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Apr 11, 2023 at 11:11:17AM -0600, Alex Williamson wrote:
> > > [Appears the list got dropped, replying to my previous message to re-add]
> >
> > Wowo this got mesed up alot, mutt drops the cc when replying for some
> > reason. I think it is fixed up now
> >
> > > > Our cdev model says that opening a cdev locks out other cdevs from
> > > > independent use, eg because of the group sharing. Extending this to
> > > > include the reset group as well seems consistent.
> > >
> > > The DMA ownership model based on the IOMMU group is consistent with
> > > legacy vfio, but now you're proposing a new ownership model that
> > > optionally allows a user to extend their ownership, opportunistically
> > > lock out other users, and wreaking havoc for management utilities that
> > > also have no insight into dev_sets or userspace driver behavior.
> >
> > I suggested below that the owership require enough open devices - so
> > it doesn't "extend ownership opportunistically", and there is no
> > havoc.
> >
> > Management tools already need to understand dev_set if they want to
> > offer reliable reset support to the VMs. Same as today.
> 
> I don't think that's true.  Our primary hot-reset use case is GPUs and
> subordinate functions, where the isolation and reset scope are often
> sufficiently similar to make hot-reset possible, regardless whether
> all the functions are assigned to a VM.  I don't think you'll find any
> management tools that takes reset scope into account otherwise.

If we only care about the primary case where iommu group and reset
scope matches, then why would the new claim model in Jason's proposal
urge the management tools to understand the reset scope now?

btw in your earlier replies you pointed out the issue of unpredictable
ordering on a multi-function device e.g. upon which one runs first
dpdk or qmeu will block the other. But I wonder what is the actual use
of allowing both running while both can't do reset due to affected reset
scope in current model.

If a vfio user cannot do reset doesn't it imply it hasn't acquired the full
permission on the device then Jason's proposal of explicitly failing it
is actually a cleaner model?

Thanks
Kevin
Tian, Kevin April 12, 2023, 7:27 a.m. UTC | #51
> From: Jason Gunthorpe
> Sent: Wednesday, April 12, 2023 8:01 AM
> 
> I see this problem as a few basic requirements from a qemu-like
> application:
> 
>  1) Does the configuration I was given support reset right now?
>  2) Will the configuration I was given support reset for the duration
>     of my execution?
>  3) What groups of the devices I already have open does the reset
>     effect?
>  4) For debugging, report to the user the full list of devices in the
>     reset group, in a way that relates back to sysfs.
>  5) Away to trigger a reset on a group of devices
> 
> #1/#2 is the API I suggested here. Ask the kernel if the current
> configuration works, and ask it to keep it working.
> 
> #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> 
> #4 is either INFO and print the BDFs or INFO2 reporting the struct
> vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).

mdev doesn't have BDF. Of course it doesn't support hot_reset either.

but it's presented to userspace as a pci device. Is it weird for a pci
device which doesn't provide a BDF cap?

from this point the vfio_device IDR# sounds more generic.

> 
> #5 is adjusting the FD list in existing RESET ioctl. Remove the need
> for userspace to specify a minimal exact list of FDs means userspace
> doesn't need the information to figure out what that list actually
> is. Pass a 0 length list and use iommufdctx.
> 
> None of these requirements suggests to me that qemu needs to know the
> group_id, or that it needs to have enough information to know how to
> fix an unavailable reset.
>
Yi Liu April 12, 2023, 10:09 a.m. UTC | #52
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 12, 2023 8:01 AM
> 
> On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote:
> 
> > > Management tools already need to understand dev_set if they want to
> > > offer reliable reset support to the VMs. Same as today.
> >
> > I don't think that's true. Our primary hot-reset use case is GPUs and
> > subordinate functions, where the isolation and reset scope are often
> > sufficiently similar to make hot-reset possible, regardless whether
> > all the functions are assigned to a VM.  I don't think you'll find any
> > management tools that takes reset scope into account otherwise.
> 
> When I think of "reliable reset support" I think of the management
> tool offering a checkbox that says "ensure PCI function reset
> availability" and if checked it will not launch the VM without a
> working reset.
> 
> If the user configures a set of VFIO devices and then hopes they get
> working reset, that is fine, but doesn't require any reporting of
> reset groups, or iommu groups to the management layer to work.
> 
> > > > As I understand the proposal, QEMU now gets to attempt to
> > > > claim ownership of the dev_set, so it opportunistically extends its
> > > > ownership and may block other users from the affected devices.
> > >
> > > We can decide the policy for the kernel to accept a claim. I suggested
> > > below "same as today" - it must hold all the groups within the
> > > iommufd_ctx.
> >
> > It must hold all the groups [that the user doesn't know about because
> > it's not a formal part of the cdev API] within the iommufd_ctx?
> 
> You keep going back to this, but I maintain userspace doesn't
> care. qemu is given a list of VFIO devices to use, all it wants to
> know is if it is allowed to use reset or not. Why should it need to
> know groups and group_ids to get that binary signal out of the kernel?
> 
> > > The simplest option for no-iommu is to require it to pass in every
> > > device fd to the reset ioctl.
> >
> > Which ironically is exactly how it ends up working today, each no-iommu
> > device has a fake IOMMU group, so every affected device (group) needs
> > to be provided.
> 
> Sure, that is probably the way forward for no-iommu. Not that anyone
> uses it..
> 
> The kicker is we don't force the user to generate a de-duplicated list
> of devices FDs, one per group, just because.
> 
> > > I want to re-focus on the basics of what cdev is supposed to be doing,
> > > because several of the idea you suggested seem against this direction:
> > >
> > >  - cdev does not have, and cannot rely on vfio_groups. We enforce this
> > >    by compiling all the vfio_group infrastructure out. iommu_groups
> > >    continue to exist.
> > >
> > >    So converting a cdev to a vfio_group is not an allowed operation.
> >
> > My only statements in this respect were towards the notion that IOMMU
> > groups continue to exist.  I'm well aware of the desire to deprecate
> > and remove vfio groups.
> 
> Yes
> 
> > >  - no-iommu should not have iommu_groups. We enforce this by compiling
> > >    out all the no-iommu vfio_group infrastructure.
> >
> > This is not logically inferred from the above if IOMMU groups continue
> > to exist and continue to be a basis for describing DMA ownership as
> > well as "reset groups"
> 
> It is not ment to flow out of the above, it is a seperate statement. I
> want the iommu_group mechanism to stop being abused outside the iommu
> core code. The only thing that should be creating groups is an
> attached iommu driver operating under ops->device_group().
> 
> VFIO needed this to support mdev and no-iommu. We already have mdev
> free of iommu_groups, I would like no-iommu to also be free of it too,
> we are very close.
> 
> That would leave POWER as the only abuser of the
> iommu_group_add_device() API, and it is only doing it because it
> hasn't got a proper iommu driver implementation yet. It turns out
> their abuse is mislocked and maybe racy to boot :(
> 
> > >  - cdev APIs should ideally not require the user to know the group_id,
> > >    we should try hard to design APIs to avoid this.
> >
> > This is a nuance, group_id vs group, where it's been previously
> > discussed that users will need to continue to know the boundaries of a
> > group for the purpose of DMA isolation and potentially IOAS
> > independence should cdev/iommufd choose to tackle those topics.
> 
> Yes, group_id is a value we have no specific use for and would require
> userspace to keep seperate track of. I'd prefer to rely on dev_id as
> much as possible instead.
> 
> > What is the actual proposal here?
> 
> I don't know anymore, you don't seem to like this direction either...
> 
> > You've said that hot-reset works if the iommufd_ctx has
> > representation from each affected group, the INFO ioctl remains as
> > it is, which suggests that it's reporting group ID and BDF, yet only
> > sysfs tells the user the relation between a vfio cdev and a group
> > and we're trying to enable a pass-by-fd model for cdev where the
> > user has no reference to a sysfs node for the device.  Show me how
> > these pieces fit together.
> 
> I prefer the version where INFO2 returns the dev_id, but info can work
> if we do the BDF cap like you suggested to Yi
> 
> > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
> > device has an IOMMU group
> 
> I don't desire every VFIO device to have an iommu_group. I want VFIO
> devices with real IOMMU drivers to have an iommu_group. mdev and
> no-iommu should not. I don't want to add them back into the design
> just so INFO has a value to return.
> 
> I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an
> iommu_group...
> 
> I see this problem as a few basic requirements from a qemu-like
> application:
> 
>  1) Does the configuration I was given support reset right now?
>  2) Will the configuration I was given support reset for the duration
>     of my execution?
>  3) What groups of the devices I already have open does the reset
>     effect?
>  4) For debugging, report to the user the full list of devices in the
>     reset group, in a way that relates back to sysfs.
>  5) Away to trigger a reset on a group of devices
> 
> #1/#2 is the API I suggested here. Ask the kernel if the current
> configuration works, and ask it to keep it working.
> 
> #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> 
> #4 is either INFO and print the BDFs or INFO2 reporting the struct
> vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).

I hope we can have a clear statement on the _INFO or INFO2 usage.
Today, per QEMU's implementation, the output of _INFO is used to:

1) do a self-check to see if all the affected groups are opened by the
    current user before it can invoke hot-reset.
2) figure out the devices that are already opened by the user. QEMU
    needs to save the state of such devices as the device may already
    been in use. If so, its state should be saved and restored prior/post
    the hot-reset.

Seems like we are relaxing the self-check as it may be done by locking
the reset group. is it?

> #5 is adjusting the FD list in existing RESET ioctl. Remove the need
> for userspace to specify a minimal exact list of FDs means userspace
> doesn't need the information to figure out what that list actually
> is. Pass a 0 length list and use iommufdctx.

If the reset group is locked, seems no need to check iommufdctx.

Thanks,
Yi Liu
Jason Gunthorpe April 12, 2023, 3:05 p.m. UTC | #53
On Wed, Apr 12, 2023 at 07:27:43AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Wednesday, April 12, 2023 8:01 AM
> > 
> > I see this problem as a few basic requirements from a qemu-like
> > application:
> > 
> >  1) Does the configuration I was given support reset right now?
> >  2) Will the configuration I was given support reset for the duration
> >     of my execution?
> >  3) What groups of the devices I already have open does the reset
> >     effect?
> >  4) For debugging, report to the user the full list of devices in the
> >     reset group, in a way that relates back to sysfs.
> >  5) Away to trigger a reset on a group of devices
> > 
> > #1/#2 is the API I suggested here. Ask the kernel if the current
> > configuration works, and ask it to keep it working.
> > 
> > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> > 
> > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).
> 
> mdev doesn't have BDF. Of course it doesn't support hot_reset either.

It should support a reset.. Maybe idxd doesn't, but it should be part
of the SIOV model. Our SIOV devices would need it for instance.

> but it's presented to userspace as a pci device. Is it weird for a pci
> device which doesn't provide a BDF cap?

It is weird for a PCI device, but it is not weird for a VFIO
device. Leaking the physical labels out of the uAPI is not clean,
IMHO.

> from this point the vfio_device IDR# sounds more generic.

Yes, I was thinking about this for the SIOV model.

Jason
Alex Williamson April 12, 2023, 4:50 p.m. UTC | #54
On Tue, 11 Apr 2023 21:01:06 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote:
> 
> > > Management tools already need to understand dev_set if they want to
> > > offer reliable reset support to the VMs. Same as today.  
> > 
> > I don't think that's true. Our primary hot-reset use case is GPUs and
> > subordinate functions, where the isolation and reset scope are often
> > sufficiently similar to make hot-reset possible, regardless whether
> > all the functions are assigned to a VM.  I don't think you'll find any
> > management tools that takes reset scope into account otherwise.  
> 
> When I think of "reliable reset support" I think of the management
> tool offering a checkbox that says "ensure PCI function reset
> availability" and if checked it will not launch the VM without a
> working reset.

This doesn't exist.

> If the user configures a set of VFIO devices and then hopes they get
> working reset, that is fine, but doesn't require any reporting of
> reset groups, or iommu groups to the management layer to work.

I think there's more than hope involved here, there are recipes to
create working hot-reset configurations because it is well specified
and predictable currently.  QEMU can indicate whether hot-reset is
available thanks to the information provided in the INFO ioctl and a VM
that owns the necessary set of groups may consistently and repeatedly
perform hot-resets.

> > > > As I understand the proposal, QEMU now gets to attempt to
> > > > claim ownership of the dev_set, so it opportunistically extends its
> > > > ownership and may block other users from the affected devices.    
> > > 
> > > We can decide the policy for the kernel to accept a claim. I suggested
> > > below "same as today" - it must hold all the groups within the
> > > iommufd_ctx.  
> > 
> > It must hold all the groups [that the user doesn't know about because
> > it's not a formal part of the cdev API] within the iommufd_ctx?  
> 
> You keep going back to this, but I maintain userspace doesn't
> care. qemu is given a list of VFIO devices to use, all it wants to
> know is if it is allowed to use reset or not. Why should it need to
> know groups and group_ids to get that binary signal out of the kernel?

hw/vfio/pci.c:2320
        error_report("vfio: Cannot reset device %s, "
                     "depends on group %d which is not owned.",
                     vdev->vbasedev.name, devices[i].group_id);

That creates a feedback loop where a user can take corrective action
with actual information in hand to resolve the issue.

> > > The simplest option for no-iommu is to require it to pass in every
> > > device fd to the reset ioctl.  
> > 
> > Which ironically is exactly how it ends up working today, each no-iommu
> > device has a fake IOMMU group, so every affected device (group) needs
> > to be provided.  
> 
> Sure, that is probably the way forward for no-iommu. Not that anyone
> uses it..
> 
> The kicker is we don't force the user to generate a de-duplicated list
> of devices FDs, one per group, just because.

So on one hand you're asking for simplicity, but on the other you're
criticizing a trivial simplification that we chose to allow the user to
pass number of group fds equal to number of devices affected so that
the user doesn't need to take that step to de-duplicate the list.  We
can't win.
 
> > > I want to re-focus on the basics of what cdev is supposed to be doing,
> > > because several of the idea you suggested seem against this direction:
> > > 
> > >  - cdev does not have, and cannot rely on vfio_groups. We enforce this
> > >    by compiling all the vfio_group infrastructure out. iommu_groups
> > >    continue to exist.
> > >    
> > >    So converting a cdev to a vfio_group is not an allowed operation.  
> > 
> > My only statements in this respect were towards the notion that IOMMU
> > groups continue to exist.  I'm well aware of the desire to deprecate
> > and remove vfio groups.  
> 
> Yes
> 
> > >  - no-iommu should not have iommu_groups. We enforce this by compiling
> > >    out all the no-iommu vfio_group infrastructure.  
> > 
> > This is not logically inferred from the above if IOMMU groups continue
> > to exist and continue to be a basis for describing DMA ownership as
> > well as "reset groups"  
> 
> It is not ment to flow out of the above, it is a seperate statement. I
> want the iommu_group mechanism to stop being abused outside the iommu
> core code. The only thing that should be creating groups is an
> attached iommu driver operating under ops->device_group().
> 
> VFIO needed this to support mdev and no-iommu. We already have mdev
> free of iommu_groups, I would like no-iommu to also be free of it too,
> we are very close.
> 
> That would leave POWER as the only abuser of the
> iommu_group_add_device() API, and it is only doing it because it
> hasn't got a proper iommu driver implementation yet. It turns out
> their abuse is mislocked and maybe racy to boot :(
> 
> > >  - cdev APIs should ideally not require the user to know the group_id,
> > >    we should try hard to design APIs to avoid this.  
> > 
> > This is a nuance, group_id vs group, where it's been previously
> > discussed that users will need to continue to know the boundaries of a
> > group for the purpose of DMA isolation and potentially IOAS
> > independence should cdev/iommufd choose to tackle those topics.  
> 
> Yes, group_id is a value we have no specific use for and would require
> userspace to keep seperate track of. I'd prefer to rely on dev_id as
> much as possible instead.

But dev-id only has meaning in relation to an iommufd_ctx, so it fails
to be useful in the context of implied ownership.

> > What is the actual proposal here?  
> 
> I don't know anymore, you don't seem to like this direction either...
> 
> > You've said that hot-reset works if the iommufd_ctx has
> > representation from each affected group, the INFO ioctl remains as
> > it is, which suggests that it's reporting group ID and BDF, yet only
> > sysfs tells the user the relation between a vfio cdev and a group
> > and we're trying to enable a pass-by-fd model for cdev where the
> > user has no reference to a sysfs node for the device.  Show me how
> > these pieces fit together.  
> 
> I prefer the version where INFO2 returns the dev_id, but info can work
> if we do the BDF cap like you suggested to Yi

As discussed ad nauseam, dev-id is useless if an affected device is not
already within the iommufd ctx.  BDF provides a mapping to specific
affected devices, but can't express implied ownership.  Group id
provides the implied ownership, but can't express specific devices.  As
Yi has pointed out, QEMU needs to know both if it has ownership of all
the affected devices, both direct and implied, and which specific
devices that it owns are affected.

> > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
> > device has an IOMMU group  
> 
> I don't desire every VFIO device to have an iommu_group. I want VFIO
> devices with real IOMMU drivers to have an iommu_group. mdev and
> no-iommu should not. I don't want to add them back into the design
> just so INFO has a value to return.
> 
> I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an
> iommu_group...

It's not been shown to me that dev-id is a useful replacement for
anything here.

> I see this problem as a few basic requirements from a qemu-like
> application:
> 
>  1) Does the configuration I was given support reset right now?
>  2) Will the configuration I was given support reset for the duration
>     of my execution?
>  3) What groups of the devices I already have open does the reset
>     effect?
>  4) For debugging, report to the user the full list of devices in the
>     reset group, in a way that relates back to sysfs.
>  5) Away to trigger a reset on a group of devices
> 
> #1/#2 is the API I suggested here. Ask the kernel if the current
> configuration works, and ask it to keep it working.

That is super sketchy because you're also advocating for
opportunistically supporting reset if the instantaneous conditions
allow is (ex. unopened devices), and going back and forth whether "ask
it to keep working" suggests that a user is able to extend their
granted ownership themselves.  I think both needs to be based on some
form of granted, not requested, ownership and not opportunism.

> #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id

Where dev-id is useful for... ?  I think there's a misuse of "groups"
in 3) above, userspace needs to know specific devices affected, thus
BDF.

> #4 is either INFO and print the BDFs or INFO2 reporting the struct
> vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).

We can't assume that all the affected devices are bound to vfio,
therefore we cannot assume a vfio_device IDR exists.

> #5 is adjusting the FD list in existing RESET ioctl. Remove the need
> for userspace to specify a minimal exact list of FDs means userspace
> doesn't need the information to figure out what that list actually
> is. Pass a 0 length list and use iommufdctx.

"...doesn't need the information to figure out what the list actually
is."  That's false, userspace needs the information whether it uses it
to make a list or not, ex. pre- and post-reset processing of specific
affected devices.  Furthermore, supporting a zero length array removes
context from the existing ioctl, which has been shown to make it prone
to creating gaps in legacy group use cases, so I don't understand why
this optimization is so pervasive or important.
 
> None of these requirements suggests to me that qemu needs to know the
> group_id, or that it needs to have enough information to know how to
> fix an unavailable reset.
> 
> Did I miss a requirement here?

So what is the exact proposal?  We can't have an INFO ioctl that simply
returns error if the ownership requirements are not met as that doesn't
support 4).  So we need one or more ioctls that a) indicates whether
the ownership requirements are met and b) indicates the set of affected
devices.  Is b) only the set of affected devices within the calling
devices iommufd_ctx (ie. dev-ids), in which case we need c) a way to
report the overall set of affected devices regardless of ownership in
support of 4), BDF?

Are we back to replacing group-ids with dev-ids in the INFO structure,
where an invalid dev-id either indicates an affected device with
implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
is meant to indicate the overall disposition based on the availability
of reset?  I'm not sure how that fully supports 4) since the user can't
determine if a given invalid dev-id is in fact a blocker, so do we end
up with multiple invalid IDs, perhaps one to indicate unknown but ok
and another to indicate an ownership gap?  Are devices outside of the
iommufd_ctx, but with implied ownership via group omitted entirely from
the lists?  I think we need an actual proposal here.  Thanks,

Alex
Alex Williamson April 12, 2023, 4:54 p.m. UTC | #55
On Wed, 12 Apr 2023 10:09:32 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, April 12, 2023 8:01 AM
> > 
> > On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote:
> >   
> > > > Management tools already need to understand dev_set if they want to
> > > > offer reliable reset support to the VMs. Same as today.  
> > >
> > > I don't think that's true. Our primary hot-reset use case is GPUs and
> > > subordinate functions, where the isolation and reset scope are often
> > > sufficiently similar to make hot-reset possible, regardless whether
> > > all the functions are assigned to a VM.  I don't think you'll find any
> > > management tools that takes reset scope into account otherwise.  
> > 
> > When I think of "reliable reset support" I think of the management
> > tool offering a checkbox that says "ensure PCI function reset
> > availability" and if checked it will not launch the VM without a
> > working reset.
> > 
> > If the user configures a set of VFIO devices and then hopes they get
> > working reset, that is fine, but doesn't require any reporting of
> > reset groups, or iommu groups to the management layer to work.
> >   
> > > > > As I understand the proposal, QEMU now gets to attempt to
> > > > > claim ownership of the dev_set, so it opportunistically extends its
> > > > > ownership and may block other users from the affected devices.  
> > > >
> > > > We can decide the policy for the kernel to accept a claim. I suggested
> > > > below "same as today" - it must hold all the groups within the
> > > > iommufd_ctx.  
> > >
> > > It must hold all the groups [that the user doesn't know about because
> > > it's not a formal part of the cdev API] within the iommufd_ctx?  
> > 
> > You keep going back to this, but I maintain userspace doesn't
> > care. qemu is given a list of VFIO devices to use, all it wants to
> > know is if it is allowed to use reset or not. Why should it need to
> > know groups and group_ids to get that binary signal out of the kernel?
> >   
> > > > The simplest option for no-iommu is to require it to pass in every
> > > > device fd to the reset ioctl.  
> > >
> > > Which ironically is exactly how it ends up working today, each no-iommu
> > > device has a fake IOMMU group, so every affected device (group) needs
> > > to be provided.  
> > 
> > Sure, that is probably the way forward for no-iommu. Not that anyone
> > uses it..
> > 
> > The kicker is we don't force the user to generate a de-duplicated list
> > of devices FDs, one per group, just because.
> >   
> > > > I want to re-focus on the basics of what cdev is supposed to be doing,
> > > > because several of the idea you suggested seem against this direction:
> > > >
> > > >  - cdev does not have, and cannot rely on vfio_groups. We enforce this
> > > >    by compiling all the vfio_group infrastructure out. iommu_groups
> > > >    continue to exist.
> > > >
> > > >    So converting a cdev to a vfio_group is not an allowed operation.  
> > >
> > > My only statements in this respect were towards the notion that IOMMU
> > > groups continue to exist.  I'm well aware of the desire to deprecate
> > > and remove vfio groups.  
> > 
> > Yes
> >   
> > > >  - no-iommu should not have iommu_groups. We enforce this by compiling
> > > >    out all the no-iommu vfio_group infrastructure.  
> > >
> > > This is not logically inferred from the above if IOMMU groups continue
> > > to exist and continue to be a basis for describing DMA ownership as
> > > well as "reset groups"  
> > 
> > It is not ment to flow out of the above, it is a seperate statement. I
> > want the iommu_group mechanism to stop being abused outside the iommu
> > core code. The only thing that should be creating groups is an
> > attached iommu driver operating under ops->device_group().
> > 
> > VFIO needed this to support mdev and no-iommu. We already have mdev
> > free of iommu_groups, I would like no-iommu to also be free of it too,
> > we are very close.
> > 
> > That would leave POWER as the only abuser of the
> > iommu_group_add_device() API, and it is only doing it because it
> > hasn't got a proper iommu driver implementation yet. It turns out
> > their abuse is mislocked and maybe racy to boot :(
> >   
> > > >  - cdev APIs should ideally not require the user to know the group_id,
> > > >    we should try hard to design APIs to avoid this.  
> > >
> > > This is a nuance, group_id vs group, where it's been previously
> > > discussed that users will need to continue to know the boundaries of a
> > > group for the purpose of DMA isolation and potentially IOAS
> > > independence should cdev/iommufd choose to tackle those topics.  
> > 
> > Yes, group_id is a value we have no specific use for and would require
> > userspace to keep seperate track of. I'd prefer to rely on dev_id as
> > much as possible instead.
> >   
> > > What is the actual proposal here?  
> > 
> > I don't know anymore, you don't seem to like this direction either...
> >   
> > > You've said that hot-reset works if the iommufd_ctx has
> > > representation from each affected group, the INFO ioctl remains as
> > > it is, which suggests that it's reporting group ID and BDF, yet only
> > > sysfs tells the user the relation between a vfio cdev and a group
> > > and we're trying to enable a pass-by-fd model for cdev where the
> > > user has no reference to a sysfs node for the device.  Show me how
> > > these pieces fit together.  
> > 
> > I prefer the version where INFO2 returns the dev_id, but info can work
> > if we do the BDF cap like you suggested to Yi
> >   
> > > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
> > > device has an IOMMU group  
> > 
> > I don't desire every VFIO device to have an iommu_group. I want VFIO
> > devices with real IOMMU drivers to have an iommu_group. mdev and
> > no-iommu should not. I don't want to add them back into the design
> > just so INFO has a value to return.
> > 
> > I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an
> > iommu_group...
> > 
> > I see this problem as a few basic requirements from a qemu-like
> > application:
> > 
> >  1) Does the configuration I was given support reset right now?
> >  2) Will the configuration I was given support reset for the duration
> >     of my execution?
> >  3) What groups of the devices I already have open does the reset
> >     effect?
> >  4) For debugging, report to the user the full list of devices in the
> >     reset group, in a way that relates back to sysfs.
> >  5) Away to trigger a reset on a group of devices
> > 
> > #1/#2 is the API I suggested here. Ask the kernel if the current
> > configuration works, and ask it to keep it working.
> > 
> > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> > 
> > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).  
> 
> I hope we can have a clear statement on the _INFO or INFO2 usage.
> Today, per QEMU's implementation, the output of _INFO is used to:
> 
> 1) do a self-check to see if all the affected groups are opened by the
>     current user before it can invoke hot-reset.
> 2) figure out the devices that are already opened by the user. QEMU
>     needs to save the state of such devices as the device may already
>     been in use. If so, its state should be saved and restored prior/post
>     the hot-reset.
> 
> Seems like we are relaxing the self-check as it may be done by locking
> the reset group. is it?

I hope not.  Locking the reset group suggests the user is able to
extend their ownership.  IMO we should not allow that.  Thanks,

Alex
Alex Williamson April 12, 2023, 5:01 p.m. UTC | #56
On Wed, 12 Apr 2023 12:05:50 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 12, 2023 at 07:27:43AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Wednesday, April 12, 2023 8:01 AM
> > > 
> > > I see this problem as a few basic requirements from a qemu-like
> > > application:
> > > 
> > >  1) Does the configuration I was given support reset right now?
> > >  2) Will the configuration I was given support reset for the duration
> > >     of my execution?
> > >  3) What groups of the devices I already have open does the reset
> > >     effect?
> > >  4) For debugging, report to the user the full list of devices in the
> > >     reset group, in a way that relates back to sysfs.
> > >  5) Away to trigger a reset on a group of devices
> > > 
> > > #1/#2 is the API I suggested here. Ask the kernel if the current
> > > configuration works, and ask it to keep it working.
> > > 
> > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> > > 
> > > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).  
> > 
> > mdev doesn't have BDF. Of course it doesn't support hot_reset either.  
> 
> It should support a reset.. Maybe idxd doesn't, but it should be part
> of the SIOV model. Our SIOV devices would need it for instance.

IIRC we require mdev devices to support VFIO_DEVICE_RESET, hot-reset is
a different beast.  I assume SIOV device support would also require
VFIO_DEVICE_RESET support and hot-reset would also be irrelevant to
them.

> > but it's presented to userspace as a pci device. Is it weird for a pci
> > device which doesn't provide a BDF cap?  
> 
> It is weird for a PCI device, but it is not weird for a VFIO
> device. Leaking the physical labels out of the uAPI is not clean,
> IMHO.
> 
> > from this point the vfio_device IDR# sounds more generic.  
> 
> Yes, I was thinking about this for the SIOV model.

Seems like we're off on a tangent, the hot-reset ioctl is not relevant
to devices simply because they expose a vfio-pci API, there is any
underlying hardware aspect that anything that is only virtualizing a
vfio-pci API shouldn't be concerned with.  Thanks,

Alex
Jason Gunthorpe April 12, 2023, 8:06 p.m. UTC | #57
On Wed, Apr 12, 2023 at 10:50:45AM -0600, Alex Williamson wrote:

> > You keep going back to this, but I maintain userspace doesn't
> > care. qemu is given a list of VFIO devices to use, all it wants to
> > know is if it is allowed to use reset or not. Why should it need to
> > know groups and group_ids to get that binary signal out of the kernel?
> 
> hw/vfio/pci.c:2320
>         error_report("vfio: Cannot reset device %s, "
>                      "depends on group %d which is not owned.",
>                      vdev->vbasedev.name, devices[i].group_id);
> 
> That creates a feedback loop where a user can take corrective action
> with actual information in hand to resolve the issue.

Which is why I listed debugging as requirement #4, and solve
requirement #4 by using the existing INFO and printing the BDF list it
returns.

> > The kicker is we don't force the user to generate a de-duplicated list
> > of devices FDs, one per group, just because.
> 
> So on one hand you're asking for simplicity, but on the other you're
> criticizing a trivial simplification that we chose to allow the user to
> pass number of group fds equal to number of devices affected so that
> the user doesn't need to take that step to de-duplicate the list.  We
> can't win.

It is not a simplification because the kernel is wired to accept only
a list of exactly that group length, no more no less. It turns into a
pointless puzzle that userspace has to solve, and it can only solve it
by knowing about groups.

If we get rid of groups we have to do something about this so
userspace doesn't need to do the calculation. That is the point of
this change.

> > > You've said that hot-reset works if the iommufd_ctx has
> > > representation from each affected group, the INFO ioctl remains as
> > > it is, which suggests that it's reporting group ID and BDF, yet only
> > > sysfs tells the user the relation between a vfio cdev and a group
> > > and we're trying to enable a pass-by-fd model for cdev where the
> > > user has no reference to a sysfs node for the device.  Show me how
> > > these pieces fit together.  
> > 
> > I prefer the version where INFO2 returns the dev_id, but info can work
> > if we do the BDF cap like you suggested to Yi
> 
> As discussed ad nauseam, dev-id is useless if an affected device is not
> already within the iommufd ctx.  

The purpose of INFO2 is to satisfy requirement #3 - which is to report
the effected devices *that are already opened*. For this dev_id is
fine. There is nothing qemu can do with devices that are outside its
iommufdctx, so it is pointless to tell it about them. It will generate
the debug print of #4 using INFO. I don't think we don't need one API here.

> > I see this problem as a few basic requirements from a qemu-like
> > application:
> > 
> >  1) Does the configuration I was given support reset right now?
> >  2) Will the configuration I was given support reset for the duration
> >     of my execution?
> >  3) What groups of the devices I already have open does the reset
> >     effect?
> >  4) For debugging, report to the user the full list of devices in the
> >     reset group, in a way that relates back to sysfs.
> >  5) Away to trigger a reset on a group of devices
> > 
> > #1/#2 is the API I suggested here. Ask the kernel if the current
> > configuration works, and ask it to keep it working.
> 
> That is super sketchy because you're also advocating for
> opportunistically supporting reset if the instantaneous conditions
> allow is (ex. unopened devices), and going back and forth whether "ask
> it to keep working" suggests that a user is able to extend their
> granted ownership themselves.  I think both needs to be based on some
> form of granted, not requested, ownership and not opportunism.

Ok, lets give up on ownership then

> > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> 
> Where dev-id is useful for... ?  I think there's a misuse of "groups"
> in 3) above, userspace needs to know specific devices affected, thus
> BDF.

I did not mean "group of devices" to mean iommu_group, I mean "the set of
devices affected by the reset"

> > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).
> 
> We can't assume that all the affected devices are bound to vfio,
> therefore we cannot assume a vfio_device IDR exists.

So BDF is better for the debugging print.

> > #5 is adjusting the FD list in existing RESET ioctl. Remove the need
> > for userspace to specify a minimal exact list of FDs means userspace
> > doesn't need the information to figure out what that list actually
> > is. Pass a 0 length list and use iommufdctx.
> 
> "...doesn't need the information to figure out what the list actually
> is."  That's false, userspace needs the information whether it uses it
> to make a list or not,

#3 is the need of affected devices, it is already covered.

I mean that #5 should not need this, #5 is only about triggering the
reset.

What I want is a #5 action that does not require doing a calcuation on
group IDs.

At the core, without any notion of groups, #5 requires userspace to
pass in every opened device FD and kernel checks that every opened
device is in the passed FD list. Close devices are ignored. Devices
with unattached drivers are ignored.

#5 does not need the answer to requirement #2.

> So we need one or more ioctls that a) indicates whether
> the ownership requirements are met 

If we reject the ownership direction, then I go back to suggesting
that INFO2 should do this.

> b) indicates the set of affected
> devices.  

INFO2 will return the dev_id which is sufficient to satisfy
requirement #3

> Is b) only the set of affected devices within the calling
> devices iommufd_ctx (ie. dev-ids),

I vote yes

> in which case we need c) a way to
> report the overall set of affected devices regardless of ownership in
> support of 4), BDF?

Yes, continue to use INFO unmodified.
 
> Are we back to replacing group-ids with dev-ids in the INFO structure,
> where an invalid dev-id either indicates an affected device with
> implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
> is meant to indicate the overall disposition based on the availability
> of reset?  

As you explore in the following this gets ugly. I prefer to keep INFO
unchanged and add INFO2.

So maybe we should make patches that look something like this, try to
come up with a workable INFO2 and squeeze no-iommu into it somehow.

Jason
Tian, Kevin April 13, 2023, 2:57 a.m. UTC | #58
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 12, 2023 11:06 PM
> 
> On Wed, Apr 12, 2023 at 07:27:43AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Wednesday, April 12, 2023 8:01 AM
> > >
> > > I see this problem as a few basic requirements from a qemu-like
> > > application:
> > >
> > >  1) Does the configuration I was given support reset right now?
> > >  2) Will the configuration I was given support reset for the duration
> > >     of my execution?
> > >  3) What groups of the devices I already have open does the reset
> > >     effect?
> > >  4) For debugging, report to the user the full list of devices in the
> > >     reset group, in a way that relates back to sysfs.
> > >  5) Away to trigger a reset on a group of devices
> > >
> > > #1/#2 is the API I suggested here. Ask the kernel if the current
> > > configuration works, and ask it to keep it working.
> > >
> > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> > >
> > > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).
> >
> > mdev doesn't have BDF. Of course it doesn't support hot_reset either.
> 
> It should support a reset.. Maybe idxd doesn't, but it should be part
> of the SIOV model. Our SIOV devices would need it for instance.

yes, supporting VFIO_DEVICE_RESET is assumed. That is required by
the siov spec.

Then no need to support hot_reset.

> 
> > but it's presented to userspace as a pci device. Is it weird for a pci
> > device which doesn't provide a BDF cap?
> 
> It is weird for a PCI device, but it is not weird for a VFIO
> device. Leaking the physical labels out of the uAPI is not clean,
> IMHO.

yes. Reporting pasid is also incorrect since it's invisible to user.

> 
> > from this point the vfio_device IDR# sounds more generic.
> 
> Yes, I was thinking about this for the SIOV model.
> 
> Jason
Tian, Kevin April 13, 2023, 8:25 a.m. UTC | #59
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 13, 2023 4:07 AM
> 
> 
> > in which case we need c) a way to
> > report the overall set of affected devices regardless of ownership in
> > support of 4), BDF?
> 
> Yes, continue to use INFO unmodified.
> 
> > Are we back to replacing group-ids with dev-ids in the INFO structure,
> > where an invalid dev-id either indicates an affected device with
> > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
> > is meant to indicate the overall disposition based on the availability
> > of reset?
> 
> As you explore in the following this gets ugly. I prefer to keep INFO
> unchanged and add INFO2.
> 

INFO needs a change when VFIO_GROUP is disabled. Now it assumes
a valid iommu group always exists:

vfio_pci_fill_devs()
{
	...
	iommu_group = iommu_group_get(&pdev->dev);
	if (!iommu_group)
		return -EPERM; /* Cannot reset non-isolated devices */
	...
}

Probably we need a special value e.g. -1 to represent noiommu case
given valid group ids are positive.

with that plus BDF cap, I'm curious what is the actual purpose of
INFO2 or why cannot requirement#3 reuse the information collected
via existing INFO?

For each opened device Qemu can find the related group id via
sysfs (if group exists) or an optional GROUP cap and use that id to
match the group id in INFO.

For noiommu it has a group id if VFIO_GROUP=y then same case.

For noiommu if VFIO_GROUP=n just do exact match based on BDF.

Either way the information returned by INFO is a superset of knowing
the reset scope between opened devices. 

Thanks
Kevin
Jason Gunthorpe April 13, 2023, 11:50 a.m. UTC | #60
On Thu, Apr 13, 2023 at 08:25:52AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, April 13, 2023 4:07 AM
> > 
> > 
> > > in which case we need c) a way to
> > > report the overall set of affected devices regardless of ownership in
> > > support of 4), BDF?
> > 
> > Yes, continue to use INFO unmodified.
> > 
> > > Are we back to replacing group-ids with dev-ids in the INFO structure,
> > > where an invalid dev-id either indicates an affected device with
> > > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
> > > is meant to indicate the overall disposition based on the availability
> > > of reset?
> > 
> > As you explore in the following this gets ugly. I prefer to keep INFO
> > unchanged and add INFO2.
> > 
> 
> INFO needs a change when VFIO_GROUP is disabled. Now it assumes
> a valid iommu group always exists:
> 
> vfio_pci_fill_devs()
> {
> 	...
> 	iommu_group = iommu_group_get(&pdev->dev);
> 	if (!iommu_group)
> 		return -EPERM; /* Cannot reset non-isolated devices */
> 	...
> }

This can still work in a ugly way. With a INFO2 the only purpose of
INFO would be debugging, so if someone uses no-iommu, with hotreset
and misconfigures it then the only downside is they don't get the
debugging print. But we know of nothing that uses this combination
anyhow..

> with that plus BDF cap, I'm curious what is the actual purpose of
> INFO2 or why cannot requirement#3 reuse the information collected
> via existing INFO?

It can - it is just more complicated for userspace to do it, it has to
extract and match the BDFs and then run some algorithm to determine if
the opened devices cover the right set of devices in the reset group,
and it has to have some special code for no-iommu.

VS info2 would return the dev_id's and a single yes/no if the right
set is present. Kernel runs the algorithm instead of userspace, it
seems more abstract this way.

Also, if we make iommufd return a 'ioas dev_id group' as well it
composes nicely that userspace just needs one translation from dev_id.

Jason
Yi Liu April 13, 2023, 2:35 p.m. UTC | #61
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 13, 2023 7:51 PM
> 
> On Thu, Apr 13, 2023 at 08:25:52AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 13, 2023 4:07 AM
> > >
> > >
> > > > in which case we need c) a way to
> > > > report the overall set of affected devices regardless of ownership in
> > > > support of 4), BDF?
> > >
> > > Yes, continue to use INFO unmodified.
> > >
> > > > Are we back to replacing group-ids with dev-ids in the INFO structure,
> > > > where an invalid dev-id either indicates an affected device with
> > > > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
> > > > is meant to indicate the overall disposition based on the availability
> > > > of reset?
> > >
> > > As you explore in the following this gets ugly. I prefer to keep INFO
> > > unchanged and add INFO2.
> > >
> >
> > INFO needs a change when VFIO_GROUP is disabled. Now it assumes
> > a valid iommu group always exists:
> >
> > vfio_pci_fill_devs()
> > {
> > 	...
> > 	iommu_group = iommu_group_get(&pdev->dev);
> > 	if (!iommu_group)
> > 		return -EPERM; /* Cannot reset non-isolated devices */
> > 	...
> > }
> 
> This can still work in a ugly way. With a INFO2 the only purpose of
> INFO would be debugging, so if someone uses no-iommu, with hotreset
> and misconfigures it then the only downside is they don't get the
> debugging print. But we know of nothing that uses this combination
> anyhow..

Today, at least QEMU will not go to do hot-reset if _INFO fails. I think
this check may need to be relaxed if want _INFO work when there is
no VFIO_GROUP (also no fake iommu_group).

Regards,
Yi Liu
Jason Gunthorpe April 13, 2023, 2:41 p.m. UTC | #62
On Thu, Apr 13, 2023 at 02:35:57PM +0000, Liu, Yi L wrote:

> Today, at least QEMU will not go to do hot-reset if _INFO fails. I think
> this check may need to be relaxed if want _INFO work when there is
> no VFIO_GROUP (also no fake iommu_group).

Current qemu does not work if there is no VFIO_GROUP, so it doesn't
matter.

In cdev mode qemu should work differently, we can make the kernel
return -1 for group_id and qemu can ignore group_id for the debug
print, or we can just make it fail.

Given qemu doesn't, and can't, support no-iommu this is pretty fringe
stuff.

Jason
Alex Williamson April 13, 2023, 6:07 p.m. UTC | #63
On Thu, 13 Apr 2023 08:50:45 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Apr 13, 2023 at 08:25:52AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 13, 2023 4:07 AM
> > > 
> > >   
> > > > in which case we need c) a way to
> > > > report the overall set of affected devices regardless of ownership in
> > > > support of 4), BDF?  
> > > 
> > > Yes, continue to use INFO unmodified.
> > >   
> > > > Are we back to replacing group-ids with dev-ids in the INFO structure,
> > > > where an invalid dev-id either indicates an affected device with
> > > > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
> > > > is meant to indicate the overall disposition based on the availability
> > > > of reset?  
> > > 
> > > As you explore in the following this gets ugly. I prefer to keep INFO
> > > unchanged and add INFO2.
> > >   
> > 
> > INFO needs a change when VFIO_GROUP is disabled. Now it assumes
> > a valid iommu group always exists:
> > 
> > vfio_pci_fill_devs()
> > {
> > 	...
> > 	iommu_group = iommu_group_get(&pdev->dev);
> > 	if (!iommu_group)
> > 		return -EPERM; /* Cannot reset non-isolated devices */
> > 	...
> > }  
> 
> This can still work in a ugly way. With a INFO2 the only purpose of
> INFO would be debugging, so if someone uses no-iommu, with hotreset
> and misconfigures it then the only downside is they don't get the
> debugging print. But we know of nothing that uses this combination
> anyhow..
> 
> > with that plus BDF cap, I'm curious what is the actual purpose of
> > INFO2 or why cannot requirement#3 reuse the information collected
> > via existing INFO?  
> 
> It can - it is just more complicated for userspace to do it, it has to
> extract and match the BDFs and then run some algorithm to determine if
> the opened devices cover the right set of devices in the reset group,
> and it has to have some special code for no-iommu.
> 
> VS info2 would return the dev_id's and a single yes/no if the right
> set is present. Kernel runs the algorithm instead of userspace, it
> seems more abstract this way.
> 
> Also, if we make iommufd return a 'ioas dev_id group' as well it
> composes nicely that userspace just needs one translation from dev_id.


IIUC, the semantics we're proposing is that an INFO2 ioctl would return
success or failure indicating whether the user has sufficient ownership
of the affected devices, and in the success case returns an array of
affected dev-ids within the user's iommufd_ctx.  Unopened, affected
devices, are not reported via INFO2, and unopened, affected devices
outside the user's scope of ownership (ie. outside the owned IOMMU
group) will generate a failure condition.

As for the INFO ioctl, it's described as unchanged, which does raise
the question of what is reported for IOMMU groups and how does the
value there coherently relate to anything else in the cdev-exclusive
vfio API...

We had already iterated a proposal where the group-id is replaced with
a dev-id in the existing ioctl and a flag indicates when the return
value is a dev-id vs group-id.  This had a gap that userspace cannot
determine if a reset is available given this information since un-owned
devices report an invalid dev-id and userspace can't know if it has
implicit ownership.

It seems cleaner to me though that we would could still re-use INFO in
a similar way, simply defining a new flag bit which is valid only in
the case of returning dev-ids and indicates if the reset is available.
Therefore in one ioctl, userspace knows if hot-reset is available
(based on a kernel determination) and can pull valid dev-ids from the
array to associate affected, owned devices, and still has the
equivalent information to know that one or more of the devices listed
with an invalid dev-id are preventing the hot-reset from being
available.

Is that an option?  Thanks,

Alex
Tian, Kevin April 14, 2023, 9:11 a.m. UTC | #64
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 14, 2023 2:07 AM
> 
> We had already iterated a proposal where the group-id is replaced with
> a dev-id in the existing ioctl and a flag indicates when the return
> value is a dev-id vs group-id.  This had a gap that userspace cannot
> determine if a reset is available given this information since un-owned
> devices report an invalid dev-id and userspace can't know if it has
> implicit ownership.
> 
> It seems cleaner to me though that we would could still re-use INFO in
> a similar way, simply defining a new flag bit which is valid only in
> the case of returning dev-ids and indicates if the reset is available.
> Therefore in one ioctl, userspace knows if hot-reset is available
> (based on a kernel determination) and can pull valid dev-ids from the

So the kernel needs to compare the group id between devices with
valid dev-ids and devices with invalid dev-ids to decide the implicit
ownership. For noiommu device which has no group_id when
VFIO_GROUP is off then it's resettable only if having a valid dev_id.

The only corner case with this option is when a user mixes group
and cdev usages. iirc you mentioned it's a valid usage to be supported.
In that case the kernel doesn't have sufficient knowledge to judge
'resettable' as it doesn't know which groups are opened by this user.

Not sure whether we can leave it in a ugly way so INFO may not tell
'resettable' accurately in that weird scenario.

> array to associate affected, owned devices, and still has the
> equivalent information to know that one or more of the devices listed
> with an invalid dev-id are preventing the hot-reset from being
> available.
> 
> Is that an option?  Thanks,
> 

This works for me if above corner case can be waived.
Yi Liu April 14, 2023, 11:38 a.m. UTC | #65
> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Friday, April 14, 2023 5:12 PM
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 14, 2023 2:07 AM
> >
> > We had already iterated a proposal where the group-id is replaced with
> > a dev-id in the existing ioctl and a flag indicates when the return
> > value is a dev-id vs group-id.  This had a gap that userspace cannot
> > determine if a reset is available given this information since un-owned
> > devices report an invalid dev-id and userspace can't know if it has
> > implicit ownership.
>
> >
> > It seems cleaner to me though that we would could still re-use INFO in
> > a similar way, simply defining a new flag bit which is valid only in
> > the case of returning dev-ids and indicates if the reset is available.
> > Therefore in one ioctl, userspace knows if hot-reset is available
> > (based on a kernel determination) and can pull valid dev-ids from the

Need to confirm the meaning of hot-reset available flag. I think it
should at least meet below two conditions to set this flag. Although
it may not mean hot-reset is for sure to succeed. (but should be
a high chance).

1) dev_set is resettable (all affected device are in dev_set)
2) affected device are owned by the current user

Also, we need to has assumption that below two cases are rare
if user encounters it, it just bad luck for them. I think the existing
_INFO and hot-reset already has such assumption. So cdev mode
can adopt it as well.

a) physical topology change (e.g. new devices plugged to affected slot)
b) an affected device is unbound from vfio

> So the kernel needs to compare the group id between devices with
> valid dev-ids and devices with invalid dev-ids to decide the implicit
> ownership. For noiommu device which has no group_id when
> VFIO_GROUP is off then it's resettable only if having a valid dev_id.

In cdev mode, noiommu device doesn't have dev_id as it is not
bound to valid iommufd. So if VFIO_GROUP is off, we may never
allow hot-reset for noiommu devices. But we don't want to have
regression with noiommu devices. Perhaps we may define the usage
of the resettable flag like this:
1) if it is set, user does not need to own all the affected devices as
    some of them may have been owned implicitly. Kernel should have
    checked it.
2) if the flag is not set, that means user needs to check ownership
    by itself. It needs to own all the affected devices. If not, don't
   do hot-reset.

This way we can still make noiommu devices support hot-reset
just like VFIO_GROUP is on. Because noiommu devices have fake
groups, such groups are all singleton. So checking all affected
devices are opened by user is just same as check all affected
groups.

> The only corner case with this option is when a user mixes group
> and cdev usages. iirc you mentioned it's a valid usage to be supported.
> In that case the kernel doesn't have sufficient knowledge to judge
> 'resettable' as it doesn't know which groups are opened by this user.
>
> Not sure whether we can leave it in a ugly way so INFO may not tell
> 'resettable' accurately in that weird scenario.

This seems not easy to support. If above scenario is allowed there can be
three cases that returns invalid dev_id.
1) devices not opened by user but owned implicitly
2) devices not owned by user
3) devices opened via group but owned by user

User would require more info to tell the above cases from each other.

> > array to associate affected, owned devices, and still has the
> > equivalent information to know that one or more of the devices listed
> > with an invalid dev-id are preventing the hot-reset from being
> > available.
> >
> > Is that an option?  Thanks,
> >
> 
> This works for me if above corner case can be waived.

One side check, perhaps already confirmed in prior email. @Alex, So
the reason for the prediction of hot-reset is to avoid the possible
vfio_pci_pre_reset() which does heavy operations like stop DMA and
copy config space. Is it? Any other special reason? Anyhow, this reason
is enough for this prediction per my understanding.

Regards,
Yi Liu
Alex Williamson April 14, 2023, 4:34 p.m. UTC | #66
On Fri, 14 Apr 2023 09:11:30 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 14, 2023 2:07 AM
> > 
> > We had already iterated a proposal where the group-id is replaced with
> > a dev-id in the existing ioctl and a flag indicates when the return
> > value is a dev-id vs group-id.  This had a gap that userspace cannot
> > determine if a reset is available given this information since un-owned
> > devices report an invalid dev-id and userspace can't know if it has
> > implicit ownership.
> > 
> > It seems cleaner to me though that we would could still re-use INFO in
> > a similar way, simply defining a new flag bit which is valid only in
> > the case of returning dev-ids and indicates if the reset is available.
> > Therefore in one ioctl, userspace knows if hot-reset is available
> > (based on a kernel determination) and can pull valid dev-ids from the  
> 
> So the kernel needs to compare the group id between devices with
> valid dev-ids and devices with invalid dev-ids to decide the implicit
> ownership. For noiommu device which has no group_id when
> VFIO_GROUP is off then it's resettable only if having a valid dev_id.

With no-iommu and VFIO_GROUP on, each no-iommu device gets it's own
group and the user must have ownership of each affected group, so
there's really no difference here.  Every affected no-iommu device must
be owned in either case.
 
> The only corner case with this option is when a user mixes group
> and cdev usages. iirc you mentioned it's a valid usage to be supported.
> In that case the kernel doesn't have sufficient knowledge to judge
> 'resettable' as it doesn't know which groups are opened by this user.

So for example we might have a 2-function device, fn0 is opened via
cdev and part of an iommufd ctx and fn1 is opened via the group
interface and potentially bound to a type1 container context.

In the INFO/INFO2 proposal, the INFO ioctl would return an array
reporting the group and BDF for each function.  The INFO ioctl is
callable from either device (aiui).  The INFO2 ioctl would fail on the
group opened device because it doesn't have an iommufd_ctx.  When
called on the cdev opened device, INFO2 would fail because the dev-set
is not represented within the iommufd_ctx.  Is this right?

In my proposal, the INFO ioctl can also be called on either device.
When called on the cdev opened device, the return structure provides
dev-ids with a flag indicating such in the return structure.  The cdev
device has a valid dev-id, the group device invalid.  The
reset-available flag is clear because the kernel cannot infer ownership
of the group opened device.  When called on the group opened device, the
IOMMU group and BDF are returned for each device.

So both approaches have similar issues here, but I think there's an
advantage to the approach of extending INFO.  In that case, the user
still gets the dev-id of the affected cdev device and therefore could
build a hot-reset ioctl call using a combination of groupfds and
devicefds, even if the cdev opened device are passed by fd.  Perhaps
it's obvious that the hot-reset device is itself affected by the reset,
but I think the example scenario could be extended to one where there
are multiple cdev opened devices and one or more group opened devices.
AIUI, the INFO2 proposal essentially only returns success if the
null-array approach is supported, ie. the kernel can infer the full
ownership of the dev-set.  However, I think we could still support a
proof-of-ownership based hot-reset with devicefds and groupfds provide
by the user.

I think what this means is that the flag we're exposing is not
"hot-reset available", but really whether the kernel can infer
ownership and the ownership conditions are satisfied.  Therefore it
essentially only flags the availability of the null-array interface
while the proof-of-ownership approach is always available.

> Not sure whether we can leave it in a ugly way so INFO may not tell
> 'resettable' accurately in that weird scenario.

Is it still ugly with the above design?  Thanks,

Alex
Alex Williamson April 14, 2023, 5:10 p.m. UTC | #67
On Fri, 14 Apr 2023 11:38:24 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Friday, April 14, 2023 5:12 PM
> >   
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 14, 2023 2:07 AM
> > >
> > > We had already iterated a proposal where the group-id is replaced with
> > > a dev-id in the existing ioctl and a flag indicates when the return
> > > value is a dev-id vs group-id.  This had a gap that userspace cannot
> > > determine if a reset is available given this information since un-owned
> > > devices report an invalid dev-id and userspace can't know if it has
> > > implicit ownership.  
> >  
> > >
> > > It seems cleaner to me though that we would could still re-use INFO in
> > > a similar way, simply defining a new flag bit which is valid only in
> > > the case of returning dev-ids and indicates if the reset is available.
> > > Therefore in one ioctl, userspace knows if hot-reset is available
> > > (based on a kernel determination) and can pull valid dev-ids from the  
> 
> Need to confirm the meaning of hot-reset available flag. I think it
> should at least meet below two conditions to set this flag. Although
> it may not mean hot-reset is for sure to succeed. (but should be
> a high chance).
> 
> 1) dev_set is resettable (all affected device are in dev_set)
> 2) affected device are owned by the current user

Per thread with Kevin, ownership can't always be known by the kernel.
Beyond the group vs cdev discussion there, isn't it also possible
(though perhaps not recommended) that a user can have multiple iommufd
ctxs?  So I think 2) becomes "ownership of the affected dev-set can be
inferred from the iommufd_ctx of the calling device", iow, the
null-array calling model is available and the flag is redefined to
match.  Reset may still be available via the proof-of-ownership model.
 
> Also, we need to has assumption that below two cases are rare
> if user encounters it, it just bad luck for them. I think the existing
> _INFO and hot-reset already has such assumption. So cdev mode
> can adopt it as well.
> 
> a) physical topology change (e.g. new devices plugged to affected slot)
> b) an affected device is unbound from vfio

Yes, these are sufficiently rare that we can't do much about them.

> > So the kernel needs to compare the group id between devices with
> > valid dev-ids and devices with invalid dev-ids to decide the implicit
> > ownership. For noiommu device which has no group_id when
> > VFIO_GROUP is off then it's resettable only if having a valid dev_id.  
> 
> In cdev mode, noiommu device doesn't have dev_id as it is not
> bound to valid iommufd. So if VFIO_GROUP is off, we may never
> allow hot-reset for noiommu devices. But we don't want to have
> regression with noiommu devices. Perhaps we may define the usage
> of the resettable flag like this:
> 1) if it is set, user does not need to own all the affected devices as
>     some of them may have been owned implicitly. Kernel should have
>     checked it.
> 2) if the flag is not set, that means user needs to check ownership
>     by itself. It needs to own all the affected devices. If not, don't
>    do hot-reset.

Exactly, the flag essentially indicates that the null-array approach is
available, lack of the flag indicates proof-of-ownership is required.
 
> This way we can still make noiommu devices support hot-reset
> just like VFIO_GROUP is on. Because noiommu devices have fake
> groups, such groups are all singleton. So checking all affected
> devices are opened by user is just same as check all affected
> groups.

Yep.

> > The only corner case with this option is when a user mixes group
> > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > In that case the kernel doesn't have sufficient knowledge to judge
> > 'resettable' as it doesn't know which groups are opened by this user.
> >
> > Not sure whether we can leave it in a ugly way so INFO may not tell
> > 'resettable' accurately in that weird scenario.  
> 
> This seems not easy to support. If above scenario is allowed there can be
> three cases that returns invalid dev_id.
> 1) devices not opened by user but owned implicitly

The cdev approach has a hard time with this in general, it has no way to
represent unopened devices. so any case where the nature of an unopened
device block reset on the dev-set is rather opaque to the user.

> 2) devices not owned by user

(and presumable not owned)  We still provide BDF.  Not much difference
from the group case here, being able to point to a BDF or group is
about all we can do.

> 3) devices opened via group but owned by user

I think this still works in the proof-of-ownership, passing fds to
hot-reset model.

> User would require more info to tell the above cases from each other.

Obviously we could be equivalent to the group model if IOMMU groups
were exposed for a device and all devices had IOMMU groups, but
reasons...

> > > array to associate affected, owned devices, and still has the
> > > equivalent information to know that one or more of the devices listed
> > > with an invalid dev-id are preventing the hot-reset from being
> > > available.
> > >
> > > Is that an option?  Thanks,
> > >  
> > 
> > This works for me if above corner case can be waived.  
> 
> One side check, perhaps already confirmed in prior email. @Alex, So
> the reason for the prediction of hot-reset is to avoid the possible
> vfio_pci_pre_reset() which does heavy operations like stop DMA and
> copy config space. Is it? Any other special reason? Anyhow, this reason
> is enough for this prediction per my understanding.

It's not clear to me what "prediction" is referring to.  As above, I
think we can redefine the reset-available flag I proposed to more
restrictively indicate that the null-array approach is available based
on the dev-set group in relation to the iommufd_ctx of the calling
device.  Prediction of the affected devices seems like basic
functionality to me, we can't assume the user's usage model, they must
be able to make a well informed decision regarding affected devices.
Thanks,

Alex
Yi Liu April 17, 2023, 4:20 a.m. UTC | #68
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 15, 2023 1:11 AM
> 
> On Fri, 14 Apr 2023 11:38:24 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Friday, April 14, 2023 5:12 PM
> > >
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Friday, April 14, 2023 2:07 AM
> > > >
> > > > We had already iterated a proposal where the group-id is replaced with
> > > > a dev-id in the existing ioctl and a flag indicates when the return
> > > > value is a dev-id vs group-id.  This had a gap that userspace cannot
> > > > determine if a reset is available given this information since un-owned
> > > > devices report an invalid dev-id and userspace can't know if it has
> > > > implicit ownership.
> > >
> > > >
> > > > It seems cleaner to me though that we would could still re-use INFO in
> > > > a similar way, simply defining a new flag bit which is valid only in
> > > > the case of returning dev-ids and indicates if the reset is available.
> > > > Therefore in one ioctl, userspace knows if hot-reset is available
> > > > (based on a kernel determination) and can pull valid dev-ids from the
> >
> > Need to confirm the meaning of hot-reset available flag. I think it
> > should at least meet below two conditions to set this flag. Although
> > it may not mean hot-reset is for sure to succeed. (but should be
> > a high chance).
> >
> > 1) dev_set is resettable (all affected device are in dev_set)
> > 2) affected device are owned by the current user
> 
> Per thread with Kevin, ownership can't always be known by the kernel.
> Beyond the group vs cdev discussion there, isn't it also possible
> (though perhaps not recommended) that a user can have multiple iommufd
> ctxs?  So I think 2) becomes "ownership of the affected dev-set can be
> inferred from the iommufd_ctx of the calling device", iow, the
> null-array calling model is available and the flag is redefined to
> match.  Reset may still be available via the proof-of-ownership model.

Yes, if there are multiple iommufd ctxs, this shall fall back to use
the proof-of-ownership model.

> 
> > Also, we need to has assumption that below two cases are rare
> > if user encounters it, it just bad luck for them. I think the existing
> > _INFO and hot-reset already has such assumption. So cdev mode
> > can adopt it as well.
> >
> > a) physical topology change (e.g. new devices plugged to affected slot)
> > b) an affected device is unbound from vfio
> 
> Yes, these are sufficiently rare that we can't do much about them.
> 
> > > So the kernel needs to compare the group id between devices with
> > > valid dev-ids and devices with invalid dev-ids to decide the implicit
> > > ownership. For noiommu device which has no group_id when
> > > VFIO_GROUP is off then it's resettable only if having a valid dev_id.
> >
> > In cdev mode, noiommu device doesn't have dev_id as it is not
> > bound to valid iommufd. So if VFIO_GROUP is off, we may never
> > allow hot-reset for noiommu devices. But we don't want to have
> > regression with noiommu devices. Perhaps we may define the usage
> > of the resettable flag like this:
> > 1) if it is set, user does not need to own all the affected devices as
> >     some of them may have been owned implicitly. Kernel should have
> >     checked it.
> > 2) if the flag is not set, that means user needs to check ownership
> >     by itself. It needs to own all the affected devices. If not, don't
> >    do hot-reset.
> 
> Exactly, the flag essentially indicates that the null-array approach is
> available, lack of the flag indicates proof-of-ownership is required.
> 
> > This way we can still make noiommu devices support hot-reset
> > just like VFIO_GROUP is on. Because noiommu devices have fake
> > groups, such groups are all singleton. So checking all affected
> > devices are opened by user is just same as check all affected
> > groups.
> 
> Yep.
> 
> > > The only corner case with this option is when a user mixes group
> > > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > > In that case the kernel doesn't have sufficient knowledge to judge
> > > 'resettable' as it doesn't know which groups are opened by this user.
> > >
> > > Not sure whether we can leave it in a ugly way so INFO may not tell
> > > 'resettable' accurately in that weird scenario.
> >
> > This seems not easy to support. If above scenario is allowed there can be
> > three cases that returns invalid dev_id.
> > 1) devices not opened by user but owned implicitly
> 
> The cdev approach has a hard time with this in general, it has no way to
> represent unopened devices. so any case where the nature of an unopened
> device block reset on the dev-set is rather opaque to the user.
> 
> > 2) devices not owned by user
> 
> (and presumable not owned)  We still provide BDF.  Not much difference
> from the group case here, being able to point to a BDF or group is
> about all we can do.
> 
> > 3) devices opened via group but owned by user
> 
> I think this still works in the proof-of-ownership, passing fds to
> hot-reset model.

Ok. let's see below scenario and user's processing makes sense.

Say there are five devices (devA, devB, devC, devD, devE) in the same reset
group. devA and devB are in the same iommu group. devC, devD, and devE have
separate iommu groups. Say devA is opened via cdev, devB is not opened, devC
is opened via group, devD is opened cdev but bound to another iommufdctx that
is different with devA. devE is not opened by any user

If this INFO is called on devA, user should get a valid dev_id for devA, but
four invalid dev_ids. The resettable flag should be clear. Below is how user
to handle the info returned.

- For devB, user shall get the group_id for devA, and also get group_id for
  devB, hence able to check ownership of devB by checking the group
- For devC, user can check ownership by the group_id and bdf returned
- For devD, if it is opened by the user, should be able to find it by bdf
- For devE, user shall fail to find it hence consider no ownership on it.

To finish the above check, user needs to get group_id via devid an also needs
to get group_id via device fd. Is it?

The above example may be the most tricky scenario. Is it? user shall not do
hot-reset as not all affected devices are owned by user. But if devE is also
opened by user, it could do hot-reset.

> > User would require more info to tell the above cases from each other.
> 
> Obviously we could be equivalent to the group model if IOMMU groups
> were exposed for a device and all devices had IOMMU groups, but
> reasons...
> 
> > > > array to associate affected, owned devices, and still has the
> > > > equivalent information to know that one or more of the devices listed
> > > > with an invalid dev-id are preventing the hot-reset from being
> > > > available.
> > > >
> > > > Is that an option?  Thanks,
> > > >
> > >
> > > This works for me if above corner case can be waived.
> >
> > One side check, perhaps already confirmed in prior email. @Alex, So
> > the reason for the prediction of hot-reset is to avoid the possible
> > vfio_pci_pre_reset() which does heavy operations like stop DMA and
> > copy config space. Is it? Any other special reason? Anyhow, this reason
> > is enough for this prediction per my understanding.
> 
> It's not clear to me what "prediction" is referring to.

It is predicting whether hot-reset ioctl can work or not as you mentioned
in prior discussion.[1].

"I disagree, as I've argued before, the info ioctl becomes so weak and
effectively arbitrary from a user perspective at being able to predict
whether the hot-reset ioctl works that it becomes useless, diminishing
the entire hot-reset info/execute API."

[1] https://lore.kernel.org/kvm/20230405134945.29e967be.alex.williamson@redhat.com/

> As above, I
> think we can redefine the reset-available flag I proposed to more
> restrictively indicate that the null-array approach is available based
> on the dev-set group in relation to the iommufd_ctx of the calling
> device.  Prediction of the affected devices seems like basic
> functionality to me, we can't assume the user's usage model, they must
> be able to make a well informed decision regarding affected devices.
> Thanks,

As my above reply with the five-device scenario. It still needs to get
group_id to check implicit ownership in the case of sharing the same
iommu_group.

Regards,
Yi Liu
Jason Gunthorpe April 17, 2023, 1:39 p.m. UTC | #69
On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote:

> The only corner case with this option is when a user mixes group
> and cdev usages. iirc you mentioned it's a valid usage to be supported.
> In that case the kernel doesn't have sufficient knowledge to judge
> 'resettable' as it doesn't know which groups are opened by this user.

IMHO we don't need to support this combination.

We can say that to use the hot reset API the user must put all their
devices into the same iommufd_ctx and cover 100% of the known use
cases for this.

There are already other situations, like nesting, that do force users
to put everything into one iommufd_ctx.

No reason to make things harder and more complicated.

I'm coming to the feeling that we should put no-iommu devices in
iommufd_ctx's as well. They would be an iommufd_access like
mdevs. That would clean up the complications they cause here.

I suppose we should have done that from the beginning - no-iommu is an
IOMMUFD access, it just uses a crazy /proc based way to learn the
PFNs. Making it a proper access and making a real VFIO ioctl that
calls iommufd_access_pin_pages() and returns the DMA mapped addresses
to userspace would go a long way to making no-iommu work in a logical,
usable, way.

Jason
Jason Gunthorpe April 17, 2023, 2:05 p.m. UTC | #70
On Thu, Apr 13, 2023 at 12:07:12PM -0600, Alex Williamson wrote:

> IIUC, the semantics we're proposing is that an INFO2 ioctl would return
> success or failure indicating whether the user has sufficient ownership
> of the affected devices, 

Or a flag, but yes

> and in the success case returns an array of
> affected dev-ids within the user's iommufd_ctx.  Unopened, affected
> devices, are not reported via INFO2, and unopened, affected devices
> outside the user's scope of ownership (ie. outside the owned IOMMU
> group) will generate a failure condition.

Yes

> As for the INFO ioctl, it's described as unchanged, which does raise
> the question of what is reported for IOMMU groups and how does the
> value there coherently relate to anything else in the cdev-exclusive
> vfio API...

For cdev mode the value of the group_id has no functional
purpose. INFO has no functional purpose beyond debugging. The cdev
enabled userspace should print the BDFs from the INFO in a debug
message and ignore the group_id.

Kernel will still fill the group_id using the iommu_get_group() stuff,
and set -1 for no-iommu.

> We had already iterated a proposal where the group-id is replaced with
> a dev-id in the existing ioctl and a flag indicates when the return
> value is a dev-id vs group-id.  This had a gap that userspace cannot
> determine if a reset is available given this information since un-owned
> devices report an invalid dev-id and userspace can't know if it has
> implicit ownership.

IIRC, yes.

> It seems cleaner to me though that we would could still re-use INFO in
> a similar way, simply defining a new flag bit which is valid only in
> the case of returning dev-ids and indicates if the reset is
> available.

Yes, it could be done like this as well. INFO2 is more a discussion
object, how we encode it in the uAPI matters a lot less. The point is
that INFO2, as an idea, returns information that no other existing API
returns: the "ownership passed flag" and "dev_id list"

Then as I said in the other mail we roll no-iommu into an iommufd_ctx
object and just follow the design that userspace must have a single
iommufd_ctx containing all the devices to use the hot reset feature.

Jason
Alex Williamson April 17, 2023, 7:01 p.m. UTC | #71
On Mon, 17 Apr 2023 04:20:27 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Saturday, April 15, 2023 1:11 AM
> > 
> > On Fri, 14 Apr 2023 11:38:24 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Tian, Kevin <kevin.tian@intel.com>
> > > > Sent: Friday, April 14, 2023 5:12 PM
> > > >  
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Friday, April 14, 2023 2:07 AM
> > > > >
> > > > > We had already iterated a proposal where the group-id is replaced with
> > > > > a dev-id in the existing ioctl and a flag indicates when the return
> > > > > value is a dev-id vs group-id.  This had a gap that userspace cannot
> > > > > determine if a reset is available given this information since un-owned
> > > > > devices report an invalid dev-id and userspace can't know if it has
> > > > > implicit ownership.  
> > > >  
> > > > >
> > > > > It seems cleaner to me though that we would could still re-use INFO in
> > > > > a similar way, simply defining a new flag bit which is valid only in
> > > > > the case of returning dev-ids and indicates if the reset is available.
> > > > > Therefore in one ioctl, userspace knows if hot-reset is available
> > > > > (based on a kernel determination) and can pull valid dev-ids from the  
> > >
> > > Need to confirm the meaning of hot-reset available flag. I think it
> > > should at least meet below two conditions to set this flag. Although
> > > it may not mean hot-reset is for sure to succeed. (but should be
> > > a high chance).
> > >
> > > 1) dev_set is resettable (all affected device are in dev_set)
> > > 2) affected device are owned by the current user  
> > 
> > Per thread with Kevin, ownership can't always be known by the kernel.
> > Beyond the group vs cdev discussion there, isn't it also possible
> > (though perhaps not recommended) that a user can have multiple iommufd
> > ctxs?  So I think 2) becomes "ownership of the affected dev-set can be
> > inferred from the iommufd_ctx of the calling device", iow, the
> > null-array calling model is available and the flag is redefined to
> > match.  Reset may still be available via the proof-of-ownership model.  
> 
> Yes, if there are multiple iommufd ctxs, this shall fall back to use
> the proof-of-ownership model.
> 
> >   
> > > Also, we need to has assumption that below two cases are rare
> > > if user encounters it, it just bad luck for them. I think the existing
> > > _INFO and hot-reset already has such assumption. So cdev mode
> > > can adopt it as well.
> > >
> > > a) physical topology change (e.g. new devices plugged to affected slot)
> > > b) an affected device is unbound from vfio  
> > 
> > Yes, these are sufficiently rare that we can't do much about them.
> >   
> > > > So the kernel needs to compare the group id between devices with
> > > > valid dev-ids and devices with invalid dev-ids to decide the implicit
> > > > ownership. For noiommu device which has no group_id when
> > > > VFIO_GROUP is off then it's resettable only if having a valid dev_id.  
> > >
> > > In cdev mode, noiommu device doesn't have dev_id as it is not
> > > bound to valid iommufd. So if VFIO_GROUP is off, we may never
> > > allow hot-reset for noiommu devices. But we don't want to have
> > > regression with noiommu devices. Perhaps we may define the usage
> > > of the resettable flag like this:
> > > 1) if it is set, user does not need to own all the affected devices as
> > >     some of them may have been owned implicitly. Kernel should have
> > >     checked it.
> > > 2) if the flag is not set, that means user needs to check ownership
> > >     by itself. It needs to own all the affected devices. If not, don't
> > >    do hot-reset.  
> > 
> > Exactly, the flag essentially indicates that the null-array approach is
> > available, lack of the flag indicates proof-of-ownership is required.
> >   
> > > This way we can still make noiommu devices support hot-reset
> > > just like VFIO_GROUP is on. Because noiommu devices have fake
> > > groups, such groups are all singleton. So checking all affected
> > > devices are opened by user is just same as check all affected
> > > groups.  
> > 
> > Yep.
> >   
> > > > The only corner case with this option is when a user mixes group
> > > > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > > > In that case the kernel doesn't have sufficient knowledge to judge
> > > > 'resettable' as it doesn't know which groups are opened by this user.
> > > >
> > > > Not sure whether we can leave it in a ugly way so INFO may not tell
> > > > 'resettable' accurately in that weird scenario.  
> > >
> > > This seems not easy to support. If above scenario is allowed there can be
> > > three cases that returns invalid dev_id.
> > > 1) devices not opened by user but owned implicitly  
> > 
> > The cdev approach has a hard time with this in general, it has no way to
> > represent unopened devices. so any case where the nature of an unopened
> > device block reset on the dev-set is rather opaque to the user.
> >   
> > > 2) devices not owned by user  
> > 
> > (and presumable not owned)  We still provide BDF.  Not much difference
> > from the group case here, being able to point to a BDF or group is
> > about all we can do.
> >   
> > > 3) devices opened via group but owned by user  
> > 
> > I think this still works in the proof-of-ownership, passing fds to
> > hot-reset model.  
> 
> Ok. let's see below scenario and user's processing makes sense.
> 
> Say there are five devices (devA, devB, devC, devD, devE) in the same reset
> group. devA and devB are in the same iommu group. devC, devD, and devE have
> separate iommu groups. Say devA is opened via cdev, devB is not opened, devC
> is opened via group, devD is opened cdev but bound to another iommufdctx that
> is different with devA. devE is not opened by any user
> 
> If this INFO is called on devA, user should get a valid dev_id for devA, but
> four invalid dev_ids. The resettable flag should be clear. Below is how user
> to handle the info returned.

INFO from devA returns:

flags: NOT_RESETABLE | DEV_ID
{
  { valid devA-id,  devA-BDF },
  { invalid dev-id, devB-BDF },
  { invalid dev-id, devC-BDF },
  { invalid dev-id, devD-BDF },
  { invalid dev-id, devE-BDF },
}

User knows devA-id, learns devA-BDF

from devC:
{
  { devA/B-group-id, devA-BDF },
  { devA/B-group-id, devB-BDF },
  { devC-group-id,   devC-BDF },
  { devD-group-id,   devD-BDF },
  { devE-group-id,   devE-BDF },
}

User is assumed to know devC group-id + BDF given group semantics,
knows devA ownership, infers devB ownership.

from devD:
flags: NOT_RESETABLE | DEV_ID
{
  { invalid dev-id, devA-BDF },
  { invalid dev-id, devB-BDF },
  { invalid dev-id, devC-BDF },
  { valid devD-id,  devD-BDF },
  { invalid dev-id, devE-BDF },
}

User knows devD-id, learns devD-bdf, knows devA and devC ownership, and
inferred devB ownership

> - For devB, user shall get the group_id for devA, and also get group_id for
>   devB, hence able to check ownership of devB by checking the group

Per above, groups are only available through the group devices,
therefore inferred ownership of devB can only be learned from devC.

> - For devC, user can check ownership by the group_id and bdf returned

Yes, the INFO ioctl on devC can confirm devC is affected, but more
importantly this is the bridge to learn BDF of other affected devices
and their groups.

> - For devD, if it is opened by the user, should be able to find it by bdf

I think the reverse, the user presumably already knows the dev-id for
devD and knows that a hot-reset of the calling device necessarily
affects the device, but it learns the BDF, which helps it connect 4 of
the 5 device affected by the reset.

> - For devE, user shall fail to find it hence consider no ownership on it.

Yes, which is correct.

> To finish the above check, user needs to get group_id via devid an also needs
> to get group_id via device fd. Is it?

Not absolutely required, but the user needs to do a lot of inferring via
BDF.

> The above example may be the most tricky scenario. Is it? user shall not do
> hot-reset as not all affected devices are owned by user. But if devE is also
> opened by user, it could do hot-reset.

Yes, it's not trivial, but Jason is now proposing that we consider
mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
this means that regardless of which device calls INFO, there's only one
answer (assuming same set of devices opened, all cdev, all within same
iommufd_ctx).  Based on what I explained about my understanding of INFO2
and Jason agreed to, I think the output would be:

flags: NOT_RESETABLE | DEV_ID
{
  { valid devA-id,  devA-BDF },
  { valid devC-id,  devC-BDF },
  { valid devD-id,  devD-BDF },
  { invalid dev-id, devE-BDF },
}

Here devB gets dropped because the kernel understands that devB is
unopened, affected, and owned.  It's therefore not a blocker for
hot-reset.  OTOH, devE is unopened, affected, and un-owned, and we
previously agreed against the opportunistic un-opened/un-owned loophole.

If devA and devD were separate iommufd_ctxs, with devC in the same
ctx as devA, I think this becomes:

INFO on devA:
flags: NOT_RESETABLE | DEV_ID
{
  { valid devA-id,  devA-BDF },
  { valid devC-id,  devC-BDF },
  { invalid dev-id, devD-BDF },
  { invalid dev-id, devE-BDF },
}

INFO on devD:
flags: NOT_RESETABLE | DEV_ID
{
  { invalid dev-id, devA-BDF },
  { invalid dev-id, devB-BDF },
  { invalid dev-id, devC-BDF },
  { valid devD-id, devD-BDF },
  { invalid dev-id, devE-BDF },
}

I think this illustrates that it makes sense for unopened affected
devices with implicit ownership to always be hidden, but otherwise are
fully enumerated.

> > > User would require more info to tell the above cases from each other.  
> > 
> > Obviously we could be equivalent to the group model if IOMMU groups
> > were exposed for a device and all devices had IOMMU groups, but
> > reasons...
> >   
> > > > > array to associate affected, owned devices, and still has the
> > > > > equivalent information to know that one or more of the devices listed
> > > > > with an invalid dev-id are preventing the hot-reset from being
> > > > > available.
> > > > >
> > > > > Is that an option?  Thanks,
> > > > >  
> > > >
> > > > This works for me if above corner case can be waived.  
> > >
> > > One side check, perhaps already confirmed in prior email. @Alex, So
> > > the reason for the prediction of hot-reset is to avoid the possible
> > > vfio_pci_pre_reset() which does heavy operations like stop DMA and
> > > copy config space. Is it? Any other special reason? Anyhow, this reason
> > > is enough for this prediction per my understanding.  
> > 
> > It's not clear to me what "prediction" is referring to.  
> 
> It is predicting whether hot-reset ioctl can work or not as you mentioned
> in prior discussion.[1].
> 
> "I disagree, as I've argued before, the info ioctl becomes so weak and
> effectively arbitrary from a user perspective at being able to predict
> whether the hot-reset ioctl works that it becomes useless, diminishing
> the entire hot-reset info/execute API."
> 
> [1] https://lore.kernel.org/kvm/20230405134945.29e967be.alex.williamson@redhat.com/

I think we're narrowing in on an interface that isn't as arbitrary.  If
we assume the restrictions that Jason proposes, then cdev is exclusively
a kernel determined reset availability model, where I'd agree that
passing device-fds as a proof of ownership is pointless.  The group
interface would therefore remain exclusively a proof-of-ownership
model since we have no incentive to extend it to kernel-determined
given the limited use case of all affected devices managed by the same
vfio container.

> > As above, I
> > think we can redefine the reset-available flag I proposed to more
> > restrictively indicate that the null-array approach is available based
> > on the dev-set group in relation to the iommufd_ctx of the calling
> > device.  Prediction of the affected devices seems like basic
> > functionality to me, we can't assume the user's usage model, they must
> > be able to make a well informed decision regarding affected devices.
> > Thanks,  
> 
> As my above reply with the five-device scenario. It still needs to get
> group_id to check implicit ownership in the case of sharing the same
> iommu_group.

Moot, but there's actually enough information there to infer IOMMU
groups for each device, but we probably can't prove that would always
be the case.  If we adopt Jason's proposal though, I don't see that we
need either a group-id or BDF capability, the BDF is only for debug
reporting.  However, there is a new burden on the kernel to identify
the affected, un-owned devices for that report.  Thanks,

Alex
Jason Gunthorpe April 17, 2023, 7:31 p.m. UTC | #72
On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> Yes, it's not trivial, but Jason is now proposing that we consider
> mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> this means that regardless of which device calls INFO, there's only one
> answer (assuming same set of devices opened, all cdev, all within same
> iommufd_ctx).  Based on what I explained about my understanding of INFO2
> and Jason agreed to, I think the output would be:
> 
> flags: NOT_RESETABLE | DEV_ID
> {
>   { valid devA-id,  devA-BDF },
>   { valid devC-id,  devC-BDF },
>   { valid devD-id,  devD-BDF },
>   { invalid dev-id, devE-BDF },
> }
> 
> Here devB gets dropped because the kernel understands that devB is
> unopened, affected, and owned.  It's therefore not a blocker for
> hot-reset.

I don't think we want to drop anything because it makes the API
ill suited for the debugging purpose.

devb should be returned with an invalid dev_id if I understand your
example. Maybe it should return with -1 as the dev_id instead of 0, to
make the debugging a bit better.

Userspace should look at only NOT_RESETTABLE to determine if it
proceeds or not, and it should use the valid dev_id list to iterate
over the devices it has open to do the config stuff.

> OTOH, devE is unopened, affected, and un-owned, and we
> previously agreed against the opportunistic un-opened/un-owned loophole.

NOT_RESETABLE should be returned in this case, yes.

If we want to enable userspace to use the loophole it should be an
additional flag. RESETABLE_FOR_NOW or something

> I think we're narrowing in on an interface that isn't as arbitrary.  If
> we assume the restrictions that Jason proposes, then cdev is exclusively
> a kernel determined reset availability model

Yes, I think this is probably best looking forward.

> where I'd agree that
> passing device-fds as a proof of ownership is pointless.  The group
> interface would therefore remain exclusively a proof-of-ownership
> model since we have no incentive to extend it to kernel-determined
> given the limited use case of all affected devices managed by the same
> vfio container.

Yes

> Moot, but there's actually enough information there to infer IOMMU
> groups for each device, but we probably can't prove that would always
> be the case.  If we adopt Jason's proposal though, I don't see that we
> need either a group-id or BDF capability, the BDF is only for debug
> reporting.  However, there is a new burden on the kernel to identify
> the affected, un-owned devices for that report.  

Yes and yes

Jason
Alex Williamson April 17, 2023, 8:06 p.m. UTC | #73
On Mon, 17 Apr 2023 16:31:56 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > Yes, it's not trivial, but Jason is now proposing that we consider
> > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > this means that regardless of which device calls INFO, there's only one
> > answer (assuming same set of devices opened, all cdev, all within same
> > iommufd_ctx).  Based on what I explained about my understanding of INFO2
> > and Jason agreed to, I think the output would be:
> > 
> > flags: NOT_RESETABLE | DEV_ID
> > {
> >   { valid devA-id,  devA-BDF },
> >   { valid devC-id,  devC-BDF },
> >   { valid devD-id,  devD-BDF },
> >   { invalid dev-id, devE-BDF },
> > }
> > 
> > Here devB gets dropped because the kernel understands that devB is
> > unopened, affected, and owned.  It's therefore not a blocker for
> > hot-reset.  
> 
> I don't think we want to drop anything because it makes the API
> ill suited for the debugging purpose.
> 
> devb should be returned with an invalid dev_id if I understand your
> example. Maybe it should return with -1 as the dev_id instead of 0, to
> make the debugging a bit better.
> 
> Userspace should look at only NOT_RESETTABLE to determine if it
> proceeds or not, and it should use the valid dev_id list to iterate
> over the devices it has open to do the config stuff.

If an affected device is owned, not opened, and not interfering with
the reset, what is it adding to the API to report it for debugging
purposes?  I'm afraid this leads into expanding "invalid dev-id" into an
errno or bitmap of error conditions that the user needs to parse.

> > OTOH, devE is unopened, affected, and un-owned, and we
> > previously agreed against the opportunistic un-opened/un-owned loophole.  
> 
> NOT_RESETABLE should be returned in this case, yes.
> 
> If we want to enable userspace to use the loophole it should be an
> additional flag. RESETABLE_FOR_NOW or something

Ugh, please no.  It's always a volatile result, but a volatile result
that relies on device state outside the scope or control of the user is
not even worthwhile imo.  Thanks,

Alex
Tian, Kevin April 18, 2023, 1:28 a.m. UTC | #74
> From: Jason Gunthorpe
> Sent: Monday, April 17, 2023 9:39 PM
> 
> On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote:
> 
> > The only corner case with this option is when a user mixes group
> > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > In that case the kernel doesn't have sufficient knowledge to judge
> > 'resettable' as it doesn't know which groups are opened by this user.
> 
> IMHO we don't need to support this combination.
> 
> We can say that to use the hot reset API the user must put all their
> devices into the same iommufd_ctx and cover 100% of the known use
> cases for this.

Make sense.

> 
> There are already other situations, like nesting, that do force users
> to put everything into one iommufd_ctx.
> 
> No reason to make things harder and more complicated.
> 
> I'm coming to the feeling that we should put no-iommu devices in
> iommufd_ctx's as well. They would be an iommufd_access like
> mdevs. That would clean up the complications they cause here.

This certainly simplifies the matter a lot!

> 
> I suppose we should have done that from the beginning - no-iommu is an
> IOMMUFD access, it just uses a crazy /proc based way to learn the
> PFNs. Making it a proper access and making a real VFIO ioctl that
> calls iommufd_access_pin_pages() and returns the DMA mapped addresses
> to userspace would go a long way to making no-iommu work in a logical,
> usable, way.
> 

Yes. This would provide a more reliable/clean way to learn PFNs for
noiommufd case.
Tian, Kevin April 18, 2023, 3:24 a.m. UTC | #75
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, April 18, 2023 4:07 AM
> 
> On Mon, 17 Apr 2023 16:31:56 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > this means that regardless of which device calls INFO, there's only one
> > > answer (assuming same set of devices opened, all cdev, all within same
> > > iommufd_ctx).  Based on what I explained about my understanding of
> INFO2
> > > and Jason agreed to, I think the output would be:
> > >
> > > flags: NOT_RESETABLE | DEV_ID
> > > {
> > >   { valid devA-id,  devA-BDF },
> > >   { valid devC-id,  devC-BDF },
> > >   { valid devD-id,  devD-BDF },
> > >   { invalid dev-id, devE-BDF },
> > > }
> > >
> > > Here devB gets dropped because the kernel understands that devB is
> > > unopened, affected, and owned.  It's therefore not a blocker for
> > > hot-reset.
> >
> > I don't think we want to drop anything because it makes the API
> > ill suited for the debugging purpose.
> >
> > devb should be returned with an invalid dev_id if I understand your
> > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > make the debugging a bit better.
> >
> > Userspace should look at only NOT_RESETTABLE to determine if it
> > proceeds or not, and it should use the valid dev_id list to iterate
> > over the devices it has open to do the config stuff.
> 
> If an affected device is owned, not opened, and not interfering with
> the reset, what is it adding to the API to report it for debugging
> purposes?  I'm afraid this leads into expanding "invalid dev-id" into an

consistent output before and after devB is opened.

> errno or bitmap of error conditions that the user needs to parse.
> 

Not exactly.

If RESETABLE invalid dev_id doesn't matter. The user only use the
valid dev_id list to iterate as Jason pointed out.

If NOT_RESETTABLE due to devE not assigned to the VM one can
easily figure out the fact by simply looking at the list of affected BDFs
and the configuration of assigned devices of the VM. Then invalid
dev_id also doesn't matter.

If NOT_RESETTABLE while devE is already assigned to the VM then it's
indication of mixing groups, cdevs or multiple iommufd_ctxs. Then
people should debug with other means/hints to dig out the exact
culprit.
Alex Williamson April 18, 2023, 4:10 a.m. UTC | #76
On Tue, 18 Apr 2023 03:24:46 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, April 18, 2023 4:07 AM
> > 
> > On Mon, 17 Apr 2023 16:31:56 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:  
> > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > this means that regardless of which device calls INFO, there's only one
> > > > answer (assuming same set of devices opened, all cdev, all within same
> > > > iommufd_ctx).  Based on what I explained about my understanding of  
> > INFO2  
> > > > and Jason agreed to, I think the output would be:
> > > >
> > > > flags: NOT_RESETABLE | DEV_ID
> > > > {
> > > >   { valid devA-id,  devA-BDF },
> > > >   { valid devC-id,  devC-BDF },
> > > >   { valid devD-id,  devD-BDF },
> > > >   { invalid dev-id, devE-BDF },
> > > > }
> > > >
> > > > Here devB gets dropped because the kernel understands that devB is
> > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > hot-reset.  
> > >
> > > I don't think we want to drop anything because it makes the API
> > > ill suited for the debugging purpose.
> > >
> > > devb should be returned with an invalid dev_id if I understand your
> > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > make the debugging a bit better.
> > >
> > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > proceeds or not, and it should use the valid dev_id list to iterate
> > > over the devices it has open to do the config stuff.  
> > 
> > If an affected device is owned, not opened, and not interfering with
> > the reset, what is it adding to the API to report it for debugging
> > purposes?  I'm afraid this leads into expanding "invalid dev-id" into an  
> 
> consistent output before and after devB is opened.

In the case where devB is not opened including it only provides
useless information.  In the case where devB is opened it's necessary
to be reported as an opened, affected device.

> > errno or bitmap of error conditions that the user needs to parse.
> >   
> 
> Not exactly.
> 
> If RESETABLE invalid dev_id doesn't matter. The user only use the
> valid dev_id list to iterate as Jason pointed out.

Yes, but...

> If NOT_RESETTABLE due to devE not assigned to the VM one can
> easily figure out the fact by simply looking at the list of affected BDFs
> and the configuration of assigned devices of the VM. Then invalid
> dev_id also doesn't matter.

Huh?

Given:

flags: NOT_RESETABLE | DEV_ID
{
  { valid devA-id,  devA-BDF },
  { invalid dev-id, devB-BDF },
  { valid devC-id,  devC-BDF },
  { valid devD-id,  devD-BDF },
  { invalid dev-id, devE-BDF },
}

How does the user determine that devE is to blame and not devB based on
BDF?  The user cannot rely on sysfs for help, they don't know the IOMMU
grouping, nor do they know the BDF except as inferred by matching valid
dev-ids in the above output.
 
> If NOT_RESETTABLE while devE is already assigned to the VM then it's
> indication of mixing groups, cdevs or multiple iommufd_ctxs. Then
> people should debug with other means/hints to dig out the exact
> culprit.

I don't know what situation you're trying to explain here.  If devE
were opened within the same iommufd_ctx, this becomes:

flags: RESETABLE | DEV_ID
{
  { valid devA-id,  devA-BDF },
  { invalid dev-id, devB-BDF },
  { valid devC-id,  devC-BDF },
  { valid devD-id,  devD-BDF },
  { valid devE-id,  devE-BDF },
}

Yes, the user should only be looking at the flag to determine the
availability of hot-reset, (here's the but) but how is it consistent to
indicate both that hot-reset is available and include an invalid
dev-id?  The consistency as I propose is that an invalid dev-id is only
presented with NOT_RESETTABLE for the device blocking hot-reset.  In
the previous case, devB is not blocking reset and reporting an invalid
dev-id only serves to obfuscate determining the blocking device.

For the cases of affected group-opened devices or separate
iommufd_ctxs, the user gets invalid dev-ids for anything outside of
the calling device's iommufd_ctx.

We haven't discussed how it fails when called on a group-opened device
in a mixed environment.  I'd propose that the INFO ioctl behaves
exactly as it does today, reporting group-id and BDF for each affected
device.  However, the hot-reset ioctl itself is not extended to accept
devicefd because there is no proof-of-ownership model for cdevs.
Therefore even if the user could map group-id to devicefd, they get
-EINVAL calling HOT_RESET with a devicefd when the ioctl is called from
a group-opened device.  Thanks,

Alex
Tian, Kevin April 18, 2023, 5:02 a.m. UTC | #77
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, April 18, 2023 12:11 PM
> 
> On Tue, 18 Apr 2023 03:24:46 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Tuesday, April 18, 2023 4:07 AM
> > >
> > > On Mon, 17 Apr 2023 16:31:56 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > > this means that regardless of which device calls INFO, there's only one
> > > > > answer (assuming same set of devices opened, all cdev, all within
> same
> > > > > iommufd_ctx).  Based on what I explained about my understanding of
> > > INFO2
> > > > > and Jason agreed to, I think the output would be:
> > > > >
> > > > > flags: NOT_RESETABLE | DEV_ID
> > > > > {
> > > > >   { valid devA-id,  devA-BDF },
> > > > >   { valid devC-id,  devC-BDF },
> > > > >   { valid devD-id,  devD-BDF },
> > > > >   { invalid dev-id, devE-BDF },
> > > > > }
> > > > >
> > > > > Here devB gets dropped because the kernel understands that devB is
> > > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > > hot-reset.
> > > >
> > > > I don't think we want to drop anything because it makes the API
> > > > ill suited for the debugging purpose.
> > > >
> > > > devb should be returned with an invalid dev_id if I understand your
> > > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > > make the debugging a bit better.
> > > >
> > > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > > proceeds or not, and it should use the valid dev_id list to iterate
> > > > over the devices it has open to do the config stuff.
> > >
> > > If an affected device is owned, not opened, and not interfering with
> > > the reset, what is it adding to the API to report it for debugging
> > > purposes?  I'm afraid this leads into expanding "invalid dev-id" into an
> >
> > consistent output before and after devB is opened.
> 
> In the case where devB is not opened including it only provides
> useless information.  In the case where devB is opened it's necessary
> to be reported as an opened, affected device.
> 
> > > errno or bitmap of error conditions that the user needs to parse.
> > >
> >
> > Not exactly.
> >
> > If RESETABLE invalid dev_id doesn't matter. The user only use the
> > valid dev_id list to iterate as Jason pointed out.
> 
> Yes, but...
> 
> > If NOT_RESETTABLE due to devE not assigned to the VM one can
> > easily figure out the fact by simply looking at the list of affected BDFs
> > and the configuration of assigned devices of the VM. Then invalid
> > dev_id also doesn't matter.
> 
> Huh?
> 
> Given:
> 
> flags: NOT_RESETABLE | DEV_ID
> {
>   { valid devA-id,  devA-BDF },
>   { invalid dev-id, devB-BDF },
>   { valid devC-id,  devC-BDF },
>   { valid devD-id,  devD-BDF },
>   { invalid dev-id, devE-BDF },
> }
> 
> How does the user determine that devE is to blame and not devB based on
> BDF?  The user cannot rely on sysfs for help, they don't know the IOMMU
> grouping, nor do they know the BDF except as inferred by matching valid
> dev-ids in the above output.

emmm aren't we talking about the 'person' who does diagnostic? This guy
will look at the VM configuration file to know that devA/B/C/D have been
assigned to the VM but not devE.

> 
> > If NOT_RESETTABLE while devE is already assigned to the VM then it's
> > indication of mixing groups, cdevs or multiple iommufd_ctxs. Then
> > people should debug with other means/hints to dig out the exact
> > culprit.
> 
> I don't know what situation you're trying to explain here.  If devE
> were opened within the same iommufd_ctx, this becomes:

It's about a scenario where the mgmt.. stack has assigned all affected
devices to Qemu but Qemu itself messed it up with mixed group/cdev
or multiple iommufd_ctx so hitting the NON_RESETTABLE situation.

> 
> flags: RESETABLE | DEV_ID
> {
>   { valid devA-id,  devA-BDF },
>   { invalid dev-id, devB-BDF },
>   { valid devC-id,  devC-BDF },
>   { valid devD-id,  devD-BDF },
>   { valid devE-id,  devE-BDF },
> }
> 
> Yes, the user should only be looking at the flag to determine the
> availability of hot-reset, (here's the but) but how is it consistent to
> indicate both that hot-reset is available and include an invalid
> dev-id?  The consistency as I propose is that an invalid dev-id is only
> presented with NOT_RESETTABLE for the device blocking hot-reset.  In
> the previous case, devB is not blocking reset and reporting an invalid
> dev-id only serves to obfuscate determining the blocking device.
> 
> For the cases of affected group-opened devices or separate
> iommufd_ctxs, the user gets invalid dev-ids for anything outside of
> the calling device's iommufd_ctx.
> 
> We haven't discussed how it fails when called on a group-opened device
> in a mixed environment.  I'd propose that the INFO ioctl behaves
> exactly as it does today, reporting group-id and BDF for each affected
> device.  However, the hot-reset ioctl itself is not extended to accept
> devicefd because there is no proof-of-ownership model for cdevs.
> Therefore even if the user could map group-id to devicefd, they get
> -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from
> a group-opened device.  Thanks,
> 

Yes I chatted with Yi about it.

If the calling device of the INFO ioctl is opened by group then behave
as it does today.

If the calling device is opened via cdev then use dev_id scheme as
discussed above.

in hot_reset ioctl the fd array only accepts group fd's.

cdev can be reset only via null fd array.

It remains a small open that null fd array could potentially work for
group-opened device too if vfio-compat is used. In that case devices
are in same iommufd ctx with valid dev_id even though they are opened 
via group. But probably it's not worthy blocking it?
Yi Liu April 18, 2023, 10:23 a.m. UTC | #78
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, April 17, 2023 9:39 PM
> 
> On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote:
> 
> > The only corner case with this option is when a user mixes group
> > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > In that case the kernel doesn't have sufficient knowledge to judge
> > 'resettable' as it doesn't know which groups are opened by this user.
> 
> IMHO we don't need to support this combination.

Do you mean we don't support hot-reset for this combination or we don't
support user using this combination. I guess the prior one. Right?

> 
> We can say that to use the hot reset API the user must put all their
> devices into the same iommufd_ctx and cover 100% of the known use
> cases for this.
> 
> There are already other situations, like nesting, that do force users
> to put everything into one iommufd_ctx.
> 
> No reason to make things harder and more complicated.

Ditto. We just fail hot-reset for the multiple iommufds case. Is it?
Otherwise, we need to prevent users from using multiple iommufds.

> I'm coming to the feeling that we should put no-iommu devices in
> iommufd_ctx's as well. They would be an iommufd_access like
> mdevs. That would clean up the complications they cause here.

Ok, the lucky thing is you have merged the patch series that creates
iommufd_access for emulated devices in bind. So cdev series needs
to handle noiommu case by creating iommufd_access.

> 
> I suppose we should have done that from the beginning - no-iommu is an
> IOMMUFD access, it just uses a crazy /proc based way to learn the
> PFNs. Making it a proper access and making a real VFIO ioctl that
> calls iommufd_access_pin_pages() and returns the DMA mapped addresses
> to userspace would go a long way to making no-iommu work in a logical,
> usable, way.

This seems to be an improvement for noiommu mode. It can be done later.
For now, generating access_id and binding noiommu devices with iommufdctx
is enough for supporting noiommu hot-reset.

Regards,
Yi Liu
Yi Liu April 18, 2023, 10:34 a.m. UTC | #79
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, April 18, 2023 12:11 PM
> 
[...]
>
> We haven't discussed how it fails when called on a group-opened device
> in a mixed environment.  I'd propose that the INFO ioctl behaves
> exactly as it does today, reporting group-id and BDF for each affected
> device.  However, the hot-reset ioctl itself is not extended to accept
> devicefd because there is no proof-of-ownership model for cdevs.
> Therefore even if the user could map group-id to devicefd, they get
> -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from
> a group-opened device.  Thanks,

Will it be better to let userspace know it shall fail if invoking hot
reset due to no proof-of-ownership as it also has cdev devices? Maybe
the RESETTABLE flag should always be meaningful. Even if the calling
device of _INFO is group-opened device. Old user applications does not
need to check it as it will never have such mixed environment. But for
new applications or the applications that have been updated per latest
vfio uapi, it should strictly check this flag before going ahead to do
hot-reset.

Regards,
Yi Liu
Jason Gunthorpe April 18, 2023, 12:57 p.m. UTC | #80
On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote:
> On Mon, 17 Apr 2023 16:31:56 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > this means that regardless of which device calls INFO, there's only one
> > > answer (assuming same set of devices opened, all cdev, all within same
> > > iommufd_ctx).  Based on what I explained about my understanding of INFO2
> > > and Jason agreed to, I think the output would be:
> > > 
> > > flags: NOT_RESETABLE | DEV_ID
> > > {
> > >   { valid devA-id,  devA-BDF },
> > >   { valid devC-id,  devC-BDF },
> > >   { valid devD-id,  devD-BDF },
> > >   { invalid dev-id, devE-BDF },
> > > }
> > > 
> > > Here devB gets dropped because the kernel understands that devB is
> > > unopened, affected, and owned.  It's therefore not a blocker for
> > > hot-reset.  
> > 
> > I don't think we want to drop anything because it makes the API
> > ill suited for the debugging purpose.
> > 
> > devb should be returned with an invalid dev_id if I understand your
> > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > make the debugging a bit better.
> > 
> > Userspace should look at only NOT_RESETTABLE to determine if it
> > proceeds or not, and it should use the valid dev_id list to iterate
> > over the devices it has open to do the config stuff.
> 
> If an affected device is owned, not opened, and not interfering with
> the reset, what is it adding to the API to report it for debugging
> purposes?

It lets it print the entire group of devices, this is the only way
something can learn the actual list of all BDFs affected.

dev_id can just return 0, we don't need a complex bitmap. Userspace
looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0.

Jason
Jason Gunthorpe April 18, 2023, 12:59 p.m. UTC | #81
On Tue, Apr 18, 2023 at 05:02:44AM +0000, Tian, Kevin wrote:

> Yes I chatted with Yi about it.
> 
> If the calling device of the INFO ioctl is opened by group then behave
> as it does today.
> 
> If the calling device is opened via cdev then use dev_id scheme as
> discussed above.
> 
> in hot_reset ioctl the fd array only accepts group fd's.
> 
> cdev can be reset only via null fd array.

Agree
 
> It remains a small open that null fd array could potentially work for
> group-opened device too if vfio-compat is used. In that case devices
> are in same iommufd ctx with valid dev_id even though they are opened 
> via group. But probably it's not worthy blocking it?

IMHO not worth the complexity to block. Security is maintained if we
use an iommufd_ctx check.

Jason
Jason Gunthorpe April 18, 2023, 1:02 p.m. UTC | #82
On Tue, Apr 18, 2023 at 10:23:55AM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Monday, April 17, 2023 9:39 PM
> > 
> > On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote:
> > 
> > > The only corner case with this option is when a user mixes group
> > > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > > In that case the kernel doesn't have sufficient knowledge to judge
> > > 'resettable' as it doesn't know which groups are opened by this user.
> > 
> > IMHO we don't need to support this combination.
> 
> Do you mean we don't support hot-reset for this combination or we don't
> support user using this combination. I guess the prior one. Right?

Yes

> Ditto. We just fail hot-reset for the multiple iommufds case. Is it?

Yes

> > I suppose we should have done that from the beginning - no-iommu is an
> > IOMMUFD access, it just uses a crazy /proc based way to learn the
> > PFNs. Making it a proper access and making a real VFIO ioctl that
> > calls iommufd_access_pin_pages() and returns the DMA mapped addresses
> > to userspace would go a long way to making no-iommu work in a logical,
> > usable, way.
> 
> This seems to be an improvement for noiommu mode. It can be done later.
> For now, generating access_id and binding noiommu devices with iommufdctx
> is enough for supporting noiommu hot-reset.

Yes, I'm not sure there is much value in improving no-iommu unless
someone also wants to go in and update dpdk.

At some point we will need to revise dpdk to use iommufd, maybe that
would be a good time to fix this too.

The point is that using an access is actually a logical and sensible
thing to do, no a hack to make hot reset work better.

Jason
Alex Williamson April 18, 2023, 4:44 p.m. UTC | #83
On Tue, 18 Apr 2023 05:02:44 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, April 18, 2023 12:11 PM
> > 
> > On Tue, 18 Apr 2023 03:24:46 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Tuesday, April 18, 2023 4:07 AM
> > > >
> > > > On Mon, 17 Apr 2023 16:31:56 -0300
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >  
> > > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:  
> > > > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > > > this means that regardless of which device calls INFO, there's only one
> > > > > > answer (assuming same set of devices opened, all cdev, all within  
> > same  
> > > > > > iommufd_ctx).  Based on what I explained about my understanding of  
> > > > INFO2  
> > > > > > and Jason agreed to, I think the output would be:
> > > > > >
> > > > > > flags: NOT_RESETABLE | DEV_ID
> > > > > > {
> > > > > >   { valid devA-id,  devA-BDF },
> > > > > >   { valid devC-id,  devC-BDF },
> > > > > >   { valid devD-id,  devD-BDF },
> > > > > >   { invalid dev-id, devE-BDF },
> > > > > > }
> > > > > >
> > > > > > Here devB gets dropped because the kernel understands that devB is
> > > > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > > > hot-reset.  
> > > > >
> > > > > I don't think we want to drop anything because it makes the API
> > > > > ill suited for the debugging purpose.
> > > > >
> > > > > devb should be returned with an invalid dev_id if I understand your
> > > > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > > > make the debugging a bit better.
> > > > >
> > > > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > > > proceeds or not, and it should use the valid dev_id list to iterate
> > > > > over the devices it has open to do the config stuff.  
> > > >
> > > > If an affected device is owned, not opened, and not interfering with
> > > > the reset, what is it adding to the API to report it for debugging
> > > > purposes?  I'm afraid this leads into expanding "invalid dev-id" into an  
> > >
> > > consistent output before and after devB is opened.  
> > 
> > In the case where devB is not opened including it only provides
> > useless information.  In the case where devB is opened it's necessary
> > to be reported as an opened, affected device.
> >   
> > > > errno or bitmap of error conditions that the user needs to parse.
> > > >  
> > >
> > > Not exactly.
> > >
> > > If RESETABLE invalid dev_id doesn't matter. The user only use the
> > > valid dev_id list to iterate as Jason pointed out.  
> > 
> > Yes, but...
> >   
> > > If NOT_RESETTABLE due to devE not assigned to the VM one can
> > > easily figure out the fact by simply looking at the list of affected BDFs
> > > and the configuration of assigned devices of the VM. Then invalid
> > > dev_id also doesn't matter.  
> > 
> > Huh?
> > 
> > Given:
> > 
> > flags: NOT_RESETABLE | DEV_ID
> > {
> >   { valid devA-id,  devA-BDF },
> >   { invalid dev-id, devB-BDF },
> >   { valid devC-id,  devC-BDF },
> >   { valid devD-id,  devD-BDF },
> >   { invalid dev-id, devE-BDF },
> > }
> > 
> > How does the user determine that devE is to blame and not devB based on
> > BDF?  The user cannot rely on sysfs for help, they don't know the IOMMU
> > grouping, nor do they know the BDF except as inferred by matching valid
> > dev-ids in the above output.  
> 
> emmm aren't we talking about the 'person' who does diagnostic? This guy
> will look at the VM configuration file to know that devA/B/C/D have been
> assigned to the VM but not devE.

Actually the scenario is that devA/C/D are assigned, devB is implicitly
owned, and it's devE that blocks the reset.  If you've followed any of
the community forums for vfio over the years, it should be readily
apparent that placing the burden solely on the end user to perform such
a diagnosis is an unreasonable expectation.

> > > If NOT_RESETTABLE while devE is already assigned to the VM then it's
> > > indication of mixing groups, cdevs or multiple iommufd_ctxs. Then
> > > people should debug with other means/hints to dig out the exact
> > > culprit.  
> > 
> > I don't know what situation you're trying to explain here.  If devE
> > were opened within the same iommufd_ctx, this becomes:  
> 
> It's about a scenario where the mgmt.. stack has assigned all affected
> devices to Qemu but Qemu itself messed it up with mixed group/cdev
> or multiple iommufd_ctx so hitting the NON_RESETTABLE situation.

Is this a reasonable scenario?  I expect the QEMU support to favor cdev
access where available and fd passing methods will only use cdev, so
QEMU should never mess up to create such an environment.  There should
never be a case where a device is exclusively available via group
rather than cdev.

> > flags: RESETABLE | DEV_ID
> > {
> >   { valid devA-id,  devA-BDF },
> >   { invalid dev-id, devB-BDF },
> >   { valid devC-id,  devC-BDF },
> >   { valid devD-id,  devD-BDF },
> >   { valid devE-id,  devE-BDF },
> > }
> > 
> > Yes, the user should only be looking at the flag to determine the
> > availability of hot-reset, (here's the but) but how is it consistent to
> > indicate both that hot-reset is available and include an invalid
> > dev-id?  The consistency as I propose is that an invalid dev-id is only
> > presented with NOT_RESETTABLE for the device blocking hot-reset.  In
> > the previous case, devB is not blocking reset and reporting an invalid
> > dev-id only serves to obfuscate determining the blocking device.
> > 
> > For the cases of affected group-opened devices or separate
> > iommufd_ctxs, the user gets invalid dev-ids for anything outside of
> > the calling device's iommufd_ctx.
> > 
> > We haven't discussed how it fails when called on a group-opened device
> > in a mixed environment.  I'd propose that the INFO ioctl behaves
> > exactly as it does today, reporting group-id and BDF for each affected
> > device.  However, the hot-reset ioctl itself is not extended to accept
> > devicefd because there is no proof-of-ownership model for cdevs.
> > Therefore even if the user could map group-id to devicefd, they get
> > -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from
> > a group-opened device.  Thanks,
> >   
> 
> Yes I chatted with Yi about it.
> 
> If the calling device of the INFO ioctl is opened by group then behave
> as it does today.
> 
> If the calling device is opened via cdev then use dev_id scheme as
> discussed above.
> 
> in hot_reset ioctl the fd array only accepts group fd's.
> 
> cdev can be reset only via null fd array.
> 
> It remains a small open that null fd array could potentially work for
> group-opened device too if vfio-compat is used. In that case devices
> are in same iommufd ctx with valid dev_id even though they are opened 
> via group. But probably it's not worthy blocking it?

Yes, let's not create new models for the compatibility interface, stick
with group-opened = group-id = proof-of-ownership.  Thanks,

Alex
Alex Williamson April 18, 2023, 4:49 p.m. UTC | #84
On Tue, 18 Apr 2023 10:34:45 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, April 18, 2023 12:11 PM
> >   
> [...]
> >
> > We haven't discussed how it fails when called on a group-opened device
> > in a mixed environment.  I'd propose that the INFO ioctl behaves
> > exactly as it does today, reporting group-id and BDF for each affected
> > device.  However, the hot-reset ioctl itself is not extended to accept
> > devicefd because there is no proof-of-ownership model for cdevs.
> > Therefore even if the user could map group-id to devicefd, they get
> > -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from
> > a group-opened device.  Thanks,  
> 
> Will it be better to let userspace know it shall fail if invoking hot
> reset due to no proof-of-ownership as it also has cdev devices? Maybe
> the RESETTABLE flag should always be meaningful. Even if the calling
> device of _INFO is group-opened device. Old user applications does not
> need to check it as it will never have such mixed environment. But for
> new applications or the applications that have been updated per latest
> vfio uapi, it should strictly check this flag before going ahead to do
> hot-reset.

The group-opened model cannot consistently predict whether the user can
provide proof-of-ownership.  I don't think we should define a flag
simply because there's a case that we can predict, the definition of
that flag becomes problematic.  Let's not complicate the interface by
trying to optimize a case that will likely never exist in practice and
can be handled via the existing legacy API.  Thanks,

Alex
Alex Williamson April 18, 2023, 6:39 p.m. UTC | #85
On Tue, 18 Apr 2023 09:57:32 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote:
> > On Mon, 17 Apr 2023 16:31:56 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:  
> > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > this means that regardless of which device calls INFO, there's only one
> > > > answer (assuming same set of devices opened, all cdev, all within same
> > > > iommufd_ctx).  Based on what I explained about my understanding of INFO2
> > > > and Jason agreed to, I think the output would be:
> > > > 
> > > > flags: NOT_RESETABLE | DEV_ID
> > > > {
> > > >   { valid devA-id,  devA-BDF },
> > > >   { valid devC-id,  devC-BDF },
> > > >   { valid devD-id,  devD-BDF },
> > > >   { invalid dev-id, devE-BDF },
> > > > }
> > > > 
> > > > Here devB gets dropped because the kernel understands that devB is
> > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > hot-reset.    
> > > 
> > > I don't think we want to drop anything because it makes the API
> > > ill suited for the debugging purpose.
> > > 
> > > devb should be returned with an invalid dev_id if I understand your
> > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > make the debugging a bit better.
> > > 
> > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > proceeds or not, and it should use the valid dev_id list to iterate
> > > over the devices it has open to do the config stuff.  
> > 
> > If an affected device is owned, not opened, and not interfering with
> > the reset, what is it adding to the API to report it for debugging
> > purposes?  
> 
> It lets it print the entire group of devices, this is the only way
> something can learn the actual list of all BDFs affected.

If we do so, userspace must be able to differentiate which devices are
blocking, which necessitates at least a bi-modal invalid dev-id.

> dev_id can just return 0, we don't need a complex bitmap. Userspace
> looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0.

I'm having trouble with a succinct definition of dev-id == 0, is it "A
device affected by the hot-reset reset, which does not directly
contribute to the availability of the hot-reset, ex. an unopened device
within the same IOMMU group as an opened device (ie. this is not the
device responsible if hot-reset is unavailable).  Whereas dev-id < 0
(== -1) is an affected device which prevents hot-reset, ex. an un-owned
device, device configured within a different iommufd_ctx, or device
opened outside of the vfio cdev API."  Is that about right?  Thanks,

Alex
Yi Liu April 20, 2023, 12:10 p.m. UTC | #86
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 19, 2023 2:39 AM
> 
> On Tue, 18 Apr 2023 09:57:32 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote:
> > > On Mon, 17 Apr 2023 16:31:56 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > > this means that regardless of which device calls INFO, there's only one
> > > > > answer (assuming same set of devices opened, all cdev, all within same
> > > > > iommufd_ctx).  Based on what I explained about my understanding of INFO2
> > > > > and Jason agreed to, I think the output would be:
> > > > >
> > > > > flags: NOT_RESETABLE | DEV_ID
> > > > > {
> > > > >   { valid devA-id,  devA-BDF },
> > > > >   { valid devC-id,  devC-BDF },
> > > > >   { valid devD-id,  devD-BDF },
> > > > >   { invalid dev-id, devE-BDF },
> > > > > }
> > > > >
> > > > > Here devB gets dropped because the kernel understands that devB is
> > > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > > hot-reset.
> > > >
> > > > I don't think we want to drop anything because it makes the API
> > > > ill suited for the debugging purpose.
> > > >
> > > > devb should be returned with an invalid dev_id if I understand your
> > > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > > make the debugging a bit better.
> > > >
> > > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > > proceeds or not, and it should use the valid dev_id list to iterate
> > > > over the devices it has open to do the config stuff.
> > >
> > > If an affected device is owned, not opened, and not interfering with
> > > the reset, what is it adding to the API to report it for debugging
> > > purposes?
> >
> > It lets it print the entire group of devices, this is the only way
> > something can learn the actual list of all BDFs affected.
> 
> If we do so, userspace must be able to differentiate which devices are
> blocking, which necessitates at least a bi-modal invalid dev-id.
> 
> > dev_id can just return 0, we don't need a complex bitmap. Userspace
> > looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0.
> 
> I'm having trouble with a succinct definition of dev-id == 0, is it "A
> device affected by the hot-reset reset, which does not directly
> contribute to the availability of the hot-reset, ex. an unopened device
> within the same IOMMU group as an opened device (ie. this is not the
> device responsible if hot-reset is unavailable). 

Hide this device in the list looks fine to me. But the calling user should
not do any new device open before finishing hot-reset. Otherwise, user may
miss a device that needs to do pre/post reset. I think this requirement is
acceptable. Is it? 

> Whereas dev-id < 0
> (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> device, device configured within a different iommufd_ctx, or device
> opened outside of the vfio cdev API."  Is that about right?  Thanks,

Do you mean to have separate err-code for the three possibilities? As
the devid is generated by iommufd and it is u32. I'm not sure if we can
have such err-code definition without reserving some ids in iommufd. 

Regards,
Yi Liu
Alex Williamson April 20, 2023, 2:08 p.m. UTC | #87
On Thu, 20 Apr 2023 12:10:20 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, April 19, 2023 2:39 AM
> > 
> > On Tue, 18 Apr 2023 09:57:32 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote:  
> > > > On Mon, 17 Apr 2023 16:31:56 -0300
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >  
> > > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:  
> > > > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > > > this means that regardless of which device calls INFO, there's only one
> > > > > > answer (assuming same set of devices opened, all cdev, all within same
> > > > > > iommufd_ctx).  Based on what I explained about my understanding of INFO2
> > > > > > and Jason agreed to, I think the output would be:
> > > > > >
> > > > > > flags: NOT_RESETABLE | DEV_ID
> > > > > > {
> > > > > >   { valid devA-id,  devA-BDF },
> > > > > >   { valid devC-id,  devC-BDF },
> > > > > >   { valid devD-id,  devD-BDF },
> > > > > >   { invalid dev-id, devE-BDF },
> > > > > > }
> > > > > >
> > > > > > Here devB gets dropped because the kernel understands that devB is
> > > > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > > > hot-reset.  
> > > > >
> > > > > I don't think we want to drop anything because it makes the API
> > > > > ill suited for the debugging purpose.
> > > > >
> > > > > devb should be returned with an invalid dev_id if I understand your
> > > > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > > > make the debugging a bit better.
> > > > >
> > > > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > > > proceeds or not, and it should use the valid dev_id list to iterate
> > > > > over the devices it has open to do the config stuff.  
> > > >
> > > > If an affected device is owned, not opened, and not interfering with
> > > > the reset, what is it adding to the API to report it for debugging
> > > > purposes?  
> > >
> > > It lets it print the entire group of devices, this is the only way
> > > something can learn the actual list of all BDFs affected.  
> > 
> > If we do so, userspace must be able to differentiate which devices are
> > blocking, which necessitates at least a bi-modal invalid dev-id.
> >   
> > > dev_id can just return 0, we don't need a complex bitmap. Userspace
> > > looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0.  
> > 
> > I'm having trouble with a succinct definition of dev-id == 0, is it "A
> > device affected by the hot-reset reset, which does not directly
> > contribute to the availability of the hot-reset, ex. an unopened device
> > within the same IOMMU group as an opened device (ie. this is not the
> > device responsible if hot-reset is unavailable).   
> 
> Hide this device in the list looks fine to me. But the calling user should
> not do any new device open before finishing hot-reset. Otherwise, user may
> miss a device that needs to do pre/post reset. I think this requirement is
> acceptable. Is it? 

I think Kevin and Jason are leaning towards reporting the entire
dev-set.  The INFO ioctl has always been a point-in-time reading, no
guarantees are made if the host or user configuration is changed.
Nothing changes in that respect.

> > Whereas dev-id < 0
> > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > device, device configured within a different iommufd_ctx, or device
> > opened outside of the vfio cdev API."  Is that about right?  Thanks,  
> 
> Do you mean to have separate err-code for the three possibilities? As
> the devid is generated by iommufd and it is u32. I'm not sure if we can
> have such err-code definition without reserving some ids in iommufd. 

Yes, if we're going to report the full dev-set, I think we need at
least two unique error codes or else the user has no way to determine
the subset of invalid dev-ids which block the reset.  I think Jason is
proposing the set of valid dev-ids are >0, a dev-id of zero indicates
some form of non-blocking, while <0 (or maybe specifically -1)
indicates a blocking device.  I was trying to get consensus on a formal
definition of each of those error codes in my previous reply.  Thanks,

Alex
Jason Gunthorpe April 21, 2023, 10:35 p.m. UTC | #88
On Thu, Apr 20, 2023 at 08:08:39AM -0600, Alex Williamson wrote:

> > Hide this device in the list looks fine to me. But the calling user should
> > not do any new device open before finishing hot-reset. Otherwise, user may
> > miss a device that needs to do pre/post reset. I think this requirement is
> > acceptable. Is it? 
> 
> I think Kevin and Jason are leaning towards reporting the entire
> dev-set.  The INFO ioctl has always been a point-in-time reading, no
> guarantees are made if the host or user configuration is changed.
> Nothing changes in that respect.

Yeah, I think your point about qemu community formus suggest we should
err toward having qemu provide some fully detailed debug report.
 
> > > Whereas dev-id < 0
> > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > > device, device configured within a different iommufd_ctx, or device
> > > opened outside of the vfio cdev API."  Is that about right?  Thanks,  
> > 
> > Do you mean to have separate err-code for the three possibilities? As
> > the devid is generated by iommufd and it is u32. I'm not sure if we can
> > have such err-code definition without reserving some ids in iommufd. 
> 
> Yes, if we're going to report the full dev-set, I think we need at
> least two unique error codes or else the user has no way to determine
> the subset of invalid dev-ids which block the reset.

If you think this is important to report we should report 0 and -1,
and adjust the iommufd xarray allocator to reserve -1

It depends what you want to show for the debugging.

eg if we have debugging where qemu dumps this table:

   BDF   In VM   iommu_group   Has VFIO driver   Has Kernel Driver

By also doing various sysfs probes based on the BDF, then the admin
action to remedy the situation is:

Make "Has VFIO driver = y" or "Has Kernel Driver = n" for every row in
the table to make the reset work.

And we don't need the distinction. Adding the 0/-1 lets you make a
useful table without doing any sysfs work.

> I think Jason is proposing the set of valid dev-ids are >0, a dev-id
> of zero indicates some form of non-blocking, while <0 (or maybe
> specifically -1) indicates a blocking device.

Yes, 0 and -1 would be fine with those definitions. The only use of
the data is to add a 'blocking use of reset' colum to the table
above..

Thanks,
Jason
Yi Liu April 23, 2023, 10:28 a.m. UTC | #89
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 18, 2023 9:02 PM
> 
> On Tue, Apr 18, 2023 at 10:23:55AM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Monday, April 17, 2023 9:39 PM
> > >
> > > On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote:
> > >
> > > > The only corner case with this option is when a user mixes group
> > > > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > > > In that case the kernel doesn't have sufficient knowledge to judge
> > > > 'resettable' as it doesn't know which groups are opened by this user.
> > >
> > > IMHO we don't need to support this combination.
> >
> > Do you mean we don't support hot-reset for this combination or we don't
> > support user using this combination. I guess the prior one. Right?
> 
> Yes
> 
> > Ditto. We just fail hot-reset for the multiple iommufds case. Is it?
> 
> Yes
> 
> > > I suppose we should have done that from the beginning - no-iommu is an
> > > IOMMUFD access, it just uses a crazy /proc based way to learn the
> > > PFNs. Making it a proper access and making a real VFIO ioctl that
> > > calls iommufd_access_pin_pages() and returns the DMA mapped addresses
> > > to userspace would go a long way to making no-iommu work in a logical,
> > > usable, way.
> >
> > This seems to be an improvement for noiommu mode. It can be done later.
> > For now, generating access_id and binding noiommu devices with iommufdctx
> > is enough for supporting noiommu hot-reset.
> 
> Yes, I'm not sure there is much value in improving no-iommu unless
> someone also wants to go in and update dpdk.
> 
> At some point we will need to revise dpdk to use iommufd, maybe that
> would be a good time to fix this too.

This noiommu improvement shall allow user to attach ioas to noiommu devices.
is it? This may be done by calling iommufd_access_attach(). So there is a
quick question. In the cdev series, shall we allow the attachment for noiommu?
I think the noiommu improvement shall require extra effort, so it is not
ready yet. If so, seems like I just need to fail the attachment for noiommu
devices. But when in the future it is ready, how can userspace know attach
is allowed for noiommu devices? Will it be an easy thing? or we may just let
the attach as a noop and always succeed for noiommu devices? any suggestions?

Regards,
Yi Liu
Yi Liu April 23, 2023, 2:46 p.m. UTC | #90
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, April 22, 2023 6:36 AM
> 
> On Thu, Apr 20, 2023 at 08:08:39AM -0600, Alex Williamson wrote:
> 
> > > Hide this device in the list looks fine to me. But the calling user should
> > > not do any new device open before finishing hot-reset. Otherwise, user may
> > > miss a device that needs to do pre/post reset. I think this requirement is
> > > acceptable. Is it?
> >
> > I think Kevin and Jason are leaning towards reporting the entire
> > dev-set.  The INFO ioctl has always been a point-in-time reading, no
> > guarantees are made if the host or user configuration is changed.
> > Nothing changes in that respect.
> 
> Yeah, I think your point about qemu community formus suggest we should
> err toward having qemu provide some fully detailed debug report.
> 
> > > > Whereas dev-id < 0
> > > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > > > device, device configured within a different iommufd_ctx, or device
> > > > opened outside of the vfio cdev API."  Is that about right?  Thanks,
> > >
> > > Do you mean to have separate err-code for the three possibilities? As
> > > the devid is generated by iommufd and it is u32. I'm not sure if we can
> > > have such err-code definition without reserving some ids in iommufd.
> >
> > Yes, if we're going to report the full dev-set, I think we need at
> > least two unique error codes or else the user has no way to determine
> > the subset of invalid dev-ids which block the reset.
> 
> If you think this is important to report we should report 0 and -1,
> and adjust the iommufd xarray allocator to reserve -1

Then the alloc range should be from 1 to 0xffffffff.
 
> 
> It depends what you want to show for the debugging.
> 
> eg if we have debugging where qemu dumps this table:
> 
>    BDF   In VM   iommu_group   Has VFIO driver   Has Kernel Driver
> 
> By also doing various sysfs probes based on the BDF, then the admin
> action to remedy the situation is:
> 
> Make "Has VFIO driver = y" or "Has Kernel Driver = n" for every row in
> the table to make the reset work.
> 
> And we don't need the distinction. Adding the 0/-1 lets you make a
> useful table without doing any sysfs work.
>
> > I think Jason is proposing the set of valid dev-ids are >0, a dev-id
> > of zero indicates some form of non-blocking, while <0 (or maybe
> > specifically -1) indicates a blocking device.
> 
> Yes, 0 and -1 would be fine with those definitions. The only use of
> the data is to add a 'blocking use of reset' colum to the table
> above..

Should -1 and 0 be defined in uapi as well? If yes, this seems not easy
to get a proper naming for them. Or just document it in vfio
uapi header to say -1 (blocking) and 0 (no-devid-but-not-blocking)
blabla.

Regards,
Yi Liu
Jason Gunthorpe April 24, 2023, 5:38 p.m. UTC | #91
On Sun, Apr 23, 2023 at 10:28:58AM +0000, Liu, Yi L wrote:

> This noiommu improvement shall allow user to attach ioas to noiommu devices.
> is it? This may be done by calling iommufd_access_attach(). So there is a
> quick question. In the cdev series, shall we allow the attachment
> for noiommu?

Yes, I think we need to undo the decision we talked about earlier
where no-iommu would be asked for with a -1 iommufd.

All vfio_devices should have an iommufd_ctx when container is compiled
out.

You don't need to do anything with the ctx for no-iommu beyond demand
that userspace provide it.

Jason
Yi Liu April 26, 2023, 7:22 a.m. UTC | #92
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, April 20, 2023 10:09 PM
[...]
> > > Whereas dev-id < 0
> > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > > device, device configured within a different iommufd_ctx, or device
> > > opened outside of the vfio cdev API."  Is that about right?  Thanks,
> >
> > Do you mean to have separate err-code for the three possibilities? As
> > the devid is generated by iommufd and it is u32. I'm not sure if we can
> > have such err-code definition without reserving some ids in iommufd.
> 
> Yes, if we're going to report the full dev-set, I think we need at
> least two unique error codes or else the user has no way to determine
> the subset of invalid dev-ids which block the reset.  I think Jason is
> proposing the set of valid dev-ids are >0, a dev-id of zero indicates
> some form of non-blocking, while <0 (or maybe specifically -1)
> indicates a blocking device.  I was trying to get consensus on a formal
> definition of each of those error codes in my previous reply.  Thanks,

Seems like RESETTABLE flag is not needed if we report -1 for the devices
that block hotreset. Userspace can deduce if the calling device is resettable
or not by checking if there is any -1 in the affected device list.

Regards,
Yi Liu
Alex Williamson April 26, 2023, 1:20 p.m. UTC | #93
On Wed, 26 Apr 2023 07:22:17 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, April 20, 2023 10:09 PM  
> [...]
> > > > Whereas dev-id < 0
> > > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > > > device, device configured within a different iommufd_ctx, or device
> > > > opened outside of the vfio cdev API."  Is that about right?  Thanks,  
> > >
> > > Do you mean to have separate err-code for the three possibilities? As
> > > the devid is generated by iommufd and it is u32. I'm not sure if we can
> > > have such err-code definition without reserving some ids in iommufd.  
> > 
> > Yes, if we're going to report the full dev-set, I think we need at
> > least two unique error codes or else the user has no way to determine
> > the subset of invalid dev-ids which block the reset.  I think Jason is
> > proposing the set of valid dev-ids are >0, a dev-id of zero indicates
> > some form of non-blocking, while <0 (or maybe specifically -1)
> > indicates a blocking device.  I was trying to get consensus on a formal
> > definition of each of those error codes in my previous reply.  Thanks,  
> 
> Seems like RESETTABLE flag is not needed if we report -1 for the devices
> that block hotreset. Userspace can deduce if the calling device is resettable
> or not by checking if there is any -1 in the affected device list.

There is some redundancy there, yes.  Given the desire for a null array
on the actual reset ioctl I assumed there would also be a desire to
streamline the info ioctl such that userspace isn't required to parse
the return array, for example maybe userspace isn't required to pass a
full buffer and can get the reset availability status from only the
header.  Of course it's still the responsibility of userspace to know
the extent of the reset.  Thanks,

Alex
Yi Liu April 26, 2023, 3:08 p.m. UTC | #94
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 26, 2023 9:20 PM
> 
> On Wed, 26 Apr 2023 07:22:17 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, April 20, 2023 10:09 PM
> > [...]
> > > > > Whereas dev-id < 0
> > > > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > > > > device, device configured within a different iommufd_ctx, or device
> > > > > opened outside of the vfio cdev API."  Is that about right?  Thanks,
> > > >
> > > > Do you mean to have separate err-code for the three possibilities? As
> > > > the devid is generated by iommufd and it is u32. I'm not sure if we can
> > > > have such err-code definition without reserving some ids in iommufd.
> > >
> > > Yes, if we're going to report the full dev-set, I think we need at
> > > least two unique error codes or else the user has no way to determine
> > > the subset of invalid dev-ids which block the reset.  I think Jason is
> > > proposing the set of valid dev-ids are >0, a dev-id of zero indicates
> > > some form of non-blocking, while <0 (or maybe specifically -1)
> > > indicates a blocking device.  I was trying to get consensus on a formal
> > > definition of each of those error codes in my previous reply.  Thanks,
> >
> > Seems like RESETTABLE flag is not needed if we report -1 for the devices
> > that block hotreset. Userspace can deduce if the calling device is resettable
> > or not by checking if there is any -1 in the affected device list.
> 
> There is some redundancy there, yes.  Given the desire for a null array
> on the actual reset ioctl I assumed there would also be a desire to
> streamline the info ioctl such that userspace isn't required to parse
> the return array, for example maybe userspace isn't required to pass a
> full buffer and can get the reset availability status from only the
> header.  Of course it's still the responsibility of userspace to know
> the extent of the reset.  Thanks,

I keep it and has sent a refreshed version for hot-reset. 
diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 19f5b075d70a..a5a7e148dce1 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -30,6 +30,7 @@ 
 #if IS_ENABLED(CONFIG_EEH)
 #include <asm/eeh.h>
 #endif
+#include <uapi/linux/iommufd.h>
 
 #include "vfio_pci_priv.h"
 
@@ -767,6 +768,20 @@  static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
 	return 0;
 }
 
+static struct vfio_device *
+vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
+			       struct pci_dev *pdev)
+{
+	struct vfio_device *cur;
+
+	lockdep_assert_held(&dev_set->lock);
+
+	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
+		if (cur->dev == &pdev->dev)
+			return cur;
+	return NULL;
+}
+
 static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
 {
 	(*(int *)data)++;
@@ -776,13 +791,20 @@  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
 struct vfio_pci_fill_info {
 	int max;
 	int cur;
+	bool require_devid;
+	struct iommufd_ctx *iommufd;
+	struct vfio_device_set *dev_set;
 	struct vfio_pci_dependent_device *devices;
 };
 
 static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
 {
 	struct vfio_pci_fill_info *fill = data;
+	struct vfio_device_set *dev_set = fill->dev_set;
 	struct iommu_group *iommu_group;
+	struct vfio_device *vdev;
+
+	lockdep_assert_held(&dev_set->lock);
 
 	if (fill->cur == fill->max)
 		return -EAGAIN; /* Something changed, try again */
@@ -791,7 +813,21 @@  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
 	if (!iommu_group)
 		return -EPERM; /* Cannot reset non-isolated devices */
 
-	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
+	if (fill->require_devid) {
+		/*
+		 * Report dev_id of the devices that are opened as cdev
+		 * and have the same iommufd with the fill->iommufd.
+		 * Otherwise, just fill IOMMUFD_INVALID_ID.
+		 */
+		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
+		if (vdev && vfio_device_cdev_opened(vdev) &&
+		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
+			vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id);
+		else
+			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
+	} else {
+		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
+	}
 	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
 	fill->devices[fill->cur].bus = pdev->bus->number;
 	fill->devices[fill->cur].devfn = pdev->devfn;
@@ -1230,17 +1266,27 @@  static int vfio_pci_ioctl_get_pci_hot_reset_info(
 		return -ENOMEM;
 
 	fill.devices = devices;
+	fill.dev_set = vdev->vdev.dev_set;
 
+	mutex_lock(&vdev->vdev.dev_set->lock);
+	if (vfio_device_cdev_opened(&vdev->vdev)) {
+		fill.require_devid = true;
+		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
+	}
 	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
 					    &fill, slot);
+	mutex_unlock(&vdev->vdev.dev_set->lock);
 
 	/*
 	 * If a device was removed between counting and filling, we may come up
 	 * short of fill.max.  If a device was added, we'll have a return of
 	 * -EAGAIN above.
 	 */
-	if (!ret)
+	if (!ret) {
 		hdr.count = fill.cur;
+		if (fill.require_devid)
+			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
+	}
 
 reset_info_exit:
 	if (copy_to_user(arg, &hdr, minsz))
@@ -2346,12 +2392,10 @@  static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev,
 static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
 {
 	struct vfio_device_set *dev_set = data;
-	struct vfio_device *cur;
 
-	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
-		if (cur->dev == &pdev->dev)
-			return 0;
-	return -EBUSY;
+	lockdep_assert_held(&dev_set->lock);
+
+	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
 }
 
 /*
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 25432ef213ee..5a34364e3b94 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -650,11 +650,32 @@  enum {
  * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
  *					      struct vfio_pci_hot_reset_info)
  *
+ * This command is used to query the affected devices in the hot reset for
+ * a given device.  User could use the information reported by this command
+ * to figure out the affected devices among the devices it has opened.
+ * This command always reports the segment, bus and devfn information for
+ * each affected device, and selectively report the group_id or the dev_id
+ * per the way how the device being queried is opened.
+ *	- If the device is opened via the traditional group/container manner,
+ *	  this command reports the group_id for each affected device.
+ *
+ *	- If the device is opened as a cdev, this command needs to report
+ *	  dev_id for each affected device and set the
+ *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
+ *	  devices that are not opened as cdev or bound to different iommufds
+ *	  with the device that is queried, report an invalid dev_id to avoid
+ *	  potential dev_id conflict as dev_id is local to iommufd.  For such
+ *	  affected devices, user shall fall back to use the segment, bus and
+ *	  devfn info to map it to opened device.
+ *
  * Return: 0 on success, -errno on failure:
  *	-enospc = insufficient buffer, -enodev = unsupported for device.
  */
 struct vfio_pci_dependent_device {
-	__u32	group_id;
+	union {
+		__u32   group_id;
+		__u32	dev_id;
+	};
 	__u16	segment;
 	__u8	bus;
 	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
@@ -663,6 +684,7 @@  struct vfio_pci_dependent_device {
 struct vfio_pci_hot_reset_info {
 	__u32	argsz;
 	__u32	flags;
+#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
 	__u32	count;
 	struct vfio_pci_dependent_device	devices[];
 };