Message ID | 20230401144429.88673-13-yi.l.liu@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Introduce new methods for verifying ownership in vfio PCI hot reset | expand |
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Saturday, April 1, 2023 10:44 PM > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) > if (!iommu_group) > return -EPERM; /* Cannot reset non-isolated devices */ Hi Alex, Is disabling iommu a sane way to test vfio noiommu mode? If no, just skip the below contents.
On Mon, 3 Apr 2023 09:25:06 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Liu, Yi L <yi.l.liu@intel.com> > > Sent: Saturday, April 1, 2023 10:44 PM > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) > > if (!iommu_group) > > return -EPERM; /* Cannot reset non-isolated devices */ > > Hi Alex, > > Is disabling iommu a sane way to test vfio noiommu mode? Yes > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci. > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind > iommufd==-1 can succeed, but failed to get hot reset info due to the above > group check. Reason is that this happens to have some affected devices, and > these devices have no valid iommu_group (because they are not bound to vfio-pci > hence nobody allocates noiommu group for them). So when hot reset info loops > such devices, it failed with -EPERM. Is this expected? Hmm, I didn't recall that we put in such a limitation, but given the minimally intrusive approach to no-iommu and the fact that we never defined an invalid group ID to return to the user, it makes sense that we just blocked the ioctl for no-iommu use. I guess we can do the same for no-iommu cdev. BTW, what does this series apply on? I'm assuming[1], but I don't see a branch from Jason yet. Thanks, Alex [1]https://lore.kernel.org/all/20230327093351.44505-1-yi.l.liu@intel.com/
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Monday, April 3, 2023 11:02 PM > > On Mon, 3 Apr 2023 09:25:06 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > Sent: Saturday, April 1, 2023 10:44 PM > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void > *data) > > > if (!iommu_group) > > > return -EPERM; /* Cannot reset non-isolated devices */ > > > > Hi Alex, > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > Yes > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci. > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind > > iommufd==-1 can succeed, but failed to get hot reset info due to the above > > group check. Reason is that this happens to have some affected devices, and > > these devices have no valid iommu_group (because they are not bound to vfio-pci > > hence nobody allocates noiommu group for them). So when hot reset info loops > > such devices, it failed with -EPERM. Is this expected? > > Hmm, I didn't recall that we put in such a limitation, but given the > minimally intrusive approach to no-iommu and the fact that we never > defined an invalid group ID to return to the user, it makes sense that > we just blocked the ioctl for no-iommu use. I guess we can do the same > for no-iommu cdev. sure. > > BTW, what does this series apply on? I'm assuming[1], but I don't see > a branch from Jason yet. Thanks, yes, this series is applied on [1]. I put the [1], this series and cdev series in https://github.com/yiliu1765/iommufd/commits/vfio_device_cdev_v9. Jason has taken [1] in the below branch. It is based on rc1. So I hesitated to apply this series and cdev series on top of it. Maybe I should have done it to make life easier.
On Mon, 3 Apr 2023 15:22:03 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Monday, April 3, 2023 11:02 PM > > > > On Mon, 3 Apr 2023 09:25:06 +0000 > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > Sent: Saturday, April 1, 2023 10:44 PM > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void > > *data) > > > > if (!iommu_group) > > > > return -EPERM; /* Cannot reset non-isolated devices */ > > > > > > Hi Alex, > > > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > > > Yes > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci. > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above > > > group check. Reason is that this happens to have some affected devices, and > > > these devices have no valid iommu_group (because they are not bound to vfio-pci > > > hence nobody allocates noiommu group for them). So when hot reset info loops > > > such devices, it failed with -EPERM. Is this expected? > > > > Hmm, I didn't recall that we put in such a limitation, but given the > > minimally intrusive approach to no-iommu and the fact that we never > > defined an invalid group ID to return to the user, it makes sense that > > we just blocked the ioctl for no-iommu use. I guess we can do the same > > for no-iommu cdev. > > sure. > > > > > BTW, what does this series apply on? I'm assuming[1], but I don't see > > a branch from Jason yet. Thanks, > > yes, this series is applied on [1]. I put the [1], this series and cdev series > in https://github.com/yiliu1765/iommufd/commits/vfio_device_cdev_v9. > > Jason has taken [1] in the below branch. It is based on rc1. So I hesitated > to apply this series and cdev series on top of it. Maybe I should have done > it to make life easier.
On Mon, Apr 03, 2023 at 09:32:18AM -0600, Alex Williamson wrote: > > yes, this series is applied on [1]. I put the [1], this series and cdev series > > in https://github.com/yiliu1765/iommufd/commits/vfio_device_cdev_v9. > > > > Jason has taken [1] in the below branch. It is based on rc1. So I hesitated > > to apply this series and cdev series on top of it. Maybe I should have done > > it to make life easier.
On Sat, 1 Apr 2023 07:44:29 -0700 Yi Liu <yi.l.liu@intel.com> wrote: > for the users that accept device fds passed from management stacks to be > able to figure out the host reset affected devices among the devices > opened by the user. This is needed as such users do not have BDF (bus, > devfn) knowledge about the devices it has opened, hence unable to use > the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > to figure out the affected devices. > > Signed-off-by: Yi Liu <yi.l.liu@intel.com> > --- > drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++---- > include/uapi/linux/vfio.h | 24 ++++++++++++- > 2 files changed, 74 insertions(+), 8 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 19f5b075d70a..a5a7e148dce1 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -30,6 +30,7 @@ > #if IS_ENABLED(CONFIG_EEH) > #include <asm/eeh.h> > #endif > +#include <uapi/linux/iommufd.h> > > #include "vfio_pci_priv.h" > > @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ > return 0; > } > > +static struct vfio_device * > +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set, > + struct pci_dev *pdev) > +{ > + struct vfio_device *cur; > + > + lockdep_assert_held(&dev_set->lock); > + > + list_for_each_entry(cur, &dev_set->device_list, dev_set_list) > + if (cur->dev == &pdev->dev) > + return cur; > + return NULL; > +} > + > static int vfio_pci_count_devs(struct pci_dev *pdev, void *data) > { > (*(int *)data)++; > @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void *data) > struct vfio_pci_fill_info { > int max; > int cur; > + bool require_devid; > + struct iommufd_ctx *iommufd; > + struct vfio_device_set *dev_set; > struct vfio_pci_dependent_device *devices; Poor structure packing, move the bool to the end. Nit, maybe just name it @devid. > }; > > static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) > { > struct vfio_pci_fill_info *fill = data; > + struct vfio_device_set *dev_set = fill->dev_set; > struct iommu_group *iommu_group; > + struct vfio_device *vdev; > + > + lockdep_assert_held(&dev_set->lock); > > if (fill->cur == fill->max) > return -EAGAIN; /* Something changed, try again */ > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) > if (!iommu_group) > return -EPERM; /* Cannot reset non-isolated devices */ > > - fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); > + if (fill->require_devid) { Nit, @vdev could be scoped here. > + /* > + * Report dev_id of the devices that are opened as cdev > + * and have the same iommufd with the fill->iommufd. > + * Otherwise, just fill IOMMUFD_INVALID_ID. > + */ > + vdev = vfio_pci_find_device_in_devset(dev_set, pdev); I wish I had a better solution to this, but I don't. > + if (vdev && vfio_device_cdev_opened(vdev) && > + fill->iommufd == vfio_iommufd_physical_ictx(vdev)) > + vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id); Long line, maybe a pointer to &fill->devices[fill->cur] would help. > + else > + fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID; > + } else { > + fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); > + } > fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus); > fill->devices[fill->cur].bus = pdev->bus->number; > fill->devices[fill->cur].devfn = pdev->devfn; > @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info( > return -ENOMEM; > > fill.devices = devices; > + fill.dev_set = vdev->vdev.dev_set; > > + mutex_lock(&vdev->vdev.dev_set->lock); > + if (vfio_device_cdev_opened(&vdev->vdev)) { > + fill.require_devid = true; > + fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); > + } We can do this unconditionally: fill.devid = vfio_device_cdev_opened(&vdev->vdev); fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); Thanks, Alex > ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs, > &fill, slot); > + mutex_unlock(&vdev->vdev.dev_set->lock); > > /* > * If a device was removed between counting and filling, we may come up > * short of fill.max. If a device was added, we'll have a return of > * -EAGAIN above. > */ > - if (!ret) > + if (!ret) { > hdr.count = fill.cur; > + if (fill.require_devid) > + hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID; > + } > > reset_info_exit: > if (copy_to_user(arg, &hdr, minsz)) > @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev, > static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data) > { > struct vfio_device_set *dev_set = data; > - struct vfio_device *cur; > > - list_for_each_entry(cur, &dev_set->device_list, dev_set_list) > - if (cur->dev == &pdev->dev) > - return 0; > - return -EBUSY; > + lockdep_assert_held(&dev_set->lock); > + > + return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY; > } > > /* > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 25432ef213ee..5a34364e3b94 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -650,11 +650,32 @@ enum { > * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12, > * struct vfio_pci_hot_reset_info) > * > + * This command is used to query the affected devices in the hot reset for > + * a given device. User could use the information reported by this command > + * to figure out the affected devices among the devices it has opened. > + * This command always reports the segment, bus and devfn information for > + * each affected device, and selectively report the group_id or the dev_id > + * per the way how the device being queried is opened. > + * - If the device is opened via the traditional group/container manner, > + * this command reports the group_id for each affected device. > + * > + * - If the device is opened as a cdev, this command needs to report > + * dev_id for each affected device and set the > + * VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag. For the affected > + * devices that are not opened as cdev or bound to different iommufds > + * with the device that is queried, report an invalid dev_id to avoid > + * potential dev_id conflict as dev_id is local to iommufd. For such > + * affected devices, user shall fall back to use the segment, bus and > + * devfn info to map it to opened device. > + * > * Return: 0 on success, -errno on failure: > * -enospc = insufficient buffer, -enodev = unsupported for device. > */ > struct vfio_pci_dependent_device { > - __u32 group_id; > + union { > + __u32 group_id; > + __u32 dev_id; > + }; > __u16 segment; > __u8 bus; > __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */ > @@ -663,6 +684,7 @@ struct vfio_pci_dependent_device { > struct vfio_pci_hot_reset_info { > __u32 argsz; > __u32 flags; > +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID (1 << 0) > __u32 count; > struct vfio_pci_dependent_device devices[]; > };
Hi Yi, On 4/1/23 16:44, Yi Liu wrote: > for the users that accept device fds passed from management stacks to be > able to figure out the host reset affected devices among the devices > opened by the user. This is needed as such users do not have BDF (bus, > devfn) knowledge about the devices it has opened, hence unable to use > the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > to figure out the affected devices. > > Signed-off-by: Yi Liu <yi.l.liu@intel.com> > --- > drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++---- > include/uapi/linux/vfio.h | 24 ++++++++++++- > 2 files changed, 74 insertions(+), 8 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 19f5b075d70a..a5a7e148dce1 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -30,6 +30,7 @@ > #if IS_ENABLED(CONFIG_EEH) > #include <asm/eeh.h> > #endif > +#include <uapi/linux/iommufd.h> > > #include "vfio_pci_priv.h" > > @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ > return 0; > } > > +static struct vfio_device * > +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set, > + struct pci_dev *pdev) > +{ > + struct vfio_device *cur; > + > + lockdep_assert_held(&dev_set->lock); > + > + list_for_each_entry(cur, &dev_set->device_list, dev_set_list) > + if (cur->dev == &pdev->dev) > + return cur; > + return NULL; > +} > + > static int vfio_pci_count_devs(struct pci_dev *pdev, void *data) > { > (*(int *)data)++; > @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void *data) > struct vfio_pci_fill_info { > int max; > int cur; > + bool require_devid; > + struct iommufd_ctx *iommufd; > + struct vfio_device_set *dev_set; > struct vfio_pci_dependent_device *devices; > }; > > static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) > { > struct vfio_pci_fill_info *fill = data; > + struct vfio_device_set *dev_set = fill->dev_set; > struct iommu_group *iommu_group; > + struct vfio_device *vdev; > + > + lockdep_assert_held(&dev_set->lock); > > if (fill->cur == fill->max) > return -EAGAIN; /* Something changed, try again */ > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) > if (!iommu_group) > return -EPERM; /* Cannot reset non-isolated devices */ > > - fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); > + if (fill->require_devid) { > + /* > + * Report dev_id of the devices that are opened as cdev > + * and have the same iommufd with the fill->iommufd. > + * Otherwise, just fill IOMMUFD_INVALID_ID. > + */ > + vdev = vfio_pci_find_device_in_devset(dev_set, pdev); > + if (vdev && vfio_device_cdev_opened(vdev) && > + fill->iommufd == vfio_iommufd_physical_ictx(vdev)) > + vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id); > + else > + fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID; > + } else { > + fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); > + } > fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus); > fill->devices[fill->cur].bus = pdev->bus->number; > fill->devices[fill->cur].devfn = pdev->devfn; > @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info( > return -ENOMEM; > > fill.devices = devices; > + fill.dev_set = vdev->vdev.dev_set; > > + mutex_lock(&vdev->vdev.dev_set->lock); > + if (vfio_device_cdev_opened(&vdev->vdev)) { > + fill.require_devid = true; > + fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); > + } > ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs, > &fill, slot); > + mutex_unlock(&vdev->vdev.dev_set->lock); > > /* > * If a device was removed between counting and filling, we may come up > * short of fill.max. If a device was added, we'll have a return of > * -EAGAIN above. > */ > - if (!ret) > + if (!ret) { > hdr.count = fill.cur; > + if (fill.require_devid) > + hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID; > + } > > reset_info_exit: > if (copy_to_user(arg, &hdr, minsz)) > @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev, > static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data) > { > struct vfio_device_set *dev_set = data; > - struct vfio_device *cur; > > - list_for_each_entry(cur, &dev_set->device_list, dev_set_list) > - if (cur->dev == &pdev->dev) > - return 0; > - return -EBUSY; > + lockdep_assert_held(&dev_set->lock); > + > + return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY; > } > > /* > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 25432ef213ee..5a34364e3b94 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -650,11 +650,32 @@ enum { > * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12, > * struct vfio_pci_hot_reset_info) > * > + * This command is used to query the affected devices in the hot reset for > + * a given device. User could use the information reported by this command > + * to figure out the affected devices among the devices it has opened. > + * This command always reports the segment, bus and devfn information for > + * each affected device, and selectively report the group_id or the dev_id > + * per the way how the device being queried is opened. > + * - If the device is opened via the traditional group/container manner, > + * this command reports the group_id for each affected device. > + * > + * - If the device is opened as a cdev, this command needs to report s/needs to report/reports > + * dev_id for each affected device and set the > + * VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag. For the affected > + * devices that are not opened as cdev or bound to different iommufds > + * with the device that is queried, report an invalid dev_id to avoid s/bound to different iommufds with the device that is queried/bound to iommufds different from the reset device one? > + * potential dev_id conflict as dev_id is local to iommufd. For such > + * affected devices, user shall fall back to use the segment, bus and > + * devfn info to map it to opened device. > + * > * Return: 0 on success, -errno on failure: > * -enospc = insufficient buffer, -enodev = unsupported for device. > */ > struct vfio_pci_dependent_device { > - __u32 group_id; > + union { > + __u32 group_id; > + __u32 dev_id; > + }; > __u16 segment; > __u8 bus; > __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */ > @@ -663,6 +684,7 @@ struct vfio_pci_dependent_device { > struct vfio_pci_hot_reset_info { > __u32 argsz; > __u32 flags; > +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID (1 << 0) > __u32 count; > struct vfio_pci_dependent_device devices[]; > }; Eric
Hi Eric, > From: Eric Auger <eric.auger@redhat.com> > Sent: Wednesday, April 5, 2023 8:20 PM > > Hi Yi, > On 4/1/23 16:44, Yi Liu wrote: > > for the users that accept device fds passed from management stacks to be > > able to figure out the host reset affected devices among the devices > > opened by the user. This is needed as such users do not have BDF (bus, > > devfn) knowledge about the devices it has opened, hence unable to use > > the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > > to figure out the affected devices. > > > > Signed-off-by: Yi Liu <yi.l.liu@intel.com> > > --- > > drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++---- > > include/uapi/linux/vfio.h | 24 ++++++++++++- > > 2 files changed, 74 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > > index 19f5b075d70a..a5a7e148dce1 100644 > > --- a/drivers/vfio/pci/vfio_pci_core.c > > +++ b/drivers/vfio/pci/vfio_pci_core.c > > @@ -30,6 +30,7 @@ > > #if IS_ENABLED(CONFIG_EEH) > > #include <asm/eeh.h> > > #endif > > +#include <uapi/linux/iommufd.h> > > > > #include "vfio_pci_priv.h" > > > > @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct > vfio_pci_core_device *vdev, int irq_typ > > return 0; > > } > > > > +static struct vfio_device * > > +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set, > > + struct pci_dev *pdev) > > +{ > > + struct vfio_device *cur; > > + > > + lockdep_assert_held(&dev_set->lock); > > + > > + list_for_each_entry(cur, &dev_set->device_list, dev_set_list) > > + if (cur->dev == &pdev->dev) > > + return cur; > > + return NULL; > > +} > > + > > static int vfio_pci_count_devs(struct pci_dev *pdev, void *data) > > { > > (*(int *)data)++; > > @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void > *data) > > struct vfio_pci_fill_info { > > int max; > > int cur; > > + bool require_devid; > > + struct iommufd_ctx *iommufd; > > + struct vfio_device_set *dev_set; > > struct vfio_pci_dependent_device *devices; > > }; > > > > static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) > > { > > struct vfio_pci_fill_info *fill = data; > > + struct vfio_device_set *dev_set = fill->dev_set; > > struct iommu_group *iommu_group; > > + struct vfio_device *vdev; > > + > > + lockdep_assert_held(&dev_set->lock); > > > > if (fill->cur == fill->max) > > return -EAGAIN; /* Something changed, try again */ > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void > *data) > > if (!iommu_group) > > return -EPERM; /* Cannot reset non-isolated devices */ > > > > - fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); > > + if (fill->require_devid) { > > + /* > > + * Report dev_id of the devices that are opened as cdev > > + * and have the same iommufd with the fill->iommufd. > > + * Otherwise, just fill IOMMUFD_INVALID_ID. > > + */ > > + vdev = vfio_pci_find_device_in_devset(dev_set, pdev); > > + if (vdev && vfio_device_cdev_opened(vdev) && > > + fill->iommufd == vfio_iommufd_physical_ictx(vdev)) > > + vfio_iommufd_physical_devid(vdev, &fill->devices[fill- > >cur].dev_id); > > + else > > + fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID; > > + } else { > > + fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); > > + } > > fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus); > > fill->devices[fill->cur].bus = pdev->bus->number; > > fill->devices[fill->cur].devfn = pdev->devfn; > > @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info( > > return -ENOMEM; > > > > fill.devices = devices; > > + fill.dev_set = vdev->vdev.dev_set; > > > > + mutex_lock(&vdev->vdev.dev_set->lock); > > + if (vfio_device_cdev_opened(&vdev->vdev)) { > > + fill.require_devid = true; > > + fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); > > + } > > ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs, > > &fill, slot); > > + mutex_unlock(&vdev->vdev.dev_set->lock); > > > > /* > > * If a device was removed between counting and filling, we may come up > > * short of fill.max. If a device was added, we'll have a return of > > * -EAGAIN above. > > */ > > - if (!ret) > > + if (!ret) { > > hdr.count = fill.cur; > > + if (fill.require_devid) > > + hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID; > > + } > > > > reset_info_exit: > > if (copy_to_user(arg, &hdr, minsz)) > > @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct > vfio_pci_core_device *vdev, > > static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data) > > { > > struct vfio_device_set *dev_set = data; > > - struct vfio_device *cur; > > > > - list_for_each_entry(cur, &dev_set->device_list, dev_set_list) > > - if (cur->dev == &pdev->dev) > > - return 0; > > - return -EBUSY; > > + lockdep_assert_held(&dev_set->lock); > > + > > + return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY; > > } > > > > /* > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index 25432ef213ee..5a34364e3b94 100644 > > --- a/include/uapi/linux/vfio.h > > +++ b/include/uapi/linux/vfio.h > > @@ -650,11 +650,32 @@ enum { > > * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12, > > * struct vfio_pci_hot_reset_info) > > * > > + * This command is used to query the affected devices in the hot reset for > > + * a given device. User could use the information reported by this command > > + * to figure out the affected devices among the devices it has opened. > > + * This command always reports the segment, bus and devfn information for > > + * each affected device, and selectively report the group_id or the dev_id > > + * per the way how the device being queried is opened. > > + * - If the device is opened via the traditional group/container manner, > > + * this command reports the group_id for each affected device. > > + * > > + * - If the device is opened as a cdev, this command needs to report > s/needs to report/reports got it. > > + * dev_id for each affected device and set the > > + * VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag. For the affected > > + * devices that are not opened as cdev or bound to different iommufds > > + * with the device that is queried, report an invalid dev_id to avoid > s/bound to different iommufds with the device that is queried/bound to > iommufds different from the reset device one? hmmm, I'm not a native speaker here. This _INFO is to query if want hot reset a given device, what devices would be affected. So it appears the queried device is better. But I'd admit "the queried device" is also "the reset device". may Alex help pick one.
On Wed, 5 Apr 2023 14:04:51 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > Hi Eric, > > > From: Eric Auger <eric.auger@redhat.com> > > Sent: Wednesday, April 5, 2023 8:20 PM > > > > Hi Yi, > > On 4/1/23 16:44, Yi Liu wrote: > > > for the users that accept device fds passed from management stacks to be > > > able to figure out the host reset affected devices among the devices > > > opened by the user. This is needed as such users do not have BDF (bus, > > > devfn) knowledge about the devices it has opened, hence unable to use > > > the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > > > to figure out the affected devices. > > > > > > Signed-off-by: Yi Liu <yi.l.liu@intel.com> > > > --- > > > drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++---- > > > include/uapi/linux/vfio.h | 24 ++++++++++++- > > > 2 files changed, 74 insertions(+), 8 deletions(-) > > > > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > > > index 19f5b075d70a..a5a7e148dce1 100644 > > > --- a/drivers/vfio/pci/vfio_pci_core.c > > > +++ b/drivers/vfio/pci/vfio_pci_core.c > > > @@ -30,6 +30,7 @@ > > > #if IS_ENABLED(CONFIG_EEH) > > > #include <asm/eeh.h> > > > #endif > > > +#include <uapi/linux/iommufd.h> > > > > > > #include "vfio_pci_priv.h" > > > > > > @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct > > vfio_pci_core_device *vdev, int irq_typ > > > return 0; > > > } > > > > > > +static struct vfio_device * > > > +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set, > > > + struct pci_dev *pdev) > > > +{ > > > + struct vfio_device *cur; > > > + > > > + lockdep_assert_held(&dev_set->lock); > > > + > > > + list_for_each_entry(cur, &dev_set->device_list, dev_set_list) > > > + if (cur->dev == &pdev->dev) > > > + return cur; > > > + return NULL; > > > +} > > > + > > > static int vfio_pci_count_devs(struct pci_dev *pdev, void *data) > > > { > > > (*(int *)data)++; > > > @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void > > *data) > > > struct vfio_pci_fill_info { > > > int max; > > > int cur; > > > + bool require_devid; > > > + struct iommufd_ctx *iommufd; > > > + struct vfio_device_set *dev_set; > > > struct vfio_pci_dependent_device *devices; > > > }; > > > > > > static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) > > > { > > > struct vfio_pci_fill_info *fill = data; > > > + struct vfio_device_set *dev_set = fill->dev_set; > > > struct iommu_group *iommu_group; > > > + struct vfio_device *vdev; > > > + > > > + lockdep_assert_held(&dev_set->lock); > > > > > > if (fill->cur == fill->max) > > > return -EAGAIN; /* Something changed, try again */ > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void > > *data) > > > if (!iommu_group) > > > return -EPERM; /* Cannot reset non-isolated devices */ > > > > > > - fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); > > > + if (fill->require_devid) { > > > + /* > > > + * Report dev_id of the devices that are opened as cdev > > > + * and have the same iommufd with the fill->iommufd. > > > + * Otherwise, just fill IOMMUFD_INVALID_ID. > > > + */ > > > + vdev = vfio_pci_find_device_in_devset(dev_set, pdev); > > > + if (vdev && vfio_device_cdev_opened(vdev) && > > > + fill->iommufd == vfio_iommufd_physical_ictx(vdev)) > > > + vfio_iommufd_physical_devid(vdev, &fill->devices[fill- > > >cur].dev_id); > > > + else > > > + fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID; > > > + } else { > > > + fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); > > > + } > > > fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus); > > > fill->devices[fill->cur].bus = pdev->bus->number; > > > fill->devices[fill->cur].devfn = pdev->devfn; > > > @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info( > > > return -ENOMEM; > > > > > > fill.devices = devices; > > > + fill.dev_set = vdev->vdev.dev_set; > > > > > > + mutex_lock(&vdev->vdev.dev_set->lock); > > > + if (vfio_device_cdev_opened(&vdev->vdev)) { > > > + fill.require_devid = true; > > > + fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); > > > + } > > > ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs, > > > &fill, slot); > > > + mutex_unlock(&vdev->vdev.dev_set->lock); > > > > > > /* > > > * If a device was removed between counting and filling, we may come up > > > * short of fill.max. If a device was added, we'll have a return of > > > * -EAGAIN above. > > > */ > > > - if (!ret) > > > + if (!ret) { > > > hdr.count = fill.cur; > > > + if (fill.require_devid) > > > + hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID; > > > + } > > > > > > reset_info_exit: > > > if (copy_to_user(arg, &hdr, minsz)) > > > @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct > > vfio_pci_core_device *vdev, > > > static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data) > > > { > > > struct vfio_device_set *dev_set = data; > > > - struct vfio_device *cur; > > > > > > - list_for_each_entry(cur, &dev_set->device_list, dev_set_list) > > > - if (cur->dev == &pdev->dev) > > > - return 0; > > > - return -EBUSY; > > > + lockdep_assert_held(&dev_set->lock); > > > + > > > + return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY; > > > } > > > > > > /* > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > > index 25432ef213ee..5a34364e3b94 100644 > > > --- a/include/uapi/linux/vfio.h > > > +++ b/include/uapi/linux/vfio.h > > > @@ -650,11 +650,32 @@ enum { > > > * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12, > > > * struct vfio_pci_hot_reset_info) > > > * > > > + * This command is used to query the affected devices in the hot reset for > > > + * a given device. User could use the information reported by this command > > > + * to figure out the affected devices among the devices it has opened. > > > + * This command always reports the segment, bus and devfn information for > > > + * each affected device, and selectively report the group_id or the dev_id > > > + * per the way how the device being queried is opened. > > > + * - If the device is opened via the traditional group/container manner, > > > + * this command reports the group_id for each affected device. > > > + * > > > + * - If the device is opened as a cdev, this command needs to report > > s/needs to report/reports > > got it. > > > > + * dev_id for each affected device and set the > > > + * VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag. For the affected > > > + * devices that are not opened as cdev or bound to different iommufds > > > + * with the device that is queried, report an invalid dev_id to avoid > > s/bound to different iommufds with the device that is queried/bound to > > iommufds different from the reset device one? > > hmmm, I'm not a native speaker here. This _INFO is to query if want > hot reset a given device, what devices would be affected. So it appears > the queried device is better. But I'd admit "the queried device" is also > "the reset device". may Alex help pick one.
On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote: > But that kind of brings to light the question of what does the user do > when they encounter this situation. What does it do now when it encounters a group_id it doesn't understand? Userspace already doesn't know if the foreign group is open or not, right? > reset can complete. If the device is opened by a different user, the > reset is blocked. The only logical conclusion is that the user should > try the reset regardless of the result of the info ioctl, which the IMHO my suggested version is still the overall saner uAPI. An info that basically returns success/fail if reset is security authorized and information about the reset groupings. Actual reset follows the returned groupings automatically. Easy for qemu. Call the info at startup to confirm reset can be emulated, use the returned information to propogate the reset groups to the guest. Trigger the reset with no fuss when the guest asks for it. Less weird corner cases. Jason
On Wed, 5 Apr 2023 13:37:05 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote: > > > But that kind of brings to light the question of what does the user do > > when they encounter this situation. > > What does it do now when it encounters a group_id it doesn't > understand? Userspace already doesn't know if the foreign group is > open or not, right? It's simple, there is currently no screwiness around opened devices. If the caller doesn't own all the groups mapping to the affected devices, hot-reset is not available. > > reset can complete. If the device is opened by a different user, the > > reset is blocked. The only logical conclusion is that the user should > > try the reset regardless of the result of the info ioctl, which the > > IMHO my suggested version is still the overall saner uAPI. > > An info that basically returns success/fail if reset is security > authorized and information about the reset groupings. > > Actual reset follows the returned groupings automatically. > > Easy for qemu. Call the info at startup to confirm reset can be > emulated, use the returned information to propogate the reset groups > to the guest. Trigger the reset with no fuss when the guest asks for > it. > > Less weird corner cases. This leads to scenarios where the info ioctl indicates a hot-reset is initially available, perhaps only because one of the affected devices was not opened at the time, and now it fails when QEMU actually tries to use it. In the group model, QEMU can know the set of affected devices and the required groups, confirm it owns those, and for all practical purposes guarantee that a hot-reset is available (yes, there might be some exceptionally rare topology changes). This goofiness around unopened devices and null-arrays is killing this API. Thanks, Alex
On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote: > On Wed, 5 Apr 2023 13:37:05 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote: > > > > > But that kind of brings to light the question of what does the user do > > > when they encounter this situation. > > > > What does it do now when it encounters a group_id it doesn't > > understand? Userspace already doesn't know if the foreign group is > > open or not, right? > > It's simple, there is currently no screwiness around opened devices. > If the caller doesn't own all the groups mapping to the affected > devices, hot-reset is not available. That still has nasty edge cases. If the reset group spans beyond a single iommu group you end up with qemu being unable to operate reset at all, and it is unfixable from an API perspective as we can't pass in groups that VFIO isn't going to use. I think you are right, the fact we'd have to return -1 dev_ids to this modified API is pretty damaging, it doesn't seem like a good direction. > This leads to scenarios where the info ioctl indicates a hot-reset is > initially available, perhaps only because one of the affected devices > was not opened at the time, and now it fails when QEMU actually tries > to use it. I would like it if the APIs toward the kernel were only about the kernel's security apparatus. It is makes it easier to reason about the kernel side and gives nice simple well defined APIs. This is a good point that qemu needs to make a policy decision if it is happy about the VFIO configuration - but that is a policy decision that should not become entangled with the kernel's security checks. Today qemu can make this policy choice the same way it does right now - call _INFO and check the group_ids. It gets the exact same outcome as today. We already discussed that we need to expose the group ID through an ioctl someplace. If this is too awkward we could add a query to the kernel if the cdev is "reset exclusive" - eg the iommufd covers all the groups that span the reset set. Jason
On 4/5/23 18:25, Alex Williamson wrote: > On Wed, 5 Apr 2023 14:04:51 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > >> Hi Eric, >> >>> From: Eric Auger <eric.auger@redhat.com> >>> Sent: Wednesday, April 5, 2023 8:20 PM >>> >>> Hi Yi, >>> On 4/1/23 16:44, Yi Liu wrote: >>>> for the users that accept device fds passed from management stacks to be >>>> able to figure out the host reset affected devices among the devices >>>> opened by the user. This is needed as such users do not have BDF (bus, >>>> devfn) knowledge about the devices it has opened, hence unable to use >>>> the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO >>>> to figure out the affected devices. >>>> >>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com> >>>> --- >>>> drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++---- >>>> include/uapi/linux/vfio.h | 24 ++++++++++++- >>>> 2 files changed, 74 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c >>>> index 19f5b075d70a..a5a7e148dce1 100644 >>>> --- a/drivers/vfio/pci/vfio_pci_core.c >>>> +++ b/drivers/vfio/pci/vfio_pci_core.c >>>> @@ -30,6 +30,7 @@ >>>> #if IS_ENABLED(CONFIG_EEH) >>>> #include <asm/eeh.h> >>>> #endif >>>> +#include <uapi/linux/iommufd.h> >>>> >>>> #include "vfio_pci_priv.h" >>>> >>>> @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct >>> vfio_pci_core_device *vdev, int irq_typ >>>> return 0; >>>> } >>>> >>>> +static struct vfio_device * >>>> +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set, >>>> + struct pci_dev *pdev) >>>> +{ >>>> + struct vfio_device *cur; >>>> + >>>> + lockdep_assert_held(&dev_set->lock); >>>> + >>>> + list_for_each_entry(cur, &dev_set->device_list, dev_set_list) >>>> + if (cur->dev == &pdev->dev) >>>> + return cur; >>>> + return NULL; >>>> +} >>>> + >>>> static int vfio_pci_count_devs(struct pci_dev *pdev, void *data) >>>> { >>>> (*(int *)data)++; >>>> @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void >>> *data) >>>> struct vfio_pci_fill_info { >>>> int max; >>>> int cur; >>>> + bool require_devid; >>>> + struct iommufd_ctx *iommufd; >>>> + struct vfio_device_set *dev_set; >>>> struct vfio_pci_dependent_device *devices; >>>> }; >>>> >>>> static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) >>>> { >>>> struct vfio_pci_fill_info *fill = data; >>>> + struct vfio_device_set *dev_set = fill->dev_set; >>>> struct iommu_group *iommu_group; >>>> + struct vfio_device *vdev; >>>> + >>>> + lockdep_assert_held(&dev_set->lock); >>>> >>>> if (fill->cur == fill->max) >>>> return -EAGAIN; /* Something changed, try again */ >>>> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void >>> *data) >>>> if (!iommu_group) >>>> return -EPERM; /* Cannot reset non-isolated devices */ >>>> >>>> - fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); >>>> + if (fill->require_devid) { >>>> + /* >>>> + * Report dev_id of the devices that are opened as cdev >>>> + * and have the same iommufd with the fill->iommufd. >>>> + * Otherwise, just fill IOMMUFD_INVALID_ID. >>>> + */ >>>> + vdev = vfio_pci_find_device_in_devset(dev_set, pdev); >>>> + if (vdev && vfio_device_cdev_opened(vdev) && >>>> + fill->iommufd == vfio_iommufd_physical_ictx(vdev)) >>>> + vfio_iommufd_physical_devid(vdev, &fill->devices[fill- >>>> cur].dev_id); >>>> + else >>>> + fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID; >>>> + } else { >>>> + fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); >>>> + } >>>> fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus); >>>> fill->devices[fill->cur].bus = pdev->bus->number; >>>> fill->devices[fill->cur].devfn = pdev->devfn; >>>> @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info( >>>> return -ENOMEM; >>>> >>>> fill.devices = devices; >>>> + fill.dev_set = vdev->vdev.dev_set; >>>> >>>> + mutex_lock(&vdev->vdev.dev_set->lock); >>>> + if (vfio_device_cdev_opened(&vdev->vdev)) { >>>> + fill.require_devid = true; >>>> + fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); >>>> + } >>>> ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs, >>>> &fill, slot); >>>> + mutex_unlock(&vdev->vdev.dev_set->lock); >>>> >>>> /* >>>> * If a device was removed between counting and filling, we may come up >>>> * short of fill.max. If a device was added, we'll have a return of >>>> * -EAGAIN above. >>>> */ >>>> - if (!ret) >>>> + if (!ret) { >>>> hdr.count = fill.cur; >>>> + if (fill.require_devid) >>>> + hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID; >>>> + } >>>> >>>> reset_info_exit: >>>> if (copy_to_user(arg, &hdr, minsz)) >>>> @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct >>> vfio_pci_core_device *vdev, >>>> static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data) >>>> { >>>> struct vfio_device_set *dev_set = data; >>>> - struct vfio_device *cur; >>>> >>>> - list_for_each_entry(cur, &dev_set->device_list, dev_set_list) >>>> - if (cur->dev == &pdev->dev) >>>> - return 0; >>>> - return -EBUSY; >>>> + lockdep_assert_held(&dev_set->lock); >>>> + >>>> + return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY; >>>> } >>>> >>>> /* >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h >>>> index 25432ef213ee..5a34364e3b94 100644 >>>> --- a/include/uapi/linux/vfio.h >>>> +++ b/include/uapi/linux/vfio.h >>>> @@ -650,11 +650,32 @@ enum { >>>> * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12, >>>> * struct vfio_pci_hot_reset_info) >>>> * >>>> + * This command is used to query the affected devices in the hot reset for >>>> + * a given device. User could use the information reported by this command >>>> + * to figure out the affected devices among the devices it has opened. the 'opened' terminology does not look sufficient here because it is not only a matter of the device being opened using cdev but it also needs to have been bound to an iommufd, dev_id being the output of the dev-iommufd binding. By the way I am now confused. What does happen if the reset impact some devices which are not bound to an iommu ctx. Previously we returned the iommu group which always pre-exists but now you will report invalid id? >>>> + * This command always reports the segment, bus and devfn information for >>>> + * each affected device, and selectively report the group_id or the dev_id >>>> + * per the way how the device being queried is opened. >>>> + * - If the device is opened via the traditional group/container manner, >>>> + * this command reports the group_id for each affected device. >>>> + * >>>> + * - If the device is opened as a cdev, this command needs to report >>> s/needs to report/reports >> got it. >> >>>> + * dev_id for each affected device and set the >>>> + * VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag. For the affected >>>> + * devices that are not opened as cdev or bound to different iommufds >>>> + * with the device that is queried, report an invalid dev_id to avoid or not bound at all >>> s/bound to different iommufds with the device that is queried/bound to >>> iommufds different from the reset device one? >> hmmm, I'm not a native speaker here. This _INFO is to query if want >> hot reset a given device, what devices would be affected. So it appears >> the queried device is better. But I'd admit "the queried device" is also >> "the reset device". may Alex help pick one.
On Wed, 5 Apr 2023 14:23:43 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote: > > On Wed, 5 Apr 2023 13:37:05 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote: > > > > > > > But that kind of brings to light the question of what does the user do > > > > when they encounter this situation. > > > > > > What does it do now when it encounters a group_id it doesn't > > > understand? Userspace already doesn't know if the foreign group is > > > open or not, right? > > > > It's simple, there is currently no screwiness around opened devices. > > If the caller doesn't own all the groups mapping to the affected > > devices, hot-reset is not available. > > That still has nasty edge cases. If the reset group spans beyond a > single iommu group you end up with qemu being unable to operate reset > at all, and it is unfixable from an API perspective as we can't pass > in groups that VFIO isn't going to use. Hmm, s/nasty/niche/? Yes, QEMU currently has no way to own a group without assigning a device from the group, but technically that could be fixed within QEMU. If QEMU doesn't own that affected group, then it can't very well count on that group to not be used in some other way when it comes time to actually do a hot-reset. > I think you are right, the fact we'd have to return -1 dev_ids to this > modified API is pretty damaging, it doesn't seem like a good > direction. > > > This leads to scenarios where the info ioctl indicates a hot-reset is > > initially available, perhaps only because one of the affected devices > > was not opened at the time, and now it fails when QEMU actually tries > > to use it. > > I would like it if the APIs toward the kernel were only about the > kernel's security apparatus. It is makes it easier to reason about the > kernel side and gives nice simple well defined APIs. Usability needs to be a consideration as well. An interface where the result is effectively arbitrary from a user perspective because the kernel is solely focused on whether the operation is allowed, evaluating constraints that the user is unaware of and cannot control, is unusable. > This is a good point that qemu needs to make a policy decision if it > is happy about the VFIO configuration - but that is a policy decision > that should not become entangled with the kernel's security checks. > > Today qemu can make this policy choice the same way it does right now > - call _INFO and check the group_ids. It gets the exact same outcome > as today. We already discussed that we need to expose the group ID > through an ioctl someplace. QEMU can make a policy decision today because the kernel provides a sufficiently reliable interface, ie. based on the set of owned groups, a hot-reset is all but guaranteed to work. If we focus only on whether a given reset is allowed from a kernel perspective and ignore that userspace needs some predictability of the kernel behavior, then QEMU cannot reasonable make that policy decision. > If this is too awkward we could add a query to the kernel if the cdev > is "reset exclusive" - eg the iommufd covers all the groups that span > the reset set. That's essentially what we have if there are valid dev-ids for each affected device in the info ioctl. I don't think it helps the user experience to create loopholes where the hot-reset ioctl can still work in spite of those missing devices. The group interface uses the fact that ownership of the group implies ownership of all devices within the group such that the user only needs to prove group ownership. But we still have underlying groups even with the cdev model, with the same ownership principles, so don't we just need to prove group ownership based on a device fd rather than a group fd? For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports capability chains, we could add a capability that reports the group ID for the device. The hot-reset info ioctl remains as it is today, reporting group-ids and bdfs. The hot-reset ioctl itself is modified to transparently support either group fds or device fds. The user can now map cdevs to group-ids and therefore follow the same rules as groups, providing at least one representative device fd for each group. We've essentially already enabled this by allowing the limit of user provided fds equal to the number of affected devices. Does that work? Thanks, Alex
On Wed, 5 Apr 2023 12:56:21 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Wed, 5 Apr 2023 14:23:43 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote: > > > On Wed, 5 Apr 2023 13:37:05 -0300 > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote: > > > > > > > > > But that kind of brings to light the question of what does the user do > > > > > when they encounter this situation. > > > > > > > > What does it do now when it encounters a group_id it doesn't > > > > understand? Userspace already doesn't know if the foreign group is > > > > open or not, right? > > > > > > It's simple, there is currently no screwiness around opened devices. > > > If the caller doesn't own all the groups mapping to the affected > > > devices, hot-reset is not available. > > > > That still has nasty edge cases. If the reset group spans beyond a > > single iommu group you end up with qemu being unable to operate reset > > at all, and it is unfixable from an API perspective as we can't pass > > in groups that VFIO isn't going to use. > > Hmm, s/nasty/niche/? Yes, QEMU currently has no way to own a group > without assigning a device from the group, but technically that could > be fixed within QEMU. If QEMU doesn't own that affected group, then it > can't very well count on that group to not be used in some other way > when it comes time to actually do a hot-reset. > > > I think you are right, the fact we'd have to return -1 dev_ids to this > > modified API is pretty damaging, it doesn't seem like a good > > direction. > > > > > This leads to scenarios where the info ioctl indicates a hot-reset is > > > initially available, perhaps only because one of the affected devices > > > was not opened at the time, and now it fails when QEMU actually tries > > > to use it. > > > > I would like it if the APIs toward the kernel were only about the > > kernel's security apparatus. It is makes it easier to reason about the > > kernel side and gives nice simple well defined APIs. > > Usability needs to be a consideration as well. An interface where the > result is effectively arbitrary from a user perspective because the > kernel is solely focused on whether the operation is allowed, > evaluating constraints that the user is unaware of and cannot control, > is unusable. > > > This is a good point that qemu needs to make a policy decision if it > > is happy about the VFIO configuration - but that is a policy decision > > that should not become entangled with the kernel's security checks. > > > > Today qemu can make this policy choice the same way it does right now > > - call _INFO and check the group_ids. It gets the exact same outcome > > as today. We already discussed that we need to expose the group ID > > through an ioctl someplace. > > QEMU can make a policy decision today because the kernel provides a > sufficiently reliable interface, ie. based on the set of owned groups, a > hot-reset is all but guaranteed to work. If we focus only on whether a > given reset is allowed from a kernel perspective and ignore that > userspace needs some predictability of the kernel behavior, then QEMU > cannot reasonable make that policy decision. > > > If this is too awkward we could add a query to the kernel if the cdev > > is "reset exclusive" - eg the iommufd covers all the groups that span > > the reset set. > > That's essentially what we have if there are valid dev-ids for each > affected device in the info ioctl. I don't think it helps the user > experience to create loopholes where the hot-reset ioctl can still work > in spite of those missing devices. The group interface uses the fact > that ownership of the group implies ownership of all devices within the > group such that the user only needs to prove group ownership. > > But we still have underlying groups even with the cdev model, with the > same ownership principles, so don't we just need to prove group > ownership based on a device fd rather than a group fd? > > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports > capability chains, we could add a capability that reports the group ID > for the device. The hot-reset info ioctl remains as it is today, > reporting group-ids and bdfs. The hot-reset ioctl itself is modified to > transparently support either group fds or device fds. The user can now > map cdevs to group-ids and therefore follow the same rules as groups, > providing at least one representative device fd for each group. We've > essentially already enabled this by allowing the limit of user provided > fds equal to the number of affected devices. If I'm not mistaken, I think this resolves cdev no-iommu to work equivalently to groups as well. Thanks, Alex
On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote: > Usability needs to be a consideration as well. An interface where the > result is effectively arbitrary from a user perspective because the > kernel is solely focused on whether the operation is allowed, > evaluating constraints that the user is unaware of and cannot control, > is unusable. Considering this API is only invoked by qemu we might be overdoing this usability and 'no shoot in foot' view. > > This is a good point that qemu needs to make a policy decision if it > > is happy about the VFIO configuration - but that is a policy decision > > that should not become entangled with the kernel's security checks. > > > > Today qemu can make this policy choice the same way it does right now > > - call _INFO and check the group_ids. It gets the exact same outcome > > as today. We already discussed that we need to expose the group ID > > through an ioctl someplace. > > QEMU can make a policy decision today because the kernel provides a > sufficiently reliable interface, ie. based on the set of owned groups, a > hot-reset is all but guaranteed to work. And we don't change that with cdev. If qemu wants to make the policy decision it keeps using the exact same _INFO interface to make that decision same it has always made. We weaken the actual reset action to only consider the security side. Applications that want this exclusive reset group policy simply must check it on their own. It is a reasonable API design. > > If this is too awkward we could add a query to the kernel if the cdev > > is "reset exclusive" - eg the iommufd covers all the groups that span > > the reset set. > > That's essentially what we have if there are valid dev-ids for each > affected device in the info ioctl. If you have dev-ids for everything, yes. If you don't, then you can't make the same policy choice using a dev-id interface. > I don't think it helps the user experience to create loopholes where > the hot-reset ioctl can still work in spite of those missing > devices. I disagree. The easy straightforward design is that the reset ioctl works if the process has security permissions. Mixing a policy check into the kernel on this path is creating complexity we don't really need. I don't view it as a loophole, it is flexability to use the API in a way that is different from what qemu wants - eg an app like dpdk may be willing to tolerate a reset group that becomes unavailable after startup. Who knows, why should we force this in the kernel? > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports > capability chains, we could add a capability that reports the group ID > for the device. I was going to put that in an iommufd ioctl so it works with VDPA too, but sure, lets assume we can get the group ID from a cdev fd. > The hot-reset info ioctl remains as it is today, reporting group-ids > and bdfs. Sure, but userspace still needs to know how to map the reset sets into dev-ids. Remember the reason we started doing this is because we don't have easy access to the BDF anymore. I like leaving this ioctl alone, lets go back to a dedicated ioctl to return the dev_ids. > The hot-reset ioctl itself is modified to transparently > support either group fds or device fds. The user can now map cdevs > to group-ids and therefore follow the same rules as groups, > providing at least one representative device fd for each group. This looks like a very complex uapi compared to the empty list option, but it seems like it would work. Jason
On Wed, 5 Apr 2023 16:21:09 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote: > > Usability needs to be a consideration as well. An interface where the > > result is effectively arbitrary from a user perspective because the > > kernel is solely focused on whether the operation is allowed, > > evaluating constraints that the user is unaware of and cannot control, > > is unusable. > > Considering this API is only invoked by qemu we might be overdoing > this usability and 'no shoot in foot' view. Ok, I'm not sure why we're diminishing the de facto vfio userspace... > > > This is a good point that qemu needs to make a policy decision if it > > > is happy about the VFIO configuration - but that is a policy decision > > > that should not become entangled with the kernel's security checks. > > > > > > Today qemu can make this policy choice the same way it does right now > > > - call _INFO and check the group_ids. It gets the exact same outcome > > > as today. We already discussed that we need to expose the group ID > > > through an ioctl someplace. > > > > QEMU can make a policy decision today because the kernel provides a > > sufficiently reliable interface, ie. based on the set of owned groups, a > > hot-reset is all but guaranteed to work. > > And we don't change that with cdev. If qemu wants to make the policy > decision it keeps using the exact same _INFO interface to make that > decision same it has always made. > > We weaken the actual reset action to only consider the security side. > > Applications that want this exclusive reset group policy simply must > check it on their own. It is a reasonable API design. I disagree, as I've argued before, the info ioctl becomes so weak and effectively arbitrary from a user perspective at being able to predict whether the hot-reset ioctl works that it becomes useless, diminishing the entire hot-reset info/execute API. > > > If this is too awkward we could add a query to the kernel if the cdev > > > is "reset exclusive" - eg the iommufd covers all the groups that span > > > the reset set. > > > > That's essentially what we have if there are valid dev-ids for each > > affected device in the info ioctl. > > If you have dev-ids for everything, yes. If you don't, then you can't > make the same policy choice using a dev-id interface. Exactly, you can't make any policy choice because the success or failure of the hot-reset ioctl can't be known. > > I don't think it helps the user experience to create loopholes where > > the hot-reset ioctl can still work in spite of those missing > > devices. > > I disagree. The easy straightforward design is that the reset ioctl > works if the process has security permissions. Mixing a policy check > into the kernel on this path is creating complexity we don't really > need. > > I don't view it as a loophole, it is flexability to use the API in a > way that is different from what qemu wants - eg an app like dpdk may > be willing to tolerate a reset group that becomes unavailable after > startup. Who knows, why should we force this in the kernel? Because look at all the problems it's causing to try to introduce these loopholes without also introducing subtle bugs. There's an argument that we're overly strict, which is better than the alternative, which seems to be what we're dabbling with. It is a straightforward interface for the hot-reset ioctl to mirror the information provided via the hot-reset info ioctl. > > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports > > capability chains, we could add a capability that reports the group ID > > for the device. > > I was going to put that in an iommufd ioctl so it works with VDPA too, > but sure, lets assume we can get the group ID from a cdev fd. > > > The hot-reset info ioctl remains as it is today, reporting group-ids > > and bdfs. > > Sure, but userspace still needs to know how to map the reset sets into > dev-ids. No, it doesn't. > Remember the reason we started doing this is because we don't > have easy access to the BDF anymore. We don't need it, the info ioctl provides the groups, the group association can be learned from the DEVICE_GET_INFO ioctl, the hot-reset ioctl only requires a single representative fd per affected group. dev-ids not required. > I like leaving this ioctl alone, lets go back to a dedicated ioctl to > return the dev_ids. I don't see any justification for this. We could add another PCI specific DEVICE_GET_INFO capability to report the bdf if we really need it, but reporting the group seems sufficient for this use case. > > The hot-reset ioctl itself is modified to transparently > > support either group fds or device fds. The user can now map cdevs > > to group-ids and therefore follow the same rules as groups, > > providing at least one representative device fd for each group. > > This looks like a very complex uapi compared to the empty list option, > but it seems like it would work. It's the same API that we have now. What's complex is trying to figure out all the subtle side-effects from the loopholes that are being proposed in this series. Thanks, Alex
On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote: > > > QEMU can make a policy decision today because the kernel provides a > > > sufficiently reliable interface, ie. based on the set of owned groups, a > > > hot-reset is all but guaranteed to work. > > > > And we don't change that with cdev. If qemu wants to make the policy > > decision it keeps using the exact same _INFO interface to make that > > decision same it has always made. > > > > We weaken the actual reset action to only consider the security side. > > > > Applications that want this exclusive reset group policy simply must > > check it on their own. It is a reasonable API design. > > I disagree, as I've argued before, the info ioctl becomes so weak and > effectively arbitrary from a user perspective at being able to predict > whether the hot-reset ioctl works that it becomes useless, diminishing > the entire hot-reset info/execute API. reset should be strictly more permissive than INFO. If INFO predicts reset is permitted then reset should succeed. We don't change INFO so it cannot "becomes so weak" ?? We don't care about the cases where INFO says it will not succeed but reset does (temporarily) succeed. I don't get what argument you are trying to make or what you think is diminished.. Again, userspace calls INFO, if info says yes then reset *always works*, exactly just like today. Userspace will call reset with a 0 length FD list and it uses a security only check that is strictly more permissive than what get_info will return. So the new check is simple in the kernel and always works in the cases we need it to work. What is getting things into trouble is insisting that RESET have additional restrictions beyond the minimum checks required for security. > > I don't view it as a loophole, it is flexability to use the API in a > > way that is different from what qemu wants - eg an app like dpdk may > > be willing to tolerate a reset group that becomes unavailable after > > startup. Who knows, why should we force this in the kernel? > > Because look at all the problems it's causing to try to introduce these > loopholes without also introducing subtle bugs. These problems are coming from tring to do this integrated version, not from my approach! AFAICT there was nothing wrong with my original plan of using the empty fd list for reset. What Yi has here is some mashup of what you and I both suggested. > > Remember the reason we started doing this is because we don't > > have easy access to the BDF anymore. > > We don't need it, the info ioctl provides the groups, the group > association can be learned from the DEVICE_GET_INFO ioctl, the > hot-reset ioctl only requires a single representative fd per affected > group. dev-ids not required. I'm not talking about triggering the ioctl. I'm talking about whatever else qemu needs to do so that the VM is aware of the reset groups device-by-device on it's side so nested VFIO in the VM reflects the same data as the hypervisor. Maybe it doesn't do this right now, but the kernel API should continue to provide the data. > > I like leaving this ioctl alone, lets go back to a dedicated ioctl to > > return the dev_ids. > > I don't see any justification for this. We could add another PCI > specific DEVICE_GET_INFO capability to report the bdf if we really need > it, but reporting the group seems sufficient for this use case. What I imagine is a single new ioctl 'get reset group 2' or something. It returns a list of dev_ids in the reset group. It has an output flag if the reset is reliable. This is the only ioctl user space needs to call. The reliable test is done by simply calling the ioctl and throwing away the dev ids. The mapping of the VM's reset groups is done by processing the dev_ids to vRIDs and flowing that into the VM somehow. We don't expose group_ids, and we don't expose BDF. It is much simpler and cleaner to use. A BDF DEVICE_GET_INFO and the existing reset INFO will encode the same data too, it is just not as elegant and requires userspace to do a lot more work to keep track of the 3 different identifiers. > > This looks like a very complex uapi compared to the empty list option, > > but it seems like it would work. > > It's the same API that we have now. What's complex is trying to figure > out all the subtle side-effects from the loopholes that are being > proposed in this series. Thanks, I might agree with you if we weren't now going backwards - ideas didn't work out and Yi has to throw stuff away. :( Jason
Hi Eric, > From: Eric Auger <eric.auger@redhat.com> > Sent: Thursday, April 6, 2023 1:58 AM [...] > >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > >>>> index 25432ef213ee..5a34364e3b94 100644 > >>>> --- a/include/uapi/linux/vfio.h > >>>> +++ b/include/uapi/linux/vfio.h > >>>> @@ -650,11 +650,32 @@ enum { > >>>> * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + > 12, > >>>> * struct vfio_pci_hot_reset_info) > >>>> * > >>>> + * This command is used to query the affected devices in the hot reset for > >>>> + * a given device. User could use the information reported by this command > >>>> + * to figure out the affected devices among the devices it has opened. > the 'opened' terminology does not look sufficient here because it is not > only a matter of the device being opened using cdev but it also needs to > have been bound to an iommufd, dev_id being the output of the > dev-iommufd binding. > > By the way I am now confused. What does happen if the reset impact some > devices which are not bound to an iommu ctx. Previously we returned the > iommu group which always pre-exists but now you will report invalid id? For such devices, user could use the bdf information to check if affected device is opened by the user. If yes, do some necessary preparation on the device before issuing hot reset. Regards, Yi Liu > >>>> + * This command always reports the segment, bus and devfn information for > >>>> + * each affected device, and selectively report the group_id or the dev_id > >>>> + * per the way how the device being queried is opened. > >>>> + * - If the device is opened via the traditional group/container manner, > >>>> + * this command reports the group_id for each affected device. > >>>> + * > >>>> + * - If the device is opened as a cdev, this command needs to report > >>> s/needs to report/reports > >> got it. > >> > >>>> + * dev_id for each affected device and set the > >>>> + * VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag. For the > affected > >>>> + * devices that are not opened as cdev or bound to different iommufds > >>>> + * with the device that is queried, report an invalid dev_id to avoid > or not bound at all > >>> s/bound to different iommufds with the device that is queried/bound to > >>> iommufds different from the reset device one? > >> hmmm, I'm not a native speaker here. This _INFO is to query if want > >> hot reset a given device, what devices would be affected. So it appears > >> the queried device is better. But I'd admit "the queried device" is also > >> "the reset device". may Alex help pick one.
Hi Alex, > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Thursday, April 6, 2023 3:50 AM > > On Wed, 5 Apr 2023 16:21:09 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote: > > > Usability needs to be a consideration as well. An interface where the > > > result is effectively arbitrary from a user perspective because the > > > kernel is solely focused on whether the operation is allowed, > > > evaluating constraints that the user is unaware of and cannot control, > > > is unusable. > > > > Considering this API is only invoked by qemu we might be overdoing > > this usability and 'no shoot in foot' view. > > Ok, I'm not sure why we're diminishing the de facto vfio userspace... > > > > > This is a good point that qemu needs to make a policy decision if it > > > > is happy about the VFIO configuration - but that is a policy decision > > > > that should not become entangled with the kernel's security checks. > > > > > > > > Today qemu can make this policy choice the same way it does right now > > > > - call _INFO and check the group_ids. It gets the exact same outcome > > > > as today. We already discussed that we need to expose the group ID > > > > through an ioctl someplace. > > > > > > QEMU can make a policy decision today because the kernel provides a > > > sufficiently reliable interface, ie. based on the set of owned groups, a > > > hot-reset is all but guaranteed to work. > > > > And we don't change that with cdev. If qemu wants to make the policy > > decision it keeps using the exact same _INFO interface to make that > > decision same it has always made. > > > > We weaken the actual reset action to only consider the security side. > > > > Applications that want this exclusive reset group policy simply must > > check it on their own. It is a reasonable API design. > > I disagree, as I've argued before, the info ioctl becomes so weak and > effectively arbitrary from a user perspective at being able to predict > whether the hot-reset ioctl works that it becomes useless, diminishing > the entire hot-reset info/execute API. > > > > > If this is too awkward we could add a query to the kernel if the cdev > > > > is "reset exclusive" - eg the iommufd covers all the groups that span > > > > the reset set. > > > > > > That's essentially what we have if there are valid dev-ids for each > > > affected device in the info ioctl. > > > > If you have dev-ids for everything, yes. If you don't, then you can't > > make the same policy choice using a dev-id interface. > > Exactly, you can't make any policy choice because the success or > failure of the hot-reset ioctl can't be known. could you elaborate a bit about what the policy is here. As far as I know, QEMU makes use of the information reported by _INFO to check: - if all the affected groups are owned by the current QEMU[1] - if the affected devices are opened by the current QEMU, if yes, QEMU needs to use vfio_pci_pre_reset() to do preparation before issuing hot rest[1] [1] vfio_pci_hot_reset() in https://github.com/qemu/qemu/blob/master/hw/vfio/pci.c > > > I don't think it helps the user experience to create loopholes where > > > the hot-reset ioctl can still work in spite of those missing > > > devices. > > > > I disagree. The easy straightforward design is that the reset ioctl > > works if the process has security permissions. Mixing a policy check > > into the kernel on this path is creating complexity we don't really > > need. > > > > I don't view it as a loophole, it is flexability to use the API in a > > way that is different from what qemu wants - eg an app like dpdk may > > be willing to tolerate a reset group that becomes unavailable after > > startup. Who knows, why should we force this in the kernel? > > Because look at all the problems it's causing to try to introduce these > loopholes without also introducing subtle bugs. There's an argument > that we're overly strict, which is better than the alternative, which > seems to be what we're dabbling with. It is a straightforward > interface for the hot-reset ioctl to mirror the information provided > via the hot-reset info ioctl. > > > > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports > > > capability chains, we could add a capability that reports the group ID > > > for the device. > > > > I was going to put that in an iommufd ioctl so it works with VDPA too, > > but sure, lets assume we can get the group ID from a cdev fd. > > > > > The hot-reset info ioctl remains as it is today, reporting group-ids > > > and bdfs. > > > > Sure, but userspace still needs to know how to map the reset sets into > > dev-ids. > > No, it doesn't. > > > Remember the reason we started doing this is because we don't > > have easy access to the BDF anymore. > > We don't need it, the info ioctl provides the groups, the group > association can be learned from the DEVICE_GET_INFO ioctl, the > hot-reset ioctl only requires a single representative fd per affected > group. dev-ids not required. > > > I like leaving this ioctl alone, lets go back to a dedicated ioctl to > > return the dev_ids. > > I don't see any justification for this. We could add another PCI > specific DEVICE_GET_INFO capability to report the bdf if we really need > it, but reporting the group seems sufficient for this use case. IMHO, the knowledge of group may be not enough. Take QEMU as an example. QEMU not only needs to ensure the group is owned by it, it also needs to do preparation on the devices that are already in use and affected by the hot reset on a new opened device. If there is only group knowledge, QEMU may blindly prepares all the devices that are already opened and belong to the same iommu group. But as I got in the discussion iommu group is not equal to hot reset scope (a.k.a. dev_set). is it? It is possible that devices in an iommu_group may span into multiple hot reset scope. For such case, get bdf info from cdev fd is necessary. Regards, Yi Liu
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, April 6, 2023 7:23 AM > > On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote: > > > > > QEMU can make a policy decision today because the kernel provides a > > > > sufficiently reliable interface, ie. based on the set of owned groups, a > > > > hot-reset is all but guaranteed to work. > > > > > > And we don't change that with cdev. If qemu wants to make the policy > > > decision it keeps using the exact same _INFO interface to make that > > > decision same it has always made. > > > > > > We weaken the actual reset action to only consider the security side. > > > > > > Applications that want this exclusive reset group policy simply must > > > check it on their own. It is a reasonable API design. > > > > I disagree, as I've argued before, the info ioctl becomes so weak and > > effectively arbitrary from a user perspective at being able to predict > > whether the hot-reset ioctl works that it becomes useless, diminishing > > the entire hot-reset info/execute API. > > reset should be strictly more permissive than INFO. If INFO predicts > reset is permitted then reset should succeed. > > We don't change INFO so it cannot "becomes so weak" ?? > > We don't care about the cases where INFO says it will not succeed but > reset does (temporarily) succeed. > > I don't get what argument you are trying to make or what you think is > diminished.. > > Again, userspace calls INFO, if info says yes then reset *always > works*, exactly just like today. > > Userspace will call reset with a 0 length FD list and it uses a > security only check that is strictly more permissive than what > get_info will return. So the new check is simple in the kernel and > always works in the cases we need it to work. > > What is getting things into trouble is insisting that RESET have > additional restrictions beyond the minimum checks required for > security. > > > > I don't view it as a loophole, it is flexability to use the API in a > > > way that is different from what qemu wants - eg an app like dpdk may > > > be willing to tolerate a reset group that becomes unavailable after > > > startup. Who knows, why should we force this in the kernel? > > > > Because look at all the problems it's causing to try to introduce these > > loopholes without also introducing subtle bugs. > > These problems are coming from tring to do this integrated version, > not from my approach! > > AFAICT there was nothing wrong with my original plan of using the > empty fd list for reset. What Yi has here is some mashup of what you > and I both suggested. Hi Alex, Jason, could be this reason. So let me try to gather the changes of this series does and the impact as far as I know. 1) only check the ownership of opened devices in the dev_set in HOT_RESET ioctl. - Impact: it changes the relationship between _INFO and HOT_RESET. As " Each group must have IOMMU protection established for the ioctl to succeed." in [1], existing design actually means userspace should own all the affected groups before heading to do HOT_RESET. With the change here, the user does not need to ensure all affected groups are opened and it can do hot-reset successfully as long as the devices in the affected group are just un-opened and can be reset. [1] https://patchwork.kernel.org/project/linux-pci/patch/20130814200845.21923.64284.stgit@bling.home/ 2) Allow passing zero-length fd array to do hot reset - Impact: this uses the iommufd as ownership check in the kernel side. It is only supposed to be used by the users that open cdev instead of users that open group. The drawback is that it cannot cover the noiommu devices as noiommu does not use iommufd at all. But it works well for most cases. 3) Allow hot reset be successful when the dev_set is singleton - Impact: this makes sense but it seems to mess up the boundary between the group path and cdev path w.r.t. the usage of zero-length fd approach. The group path can succeed to do hot reset even if it is passing an empty fd array if the dev_set happens to be singleton. 4) Allow passing device fd to do hot reset - Impact: this is a new way for hot reset. should have no impact. 5) Extend the _INFO to report devid - Impact: this changes the way user to decode the info reported back. devid and groupid are returned per the way the queried device is opened. Since it was suggested to support the scenario in which some devices are opened via cdev while some devices are opened via group. This makes us to return invalid_devid for the device that is opened via group if it is affected by the hot reset of a device that is opened via cdev. This was proposed to support the future device fd passing usage which is only available in cdev path. To me the major confusion is from 1) and 3). 1) changes the meaning of _INFO and HOT_RESET, while 3) messes up the boundary. Here is my thought: For 1), it was proposed due to below reason[2]. We'd like to make a scenario that works in the group path be workable in cdev path as well. But IMHO, we may just accept that cdev path cannot work for such scenario to avoid sublte change to uapi. Otherwise, we need to have another HOT_RESET ioctl or a hint in HOT_RESET ioctl to tell the kernel whether relaxed ownership check is expected. Maybe this is awkward. But if we want to keep it, we'd do it with the awareness by user. [2] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@nvidia.com/ For 3), it was proposed when discussing the hot reset for noiommu[3]. But it does not make hot reset always workable for noiommu in cdev, just in case dev_set is singleton. So it is more of a general optimization that can make the kernel skip the ownership check. But to make use of it, we may need to test it before sanitizing the group fds from user or the iommufd check. Maybe the dev_set singleton test in this series is not well placed. If so, I can further modify it. [3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/ Regards, Yi Liu > > > > Remember the reason we started doing this is because we don't > > > have easy access to the BDF anymore. > > > > We don't need it, the info ioctl provides the groups, the group > > association can be learned from the DEVICE_GET_INFO ioctl, the > > hot-reset ioctl only requires a single representative fd per affected > > group. dev-ids not required. > > I'm not talking about triggering the ioctl. > > I'm talking about whatever else qemu needs to do so that the VM is > aware of the reset groups device-by-device on it's side so nested VFIO > in the VM reflects the same data as the hypervisor. Maybe it doesn't > do this right now, but the kernel API should continue to provide the > data. > > > > I like leaving this ioctl alone, lets go back to a dedicated ioctl to > > > return the dev_ids. > > > > I don't see any justification for this. We could add another PCI > > specific DEVICE_GET_INFO capability to report the bdf if we really need > > it, but reporting the group seems sufficient for this use case. > > What I imagine is a single new ioctl 'get reset group 2' or something. > It returns a list of dev_ids in the reset group. It has an output flag > if the reset is reliable. This is the only ioctl user space needs to > call. > > The reliable test is done by simply calling the ioctl and throwing > away the dev ids. The mapping of the VM's reset groups is done by > processing the dev_ids to vRIDs and flowing that into the VM somehow. > > We don't expose group_ids, and we don't expose BDF. It is much simpler > and cleaner to use. > > A BDF DEVICE_GET_INFO and the existing reset INFO will encode the same > data too, it is just not as elegant and requires userspace to do a lot > more work to keep track of the 3 different identifiers. > > > > This looks like a very complex uapi compared to the empty list option, > > > but it seems like it would work. > > > > It's the same API that we have now. What's complex is trying to figure > > out all the subtle side-effects from the loopholes that are being > > proposed in this series. Thanks, > > I might agree with you if we weren't now going backwards - > ideas didn't work out and Yi has to throw stuff away. :( > > Jason
On Thu, 6 Apr 2023 06:34:08 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > Hi Alex, > > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Thursday, April 6, 2023 3:50 AM > > > > On Wed, 5 Apr 2023 16:21:09 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote: > > > > Usability needs to be a consideration as well. An interface where the > > > > result is effectively arbitrary from a user perspective because the > > > > kernel is solely focused on whether the operation is allowed, > > > > evaluating constraints that the user is unaware of and cannot control, > > > > is unusable. > > > > > > Considering this API is only invoked by qemu we might be overdoing > > > this usability and 'no shoot in foot' view. > > > > Ok, I'm not sure why we're diminishing the de facto vfio userspace... > > > > > > > This is a good point that qemu needs to make a policy decision if it > > > > > is happy about the VFIO configuration - but that is a policy decision > > > > > that should not become entangled with the kernel's security checks. > > > > > > > > > > Today qemu can make this policy choice the same way it does right now > > > > > - call _INFO and check the group_ids. It gets the exact same outcome > > > > > as today. We already discussed that we need to expose the group ID > > > > > through an ioctl someplace. > > > > > > > > QEMU can make a policy decision today because the kernel provides a > > > > sufficiently reliable interface, ie. based on the set of owned groups, a > > > > hot-reset is all but guaranteed to work. > > > > > > And we don't change that with cdev. If qemu wants to make the policy > > > decision it keeps using the exact same _INFO interface to make that > > > decision same it has always made. > > > > > > We weaken the actual reset action to only consider the security side. > > > > > > Applications that want this exclusive reset group policy simply must > > > check it on their own. It is a reasonable API design. > > > > I disagree, as I've argued before, the info ioctl becomes so weak and > > effectively arbitrary from a user perspective at being able to predict > > whether the hot-reset ioctl works that it becomes useless, diminishing > > the entire hot-reset info/execute API. > > > > > > > If this is too awkward we could add a query to the kernel if the cdev > > > > > is "reset exclusive" - eg the iommufd covers all the groups that span > > > > > the reset set. > > > > > > > > That's essentially what we have if there are valid dev-ids for each > > > > affected device in the info ioctl. > > > > > > If you have dev-ids for everything, yes. If you don't, then you can't > > > make the same policy choice using a dev-id interface. > > > > Exactly, you can't make any policy choice because the success or > > failure of the hot-reset ioctl can't be known. > > could you elaborate a bit about what the policy is here. As far as I know, > QEMU makes use of the information reported by _INFO to check: > - if all the affected groups are owned by the current QEMU[1] > - if the affected devices are opened by the current QEMU, if yes, QEMU > needs to use vfio_pci_pre_reset() to do preparation before issuing > hot rest[1] > > [1] vfio_pci_hot_reset() in https://github.com/qemu/qemu/blob/master/hw/vfio/pci.c Regarding the policy decisions, look for instance at the distinction between vfio_pci_hot_reset_one() vs vfio_pci_hot_reset_multi(), or the way QEMU will opt for a bus reset if it believes only a PM reset is available. In my proposal, I did miss that if _INFO reports the group and bdf that allows QEMU to associate fd passed devices to a group affected by the reset, but not specifically whether the device is affected by the reset. I think that would be justification for capabilities on the DEVICE_GET_INFO ioctl to report both the group and PCI address as separate capabilities. > > > > I don't think it helps the user experience to create loopholes where > > > > the hot-reset ioctl can still work in spite of those missing > > > > devices. > > > > > > I disagree. The easy straightforward design is that the reset ioctl > > > works if the process has security permissions. Mixing a policy check > > > into the kernel on this path is creating complexity we don't really > > > need. > > > > > > I don't view it as a loophole, it is flexability to use the API in a > > > way that is different from what qemu wants - eg an app like dpdk may > > > be willing to tolerate a reset group that becomes unavailable after > > > startup. Who knows, why should we force this in the kernel? > > > > Because look at all the problems it's causing to try to introduce these > > loopholes without also introducing subtle bugs. There's an argument > > that we're overly strict, which is better than the alternative, which > > seems to be what we're dabbling with. It is a straightforward > > interface for the hot-reset ioctl to mirror the information provided > > via the hot-reset info ioctl. > > > > > > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports > > > > capability chains, we could add a capability that reports the group ID > > > > for the device. > > > > > > I was going to put that in an iommufd ioctl so it works with VDPA too, > > > but sure, lets assume we can get the group ID from a cdev fd. > > > > > > > The hot-reset info ioctl remains as it is today, reporting group-ids > > > > and bdfs. > > > > > > Sure, but userspace still needs to know how to map the reset sets into > > > dev-ids. > > > > No, it doesn't. > > > > > Remember the reason we started doing this is because we don't > > > have easy access to the BDF anymore. > > > > We don't need it, the info ioctl provides the groups, the group > > association can be learned from the DEVICE_GET_INFO ioctl, the > > hot-reset ioctl only requires a single representative fd per affected > > group. dev-ids not required. > > > > > I like leaving this ioctl alone, lets go back to a dedicated ioctl to > > > return the dev_ids. > > > > I don't see any justification for this. We could add another PCI > > specific DEVICE_GET_INFO capability to report the bdf if we really need > > it, but reporting the group seems sufficient for this use case. > > IMHO, the knowledge of group may be not enough. Take QEMU as an example. > QEMU not only needs to ensure the group is owned by it, it also needs to > do preparation on the devices that are already in use and affected by > the hot reset on a new opened device. If there is only group knowledge, > QEMU may blindly prepares all the devices that are already opened and > belong to the same iommu group. But as I got in the discussion iommu > group is not equal to hot reset scope (a.k.a. dev_set). is it? It is > possible that devices in an iommu_group may span into multiple hot > reset scope. For such case, get bdf info from cdev fd is necessary. Yes, you're correct, group and reset scope are not equivalent, so we'd require a means to get both the group and the bdf for the device. Knowing the bdf allows the user to know which opened devices are directly affected by the reset, knowing the group allows the user to know if ancillary affected devices are within the set of groups the user owns and therefore effectively under their purview. Thanks, Alex
On Thu, 6 Apr 2023 10:02:10 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Thursday, April 6, 2023 7:23 AM > > > > On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote: > > > > > > > QEMU can make a policy decision today because the kernel provides a > > > > > sufficiently reliable interface, ie. based on the set of owned groups, a > > > > > hot-reset is all but guaranteed to work. > > > > > > > > And we don't change that with cdev. If qemu wants to make the policy > > > > decision it keeps using the exact same _INFO interface to make that > > > > decision same it has always made. > > > > > > > > We weaken the actual reset action to only consider the security side. > > > > > > > > Applications that want this exclusive reset group policy simply must > > > > check it on their own. It is a reasonable API design. > > > > > > I disagree, as I've argued before, the info ioctl becomes so weak and > > > effectively arbitrary from a user perspective at being able to predict > > > whether the hot-reset ioctl works that it becomes useless, diminishing > > > the entire hot-reset info/execute API. > > > > reset should be strictly more permissive than INFO. If INFO predicts > > reset is permitted then reset should succeed. > > > > We don't change INFO so it cannot "becomes so weak" ?? > > > > We don't care about the cases where INFO says it will not succeed but > > reset does (temporarily) succeed. > > > > I don't get what argument you are trying to make or what you think is > > diminished.. > > > > Again, userspace calls INFO, if info says yes then reset *always > > works*, exactly just like today. > > > > Userspace will call reset with a 0 length FD list and it uses a > > security only check that is strictly more permissive than what > > get_info will return. So the new check is simple in the kernel and > > always works in the cases we need it to work. > > > > What is getting things into trouble is insisting that RESET have > > additional restrictions beyond the minimum checks required for > > security. > > > > > > I don't view it as a loophole, it is flexability to use the API in a > > > > way that is different from what qemu wants - eg an app like dpdk may > > > > be willing to tolerate a reset group that becomes unavailable after > > > > startup. Who knows, why should we force this in the kernel? > > > > > > Because look at all the problems it's causing to try to introduce these > > > loopholes without also introducing subtle bugs. > > > > These problems are coming from tring to do this integrated version, > > not from my approach! > > > > AFAICT there was nothing wrong with my original plan of using the > > empty fd list for reset. What Yi has here is some mashup of what you > > and I both suggested. > > Hi Alex, Jason, > > could be this reason. So let me try to gather the changes of this series > does and the impact as far as I know. > > 1) only check the ownership of opened devices in the dev_set > in HOT_RESET ioctl. > - Impact: it changes the relationship between _INFO and HOT_RESET. > As " Each group must have IOMMU protection established for the > ioctl to succeed." in [1], existing design actually means userspace > should own all the affected groups before heading to do HOT_RESET. > With the change here, the user does not need to ensure all affected > groups are opened and it can do hot-reset successfully as long as the > devices in the affected group are just un-opened and can be reset. > > [1] https://patchwork.kernel.org/project/linux-pci/patch/20130814200845.21923.64284.stgit@bling.home/ Where whether a device is opened is subject to change outside of the user's control. This essentially allows the user to perform hot-resets of devices outside of their ownership so long as the device is not used elsewhere, versus the current requirement that the user own all the affected groups, which implies device ownership. It's not been justified why this feature needs to exist, imo. > 2) Allow passing zero-length fd array to do hot reset > - Impact: this uses the iommufd as ownership check in the kernel side. > It is only supposed to be used by the users that open cdev instead of > users that open group. The drawback is that it cannot cover the noiommu > devices as noiommu does not use iommufd at all. But it works well for > most cases. The "only supposed to be used" is problematic here, we're extending all the interfaces to transparently accept group and device fds, but here we need to make a distinction because the ioctl needs to perform one way for groups and another way for devices, which it currently doesn't do. As above, I've not seen sufficient justification for this other than references to reducing complexity, but the only userspace expected to make use of this interface already has equivalent complexity. > 3) Allow hot reset be successful when the dev_set is singleton > - Impact: this makes sense but it seems to mess up the boundary between > the group path and cdev path w.r.t. the usage of zero-length fd approach. > The group path can succeed to do hot reset even if it is passing an empty > fd array if the dev_set happens to be singleton. Again, what is the justification for requiring this, it seems to be only a hack towards no-iommu support with cdev, which we can achieve by other means. Why have we not needed this in the group model? It introduces subtle loopholes, so while maybe we could, I don't see why we should, therefore I cannot agree with "this makes sense". > 4) Allow passing device fd to do hot reset > - Impact: this is a new way for hot reset. should have no impact. > > 5) Extend the _INFO to report devid > - Impact: this changes the way user to decode the info reported back. > devid and groupid are returned per the way the queried device is opened. > Since it was suggested to support the scenario in which some devices > are opened via cdev while some devices are opened via group. This makes > us to return invalid_devid for the device that is opened via group if > it is affected by the hot reset of a device that is opened via cdev. > > This was proposed to support the future device fd passing usage which is > only available in cdev path. I think this is fundamentally flawed because of the scope of the dev-id. We can only provide dev-ids for devices which belong to the same iommufd of the calling device, thus there are multiple instances where no dev-id can be provided. The group-id and bdf are static properties of the devices, regardless of their ownership. The bdf provides the specific device level association while the group-id indicates implied, static ownership. > To me the major confusion is from 1) and 3). 1) changes the meaning of > _INFO and HOT_RESET, while 3) messes up the boundary. As above, I think 2) is also an issue. > Here is my thought: > > For 1), it was proposed due to below reason[2]. We'd like to make a scenario > that works in the group path be workable in cdev path as well. But IMHO, we > may just accept that cdev path cannot work for such scenario to avoid sublte > change to uapi. Otherwise, we need to have another HOT_RESET ioctl or a > hint in HOT_RESET ioctl to tell the kernel whether relaxed ownership check > is expected. Maybe this is awkward. But if we want to keep it, we'd do it > with the awareness by user. > > [2] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@nvidia.com/ The group association is that relaxed ownership test. Yes, there are corner cases where we have a dual function card with separate IOMMU groups, where a user owning function 0 could do a bus reset because function 1 is temporarily unused, but so what, what good is that, have we ever had an issue raised because of this? The user can't rely on the unopened state of the other function. It's an entirely opportunistic optimization. The much more typical scenario is that a multi-function device does not provide isolation, all the functions are in the same group and because of the association of the group the user has implied ownership of the other devices for the purpose of a reset. > For 3), it was proposed when discussing the hot reset for noiommu[3]. But > it does not make hot reset always workable for noiommu in cdev, just in > case dev_set is singleton. So it is more of a general optimization that can > make the kernel skip the ownership check. But to make use of it, we may > need to test it before sanitizing the group fds from user or the iommufd > check. Maybe the dev_set singleton test in this series is not well placed. > If so, I can further modify it. > > [3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/ As above, this seems to be some optimization related to no-iommu for cdev because we don't have an iommufd association for the device in no-iommu mode. Note however that the current group interface doesn't care about the IOMMU context of the devices. We only need proof that the user owns the affected groups. So why are we bringing iommufd context anywhere into this interface, here or the null-array interface? It seems like the minor difference with cdev is that a) we're passing device fds rather than group fds, and b) those device fds need to be validated as having device access to complete the proof of ownership relative to the group. Otherwise we add capabilities to DEVICE_GET_INFO to support the device fd passing model where the user doesn't know the device group or bdf and allow the reset ioctl itself to accept device fds (extracting the group relationship for those which the user has configured for access). Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, April 7, 2023 1:54 AM > > On Thu, 6 Apr 2023 10:02:10 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Thursday, April 6, 2023 7:23 AM > > > > > > On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote: > > > > > > > > > QEMU can make a policy decision today because the kernel provides a > > > > > > sufficiently reliable interface, ie. based on the set of owned groups, a > > > > > > hot-reset is all but guaranteed to work. > > > > > > > > > > And we don't change that with cdev. If qemu wants to make the policy > > > > > decision it keeps using the exact same _INFO interface to make that > > > > > decision same it has always made. > > > > > > > > > > We weaken the actual reset action to only consider the security side. > > > > > > > > > > Applications that want this exclusive reset group policy simply must > > > > > check it on their own. It is a reasonable API design. > > > > > > > > I disagree, as I've argued before, the info ioctl becomes so weak and > > > > effectively arbitrary from a user perspective at being able to predict > > > > whether the hot-reset ioctl works that it becomes useless, diminishing > > > > the entire hot-reset info/execute API. > > > > > > reset should be strictly more permissive than INFO. If INFO predicts > > > reset is permitted then reset should succeed. > > > > > > We don't change INFO so it cannot "becomes so weak" ?? > > > > > > We don't care about the cases where INFO says it will not succeed but > > > reset does (temporarily) succeed. > > > > > > I don't get what argument you are trying to make or what you think is > > > diminished.. > > > > > > Again, userspace calls INFO, if info says yes then reset *always > > > works*, exactly just like today. > > > > > > Userspace will call reset with a 0 length FD list and it uses a > > > security only check that is strictly more permissive than what > > > get_info will return. So the new check is simple in the kernel and > > > always works in the cases we need it to work. > > > > > > What is getting things into trouble is insisting that RESET have > > > additional restrictions beyond the minimum checks required for > > > security. > > > > > > > > I don't view it as a loophole, it is flexability to use the API in a > > > > > way that is different from what qemu wants - eg an app like dpdk may > > > > > be willing to tolerate a reset group that becomes unavailable after > > > > > startup. Who knows, why should we force this in the kernel? > > > > > > > > Because look at all the problems it's causing to try to introduce these > > > > loopholes without also introducing subtle bugs. > > > > > > These problems are coming from tring to do this integrated version, > > > not from my approach! > > > > > > AFAICT there was nothing wrong with my original plan of using the > > > empty fd list for reset. What Yi has here is some mashup of what you > > > and I both suggested. > > > > Hi Alex, Jason, > > > > could be this reason. So let me try to gather the changes of this series > > does and the impact as far as I know. > > > > 1) only check the ownership of opened devices in the dev_set > > in HOT_RESET ioctl. > > - Impact: it changes the relationship between _INFO and HOT_RESET. > > As " Each group must have IOMMU protection established for the > > ioctl to succeed." in [1], existing design actually means userspace > > should own all the affected groups before heading to do HOT_RESET. > > With the change here, the user does not need to ensure all affected > > groups are opened and it can do hot-reset successfully as long as the > > devices in the affected group are just un-opened and can be reset. > > > > [1] https://patchwork.kernel.org/project/linux- > pci/patch/20130814200845.21923.64284.stgit@bling.home/ > > Where whether a device is opened is subject to change outside of the > user's control. This essentially allows the user to perform hot-resets > of devices outside of their ownership so long as the device is not > used elsewhere, versus the current requirement that the user own all the > affected groups, which implies device ownership. It's not been > justified why this feature needs to exist, imo. > > > 2) Allow passing zero-length fd array to do hot reset > > - Impact: this uses the iommufd as ownership check in the kernel side. > > It is only supposed to be used by the users that open cdev instead of > > users that open group. The drawback is that it cannot cover the noiommu > > devices as noiommu does not use iommufd at all. But it works well for > > most cases. > > The "only supposed to be used" is problematic here, we're extending all > the interfaces to transparently accept group and device fds, but here > we need to make a distinction because the ioctl needs to perform one > way for groups and another way for devices, which it currently doesn't > do. As above, I've not seen sufficient justification for this other > than references to reducing complexity, but the only userspace expected > to make use of this interface already has equivalent complexity. > > > 3) Allow hot reset be successful when the dev_set is singleton > > - Impact: this makes sense but it seems to mess up the boundary between > > the group path and cdev path w.r.t. the usage of zero-length fd approach. > > The group path can succeed to do hot reset even if it is passing an empty > > fd array if the dev_set happens to be singleton. > > Again, what is the justification for requiring this, it seems to be > only a hack towards no-iommu support with cdev, which we can achieve by > other means. Why have we not needed this in the group model? It > introduces subtle loopholes, so while maybe we could, I don't see why we > should, therefore I cannot agree with "this makes sense". > > > 4) Allow passing device fd to do hot reset > > - Impact: this is a new way for hot reset. should have no impact. > > > > 5) Extend the _INFO to report devid > > - Impact: this changes the way user to decode the info reported back. > > devid and groupid are returned per the way the queried device is opened. > > Since it was suggested to support the scenario in which some devices > > are opened via cdev while some devices are opened via group. This makes > > us to return invalid_devid for the device that is opened via group if > > it is affected by the hot reset of a device that is opened via cdev. > > > > This was proposed to support the future device fd passing usage which is > > only available in cdev path. > > I think this is fundamentally flawed because of the scope of the > dev-id. We can only provide dev-ids for devices which belong to the > same iommufd of the calling device, thus there are multiple instances > where no dev-id can be provided. The group-id and bdf are static > properties of the devices, regardless of their ownership. The bdf > provides the specific device level association while the group-id > indicates implied, static ownership. > > > To me the major confusion is from 1) and 3). 1) changes the meaning of > > _INFO and HOT_RESET, while 3) messes up the boundary. > > As above, I think 2) is also an issue. > > > Here is my thought: > > > > For 1), it was proposed due to below reason[2]. We'd like to make a scenario > > that works in the group path be workable in cdev path as well. But IMHO, we > > may just accept that cdev path cannot work for such scenario to avoid sublte > > change to uapi. Otherwise, we need to have another HOT_RESET ioctl or a > > hint in HOT_RESET ioctl to tell the kernel whether relaxed ownership check > > is expected. Maybe this is awkward. But if we want to keep it, we'd do it > > with the awareness by user. > > > > [2] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@nvidia.com/ > > The group association is that relaxed ownership test. Yes, there are > corner cases where we have a dual function card with separate IOMMU > groups, where a user owning function 0 could do a bus reset because > function 1 is temporarily unused, but so what, what good is that, have > we ever had an issue raised because of this? The user can't rely on > the unopened state of the other function. It's an entirely > opportunistic optimization. > > The much more typical scenario is that a multi-function device does not > provide isolation, all the functions are in the same group and because > of the association of the group the user has implied ownership of the > other devices for the purpose of a reset. > > > For 3), it was proposed when discussing the hot reset for noiommu[3]. But > > it does not make hot reset always workable for noiommu in cdev, just in > > case dev_set is singleton. So it is more of a general optimization that can > > make the kernel skip the ownership check. But to make use of it, we may > > need to test it before sanitizing the group fds from user or the iommufd > > check. Maybe the dev_set singleton test in this series is not well placed. > > If so, I can further modify it. > > > > [3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/ > > As above, this seems to be some optimization related to no-iommu for > cdev because we don't have an iommufd association for the device in > no-iommu mode. Note however that the current group interface doesn't > care about the IOMMU context of the devices. We only need proof that > the user owns the affected groups. So why are we bringing iommufd > context anywhere into this interface, here or the null-array interface? > > It seems like the minor difference with cdev is that a) we're passing > device fds rather than group fds, and b) those device fds need to be > validated as having device access to complete the proof of ownership > relative to the group. Otherwise we add capabilities to > DEVICE_GET_INFO to support the device fd passing model where the user > doesn't know the device group or bdf and allow the reset ioctl itself > to accept device fds (extracting the group relationship for those which > the user has configured for access). Thanks, so your suggestion is to drop 1) 2) 3) and 5), keep 4) and add new bdf/group capability to DEVICE_GET_INFO to retrieve group_id and bdf. In this way, the existing _INFO ioctl can be reused without any change. is it? Regards, Yi Liu
Hi Alex, > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Monday, April 3, 2023 11:02 PM > > On Mon, 3 Apr 2023 09:25:06 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > Sent: Saturday, April 1, 2023 10:44 PM > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void > *data) > > > if (!iommu_group) > > > return -EPERM; /* Cannot reset non-isolated devices */ > > > > Hi Alex, > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > Yes > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci. > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind > > iommufd==-1 can succeed, but failed to get hot reset info due to the above > > group check. Reason is that this happens to have some affected devices, and > > these devices have no valid iommu_group (because they are not bound to vfio-pci > > hence nobody allocates noiommu group for them). So when hot reset info loops > > such devices, it failed with -EPERM. Is this expected? > > Hmm, I didn't recall that we put in such a limitation, but given the > minimally intrusive approach to no-iommu and the fact that we never > defined an invalid group ID to return to the user, it makes sense that > we just blocked the ioctl for no-iommu use. I guess we can do the same > for no-iommu cdev. I just realize a further issue related to this limitation. Remember that we may finally compile out the vfio group infrastructure in the future. Say I want to test noiommu, I may boot such a kernel with iommu disabled. I think the _INFO ioctl would fail as there is no iommu_group. Does it mean we will not support hot reset for noiommu in future if vfio group infrastructure is compiled out? As another thread, we are going to add a new bdf/group capability to DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new bdf/group capability or add a flag in the capability to mark the group_id is invalid? Regards, Yi Liu
On Fri, 7 Apr 2023 10:09:58 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > Hi Alex, > > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Monday, April 3, 2023 11:02 PM > > > > On Mon, 3 Apr 2023 09:25:06 +0000 > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > Sent: Saturday, April 1, 2023 10:44 PM > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void > > *data) > > > > if (!iommu_group) > > > > return -EPERM; /* Cannot reset non-isolated devices */ > > > > > > Hi Alex, > > > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > > > Yes > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci. > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above > > > group check. Reason is that this happens to have some affected devices, and > > > these devices have no valid iommu_group (because they are not bound to vfio-pci > > > hence nobody allocates noiommu group for them). So when hot reset info loops > > > such devices, it failed with -EPERM. Is this expected? > > > > Hmm, I didn't recall that we put in such a limitation, but given the > > minimally intrusive approach to no-iommu and the fact that we never > > defined an invalid group ID to return to the user, it makes sense that > > we just blocked the ioctl for no-iommu use. I guess we can do the same > > for no-iommu cdev. > > I just realize a further issue related to this limitation. Remember that we > may finally compile out the vfio group infrastructure in the future. Say I > want to test noiommu, I may boot such a kernel with iommu disabled. I think > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will > not support hot reset for noiommu in future if vfio group infrastructure is > compiled out? We're talking about IOMMU groups, IOMMU groups are always present regardless of whether we expose a vfio group interface to userspace. Remember, we create IOMMU groups even in the no-iommu case. Even with pure cdev, there are underlying IOMMU groups that maintain the DMA ownership. > As another thread, we are going to add a new bdf/group capability to > DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new > bdf/group capability or add a flag in the capability to mark the group_id > is invalid? As above, there's always an IOMMU group, it's never invalid. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, April 7, 2023 8:04 PM > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void > > > *data) > > > > > if (!iommu_group) > > > > > return -EPERM; /* Cannot reset non-isolated devices */ [1] > > > > > > > > Hi Alex, > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > > > > > Yes > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci. > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above > > > > group check. Reason is that this happens to have some affected devices, and > > > > these devices have no valid iommu_group (because they are not bound to vfio- > pci > > > > hence nobody allocates noiommu group for them). So when hot reset info loops > > > > such devices, it failed with -EPERM. Is this expected? > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the > > > minimally intrusive approach to no-iommu and the fact that we never > > > defined an invalid group ID to return to the user, it makes sense that > > > we just blocked the ioctl for no-iommu use. I guess we can do the same > > > for no-iommu cdev. > > > > I just realize a further issue related to this limitation. Remember that we > > may finally compile out the vfio group infrastructure in the future. Say I > > want to test noiommu, I may boot such a kernel with iommu disabled. I think > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will > > not support hot reset for noiommu in future if vfio group infrastructure is > > compiled out? > > We're talking about IOMMU groups, IOMMU groups are always present > regardless of whether we expose a vfio group interface to userspace. > Remember, we create IOMMU groups even in the no-iommu case. Even with > pure cdev, there are underlying IOMMU groups that maintain the DMA > ownership. hmmm. As [1], when iommu is disabled, there will be no iommu_group for a given device unless it is registered to VFIO, which a fake group is created. That's why I hit the limitation [1]. When vfio_group is compiled out, then even fake group goes away. > > > As another thread, we are going to add a new bdf/group capability to > > DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new > > bdf/group capability or add a flag in the capability to mark the group_id > > is invalid? > > As above, there's always an IOMMU group, it's never invalid. Thanks, Regards, Yi Liu
On Fri, 7 Apr 2023 13:24:25 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Friday, April 7, 2023 8:04 PM > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void > > > > *data) > > > > > > if (!iommu_group) > > > > > > return -EPERM; /* Cannot reset non-isolated devices */ > > [1] > > > > > > > > > > > Hi Alex, > > > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > > > > > > > Yes > > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci. > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above > > > > > group check. Reason is that this happens to have some affected devices, and > > > > > these devices have no valid iommu_group (because they are not bound to vfio- > > pci > > > > > hence nobody allocates noiommu group for them). So when hot reset info loops > > > > > such devices, it failed with -EPERM. Is this expected? > > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the > > > > minimally intrusive approach to no-iommu and the fact that we never > > > > defined an invalid group ID to return to the user, it makes sense that > > > > we just blocked the ioctl for no-iommu use. I guess we can do the same > > > > for no-iommu cdev. > > > > > > I just realize a further issue related to this limitation. Remember that we > > > may finally compile out the vfio group infrastructure in the future. Say I > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will > > > not support hot reset for noiommu in future if vfio group infrastructure is > > > compiled out? > > > > We're talking about IOMMU groups, IOMMU groups are always present > > regardless of whether we expose a vfio group interface to userspace. > > Remember, we create IOMMU groups even in the no-iommu case. Even with > > pure cdev, there are underlying IOMMU groups that maintain the DMA > > ownership. > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a > given device unless it is registered to VFIO, which a fake group is created. > That's why I hit the limitation [1]. When vfio_group is compiled out, then > even fake group goes away. In the vfio group case, [1] can be hit with no-iommu only when there are affected devices which are not bound to vfio. Why are we not allocating an IOMMU group to no-iommu devices when vfio group is disabled? Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, April 7, 2023 9:52 PM > > On Fri, 7 Apr 2023 13:24:25 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Friday, April 7, 2023 8:04 PM > > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, > void > > > > > *data) > > > > > > > if (!iommu_group) > > > > > > > return -EPERM; /* Cannot reset non-isolated devices */ > > > > [1] > > > > > > > > > > > > > > Hi Alex, > > > > > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > > > > > > > > > Yes > > > > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci. > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. > Bind > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above > > > > > > group check. Reason is that this happens to have some affected devices, and > > > > > > these devices have no valid iommu_group (because they are not bound to > vfio- > > > pci > > > > > > hence nobody allocates noiommu group for them). So when hot reset info > loops > > > > > > such devices, it failed with -EPERM. Is this expected? > > > > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the > > > > > minimally intrusive approach to no-iommu and the fact that we never > > > > > defined an invalid group ID to return to the user, it makes sense that > > > > > we just blocked the ioctl for no-iommu use. I guess we can do the same > > > > > for no-iommu cdev. > > > > > > > > I just realize a further issue related to this limitation. Remember that we > > > > may finally compile out the vfio group infrastructure in the future. Say I > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will > > > > not support hot reset for noiommu in future if vfio group infrastructure is > > > > compiled out? > > > > > > We're talking about IOMMU groups, IOMMU groups are always present > > > regardless of whether we expose a vfio group interface to userspace. > > > Remember, we create IOMMU groups even in the no-iommu case. Even with > > > pure cdev, there are underlying IOMMU groups that maintain the DMA > > > ownership. > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a > > given device unless it is registered to VFIO, which a fake group is created. > > That's why I hit the limitation [1]. When vfio_group is compiled out, then > > even fake group goes away. > > In the vfio group case, [1] can be hit with no-iommu only when there > are affected devices which are not bound to vfio. yes. because vfio would allocate fake group when device is registered to it. > Why are we not > allocating an IOMMU group to no-iommu devices when vfio group is > disabled? Thanks, hmmm. when the vfio group code is configured out. The vfio_device_set_group() just returns 0 after below patch is applied and CONFIG_VFIO_GROUP=n. So when there is no vfio group, the fake group also goes away. https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ Regards, Yi Liu
On Fri, 7 Apr 2023 14:04:02 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Friday, April 7, 2023 9:52 PM > > > > On Fri, 7 Apr 2023 13:24:25 +0000 > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Friday, April 7, 2023 8:04 PM > > > > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, > > void > > > > > > *data) > > > > > > > > if (!iommu_group) > > > > > > > > return -EPERM; /* Cannot reset non-isolated devices */ > > > > > > [1] > > > > > > > > > > > > > > > > > Hi Alex, > > > > > > > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > > > > > > > > > > > Yes > > > > > > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci. > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. > > Bind > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above > > > > > > > group check. Reason is that this happens to have some affected devices, and > > > > > > > these devices have no valid iommu_group (because they are not bound to > > vfio- > > > > pci > > > > > > > hence nobody allocates noiommu group for them). So when hot reset info > > loops > > > > > > > such devices, it failed with -EPERM. Is this expected? > > > > > > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the > > > > > > minimally intrusive approach to no-iommu and the fact that we never > > > > > > defined an invalid group ID to return to the user, it makes sense that > > > > > > we just blocked the ioctl for no-iommu use. I guess we can do the same > > > > > > for no-iommu cdev. > > > > > > > > > > I just realize a further issue related to this limitation. Remember that we > > > > > may finally compile out the vfio group infrastructure in the future. Say I > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will > > > > > not support hot reset for noiommu in future if vfio group infrastructure is > > > > > compiled out? > > > > > > > > We're talking about IOMMU groups, IOMMU groups are always present > > > > regardless of whether we expose a vfio group interface to userspace. > > > > Remember, we create IOMMU groups even in the no-iommu case. Even with > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA > > > > ownership. > > > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a > > > given device unless it is registered to VFIO, which a fake group is created. > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then > > > even fake group goes away. > > > > In the vfio group case, [1] can be hit with no-iommu only when there > > are affected devices which are not bound to vfio. > > yes. because vfio would allocate fake group when device is registered to > it. > > > Why are we not > > allocating an IOMMU group to no-iommu devices when vfio group is > > disabled? Thanks, > > hmmm. when the vfio group code is configured out. The > vfio_device_set_group() just returns 0 after below patch is > applied and CONFIG_VFIO_GROUP=n. So when there is no > vfio group, the fake group also goes away. > > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ Is this a fundamental issue or just a problem with the current implementation proposal? It seems like the latter. FWIW, I also don't see a taint happening in the cdev path for no-iommu use. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, April 7, 2023 11:14 PM > > On Fri, 7 Apr 2023 14:04:02 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Friday, April 7, 2023 9:52 PM > > > > > > On Fri, 7 Apr 2023 13:24:25 +0000 > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > Sent: Friday, April 7, 2023 8:04 PM > > > > > > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev > *pdev, > > > void > > > > > > > *data) > > > > > > > > > if (!iommu_group) > > > > > > > > > return -EPERM; /* Cannot reset non-isolated devices */ > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > Hi Alex, > > > > > > > > > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > > > > > > > > > > > > > Yes > > > > > > > > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio- > pci. > > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. > > > Bind > > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the > above > > > > > > > > group check. Reason is that this happens to have some affected devices, > and > > > > > > > > these devices have no valid iommu_group (because they are not bound to > > > vfio- > > > > > pci > > > > > > > > hence nobody allocates noiommu group for them). So when hot reset info > > > loops > > > > > > > > such devices, it failed with -EPERM. Is this expected? > > > > > > > > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the > > > > > > > minimally intrusive approach to no-iommu and the fact that we never > > > > > > > defined an invalid group ID to return to the user, it makes sense that > > > > > > > we just blocked the ioctl for no-iommu use. I guess we can do the same > > > > > > > for no-iommu cdev. > > > > > > > > > > > > I just realize a further issue related to this limitation. Remember that we > > > > > > may finally compile out the vfio group infrastructure in the future. Say I > > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think > > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will > > > > > > not support hot reset for noiommu in future if vfio group infrastructure is > > > > > > compiled out? > > > > > > > > > > We're talking about IOMMU groups, IOMMU groups are always present > > > > > regardless of whether we expose a vfio group interface to userspace. > > > > > Remember, we create IOMMU groups even in the no-iommu case. Even with > > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA > > > > > ownership. > > > > > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a > > > > given device unless it is registered to VFIO, which a fake group is created. > > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then > > > > even fake group goes away. > > > > > > In the vfio group case, [1] can be hit with no-iommu only when there > > > are affected devices which are not bound to vfio. > > > > yes. because vfio would allocate fake group when device is registered to > > it. > > > > > Why are we not > > > allocating an IOMMU group to no-iommu devices when vfio group is > > > disabled? Thanks, > > > > hmmm. when the vfio group code is configured out. The > > vfio_device_set_group() just returns 0 after below patch is > > applied and CONFIG_VFIO_GROUP=n. So when there is no > > vfio group, the fake group also goes away. > > > > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ > > Is this a fundamental issue or just a problem with the current > implementation proposal? It seems like the latter. FWIW, I also don't > see a taint happening in the cdev path for no-iommu use. Thanks, yes. the latter case. The reason I raised it here is to confirm the policy on the new group/bdf capability in the DEVICE_GET_INFO. If there is no iommu group, perhaps I only need to exclude the new group/bdf capability from the cap chain of DEVICE_GET_INFO. is it? Regards, Yi Liu
On Fri, 7 Apr 2023 15:47:10 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Friday, April 7, 2023 11:14 PM > > > > On Fri, 7 Apr 2023 14:04:02 +0000 > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Friday, April 7, 2023 9:52 PM > > > > > > > > On Fri, 7 Apr 2023 13:24:25 +0000 > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > > Sent: Friday, April 7, 2023 8:04 PM > > > > > > > > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev > > *pdev, > > > > void > > > > > > > > *data) > > > > > > > > > > if (!iommu_group) > > > > > > > > > > return -EPERM; /* Cannot reset non-isolated devices */ > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > Hi Alex, > > > > > > > > > > > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > > > > > > > > > > > > > > > Yes > > > > > > > > > > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio- > > pci. > > > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. > > > > Bind > > > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the > > above > > > > > > > > > group check. Reason is that this happens to have some affected devices, > > and > > > > > > > > > these devices have no valid iommu_group (because they are not bound to > > > > vfio- > > > > > > pci > > > > > > > > > hence nobody allocates noiommu group for them). So when hot reset info > > > > loops > > > > > > > > > such devices, it failed with -EPERM. Is this expected? > > > > > > > > > > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the > > > > > > > > minimally intrusive approach to no-iommu and the fact that we never > > > > > > > > defined an invalid group ID to return to the user, it makes sense that > > > > > > > > we just blocked the ioctl for no-iommu use. I guess we can do the same > > > > > > > > for no-iommu cdev. > > > > > > > > > > > > > > I just realize a further issue related to this limitation. Remember that we > > > > > > > may finally compile out the vfio group infrastructure in the future. Say I > > > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think > > > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will > > > > > > > not support hot reset for noiommu in future if vfio group infrastructure is > > > > > > > compiled out? > > > > > > > > > > > > We're talking about IOMMU groups, IOMMU groups are always present > > > > > > regardless of whether we expose a vfio group interface to userspace. > > > > > > Remember, we create IOMMU groups even in the no-iommu case. Even with > > > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA > > > > > > ownership. > > > > > > > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a > > > > > given device unless it is registered to VFIO, which a fake group is created. > > > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then > > > > > even fake group goes away. > > > > > > > > In the vfio group case, [1] can be hit with no-iommu only when there > > > > are affected devices which are not bound to vfio. > > > > > > yes. because vfio would allocate fake group when device is registered to > > > it. > > > > > > > Why are we not > > > > allocating an IOMMU group to no-iommu devices when vfio group is > > > > disabled? Thanks, > > > > > > hmmm. when the vfio group code is configured out. The > > > vfio_device_set_group() just returns 0 after below patch is > > > applied and CONFIG_VFIO_GROUP=n. So when there is no > > > vfio group, the fake group also goes away. > > > > > > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ > > > > Is this a fundamental issue or just a problem with the current > > implementation proposal? It seems like the latter. FWIW, I also don't > > see a taint happening in the cdev path for no-iommu use. Thanks, > > yes. the latter case. The reason I raised it here is to confirm the > policy on the new group/bdf capability in the DEVICE_GET_INFO. If > there is no iommu group, perhaps I only need to exclude the new > group/bdf capability from the cap chain of DEVICE_GET_INFO. is it? I think we need to revisit the question of why allocating an IOMMU group for a no-iommu device is exclusive to the vfio group support. We've already been down the path of trying to report a field that only exists for devices with certain properties with dev-id. It doesn't work well. I think we've said all along that while the cdev interface is device based, there are still going to be underlying IOMMU groups for the user to be aware of, they're just not as much a fundamental part of the interface. There should not be a case where a device doesn't have a group to report. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Saturday, April 8, 2023 5:07 AM > > On Fri, 7 Apr 2023 15:47:10 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Friday, April 7, 2023 11:14 PM > > > > > > On Fri, 7 Apr 2023 14:04:02 +0000 > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > Sent: Friday, April 7, 2023 9:52 PM > > > > > > > > > > On Fri, 7 Apr 2023 13:24:25 +0000 > > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > > > Sent: Friday, April 7, 2023 8:04 PM > > > > > > > > > > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev > > > *pdev, > > > > > void > > > > > > > > > *data) > > > > > > > > > > > if (!iommu_group) > > > > > > > > > > > return -EPERM; /* Cannot reset non-isolated devices > */ > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Alex, > > > > > > > > > > > > > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > > > > > > > > > > > > > > > > > Yes > > > > > > > > > > > > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to > vfio- > > > pci. > > > > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu- > vfio0. > > > > > Bind > > > > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the > > > above > > > > > > > > > > group check. Reason is that this happens to have some affected > devices, > > > and > > > > > > > > > > these devices have no valid iommu_group (because they are not > bound to > > > > > vfio- > > > > > > > pci > > > > > > > > > > hence nobody allocates noiommu group for them). So when hot reset > info > > > > > loops > > > > > > > > > > such devices, it failed with -EPERM. Is this expected? > > > > > > > > > > > > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the > > > > > > > > > minimally intrusive approach to no-iommu and the fact that we never > > > > > > > > > defined an invalid group ID to return to the user, it makes sense that > > > > > > > > > we just blocked the ioctl for no-iommu use. I guess we can do the same > > > > > > > > > for no-iommu cdev. > > > > > > > > > > > > > > > > I just realize a further issue related to this limitation. Remember that we > > > > > > > > may finally compile out the vfio group infrastructure in the future. Say I > > > > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I > think > > > > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we > will > > > > > > > > not support hot reset for noiommu in future if vfio group infrastructure is > > > > > > > > compiled out? > > > > > > > > > > > > > > We're talking about IOMMU groups, IOMMU groups are always present > > > > > > > regardless of whether we expose a vfio group interface to userspace. > > > > > > > Remember, we create IOMMU groups even in the no-iommu case. Even > with > > > > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA > > > > > > > ownership. > > > > > > > > > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a > > > > > > given device unless it is registered to VFIO, which a fake group is created. > > > > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then > > > > > > even fake group goes away. > > > > > > > > > > In the vfio group case, [1] can be hit with no-iommu only when there > > > > > are affected devices which are not bound to vfio. > > > > > > > > yes. because vfio would allocate fake group when device is registered to > > > > it. > > > > > > > > > Why are we not > > > > > allocating an IOMMU group to no-iommu devices when vfio group is > > > > > disabled? Thanks, > > > > > > > > hmmm. when the vfio group code is configured out. The > > > > vfio_device_set_group() just returns 0 after below patch is > > > > applied and CONFIG_VFIO_GROUP=n. So when there is no > > > > vfio group, the fake group also goes away. > > > > > > > > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ > > > > > > Is this a fundamental issue or just a problem with the current > > > implementation proposal? It seems like the latter. FWIW, I also don't > > > see a taint happening in the cdev path for no-iommu use. Thanks, > > > > yes. the latter case. The reason I raised it here is to confirm the > > policy on the new group/bdf capability in the DEVICE_GET_INFO. If > > there is no iommu group, perhaps I only need to exclude the new > > group/bdf capability from the cap chain of DEVICE_GET_INFO. is it? > > I think we need to revisit the question of why allocating an IOMMU > group for a no-iommu device is exclusive to the vfio group support. For no-iommu device, the iommu group is a fake group allocated by vfio. is it? And the fake group allocation is part of the vfio group code. It is the vfio_device_set_group() in group.c. If vfio group code is not compiled in, vfio does not allocate fake groups. Detail for this compiling can be found in link [1]. > We've already been down the path of trying to report a field that only > exists for devices with certain properties with dev-id. It doesn't > work well. I think we've said all along that while the cdev interface > is device based, there are still going to be underlying IOMMU groups > for the user to be aware of, they're just not as much a fundamental > part of the interface. There should not be a case where a device > doesn't have a group to report. Thanks, As the patch in link [1] makes vfio group optional, so if compile a kernel with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no group to report. Perhaps this is not a typical usage but still a sane usage for noiommu mode as I confirmed with you in this thread. So when it comes, needs to consider what to report for the group field. Perhaps I messed up the discussion by referring to a patch that is part of another series. But I think it should be considered when talking about the group to be reported. [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ Regards, Yi Liu
On Sat, 8 Apr 2023 05:07:16 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Saturday, April 8, 2023 5:07 AM > > > > On Fri, 7 Apr 2023 15:47:10 +0000 > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Friday, April 7, 2023 11:14 PM > > > > > > > > On Fri, 7 Apr 2023 14:04:02 +0000 > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > > Sent: Friday, April 7, 2023 9:52 PM > > > > > > > > > > > > On Fri, 7 Apr 2023 13:24:25 +0000 > > > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > > > > Sent: Friday, April 7, 2023 8:04 PM > > > > > > > > > > > > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev > > > > *pdev, > > > > > > void > > > > > > > > > > *data) > > > > > > > > > > > > if (!iommu_group) > > > > > > > > > > > > return -EPERM; /* Cannot reset non-isolated devices > > */ > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Alex, > > > > > > > > > > > > > > > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > > > > > > > > > > > > > > > > > > > Yes > > > > > > > > > > > > > > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to > > vfio- > > > > pci. > > > > > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu- > > vfio0. > > > > > > Bind > > > > > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the > > > > above > > > > > > > > > > > group check. Reason is that this happens to have some affected > > devices, > > > > and > > > > > > > > > > > these devices have no valid iommu_group (because they are not > > bound to > > > > > > vfio- > > > > > > > > pci > > > > > > > > > > > hence nobody allocates noiommu group for them). So when hot reset > > info > > > > > > loops > > > > > > > > > > > such devices, it failed with -EPERM. Is this expected? > > > > > > > > > > > > > > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the > > > > > > > > > > minimally intrusive approach to no-iommu and the fact that we never > > > > > > > > > > defined an invalid group ID to return to the user, it makes sense that > > > > > > > > > > we just blocked the ioctl for no-iommu use. I guess we can do the same > > > > > > > > > > for no-iommu cdev. > > > > > > > > > > > > > > > > > > I just realize a further issue related to this limitation. Remember that we > > > > > > > > > may finally compile out the vfio group infrastructure in the future. Say I > > > > > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I > > think > > > > > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we > > will > > > > > > > > > not support hot reset for noiommu in future if vfio group infrastructure is > > > > > > > > > compiled out? > > > > > > > > > > > > > > > > We're talking about IOMMU groups, IOMMU groups are always present > > > > > > > > regardless of whether we expose a vfio group interface to userspace. > > > > > > > > Remember, we create IOMMU groups even in the no-iommu case. Even > > with > > > > > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA > > > > > > > > ownership. > > > > > > > > > > > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a > > > > > > > given device unless it is registered to VFIO, which a fake group is created. > > > > > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then > > > > > > > even fake group goes away. > > > > > > > > > > > > In the vfio group case, [1] can be hit with no-iommu only when there > > > > > > are affected devices which are not bound to vfio. > > > > > > > > > > yes. because vfio would allocate fake group when device is registered to > > > > > it. > > > > > > > > > > > Why are we not > > > > > > allocating an IOMMU group to no-iommu devices when vfio group is > > > > > > disabled? Thanks, > > > > > > > > > > hmmm. when the vfio group code is configured out. The > > > > > vfio_device_set_group() just returns 0 after below patch is > > > > > applied and CONFIG_VFIO_GROUP=n. So when there is no > > > > > vfio group, the fake group also goes away. > > > > > > > > > > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ > > > > > > > > Is this a fundamental issue or just a problem with the current > > > > implementation proposal? It seems like the latter. FWIW, I also don't > > > > see a taint happening in the cdev path for no-iommu use. Thanks, > > > > > > yes. the latter case. The reason I raised it here is to confirm the > > > policy on the new group/bdf capability in the DEVICE_GET_INFO. If > > > there is no iommu group, perhaps I only need to exclude the new > > > group/bdf capability from the cap chain of DEVICE_GET_INFO. is it? > > > > I think we need to revisit the question of why allocating an IOMMU > > group for a no-iommu device is exclusive to the vfio group support. > > For no-iommu device, the iommu group is a fake group allocated by vfio. > is it? And the fake group allocation is part of the vfio group code. > It is the vfio_device_set_group() in group.c. If vfio group code is not > compiled in, vfio does not allocate fake groups. Detail for this compiling > can be found in link [1]. > > > We've already been down the path of trying to report a field that only > > exists for devices with certain properties with dev-id. It doesn't > > work well. I think we've said all along that while the cdev interface > > is device based, there are still going to be underlying IOMMU groups > > for the user to be aware of, they're just not as much a fundamental > > part of the interface. There should not be a case where a device > > doesn't have a group to report. Thanks, > > As the patch in link [1] makes vfio group optional, so if compile a kernel > with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no > group to report. Perhaps this is not a typical usage but still a sane usage > for noiommu mode as I confirmed with you in this thread. So when it comes, > needs to consider what to report for the group field. > > Perhaps I messed up the discussion by referring to a patch that is part of > another series. But I think it should be considered when talking about the > group to be reported. The question is whether the split that group.c code handles both the vfio group AND creation of the IOMMU group in such cases is the correct split. I'm not arguing that the way the code is currently laid out has the fake IOMMU group for no-iommu devices created in vfio group specific code, but we have a common interface that makes use of IOMMU group information for which we don't have an equivalent alternative data field to report. We've shown that dev-id doesn't work here because dev-ids only exist for devices within the user's IOMMU context. Also reporting an invalid ID of any sort fails to indicate the potential implied ownership. Therefore I recognize that if this interface is to report an IOMMU group, then the creation of fake IOMMU groups existing only in vfio group code would need to be refactored. Thanks, Alex
On 2023/4/8 22:20, Alex Williamson wrote: > On Sat, 8 Apr 2023 05:07:16 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > >>> From: Alex Williamson <alex.williamson@redhat.com> >>> Sent: Saturday, April 8, 2023 5:07 AM >>> >>> On Fri, 7 Apr 2023 15:47:10 +0000 >>> "Liu, Yi L" <yi.l.liu@intel.com> wrote: >>> >>>>> From: Alex Williamson <alex.williamson@redhat.com> >>>>> Sent: Friday, April 7, 2023 11:14 PM >>>>> >>>>> On Fri, 7 Apr 2023 14:04:02 +0000 >>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote: >>>>> >>>>>>> From: Alex Williamson <alex.williamson@redhat.com> >>>>>>> Sent: Friday, April 7, 2023 9:52 PM >>>>>>> >>>>>>> On Fri, 7 Apr 2023 13:24:25 +0000 >>>>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote: >>>>>>> >>>>>>>>> From: Alex Williamson <alex.williamson@redhat.com> >>>>>>>>> Sent: Friday, April 7, 2023 8:04 PM >>>>>>>>> >>>>>>>>>>>>> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev >>>>> *pdev, >>>>>>> void >>>>>>>>>>> *data) >>>>>>>>>>>>> if (!iommu_group) >>>>>>>>>>>>> return -EPERM; /* Cannot reset non-isolated devices >>> */ >>>>>>>> >>>>>>>> [1] >>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Alex, >>>>>>>>>>>> >>>>>>>>>>>> Is disabling iommu a sane way to test vfio noiommu mode? >>>>>>>>>>> >>>>>>>>>>> Yes >>>>>>>>>>> >>>>>>>>>>>> I added intel_iommu=off to disable intel iommu and bind a device to >>> vfio- >>>>> pci. >>>>>>>>>>>> I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu- >>> vfio0. >>>>>>> Bind >>>>>>>>>>>> iommufd==-1 can succeed, but failed to get hot reset info due to the >>>>> above >>>>>>>>>>>> group check. Reason is that this happens to have some affected >>> devices, >>>>> and >>>>>>>>>>>> these devices have no valid iommu_group (because they are not >>> bound to >>>>>>> vfio- >>>>>>>>> pci >>>>>>>>>>>> hence nobody allocates noiommu group for them). So when hot reset >>> info >>>>>>> loops >>>>>>>>>>>> such devices, it failed with -EPERM. Is this expected? >>>>>>>>>>> >>>>>>>>>>> Hmm, I didn't recall that we put in such a limitation, but given the >>>>>>>>>>> minimally intrusive approach to no-iommu and the fact that we never >>>>>>>>>>> defined an invalid group ID to return to the user, it makes sense that >>>>>>>>>>> we just blocked the ioctl for no-iommu use. I guess we can do the same >>>>>>>>>>> for no-iommu cdev. >>>>>>>>>> >>>>>>>>>> I just realize a further issue related to this limitation. Remember that we >>>>>>>>>> may finally compile out the vfio group infrastructure in the future. Say I >>>>>>>>>> want to test noiommu, I may boot such a kernel with iommu disabled. I >>> think >>>>>>>>>> the _INFO ioctl would fail as there is no iommu_group. Does it mean we >>> will >>>>>>>>>> not support hot reset for noiommu in future if vfio group infrastructure is >>>>>>>>>> compiled out? >>>>>>>>> >>>>>>>>> We're talking about IOMMU groups, IOMMU groups are always present >>>>>>>>> regardless of whether we expose a vfio group interface to userspace. >>>>>>>>> Remember, we create IOMMU groups even in the no-iommu case. Even >>> with >>>>>>>>> pure cdev, there are underlying IOMMU groups that maintain the DMA >>>>>>>>> ownership. >>>>>>>> >>>>>>>> hmmm. As [1], when iommu is disabled, there will be no iommu_group for a >>>>>>>> given device unless it is registered to VFIO, which a fake group is created. >>>>>>>> That's why I hit the limitation [1]. When vfio_group is compiled out, then >>>>>>>> even fake group goes away. >>>>>>> >>>>>>> In the vfio group case, [1] can be hit with no-iommu only when there >>>>>>> are affected devices which are not bound to vfio. >>>>>> >>>>>> yes. because vfio would allocate fake group when device is registered to >>>>>> it. >>>>>> >>>>>>> Why are we not >>>>>>> allocating an IOMMU group to no-iommu devices when vfio group is >>>>>>> disabled? Thanks, >>>>>> >>>>>> hmmm. when the vfio group code is configured out. The >>>>>> vfio_device_set_group() just returns 0 after below patch is >>>>>> applied and CONFIG_VFIO_GROUP=n. So when there is no >>>>>> vfio group, the fake group also goes away. >>>>>> >>>>>> https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ >>>>> >>>>> Is this a fundamental issue or just a problem with the current >>>>> implementation proposal? It seems like the latter. FWIW, I also don't >>>>> see a taint happening in the cdev path for no-iommu use. Thanks, >>>> >>>> yes. the latter case. The reason I raised it here is to confirm the >>>> policy on the new group/bdf capability in the DEVICE_GET_INFO. If >>>> there is no iommu group, perhaps I only need to exclude the new >>>> group/bdf capability from the cap chain of DEVICE_GET_INFO. is it? >>> >>> I think we need to revisit the question of why allocating an IOMMU >>> group for a no-iommu device is exclusive to the vfio group support. >> >> For no-iommu device, the iommu group is a fake group allocated by vfio. >> is it? And the fake group allocation is part of the vfio group code. >> It is the vfio_device_set_group() in group.c. If vfio group code is not >> compiled in, vfio does not allocate fake groups. Detail for this compiling >> can be found in link [1]. >> >>> We've already been down the path of trying to report a field that only >>> exists for devices with certain properties with dev-id. It doesn't >>> work well. I think we've said all along that while the cdev interface >>> is device based, there are still going to be underlying IOMMU groups >>> for the user to be aware of, they're just not as much a fundamental >>> part of the interface. There should not be a case where a device >>> doesn't have a group to report. Thanks, >> >> As the patch in link [1] makes vfio group optional, so if compile a kernel >> with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no >> group to report. Perhaps this is not a typical usage but still a sane usage >> for noiommu mode as I confirmed with you in this thread. So when it comes, >> needs to consider what to report for the group field. >> >> Perhaps I messed up the discussion by referring to a patch that is part of >> another series. But I think it should be considered when talking about the >> group to be reported. > > The question is whether the split that group.c code handles both the > vfio group AND creation of the IOMMU group in such cases is the correct > split. I'm not arguing that the way the code is currently laid out has > the fake IOMMU group for no-iommu devices created in vfio group > specific code, but we have a common interface that makes use of IOMMU > group information for which we don't have an equivalent alternative > data field to report. yes. It is needed to ensure _HOT_RESET_INFO workable for noiommu devices. > We've shown that dev-id doesn't work here because dev-ids only exist > for devices within the user's IOMMU context. Also reporting an invalid > ID of any sort fails to indicate the potential implied ownership. > Therefore I recognize that if this interface is to report an IOMMU > group, then the creation of fake IOMMU groups existing only in vfio > group code would need to be refactored. Thanks, yeah, needs to move the iommu group creation back to vfio_main.c. This would be a prerequisite for [1] [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ I'll also try out your suggestion to add a capability like below and link it in the vfio_device_info cap chain. #define VFIO_DEVICE_INFO_CAP_PCI_BDF 5 struct vfio_device_info_cap_pci_bdf { struct vfio_info_cap_header header; __u32 group_id; __u16 segment; __u8 bus; __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */ };
On Sun, 9 Apr 2023 19:58:47 +0800 Yi Liu <yi.l.liu@intel.com> wrote: > On 2023/4/8 22:20, Alex Williamson wrote: > > On Sat, 8 Apr 2023 05:07:16 +0000 > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > >>> From: Alex Williamson <alex.williamson@redhat.com> > >>> Sent: Saturday, April 8, 2023 5:07 AM > >>> > >>> On Fri, 7 Apr 2023 15:47:10 +0000 > >>> "Liu, Yi L" <yi.l.liu@intel.com> wrote: > >>> > >>>>> From: Alex Williamson <alex.williamson@redhat.com> > >>>>> Sent: Friday, April 7, 2023 11:14 PM > >>>>> > >>>>> On Fri, 7 Apr 2023 14:04:02 +0000 > >>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote: > >>>>> > >>>>>>> From: Alex Williamson <alex.williamson@redhat.com> > >>>>>>> Sent: Friday, April 7, 2023 9:52 PM > >>>>>>> > >>>>>>> On Fri, 7 Apr 2023 13:24:25 +0000 > >>>>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote: > >>>>>>> > >>>>>>>>> From: Alex Williamson <alex.williamson@redhat.com> > >>>>>>>>> Sent: Friday, April 7, 2023 8:04 PM > >>>>>>>>> > >>>>>>>>>>>>> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev > >>>>> *pdev, > >>>>>>> void > >>>>>>>>>>> *data) > >>>>>>>>>>>>> if (!iommu_group) > >>>>>>>>>>>>> return -EPERM; /* Cannot reset non-isolated devices > >>> */ > >>>>>>>> > >>>>>>>> [1] > >>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Hi Alex, > >>>>>>>>>>>> > >>>>>>>>>>>> Is disabling iommu a sane way to test vfio noiommu mode? > >>>>>>>>>>> > >>>>>>>>>>> Yes > >>>>>>>>>>> > >>>>>>>>>>>> I added intel_iommu=off to disable intel iommu and bind a device to > >>> vfio- > >>>>> pci. > >>>>>>>>>>>> I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu- > >>> vfio0. > >>>>>>> Bind > >>>>>>>>>>>> iommufd==-1 can succeed, but failed to get hot reset info due to the > >>>>> above > >>>>>>>>>>>> group check. Reason is that this happens to have some affected > >>> devices, > >>>>> and > >>>>>>>>>>>> these devices have no valid iommu_group (because they are not > >>> bound to > >>>>>>> vfio- > >>>>>>>>> pci > >>>>>>>>>>>> hence nobody allocates noiommu group for them). So when hot reset > >>> info > >>>>>>> loops > >>>>>>>>>>>> such devices, it failed with -EPERM. Is this expected? > >>>>>>>>>>> > >>>>>>>>>>> Hmm, I didn't recall that we put in such a limitation, but given the > >>>>>>>>>>> minimally intrusive approach to no-iommu and the fact that we never > >>>>>>>>>>> defined an invalid group ID to return to the user, it makes sense that > >>>>>>>>>>> we just blocked the ioctl for no-iommu use. I guess we can do the same > >>>>>>>>>>> for no-iommu cdev. > >>>>>>>>>> > >>>>>>>>>> I just realize a further issue related to this limitation. Remember that we > >>>>>>>>>> may finally compile out the vfio group infrastructure in the future. Say I > >>>>>>>>>> want to test noiommu, I may boot such a kernel with iommu disabled. I > >>> think > >>>>>>>>>> the _INFO ioctl would fail as there is no iommu_group. Does it mean we > >>> will > >>>>>>>>>> not support hot reset for noiommu in future if vfio group infrastructure is > >>>>>>>>>> compiled out? > >>>>>>>>> > >>>>>>>>> We're talking about IOMMU groups, IOMMU groups are always present > >>>>>>>>> regardless of whether we expose a vfio group interface to userspace. > >>>>>>>>> Remember, we create IOMMU groups even in the no-iommu case. Even > >>> with > >>>>>>>>> pure cdev, there are underlying IOMMU groups that maintain the DMA > >>>>>>>>> ownership. > >>>>>>>> > >>>>>>>> hmmm. As [1], when iommu is disabled, there will be no iommu_group for a > >>>>>>>> given device unless it is registered to VFIO, which a fake group is created. > >>>>>>>> That's why I hit the limitation [1]. When vfio_group is compiled out, then > >>>>>>>> even fake group goes away. > >>>>>>> > >>>>>>> In the vfio group case, [1] can be hit with no-iommu only when there > >>>>>>> are affected devices which are not bound to vfio. > >>>>>> > >>>>>> yes. because vfio would allocate fake group when device is registered to > >>>>>> it. > >>>>>> > >>>>>>> Why are we not > >>>>>>> allocating an IOMMU group to no-iommu devices when vfio group is > >>>>>>> disabled? Thanks, > >>>>>> > >>>>>> hmmm. when the vfio group code is configured out. The > >>>>>> vfio_device_set_group() just returns 0 after below patch is > >>>>>> applied and CONFIG_VFIO_GROUP=n. So when there is no > >>>>>> vfio group, the fake group also goes away. > >>>>>> > >>>>>> https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ > >>>>> > >>>>> Is this a fundamental issue or just a problem with the current > >>>>> implementation proposal? It seems like the latter. FWIW, I also don't > >>>>> see a taint happening in the cdev path for no-iommu use. Thanks, > >>>> > >>>> yes. the latter case. The reason I raised it here is to confirm the > >>>> policy on the new group/bdf capability in the DEVICE_GET_INFO. If > >>>> there is no iommu group, perhaps I only need to exclude the new > >>>> group/bdf capability from the cap chain of DEVICE_GET_INFO. is it? > >>> > >>> I think we need to revisit the question of why allocating an IOMMU > >>> group for a no-iommu device is exclusive to the vfio group support. > >> > >> For no-iommu device, the iommu group is a fake group allocated by vfio. > >> is it? And the fake group allocation is part of the vfio group code. > >> It is the vfio_device_set_group() in group.c. If vfio group code is not > >> compiled in, vfio does not allocate fake groups. Detail for this compiling > >> can be found in link [1]. > >> > >>> We've already been down the path of trying to report a field that only > >>> exists for devices with certain properties with dev-id. It doesn't > >>> work well. I think we've said all along that while the cdev interface > >>> is device based, there are still going to be underlying IOMMU groups > >>> for the user to be aware of, they're just not as much a fundamental > >>> part of the interface. There should not be a case where a device > >>> doesn't have a group to report. Thanks, > >> > >> As the patch in link [1] makes vfio group optional, so if compile a kernel > >> with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no > >> group to report. Perhaps this is not a typical usage but still a sane usage > >> for noiommu mode as I confirmed with you in this thread. So when it comes, > >> needs to consider what to report for the group field. > >> > >> Perhaps I messed up the discussion by referring to a patch that is part of > >> another series. But I think it should be considered when talking about the > >> group to be reported. > > > > The question is whether the split that group.c code handles both the > > vfio group AND creation of the IOMMU group in such cases is the correct > > split. I'm not arguing that the way the code is currently laid out has > > the fake IOMMU group for no-iommu devices created in vfio group > > specific code, but we have a common interface that makes use of IOMMU > > group information for which we don't have an equivalent alternative > > data field to report. > > yes. It is needed to ensure _HOT_RESET_INFO workable for noiommu devices. > > > We've shown that dev-id doesn't work here because dev-ids only exist > > for devices within the user's IOMMU context. Also reporting an invalid > > ID of any sort fails to indicate the potential implied ownership. > > Therefore I recognize that if this interface is to report an IOMMU > > group, then the creation of fake IOMMU groups existing only in vfio > > group code would need to be refactored. Thanks, > > yeah, needs to move the iommu group creation back to vfio_main.c. This > would be a prerequisite for [1] > > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ > > I'll also try out your suggestion to add a capability like below and link > it in the vfio_device_info cap chain. > > #define VFIO_DEVICE_INFO_CAP_PCI_BDF 5 > > struct vfio_device_info_cap_pci_bdf { > struct vfio_info_cap_header header; > __u32 group_id; > __u16 segment; > __u8 bus; > __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */ > }; > Group-id and bdf should be separate capabilities, all device should report a group-id capability and only PCI devices a bdf capability. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Sunday, April 9, 2023 9:30 PM [...] > > yeah, needs to move the iommu group creation back to vfio_main.c. This > > would be a prerequisite for [1] > > > > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ > > > > I'll also try out your suggestion to add a capability like below and link > > it in the vfio_device_info cap chain. > > > > #define VFIO_DEVICE_INFO_CAP_PCI_BDF 5 > > > > struct vfio_device_info_cap_pci_bdf { > > struct vfio_info_cap_header header; > > __u32 group_id; > > __u16 segment; > > __u8 bus; > > __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */ > > }; > > > > Group-id and bdf should be separate capabilities, all device should > report a group-id capability and only PCI devices a bdf capability. ok. Since this is to support the device fd passing usage, so we need to let all the vfio device drivers report group-id capability. is it? So may have a below helper in vfio_main.c. How about the sample drivers? seems not necessary for them. right? int vfio_pci_info_add_group_cap(struct device *dev, struct vfio_info_cap *caps) { struct vfio_pci_device_info_cap_group cap = { .header.id = VFIO_DEVICE_INFO_CAP_GROUP_ID, .header.version = 1, }; struct iommu_group *iommu_group; iommu_group = iommu_group_get(&pdev->dev); if (!iommu_group) { kfree(caps->buf); return -EPERM; } cap.group_id = iommu_group_id(iommu_group); iommu_group_put(iommu_group); return vfio_info_add_capability(caps, &cap.header, sizeof(cap)); } Regards, Yi Liu
On Mon, 10 Apr 2023 08:48:54 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Sunday, April 9, 2023 9:30 PM > [...] > > > yeah, needs to move the iommu group creation back to vfio_main.c. This > > > would be a prerequisite for [1] > > > > > > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ > > > > > > I'll also try out your suggestion to add a capability like below and link > > > it in the vfio_device_info cap chain. > > > > > > #define VFIO_DEVICE_INFO_CAP_PCI_BDF 5 > > > > > > struct vfio_device_info_cap_pci_bdf { > > > struct vfio_info_cap_header header; > > > __u32 group_id; > > > __u16 segment; > > > __u8 bus; > > > __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */ > > > }; > > > > > > > Group-id and bdf should be separate capabilities, all device should > > report a group-id capability and only PCI devices a bdf capability. > > ok. Since this is to support the device fd passing usage, so we need to > let all the vfio device drivers report group-id capability. is it? So may > have a below helper in vfio_main.c. How about the sample drivers? > seems not necessary for them. right? The more common we can make it, the better, but if it ends up that the individual drivers need to initialize the capability then it would probably be limited to those driver with a need to expose the group. Sample drivers for the purpose of illustrating the interface and of course anything based on vfio-pci-core which exposes hot-reset. Thanks Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Monday, April 10, 2023 10:41 PM > > On Mon, 10 Apr 2023 08:48:54 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Sunday, April 9, 2023 9:30 PM > > [...] > > > > yeah, needs to move the iommu group creation back to vfio_main.c. This > > > > would be a prerequisite for [1] > > > > > > > > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ > > > > > > > > I'll also try out your suggestion to add a capability like below and link > > > > it in the vfio_device_info cap chain. > > > > > > > > #define VFIO_DEVICE_INFO_CAP_PCI_BDF 5 > > > > > > > > struct vfio_device_info_cap_pci_bdf { > > > > struct vfio_info_cap_header header; > > > > __u32 group_id; > > > > __u16 segment; > > > > __u8 bus; > > > > __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */ > > > > }; > > > > > > > > > > Group-id and bdf should be separate capabilities, all device should > > > report a group-id capability and only PCI devices a bdf capability. > > > > ok. Since this is to support the device fd passing usage, so we need to > > let all the vfio device drivers report group-id capability. is it? So may > > have a below helper in vfio_main.c. How about the sample drivers? > > seems not necessary for them. right? > > The more common we can make it, the better, but if it ends up that the > individual drivers need to initialize the capability then it would > probably be limited to those driver with a need to expose the group. looks to be such a case. vfio_device_info is assembled by the individual drivers. If want to report group_id capability as a common behavior, needs to change all of them. Had a quick draft for it as below commit: https://github.com/yiliu1765/iommufd/commit/ff4b8bee90761961041126305183a9a7e0f0542d https://github.com/yiliu1765/iommufd/commits/report_group_id > Sample drivers for the purpose of illustrating the interface and of > course anything based on vfio-pci-core which exposes hot-reset. Thanks do you see any sample drivers need to report group_id cap? IMHO, seems no. Regards, Yi Liu
On Mon, 10 Apr 2023 15:18:27 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Monday, April 10, 2023 10:41 PM > > > > On Mon, 10 Apr 2023 08:48:54 +0000 > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Sunday, April 9, 2023 9:30 PM > > > [...] > > > > > yeah, needs to move the iommu group creation back to vfio_main.c. This > > > > > would be a prerequisite for [1] > > > > > > > > > > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/ > > > > > > > > > > I'll also try out your suggestion to add a capability like below and link > > > > > it in the vfio_device_info cap chain. > > > > > > > > > > #define VFIO_DEVICE_INFO_CAP_PCI_BDF 5 > > > > > > > > > > struct vfio_device_info_cap_pci_bdf { > > > > > struct vfio_info_cap_header header; > > > > > __u32 group_id; > > > > > __u16 segment; > > > > > __u8 bus; > > > > > __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */ > > > > > }; > > > > > > > > > > > > > Group-id and bdf should be separate capabilities, all device should > > > > report a group-id capability and only PCI devices a bdf capability. > > > > > > ok. Since this is to support the device fd passing usage, so we need to > > > let all the vfio device drivers report group-id capability. is it? So may > > > have a below helper in vfio_main.c. How about the sample drivers? > > > seems not necessary for them. right? > > > > The more common we can make it, the better, but if it ends up that the > > individual drivers need to initialize the capability then it would > > probably be limited to those driver with a need to expose the group. > > looks to be such a case. vfio_device_info is assembled by the individual > drivers. If want to report group_id capability as a common behavior, needs > to change all of them. Had a quick draft for it as below commit: > > https://github.com/yiliu1765/iommufd/commit/ff4b8bee90761961041126305183a9a7e0f0542d > > https://github.com/yiliu1765/iommufd/commits/report_group_id > > > Sample drivers for the purpose of illustrating the interface and of > > course anything based on vfio-pci-core which exposes hot-reset. Thanks > > do you see any sample drivers need to report group_id cap? IMHO, seems > no. As in the quoted text, part of the purpose of the sample drivers is to act both as a proof-of-concept and illustration of the API, therefore gratuitous exposure of such capabilities should be encouraged. They would also provide a proof point of an mdev device, ie. emulated IOMMU device, exposing the capability. Thanks, Alex
Hi Alex, > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, April 7, 2023 8:04 PM > > On Fri, 7 Apr 2023 10:09:58 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > Hi Alex, > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Monday, April 3, 2023 11:02 PM > > > > > > On Mon, 3 Apr 2023 09:25:06 +0000 > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > > Sent: Saturday, April 1, 2023 10:44 PM > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void > > > *data) > > > > > if (!iommu_group) > > > > > return -EPERM; /* Cannot reset non-isolated devices */ > > > > > > > > Hi Alex, > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode? > > > > > > Yes > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci. > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above > > > > group check. Reason is that this happens to have some affected devices, and > > > > these devices have no valid iommu_group (because they are not bound to vfio- > pci > > > > hence nobody allocates noiommu group for them). So when hot reset info loops > > > > such devices, it failed with -EPERM. Is this expected? > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the > > > minimally intrusive approach to no-iommu and the fact that we never > > > defined an invalid group ID to return to the user, it makes sense that > > > we just blocked the ioctl for no-iommu use. I guess we can do the same > > > for no-iommu cdev. > > > > I just realize a further issue related to this limitation. Remember that we > > may finally compile out the vfio group infrastructure in the future. Say I > > want to test noiommu, I may boot such a kernel with iommu disabled. I think > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will > > not support hot reset for noiommu in future if vfio group infrastructure is > > compiled out? > > We're talking about IOMMU groups, IOMMU groups are always present > regardless of whether we expose a vfio group interface to userspace. > Remember, we create IOMMU groups even in the no-iommu case. Even with > pure cdev, there are underlying IOMMU groups that maintain the DMA > ownership. I just realize that there is one case that does not have iommu group. although not implemented yet. There was a discussion on SIOV support. IIRC, it was agreed that no need to allocate iommu_group for SIOV case. Kevin or Jason can keep me honest here. I failed to find out the link of this discussion. > > As another thread, we are going to add a new bdf/group capability to > > DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new > > bdf/group capability or add a flag in the capability to mark the group_id > > is invalid? > > As above, there's always an IOMMU group, it's never invalid. Thanks, Regards, Yi Liu
On Thu, Apr 06, 2023 at 11:53:47AM -0600, Alex Williamson wrote: > Where whether a device is opened is subject to change outside of the > user's control. This essentially allows the user to perform hot-resets > of devices outside of their ownership so long as the device is not > used elsewhere, versus the current requirement that the user own all the > affected groups, which implies device ownership. It's not been > justified why this feature needs to exist, imo. The cdev API doesn't have the notion that owning a group means you "own" some collection of devices. It still happens as a side effect, but it isn't obviously part of the API. I'm really loath to re-introduce that group-based concept just for this. We are trying reduce the group API surface. How about a different direction. We add a new uAPI for cdev mode that is "take ownership of the reset group". Maybe it can be a flag in during bind. When requested vfio will ensure that every device in the reset group is only bound to this iommufd_ctx or left closed. Now and in the future. Since no-iommu has no iommufd_ctx this means we can open only one device in the reset group. With this flag RESET is guaranteed to always work by definition. We continue with the zero-length FD, but we can just replace the security checks with a check if we are in reset group ownership mode. _INFO is unchanged. We decide if we add a new IOCTL to return the BDF so the existing _INFO can get back to the dev_id or a new IOCTL that returns the dev_id list of the reset group. Userspace is required to figure out the extent of the reset, but we don't require that userspace prove to the kernel it did this when requesting the reset. Jason
On Fri, Apr 07, 2023 at 03:07:21PM -0600, Alex Williamson wrote: > I think we need to revisit the question of why allocating an IOMMU > group for a no-iommu device is exclusive to the vfio group support. One of the points of this effort is to remove the co-mingling of iommu and VFIO so much. We should not create the fake iommu groups for no-iommu. The _INFO API reporting the group is not a good reason to wreck this clean separation. Jason
On Sun, Apr 09, 2023 at 07:29:51AM -0600, Alex Williamson wrote: > > struct vfio_device_info_cap_pci_bdf { > > struct vfio_info_cap_header header; > > __u32 group_id; > > __u16 segment; > > __u8 bus; > > __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */ > > }; > > > > Group-id and bdf should be separate capabilities, all device should > report a group-id capability and only PCI devices a bdf capability. Group should be reported by iommufd using a generic ioctl, and not be part of VFIO. This should report BDF only and only work for PCI. Jason
On Tue, 11 Apr 2023 10:24:58 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Thu, Apr 06, 2023 at 11:53:47AM -0600, Alex Williamson wrote: > > > Where whether a device is opened is subject to change outside of the > > user's control. This essentially allows the user to perform hot-resets > > of devices outside of their ownership so long as the device is not > > used elsewhere, versus the current requirement that the user own all the > > affected groups, which implies device ownership. It's not been > > justified why this feature needs to exist, imo. > > The cdev API doesn't have the notion that owning a group means you > "own" some collection of devices. It still happens as a side effect, > but it isn't obviously part of the API. I'm really loath to > re-introduce that group-based concept just for this. We are trying > reduce the group API surface. > > How about a different direction. > > We add a new uAPI for cdev mode that is "take ownership of the reset > group". Maybe it can be a flag in during bind. > > When requested vfio will ensure that every device in the reset group > is only bound to this iommufd_ctx or left closed. Now and in the > future. Since no-iommu has no iommufd_ctx this means we can open only > one device in the reset group. > > With this flag RESET is guaranteed to always work by definition. > > We continue with the zero-length FD, but we can just replace the > security checks with a check if we are in reset group ownership mode. > > _INFO is unchanged. > > We decide if we add a new IOCTL to return the BDF so the existing > _INFO can get back to the dev_id or a new IOCTL that returns the > dev_id list of the reset group. > > Userspace is required to figure out the extent of the reset, but we > don't require that userspace prove to the kernel it did this when > requesting the reset. Take for example a multi-function PCIe device with ACS isolation between functions, are you going to allow a user who has only been granted ownership of a subset of functions control of the entire dev_set? It seems this proposal essentially extends the ownership model to the greater of the dev_set or iommu group, apparently neither of which are explicitly exposed to the user in the cdev API. How does a user determine when devices cannot be used independently in the cdev API? Thanks, Alex
[Appears the list got dropped, replying to my previous message to re-add] On Tue, 11 Apr 2023 13:32:16 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Apr 11, 2023 at 09:54:17AM -0600, Alex Williamson wrote: > > On Tue, 11 Apr 2023 10:24:58 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Thu, Apr 06, 2023 at 11:53:47AM -0600, Alex Williamson wrote: > > > > > > > Where whether a device is opened is subject to change outside of the > > > > user's control. This essentially allows the user to perform hot-resets > > > > of devices outside of their ownership so long as the device is not > > > > used elsewhere, versus the current requirement that the user own all the > > > > affected groups, which implies device ownership. It's not been > > > > justified why this feature needs to exist, imo. > > > > > > The cdev API doesn't have the notion that owning a group means you > > > "own" some collection of devices. It still happens as a side effect, > > > but it isn't obviously part of the API. I'm really loath to > > > re-introduce that group-based concept just for this. We are trying > > > reduce the group API surface. > > > > > > How about a different direction. > > > > > > We add a new uAPI for cdev mode that is "take ownership of the reset > > > group". Maybe it can be a flag in during bind. > > > > > > When requested vfio will ensure that every device in the reset group > > > is only bound to this iommufd_ctx or left closed. Now and in the > > > future. Since no-iommu has no iommufd_ctx this means we can open only > > > one device in the reset group. > > > > > > With this flag RESET is guaranteed to always work by definition. > > > > > > We continue with the zero-length FD, but we can just replace the > > > security checks with a check if we are in reset group ownership mode. > > > > > > _INFO is unchanged. > > > > > > We decide if we add a new IOCTL to return the BDF so the existing > > > _INFO can get back to the dev_id or a new IOCTL that returns the > > > dev_id list of the reset group. > > > > > > Userspace is required to figure out the extent of the reset, but we > > > don't require that userspace prove to the kernel it did this when > > > requesting the reset. > > > > Take for example a multi-function PCIe device with ACS isolation between > > functions, are you going to allow a user who has only been granted > > ownership of a subset of functions control of the entire dev_set? > > Our cdev model says that opening a cdev locks out other cdevs from > independent use, eg because of the group sharing. Extending this to > include the reset group as well seems consistent. The DMA ownership model based on the IOMMU group is consistent with legacy vfio, but now you're proposing a new ownership model that optionally allows a user to extend their ownership, opportunistically lock out other users, and wreaking havoc for management utilities that also have no insight into dev_sets or userspace driver behavior. > There is some security concern here, but that goes both ways, a 3rd > party should not be able to break an application that needs to use > this RESET and had sufficient privileges to assert an ownership. There are clearly scenarios we have now that could break. For example, today if QEMU doesn't own all the IOMMU groups for a mult-function device, it can't do a reset, the remaining functions are available for other users. As I understand the proposal, QEMU now gets to attempt to claim ownership of the dev_set, so it opportunistically extends its ownership and may block other users from the affected devices. Ordering makes this effectively unpredictable, if a userspace like DPDK that doesn't assert dev_set ownership is started first, QEMU can start and be denied hot-reset support. In the reverse ordering, the DPDK application can be locked out by QEMU. > I'd say anyone should be able to assert RESET ownership if, like > today, the iommufd_ctx has all the groups of the dev_set inside > it. Once asserted it becomes safe against all forms of hotplug, and > continues to be safe even if some of the devices are closed. eg hot > unplugging from the VM doesn't change the availability of RESET. > > This comes from your ask that qemu know clearly if RESET works, and it > doesn't change while qemu is running. This seems stronger and clearer > than the current implicit scheme. It also doesn't require usespace to > do any calculations with groups or BDFs to figure out of RESET is > available, kernel confirms it directly. As above, clarity and predictability seem lacking in this proposal. With the current scheme, the ownership of the affected devices is implied if they exist within an owned group, but the strength of that ownership is clear. Affected devices outside the set of owned groups says that hot-reset is unavailable without any of this "but QEMU might be able to request it" or "unless the affected device is currently unopened" variables. > > seems this proposal essentially extends the ownership model to the > > greater of the dev_set or iommu group, apparently neither of which > > are explicitly exposed to the user in the cdev API. > > IIRC the group id can be learned from sysfs before opening the cdev > file. Something like /sys/class/vfio/XX/../../iommu_group And in the passed cdev fd model... ? > We should also have an iommufd ioctl to report the "same ioas" > groupings of dev_ids to make it easy on userspace. I haven't checked > to see what the current qemu patches are doing with this.. Seems we're ignoring that no-iommu doesn't have a valid iommufd. > > How does a user determine when devices cannot be used independently > > in the cdev API? > > We have this problem right now. The only way to learn the reset group > is to call the _INFO ioctl. We could add a sysfs "pci_reset_group" > under /sys/class/vfio/XX/ if something needs it earlier. For all the complaints about complexity, now we're asking management tools to not only take into account IOMMU groups, but also reset groups, and some inferred knowledge about the application and devices to speculate whether reset group ownership is taken by a given userspace?? Thanks, Alex
On Tue, Apr 11, 2023 at 11:11:17AM -0600, Alex Williamson wrote: > [Appears the list got dropped, replying to my previous message to re-add] Wowo this got mesed up alot, mutt drops the cc when replying for some reason. I think it is fixed up now > > Our cdev model says that opening a cdev locks out other cdevs from > > independent use, eg because of the group sharing. Extending this to > > include the reset group as well seems consistent. > > The DMA ownership model based on the IOMMU group is consistent with > legacy vfio, but now you're proposing a new ownership model that > optionally allows a user to extend their ownership, opportunistically > lock out other users, and wreaking havoc for management utilities that > also have no insight into dev_sets or userspace driver behavior. I suggested below that the owership require enough open devices - so it doesn't "extend ownership opportunistically", and there is no havoc. Management tools already need to understand dev_set if they want to offer reliable reset support to the VMs. Same as today. > > There is some security concern here, but that goes both ways, a 3rd > > party should not be able to break an application that needs to use > > this RESET and had sufficient privileges to assert an ownership. > > There are clearly scenarios we have now that could break. For example, > today if QEMU doesn't own all the IOMMU groups for a mult-function > device, it can't do a reset, the remaining functions are available for > other users. Sure, and we can keep that with this approach. > As I understand the proposal, QEMU now gets to attempt to > claim ownership of the dev_set, so it opportunistically extends its > ownership and may block other users from the affected devices. We can decide the policy for the kernel to accept a claim. I suggested below "same as today" - it must hold all the groups within the iommufd_ctx. The main point is to make this claiming operation qemu needs to do clearer and more explicit. I view this as better than trying to guess if it successfully made the claim by inspecting the _INFO output. > > I'd say anyone should be able to assert RESET ownership if, like > > today, the iommufd_ctx has all the groups of the dev_set inside > > it. Once asserted it becomes safe against all forms of hotplug, and > > continues to be safe even if some of the devices are closed. eg hot > > unplugging from the VM doesn't change the availability of RESET. > > > > This comes from your ask that qemu know clearly if RESET works, and it > > doesn't change while qemu is running. This seems stronger and clearer > > than the current implicit scheme. It also doesn't require usespace to > > do any calculations with groups or BDFs to figure out of RESET is > > available, kernel confirms it directly. > > As above, clarity and predictability seem lacking in this proposal. > With the current scheme, the ownership of the affected devices is > implied if they exist within an owned group, but the strength of that > ownership is clear. Same logic holds here Ownership is claimed same as today by having all groups representated in the iommufd_ctx. This seems just as clear as today. > > > seems this proposal essentially extends the ownership model to the > > > greater of the dev_set or iommu group, apparently neither of which > > > are explicitly exposed to the user in the cdev API. > > > > IIRC the group id can be learned from sysfs before opening the cdev > > file. Something like /sys/class/vfio/XX/../../iommu_group > > And in the passed cdev fd model... ? IMHO we should try to avoid needing to expose group_id specifically to userspace. We are missing a way to learn the "same ioas" restriction in iommufd, and it should provide that directly based on dev_ids. Otherwise if we really really need group_id then iommufd should provide an ioctl to get it. Let's find a good reason first > > We should also have an iommufd ioctl to report the "same ioas" > > groupings of dev_ids to make it easy on userspace. I haven't checked > > to see what the current qemu patches are doing with this.. > > Seems we're ignoring that no-iommu doesn't have a valid iommufd. no-iommu doesn't and shouldn't have iommu_groups either. It also doesn't have an IOAS so querying for same-IOAS is not necessary. The simplest option for no-iommu is to require it to pass in every device fd to the reset ioctl. > > > How does a user determine when devices cannot be used independently > > > in the cdev API? > > > > We have this problem right now. The only way to learn the reset group > > is to call the _INFO ioctl. We could add a sysfs "pci_reset_group" > > under /sys/class/vfio/XX/ if something needs it earlier. > > For all the complaints about complexity, now we're asking management > tools to not only take into account IOMMU groups, but also reset > groups, and some inferred knowledge about the application and devices > to speculate whether reset group ownership is taken by a given > userspace?? No, we are trying to keep things pretty much the same as today without resorting to exposing a lot of group related concepts. The reset group is a clear concept that already exists and isn't exposed. If we really need to know about it then it should be exposed on its own, as a seperate discussion from this cdev stuff. I want to re-focus on the basics of what cdev is supposed to be doing, because several of the idea you suggested seem against this direction: - cdev does not have, and cannot rely on vfio_groups. We enforce this by compiling all the vfio_group infrastructure out. iommu_groups continue to exist. So converting a cdev to a vfio_group is not an allowed operation. - no-iommu should not have iommu_groups. We enforce this by compiling out all the no-iommu vfio_group infrastructure. - cdev APIs should ideally not require the user to know the group_id, we should try hard to design APIs to avoid this. We have solved every other problem but reset like this, I would like to get past reset without compromising the above. Jason
On Tue, 11 Apr 2023 15:40:07 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Apr 11, 2023 at 11:11:17AM -0600, Alex Williamson wrote: > > [Appears the list got dropped, replying to my previous message to re-add] > > Wowo this got mesed up alot, mutt drops the cc when replying for some > reason. I think it is fixed up now > > > > Our cdev model says that opening a cdev locks out other cdevs from > > > independent use, eg because of the group sharing. Extending this to > > > include the reset group as well seems consistent. > > > > The DMA ownership model based on the IOMMU group is consistent with > > legacy vfio, but now you're proposing a new ownership model that > > optionally allows a user to extend their ownership, opportunistically > > lock out other users, and wreaking havoc for management utilities that > > also have no insight into dev_sets or userspace driver behavior. > > I suggested below that the owership require enough open devices - so > it doesn't "extend ownership opportunistically", and there is no > havoc. > > Management tools already need to understand dev_set if they want to > offer reliable reset support to the VMs. Same as today. I don't think that's true. Our primary hot-reset use case is GPUs and subordinate functions, where the isolation and reset scope are often sufficiently similar to make hot-reset possible, regardless whether all the functions are assigned to a VM. I don't think you'll find any management tools that takes reset scope into account otherwise. > > > There is some security concern here, but that goes both ways, a 3rd > > > party should not be able to break an application that needs to use > > > this RESET and had sufficient privileges to assert an ownership. > > > > There are clearly scenarios we have now that could break. For example, > > today if QEMU doesn't own all the IOMMU groups for a mult-function > > device, it can't do a reset, the remaining functions are available for > > other users. > > Sure, and we can keep that with this approach. > > > As I understand the proposal, QEMU now gets to attempt to > > claim ownership of the dev_set, so it opportunistically extends its > > ownership and may block other users from the affected devices. > > We can decide the policy for the kernel to accept a claim. I suggested > below "same as today" - it must hold all the groups within the > iommufd_ctx. It must hold all the groups [that the user doesn't know about because it's not a formal part of the cdev API] within the iommufd_ctx? > The main point is to make this claiming operation qemu needs to do > clearer and more explicit. I view this as better than trying to guess > if it successfully made the claim by inspecting the _INFO output. There is no guessing in the current API. Guessing is what happens when hot-reset magically works because one of the devices wasn't opened at the time, or the iommufd_ctx happens to hold all the affected groups that the user doesn't have an API to understand. The current API has a very concise requirement, the user must own all of the groups affected by the hot-reset in order to effect a hot-reset. > > > I'd say anyone should be able to assert RESET ownership if, like > > > today, the iommufd_ctx has all the groups of the dev_set inside > > > it. Once asserted it becomes safe against all forms of hotplug, and > > > continues to be safe even if some of the devices are closed. eg hot > > > unplugging from the VM doesn't change the availability of RESET. > > > > > > This comes from your ask that qemu know clearly if RESET works, and it > > > doesn't change while qemu is running. This seems stronger and clearer > > > than the current implicit scheme. It also doesn't require usespace to > > > do any calculations with groups or BDFs to figure out of RESET is > > > available, kernel confirms it directly. > > > > As above, clarity and predictability seem lacking in this proposal. > > With the current scheme, the ownership of the affected devices is > > implied if they exist within an owned group, but the strength of that > > ownership is clear. > > Same logic holds here > > Ownership is claimed same as today by having all groups representated > in the iommufd_ctx. This seems just as clear as today. I don't know if anyone else is having this trouble, but I'm seeing conflicting requirements. The cdev API is not to expose groups unless a requirement is found to need them, of which this is apparently not one, but all the groups need to be represented in the iommufd_ctx in order to make use of this interface. How is that clear? > > > > seems this proposal essentially extends the ownership model to the > > > > greater of the dev_set or iommu group, apparently neither of which > > > > are explicitly exposed to the user in the cdev API. > > > > > > IIRC the group id can be learned from sysfs before opening the cdev > > > file. Something like /sys/class/vfio/XX/../../iommu_group > > > > And in the passed cdev fd model... ? > > IMHO we should try to avoid needing to expose group_id specifically to > userspace. We are missing a way to learn the "same ioas" restriction > in iommufd, and it should provide that directly based on dev_ids. Is this yet another "we need to expose groups to understand the ioas restriction but we're not going to because reasons" argument? > Otherwise if we really really need group_id then iommufd should > provide an ioctl to get it. Let's find a good reason first If needing to have all of the groups represented in an iommufd_ctx in order to effect a reset without allowing the user to know the set of affected groups and device to group relationship isn't a reason... well I'm just lost. > > > We should also have an iommufd ioctl to report the "same ioas" > > > groupings of dev_ids to make it easy on userspace. I haven't checked > > > to see what the current qemu patches are doing with this.. > > > > Seems we're ignoring that no-iommu doesn't have a valid iommufd. > > no-iommu doesn't and shouldn't have iommu_groups either. It also > doesn't have an IOAS so querying for same-IOAS is not necessary. > > The simplest option for no-iommu is to require it to pass in every > device fd to the reset ioctl. Which ironically is exactly how it ends up working today, each no-iommu device has a fake IOMMU group, so every affected device (group) needs to be provided. > > > > How does a user determine when devices cannot be used independently > > > > in the cdev API? > > > > > > We have this problem right now. The only way to learn the reset group > > > is to call the _INFO ioctl. We could add a sysfs "pci_reset_group" > > > under /sys/class/vfio/XX/ if something needs it earlier. > > > > For all the complaints about complexity, now we're asking management > > tools to not only take into account IOMMU groups, but also reset > > groups, and some inferred knowledge about the application and devices > > to speculate whether reset group ownership is taken by a given > > userspace?? > > No, we are trying to keep things pretty much the same as today without > resorting to exposing a lot of group related concepts. > > The reset group is a clear concept that already exists and isn't > exposed. If we really need to know about it then it should be exposed > on its own, as a seperate discussion from this cdev stuff. "[A]nd isn't exposed"... what exactly is the hot-reset INFO ioctl exposing if not that? > I want to re-focus on the basics of what cdev is supposed to be doing, > because several of the idea you suggested seem against this direction: > > - cdev does not have, and cannot rely on vfio_groups. We enforce this > by compiling all the vfio_group infrastructure out. iommu_groups > continue to exist. > > So converting a cdev to a vfio_group is not an allowed operation. My only statements in this respect were towards the notion that IOMMU groups continue to exist. I'm well aware of the desire to deprecate and remove vfio groups. > - no-iommu should not have iommu_groups. We enforce this by compiling > out all the no-iommu vfio_group infrastructure. This is not logically inferred from the above if IOMMU groups continue to exist and continue to be a basis for describing DMA ownership as well as "reset groups" > - cdev APIs should ideally not require the user to know the group_id, > we should try hard to design APIs to avoid this. This is a nuance, group_id vs group, where it's been previously discussed that users will need to continue to know the boundaries of a group for the purpose of DMA isolation and potentially IOAS independence should cdev/iommufd choose to tackle those topics. > We have solved every other problem but reset like this, I would like > to get past reset without compromising the above. "These aren't the droids we're looking for." What is the actual proposal here? You've said that hot-reset works if the iommufd_ctx has representation from each affected group, the INFO ioctl remains as it is, which suggests that it's reporting group ID and BDF, yet only sysfs tells the user the relation between a vfio cdev and a group and we're trying to enable a pass-by-fd model for cdev where the user has no reference to a sysfs node for the device. Show me how these pieces fit together. OTOH, if we say IOMMU groups continue to exist [agreed], every vfio device has an IOMMU group, and there's an API to learn the group ID, the solution becomes much more clear and no-iommu devices require no special cases or restrictions. Not only does the INFO ioctl remain the same, but the hot-reset ioctl itself remains effectively the same accepting either vfio cdevs or groups. Thanks, Alex
On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote: > > Management tools already need to understand dev_set if they want to > > offer reliable reset support to the VMs. Same as today. > > I don't think that's true. Our primary hot-reset use case is GPUs and > subordinate functions, where the isolation and reset scope are often > sufficiently similar to make hot-reset possible, regardless whether > all the functions are assigned to a VM. I don't think you'll find any > management tools that takes reset scope into account otherwise. When I think of "reliable reset support" I think of the management tool offering a checkbox that says "ensure PCI function reset availability" and if checked it will not launch the VM without a working reset. If the user configures a set of VFIO devices and then hopes they get working reset, that is fine, but doesn't require any reporting of reset groups, or iommu groups to the management layer to work. > > > As I understand the proposal, QEMU now gets to attempt to > > > claim ownership of the dev_set, so it opportunistically extends its > > > ownership and may block other users from the affected devices. > > > > We can decide the policy for the kernel to accept a claim. I suggested > > below "same as today" - it must hold all the groups within the > > iommufd_ctx. > > It must hold all the groups [that the user doesn't know about because > it's not a formal part of the cdev API] within the iommufd_ctx? You keep going back to this, but I maintain userspace doesn't care. qemu is given a list of VFIO devices to use, all it wants to know is if it is allowed to use reset or not. Why should it need to know groups and group_ids to get that binary signal out of the kernel? > > The simplest option for no-iommu is to require it to pass in every > > device fd to the reset ioctl. > > Which ironically is exactly how it ends up working today, each no-iommu > device has a fake IOMMU group, so every affected device (group) needs > to be provided. Sure, that is probably the way forward for no-iommu. Not that anyone uses it.. The kicker is we don't force the user to generate a de-duplicated list of devices FDs, one per group, just because. > > I want to re-focus on the basics of what cdev is supposed to be doing, > > because several of the idea you suggested seem against this direction: > > > > - cdev does not have, and cannot rely on vfio_groups. We enforce this > > by compiling all the vfio_group infrastructure out. iommu_groups > > continue to exist. > > > > So converting a cdev to a vfio_group is not an allowed operation. > > My only statements in this respect were towards the notion that IOMMU > groups continue to exist. I'm well aware of the desire to deprecate > and remove vfio groups. Yes > > - no-iommu should not have iommu_groups. We enforce this by compiling > > out all the no-iommu vfio_group infrastructure. > > This is not logically inferred from the above if IOMMU groups continue > to exist and continue to be a basis for describing DMA ownership as > well as "reset groups" It is not ment to flow out of the above, it is a seperate statement. I want the iommu_group mechanism to stop being abused outside the iommu core code. The only thing that should be creating groups is an attached iommu driver operating under ops->device_group(). VFIO needed this to support mdev and no-iommu. We already have mdev free of iommu_groups, I would like no-iommu to also be free of it too, we are very close. That would leave POWER as the only abuser of the iommu_group_add_device() API, and it is only doing it because it hasn't got a proper iommu driver implementation yet. It turns out their abuse is mislocked and maybe racy to boot :( > > - cdev APIs should ideally not require the user to know the group_id, > > we should try hard to design APIs to avoid this. > > This is a nuance, group_id vs group, where it's been previously > discussed that users will need to continue to know the boundaries of a > group for the purpose of DMA isolation and potentially IOAS > independence should cdev/iommufd choose to tackle those topics. Yes, group_id is a value we have no specific use for and would require userspace to keep seperate track of. I'd prefer to rely on dev_id as much as possible instead. > What is the actual proposal here? I don't know anymore, you don't seem to like this direction either... > You've said that hot-reset works if the iommufd_ctx has > representation from each affected group, the INFO ioctl remains as > it is, which suggests that it's reporting group ID and BDF, yet only > sysfs tells the user the relation between a vfio cdev and a group > and we're trying to enable a pass-by-fd model for cdev where the > user has no reference to a sysfs node for the device. Show me how > these pieces fit together. I prefer the version where INFO2 returns the dev_id, but info can work if we do the BDF cap like you suggested to Yi > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio > device has an IOMMU group I don't desire every VFIO device to have an iommu_group. I want VFIO devices with real IOMMU drivers to have an iommu_group. mdev and no-iommu should not. I don't want to add them back into the design just so INFO has a value to return. I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an iommu_group... I see this problem as a few basic requirements from a qemu-like application: 1) Does the configuration I was given support reset right now? 2) Will the configuration I was given support reset for the duration of my execution? 3) What groups of the devices I already have open does the reset effect? 4) For debugging, report to the user the full list of devices in the reset group, in a way that relates back to sysfs. 5) Away to trigger a reset on a group of devices #1/#2 is the API I suggested here. Ask the kernel if the current configuration works, and ask it to keep it working. #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id #4 is either INFO and print the BDFs or INFO2 reporting the struct vfio_device IDR # (eg /sys/class/vfio/vfioXXX/). #5 is adjusting the FD list in existing RESET ioctl. Remove the need for userspace to specify a minimal exact list of FDs means userspace doesn't need the information to figure out what that list actually is. Pass a 0 length list and use iommufdctx. None of these requirements suggests to me that qemu needs to know the group_id, or that it needs to have enough information to know how to fix an unavailable reset. Did I miss a requirement here? Regards, Jason
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, April 12, 2023 5:58 AM > > On Tue, 11 Apr 2023 15:40:07 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Apr 11, 2023 at 11:11:17AM -0600, Alex Williamson wrote: > > > [Appears the list got dropped, replying to my previous message to re-add] > > > > Wowo this got mesed up alot, mutt drops the cc when replying for some > > reason. I think it is fixed up now > > > > > > Our cdev model says that opening a cdev locks out other cdevs from > > > > independent use, eg because of the group sharing. Extending this to > > > > include the reset group as well seems consistent. > > > > > > The DMA ownership model based on the IOMMU group is consistent with > > > legacy vfio, but now you're proposing a new ownership model that > > > optionally allows a user to extend their ownership, opportunistically > > > lock out other users, and wreaking havoc for management utilities that > > > also have no insight into dev_sets or userspace driver behavior. > > > > I suggested below that the owership require enough open devices - so > > it doesn't "extend ownership opportunistically", and there is no > > havoc. > > > > Management tools already need to understand dev_set if they want to > > offer reliable reset support to the VMs. Same as today. > > I don't think that's true. Our primary hot-reset use case is GPUs and > subordinate functions, where the isolation and reset scope are often > sufficiently similar to make hot-reset possible, regardless whether > all the functions are assigned to a VM. I don't think you'll find any > management tools that takes reset scope into account otherwise. If we only care about the primary case where iommu group and reset scope matches, then why would the new claim model in Jason's proposal urge the management tools to understand the reset scope now? btw in your earlier replies you pointed out the issue of unpredictable ordering on a multi-function device e.g. upon which one runs first dpdk or qmeu will block the other. But I wonder what is the actual use of allowing both running while both can't do reset due to affected reset scope in current model. If a vfio user cannot do reset doesn't it imply it hasn't acquired the full permission on the device then Jason's proposal of explicitly failing it is actually a cleaner model? Thanks Kevin
> From: Jason Gunthorpe > Sent: Wednesday, April 12, 2023 8:01 AM > > I see this problem as a few basic requirements from a qemu-like > application: > > 1) Does the configuration I was given support reset right now? > 2) Will the configuration I was given support reset for the duration > of my execution? > 3) What groups of the devices I already have open does the reset > effect? > 4) For debugging, report to the user the full list of devices in the > reset group, in a way that relates back to sysfs. > 5) Away to trigger a reset on a group of devices > > #1/#2 is the API I suggested here. Ask the kernel if the current > configuration works, and ask it to keep it working. > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id > > #4 is either INFO and print the BDFs or INFO2 reporting the struct > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/). mdev doesn't have BDF. Of course it doesn't support hot_reset either. but it's presented to userspace as a pci device. Is it weird for a pci device which doesn't provide a BDF cap? from this point the vfio_device IDR# sounds more generic. > > #5 is adjusting the FD list in existing RESET ioctl. Remove the need > for userspace to specify a minimal exact list of FDs means userspace > doesn't need the information to figure out what that list actually > is. Pass a 0 length list and use iommufdctx. > > None of these requirements suggests to me that qemu needs to know the > group_id, or that it needs to have enough information to know how to > fix an unavailable reset. >
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, April 12, 2023 8:01 AM > > On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote: > > > > Management tools already need to understand dev_set if they want to > > > offer reliable reset support to the VMs. Same as today. > > > > I don't think that's true. Our primary hot-reset use case is GPUs and > > subordinate functions, where the isolation and reset scope are often > > sufficiently similar to make hot-reset possible, regardless whether > > all the functions are assigned to a VM. I don't think you'll find any > > management tools that takes reset scope into account otherwise. > > When I think of "reliable reset support" I think of the management > tool offering a checkbox that says "ensure PCI function reset > availability" and if checked it will not launch the VM without a > working reset. > > If the user configures a set of VFIO devices and then hopes they get > working reset, that is fine, but doesn't require any reporting of > reset groups, or iommu groups to the management layer to work. > > > > > As I understand the proposal, QEMU now gets to attempt to > > > > claim ownership of the dev_set, so it opportunistically extends its > > > > ownership and may block other users from the affected devices. > > > > > > We can decide the policy for the kernel to accept a claim. I suggested > > > below "same as today" - it must hold all the groups within the > > > iommufd_ctx. > > > > It must hold all the groups [that the user doesn't know about because > > it's not a formal part of the cdev API] within the iommufd_ctx? > > You keep going back to this, but I maintain userspace doesn't > care. qemu is given a list of VFIO devices to use, all it wants to > know is if it is allowed to use reset or not. Why should it need to > know groups and group_ids to get that binary signal out of the kernel? > > > > The simplest option for no-iommu is to require it to pass in every > > > device fd to the reset ioctl. > > > > Which ironically is exactly how it ends up working today, each no-iommu > > device has a fake IOMMU group, so every affected device (group) needs > > to be provided. > > Sure, that is probably the way forward for no-iommu. Not that anyone > uses it.. > > The kicker is we don't force the user to generate a de-duplicated list > of devices FDs, one per group, just because. > > > > I want to re-focus on the basics of what cdev is supposed to be doing, > > > because several of the idea you suggested seem against this direction: > > > > > > - cdev does not have, and cannot rely on vfio_groups. We enforce this > > > by compiling all the vfio_group infrastructure out. iommu_groups > > > continue to exist. > > > > > > So converting a cdev to a vfio_group is not an allowed operation. > > > > My only statements in this respect were towards the notion that IOMMU > > groups continue to exist. I'm well aware of the desire to deprecate > > and remove vfio groups. > > Yes > > > > - no-iommu should not have iommu_groups. We enforce this by compiling > > > out all the no-iommu vfio_group infrastructure. > > > > This is not logically inferred from the above if IOMMU groups continue > > to exist and continue to be a basis for describing DMA ownership as > > well as "reset groups" > > It is not ment to flow out of the above, it is a seperate statement. I > want the iommu_group mechanism to stop being abused outside the iommu > core code. The only thing that should be creating groups is an > attached iommu driver operating under ops->device_group(). > > VFIO needed this to support mdev and no-iommu. We already have mdev > free of iommu_groups, I would like no-iommu to also be free of it too, > we are very close. > > That would leave POWER as the only abuser of the > iommu_group_add_device() API, and it is only doing it because it > hasn't got a proper iommu driver implementation yet. It turns out > their abuse is mislocked and maybe racy to boot :( > > > > - cdev APIs should ideally not require the user to know the group_id, > > > we should try hard to design APIs to avoid this. > > > > This is a nuance, group_id vs group, where it's been previously > > discussed that users will need to continue to know the boundaries of a > > group for the purpose of DMA isolation and potentially IOAS > > independence should cdev/iommufd choose to tackle those topics. > > Yes, group_id is a value we have no specific use for and would require > userspace to keep seperate track of. I'd prefer to rely on dev_id as > much as possible instead. > > > What is the actual proposal here? > > I don't know anymore, you don't seem to like this direction either... > > > You've said that hot-reset works if the iommufd_ctx has > > representation from each affected group, the INFO ioctl remains as > > it is, which suggests that it's reporting group ID and BDF, yet only > > sysfs tells the user the relation between a vfio cdev and a group > > and we're trying to enable a pass-by-fd model for cdev where the > > user has no reference to a sysfs node for the device. Show me how > > these pieces fit together. > > I prefer the version where INFO2 returns the dev_id, but info can work > if we do the BDF cap like you suggested to Yi > > > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio > > device has an IOMMU group > > I don't desire every VFIO device to have an iommu_group. I want VFIO > devices with real IOMMU drivers to have an iommu_group. mdev and > no-iommu should not. I don't want to add them back into the design > just so INFO has a value to return. > > I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an > iommu_group... > > I see this problem as a few basic requirements from a qemu-like > application: > > 1) Does the configuration I was given support reset right now? > 2) Will the configuration I was given support reset for the duration > of my execution? > 3) What groups of the devices I already have open does the reset > effect? > 4) For debugging, report to the user the full list of devices in the > reset group, in a way that relates back to sysfs. > 5) Away to trigger a reset on a group of devices > > #1/#2 is the API I suggested here. Ask the kernel if the current > configuration works, and ask it to keep it working. > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id > > #4 is either INFO and print the BDFs or INFO2 reporting the struct > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/). I hope we can have a clear statement on the _INFO or INFO2 usage. Today, per QEMU's implementation, the output of _INFO is used to: 1) do a self-check to see if all the affected groups are opened by the current user before it can invoke hot-reset. 2) figure out the devices that are already opened by the user. QEMU needs to save the state of such devices as the device may already been in use. If so, its state should be saved and restored prior/post the hot-reset. Seems like we are relaxing the self-check as it may be done by locking the reset group. is it? > #5 is adjusting the FD list in existing RESET ioctl. Remove the need > for userspace to specify a minimal exact list of FDs means userspace > doesn't need the information to figure out what that list actually > is. Pass a 0 length list and use iommufdctx. If the reset group is locked, seems no need to check iommufdctx. Thanks, Yi Liu
On Wed, Apr 12, 2023 at 07:27:43AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, April 12, 2023 8:01 AM > > > > I see this problem as a few basic requirements from a qemu-like > > application: > > > > 1) Does the configuration I was given support reset right now? > > 2) Will the configuration I was given support reset for the duration > > of my execution? > > 3) What groups of the devices I already have open does the reset > > effect? > > 4) For debugging, report to the user the full list of devices in the > > reset group, in a way that relates back to sysfs. > > 5) Away to trigger a reset on a group of devices > > > > #1/#2 is the API I suggested here. Ask the kernel if the current > > configuration works, and ask it to keep it working. > > > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id > > > > #4 is either INFO and print the BDFs or INFO2 reporting the struct > > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/). > > mdev doesn't have BDF. Of course it doesn't support hot_reset either. It should support a reset.. Maybe idxd doesn't, but it should be part of the SIOV model. Our SIOV devices would need it for instance. > but it's presented to userspace as a pci device. Is it weird for a pci > device which doesn't provide a BDF cap? It is weird for a PCI device, but it is not weird for a VFIO device. Leaking the physical labels out of the uAPI is not clean, IMHO. > from this point the vfio_device IDR# sounds more generic. Yes, I was thinking about this for the SIOV model. Jason
On Tue, 11 Apr 2023 21:01:06 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote: > > > > Management tools already need to understand dev_set if they want to > > > offer reliable reset support to the VMs. Same as today. > > > > I don't think that's true. Our primary hot-reset use case is GPUs and > > subordinate functions, where the isolation and reset scope are often > > sufficiently similar to make hot-reset possible, regardless whether > > all the functions are assigned to a VM. I don't think you'll find any > > management tools that takes reset scope into account otherwise. > > When I think of "reliable reset support" I think of the management > tool offering a checkbox that says "ensure PCI function reset > availability" and if checked it will not launch the VM without a > working reset. This doesn't exist. > If the user configures a set of VFIO devices and then hopes they get > working reset, that is fine, but doesn't require any reporting of > reset groups, or iommu groups to the management layer to work. I think there's more than hope involved here, there are recipes to create working hot-reset configurations because it is well specified and predictable currently. QEMU can indicate whether hot-reset is available thanks to the information provided in the INFO ioctl and a VM that owns the necessary set of groups may consistently and repeatedly perform hot-resets. > > > > As I understand the proposal, QEMU now gets to attempt to > > > > claim ownership of the dev_set, so it opportunistically extends its > > > > ownership and may block other users from the affected devices. > > > > > > We can decide the policy for the kernel to accept a claim. I suggested > > > below "same as today" - it must hold all the groups within the > > > iommufd_ctx. > > > > It must hold all the groups [that the user doesn't know about because > > it's not a formal part of the cdev API] within the iommufd_ctx? > > You keep going back to this, but I maintain userspace doesn't > care. qemu is given a list of VFIO devices to use, all it wants to > know is if it is allowed to use reset or not. Why should it need to > know groups and group_ids to get that binary signal out of the kernel? hw/vfio/pci.c:2320 error_report("vfio: Cannot reset device %s, " "depends on group %d which is not owned.", vdev->vbasedev.name, devices[i].group_id); That creates a feedback loop where a user can take corrective action with actual information in hand to resolve the issue. > > > The simplest option for no-iommu is to require it to pass in every > > > device fd to the reset ioctl. > > > > Which ironically is exactly how it ends up working today, each no-iommu > > device has a fake IOMMU group, so every affected device (group) needs > > to be provided. > > Sure, that is probably the way forward for no-iommu. Not that anyone > uses it.. > > The kicker is we don't force the user to generate a de-duplicated list > of devices FDs, one per group, just because. So on one hand you're asking for simplicity, but on the other you're criticizing a trivial simplification that we chose to allow the user to pass number of group fds equal to number of devices affected so that the user doesn't need to take that step to de-duplicate the list. We can't win. > > > I want to re-focus on the basics of what cdev is supposed to be doing, > > > because several of the idea you suggested seem against this direction: > > > > > > - cdev does not have, and cannot rely on vfio_groups. We enforce this > > > by compiling all the vfio_group infrastructure out. iommu_groups > > > continue to exist. > > > > > > So converting a cdev to a vfio_group is not an allowed operation. > > > > My only statements in this respect were towards the notion that IOMMU > > groups continue to exist. I'm well aware of the desire to deprecate > > and remove vfio groups. > > Yes > > > > - no-iommu should not have iommu_groups. We enforce this by compiling > > > out all the no-iommu vfio_group infrastructure. > > > > This is not logically inferred from the above if IOMMU groups continue > > to exist and continue to be a basis for describing DMA ownership as > > well as "reset groups" > > It is not ment to flow out of the above, it is a seperate statement. I > want the iommu_group mechanism to stop being abused outside the iommu > core code. The only thing that should be creating groups is an > attached iommu driver operating under ops->device_group(). > > VFIO needed this to support mdev and no-iommu. We already have mdev > free of iommu_groups, I would like no-iommu to also be free of it too, > we are very close. > > That would leave POWER as the only abuser of the > iommu_group_add_device() API, and it is only doing it because it > hasn't got a proper iommu driver implementation yet. It turns out > their abuse is mislocked and maybe racy to boot :( > > > > - cdev APIs should ideally not require the user to know the group_id, > > > we should try hard to design APIs to avoid this. > > > > This is a nuance, group_id vs group, where it's been previously > > discussed that users will need to continue to know the boundaries of a > > group for the purpose of DMA isolation and potentially IOAS > > independence should cdev/iommufd choose to tackle those topics. > > Yes, group_id is a value we have no specific use for and would require > userspace to keep seperate track of. I'd prefer to rely on dev_id as > much as possible instead. But dev-id only has meaning in relation to an iommufd_ctx, so it fails to be useful in the context of implied ownership. > > What is the actual proposal here? > > I don't know anymore, you don't seem to like this direction either... > > > You've said that hot-reset works if the iommufd_ctx has > > representation from each affected group, the INFO ioctl remains as > > it is, which suggests that it's reporting group ID and BDF, yet only > > sysfs tells the user the relation between a vfio cdev and a group > > and we're trying to enable a pass-by-fd model for cdev where the > > user has no reference to a sysfs node for the device. Show me how > > these pieces fit together. > > I prefer the version where INFO2 returns the dev_id, but info can work > if we do the BDF cap like you suggested to Yi As discussed ad nauseam, dev-id is useless if an affected device is not already within the iommufd ctx. BDF provides a mapping to specific affected devices, but can't express implied ownership. Group id provides the implied ownership, but can't express specific devices. As Yi has pointed out, QEMU needs to know both if it has ownership of all the affected devices, both direct and implied, and which specific devices that it owns are affected. > > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio > > device has an IOMMU group > > I don't desire every VFIO device to have an iommu_group. I want VFIO > devices with real IOMMU drivers to have an iommu_group. mdev and > no-iommu should not. I don't want to add them back into the design > just so INFO has a value to return. > > I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an > iommu_group... It's not been shown to me that dev-id is a useful replacement for anything here. > I see this problem as a few basic requirements from a qemu-like > application: > > 1) Does the configuration I was given support reset right now? > 2) Will the configuration I was given support reset for the duration > of my execution? > 3) What groups of the devices I already have open does the reset > effect? > 4) For debugging, report to the user the full list of devices in the > reset group, in a way that relates back to sysfs. > 5) Away to trigger a reset on a group of devices > > #1/#2 is the API I suggested here. Ask the kernel if the current > configuration works, and ask it to keep it working. That is super sketchy because you're also advocating for opportunistically supporting reset if the instantaneous conditions allow is (ex. unopened devices), and going back and forth whether "ask it to keep working" suggests that a user is able to extend their granted ownership themselves. I think both needs to be based on some form of granted, not requested, ownership and not opportunism. > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id Where dev-id is useful for... ? I think there's a misuse of "groups" in 3) above, userspace needs to know specific devices affected, thus BDF. > #4 is either INFO and print the BDFs or INFO2 reporting the struct > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/). We can't assume that all the affected devices are bound to vfio, therefore we cannot assume a vfio_device IDR exists. > #5 is adjusting the FD list in existing RESET ioctl. Remove the need > for userspace to specify a minimal exact list of FDs means userspace > doesn't need the information to figure out what that list actually > is. Pass a 0 length list and use iommufdctx. "...doesn't need the information to figure out what the list actually is." That's false, userspace needs the information whether it uses it to make a list or not, ex. pre- and post-reset processing of specific affected devices. Furthermore, supporting a zero length array removes context from the existing ioctl, which has been shown to make it prone to creating gaps in legacy group use cases, so I don't understand why this optimization is so pervasive or important. > None of these requirements suggests to me that qemu needs to know the > group_id, or that it needs to have enough information to know how to > fix an unavailable reset. > > Did I miss a requirement here? So what is the exact proposal? We can't have an INFO ioctl that simply returns error if the ownership requirements are not met as that doesn't support 4). So we need one or more ioctls that a) indicates whether the ownership requirements are met and b) indicates the set of affected devices. Is b) only the set of affected devices within the calling devices iommufd_ctx (ie. dev-ids), in which case we need c) a way to report the overall set of affected devices regardless of ownership in support of 4), BDF? Are we back to replacing group-ids with dev-ids in the INFO structure, where an invalid dev-id either indicates an affected device with implied ownership (ok) or a gap in ownership (bad) and a flag somewhere is meant to indicate the overall disposition based on the availability of reset? I'm not sure how that fully supports 4) since the user can't determine if a given invalid dev-id is in fact a blocker, so do we end up with multiple invalid IDs, perhaps one to indicate unknown but ok and another to indicate an ownership gap? Are devices outside of the iommufd_ctx, but with implied ownership via group omitted entirely from the lists? I think we need an actual proposal here. Thanks, Alex
On Wed, 12 Apr 2023 10:09:32 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, April 12, 2023 8:01 AM > > > > On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote: > > > > > > Management tools already need to understand dev_set if they want to > > > > offer reliable reset support to the VMs. Same as today. > > > > > > I don't think that's true. Our primary hot-reset use case is GPUs and > > > subordinate functions, where the isolation and reset scope are often > > > sufficiently similar to make hot-reset possible, regardless whether > > > all the functions are assigned to a VM. I don't think you'll find any > > > management tools that takes reset scope into account otherwise. > > > > When I think of "reliable reset support" I think of the management > > tool offering a checkbox that says "ensure PCI function reset > > availability" and if checked it will not launch the VM without a > > working reset. > > > > If the user configures a set of VFIO devices and then hopes they get > > working reset, that is fine, but doesn't require any reporting of > > reset groups, or iommu groups to the management layer to work. > > > > > > > As I understand the proposal, QEMU now gets to attempt to > > > > > claim ownership of the dev_set, so it opportunistically extends its > > > > > ownership and may block other users from the affected devices. > > > > > > > > We can decide the policy for the kernel to accept a claim. I suggested > > > > below "same as today" - it must hold all the groups within the > > > > iommufd_ctx. > > > > > > It must hold all the groups [that the user doesn't know about because > > > it's not a formal part of the cdev API] within the iommufd_ctx? > > > > You keep going back to this, but I maintain userspace doesn't > > care. qemu is given a list of VFIO devices to use, all it wants to > > know is if it is allowed to use reset or not. Why should it need to > > know groups and group_ids to get that binary signal out of the kernel? > > > > > > The simplest option for no-iommu is to require it to pass in every > > > > device fd to the reset ioctl. > > > > > > Which ironically is exactly how it ends up working today, each no-iommu > > > device has a fake IOMMU group, so every affected device (group) needs > > > to be provided. > > > > Sure, that is probably the way forward for no-iommu. Not that anyone > > uses it.. > > > > The kicker is we don't force the user to generate a de-duplicated list > > of devices FDs, one per group, just because. > > > > > > I want to re-focus on the basics of what cdev is supposed to be doing, > > > > because several of the idea you suggested seem against this direction: > > > > > > > > - cdev does not have, and cannot rely on vfio_groups. We enforce this > > > > by compiling all the vfio_group infrastructure out. iommu_groups > > > > continue to exist. > > > > > > > > So converting a cdev to a vfio_group is not an allowed operation. > > > > > > My only statements in this respect were towards the notion that IOMMU > > > groups continue to exist. I'm well aware of the desire to deprecate > > > and remove vfio groups. > > > > Yes > > > > > > - no-iommu should not have iommu_groups. We enforce this by compiling > > > > out all the no-iommu vfio_group infrastructure. > > > > > > This is not logically inferred from the above if IOMMU groups continue > > > to exist and continue to be a basis for describing DMA ownership as > > > well as "reset groups" > > > > It is not ment to flow out of the above, it is a seperate statement. I > > want the iommu_group mechanism to stop being abused outside the iommu > > core code. The only thing that should be creating groups is an > > attached iommu driver operating under ops->device_group(). > > > > VFIO needed this to support mdev and no-iommu. We already have mdev > > free of iommu_groups, I would like no-iommu to also be free of it too, > > we are very close. > > > > That would leave POWER as the only abuser of the > > iommu_group_add_device() API, and it is only doing it because it > > hasn't got a proper iommu driver implementation yet. It turns out > > their abuse is mislocked and maybe racy to boot :( > > > > > > - cdev APIs should ideally not require the user to know the group_id, > > > > we should try hard to design APIs to avoid this. > > > > > > This is a nuance, group_id vs group, where it's been previously > > > discussed that users will need to continue to know the boundaries of a > > > group for the purpose of DMA isolation and potentially IOAS > > > independence should cdev/iommufd choose to tackle those topics. > > > > Yes, group_id is a value we have no specific use for and would require > > userspace to keep seperate track of. I'd prefer to rely on dev_id as > > much as possible instead. > > > > > What is the actual proposal here? > > > > I don't know anymore, you don't seem to like this direction either... > > > > > You've said that hot-reset works if the iommufd_ctx has > > > representation from each affected group, the INFO ioctl remains as > > > it is, which suggests that it's reporting group ID and BDF, yet only > > > sysfs tells the user the relation between a vfio cdev and a group > > > and we're trying to enable a pass-by-fd model for cdev where the > > > user has no reference to a sysfs node for the device. Show me how > > > these pieces fit together. > > > > I prefer the version where INFO2 returns the dev_id, but info can work > > if we do the BDF cap like you suggested to Yi > > > > > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio > > > device has an IOMMU group > > > > I don't desire every VFIO device to have an iommu_group. I want VFIO > > devices with real IOMMU drivers to have an iommu_group. mdev and > > no-iommu should not. I don't want to add them back into the design > > just so INFO has a value to return. > > > > I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an > > iommu_group... > > > > I see this problem as a few basic requirements from a qemu-like > > application: > > > > 1) Does the configuration I was given support reset right now? > > 2) Will the configuration I was given support reset for the duration > > of my execution? > > 3) What groups of the devices I already have open does the reset > > effect? > > 4) For debugging, report to the user the full list of devices in the > > reset group, in a way that relates back to sysfs. > > 5) Away to trigger a reset on a group of devices > > > > #1/#2 is the API I suggested here. Ask the kernel if the current > > configuration works, and ask it to keep it working. > > > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id > > > > #4 is either INFO and print the BDFs or INFO2 reporting the struct > > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/). > > I hope we can have a clear statement on the _INFO or INFO2 usage. > Today, per QEMU's implementation, the output of _INFO is used to: > > 1) do a self-check to see if all the affected groups are opened by the > current user before it can invoke hot-reset. > 2) figure out the devices that are already opened by the user. QEMU > needs to save the state of such devices as the device may already > been in use. If so, its state should be saved and restored prior/post > the hot-reset. > > Seems like we are relaxing the self-check as it may be done by locking > the reset group. is it? I hope not. Locking the reset group suggests the user is able to extend their ownership. IMO we should not allow that. Thanks, Alex
On Wed, 12 Apr 2023 12:05:50 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Apr 12, 2023 at 07:27:43AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Wednesday, April 12, 2023 8:01 AM > > > > > > I see this problem as a few basic requirements from a qemu-like > > > application: > > > > > > 1) Does the configuration I was given support reset right now? > > > 2) Will the configuration I was given support reset for the duration > > > of my execution? > > > 3) What groups of the devices I already have open does the reset > > > effect? > > > 4) For debugging, report to the user the full list of devices in the > > > reset group, in a way that relates back to sysfs. > > > 5) Away to trigger a reset on a group of devices > > > > > > #1/#2 is the API I suggested here. Ask the kernel if the current > > > configuration works, and ask it to keep it working. > > > > > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id > > > > > > #4 is either INFO and print the BDFs or INFO2 reporting the struct > > > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/). > > > > mdev doesn't have BDF. Of course it doesn't support hot_reset either. > > It should support a reset.. Maybe idxd doesn't, but it should be part > of the SIOV model. Our SIOV devices would need it for instance. IIRC we require mdev devices to support VFIO_DEVICE_RESET, hot-reset is a different beast. I assume SIOV device support would also require VFIO_DEVICE_RESET support and hot-reset would also be irrelevant to them. > > but it's presented to userspace as a pci device. Is it weird for a pci > > device which doesn't provide a BDF cap? > > It is weird for a PCI device, but it is not weird for a VFIO > device. Leaking the physical labels out of the uAPI is not clean, > IMHO. > > > from this point the vfio_device IDR# sounds more generic. > > Yes, I was thinking about this for the SIOV model. Seems like we're off on a tangent, the hot-reset ioctl is not relevant to devices simply because they expose a vfio-pci API, there is any underlying hardware aspect that anything that is only virtualizing a vfio-pci API shouldn't be concerned with. Thanks, Alex
On Wed, Apr 12, 2023 at 10:50:45AM -0600, Alex Williamson wrote: > > You keep going back to this, but I maintain userspace doesn't > > care. qemu is given a list of VFIO devices to use, all it wants to > > know is if it is allowed to use reset or not. Why should it need to > > know groups and group_ids to get that binary signal out of the kernel? > > hw/vfio/pci.c:2320 > error_report("vfio: Cannot reset device %s, " > "depends on group %d which is not owned.", > vdev->vbasedev.name, devices[i].group_id); > > That creates a feedback loop where a user can take corrective action > with actual information in hand to resolve the issue. Which is why I listed debugging as requirement #4, and solve requirement #4 by using the existing INFO and printing the BDF list it returns. > > The kicker is we don't force the user to generate a de-duplicated list > > of devices FDs, one per group, just because. > > So on one hand you're asking for simplicity, but on the other you're > criticizing a trivial simplification that we chose to allow the user to > pass number of group fds equal to number of devices affected so that > the user doesn't need to take that step to de-duplicate the list. We > can't win. It is not a simplification because the kernel is wired to accept only a list of exactly that group length, no more no less. It turns into a pointless puzzle that userspace has to solve, and it can only solve it by knowing about groups. If we get rid of groups we have to do something about this so userspace doesn't need to do the calculation. That is the point of this change. > > > You've said that hot-reset works if the iommufd_ctx has > > > representation from each affected group, the INFO ioctl remains as > > > it is, which suggests that it's reporting group ID and BDF, yet only > > > sysfs tells the user the relation between a vfio cdev and a group > > > and we're trying to enable a pass-by-fd model for cdev where the > > > user has no reference to a sysfs node for the device. Show me how > > > these pieces fit together. > > > > I prefer the version where INFO2 returns the dev_id, but info can work > > if we do the BDF cap like you suggested to Yi > > As discussed ad nauseam, dev-id is useless if an affected device is not > already within the iommufd ctx. The purpose of INFO2 is to satisfy requirement #3 - which is to report the effected devices *that are already opened*. For this dev_id is fine. There is nothing qemu can do with devices that are outside its iommufdctx, so it is pointless to tell it about them. It will generate the debug print of #4 using INFO. I don't think we don't need one API here. > > I see this problem as a few basic requirements from a qemu-like > > application: > > > > 1) Does the configuration I was given support reset right now? > > 2) Will the configuration I was given support reset for the duration > > of my execution? > > 3) What groups of the devices I already have open does the reset > > effect? > > 4) For debugging, report to the user the full list of devices in the > > reset group, in a way that relates back to sysfs. > > 5) Away to trigger a reset on a group of devices > > > > #1/#2 is the API I suggested here. Ask the kernel if the current > > configuration works, and ask it to keep it working. > > That is super sketchy because you're also advocating for > opportunistically supporting reset if the instantaneous conditions > allow is (ex. unopened devices), and going back and forth whether "ask > it to keep working" suggests that a user is able to extend their > granted ownership themselves. I think both needs to be based on some > form of granted, not requested, ownership and not opportunism. Ok, lets give up on ownership then > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id > > Where dev-id is useful for... ? I think there's a misuse of "groups" > in 3) above, userspace needs to know specific devices affected, thus > BDF. I did not mean "group of devices" to mean iommu_group, I mean "the set of devices affected by the reset" > > #4 is either INFO and print the BDFs or INFO2 reporting the struct > > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/). > > We can't assume that all the affected devices are bound to vfio, > therefore we cannot assume a vfio_device IDR exists. So BDF is better for the debugging print. > > #5 is adjusting the FD list in existing RESET ioctl. Remove the need > > for userspace to specify a minimal exact list of FDs means userspace > > doesn't need the information to figure out what that list actually > > is. Pass a 0 length list and use iommufdctx. > > "...doesn't need the information to figure out what the list actually > is." That's false, userspace needs the information whether it uses it > to make a list or not, #3 is the need of affected devices, it is already covered. I mean that #5 should not need this, #5 is only about triggering the reset. What I want is a #5 action that does not require doing a calcuation on group IDs. At the core, without any notion of groups, #5 requires userspace to pass in every opened device FD and kernel checks that every opened device is in the passed FD list. Close devices are ignored. Devices with unattached drivers are ignored. #5 does not need the answer to requirement #2. > So we need one or more ioctls that a) indicates whether > the ownership requirements are met If we reject the ownership direction, then I go back to suggesting that INFO2 should do this. > b) indicates the set of affected > devices. INFO2 will return the dev_id which is sufficient to satisfy requirement #3 > Is b) only the set of affected devices within the calling > devices iommufd_ctx (ie. dev-ids), I vote yes > in which case we need c) a way to > report the overall set of affected devices regardless of ownership in > support of 4), BDF? Yes, continue to use INFO unmodified. > Are we back to replacing group-ids with dev-ids in the INFO structure, > where an invalid dev-id either indicates an affected device with > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere > is meant to indicate the overall disposition based on the availability > of reset? As you explore in the following this gets ugly. I prefer to keep INFO unchanged and add INFO2. So maybe we should make patches that look something like this, try to come up with a workable INFO2 and squeeze no-iommu into it somehow. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, April 12, 2023 11:06 PM > > On Wed, Apr 12, 2023 at 07:27:43AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Wednesday, April 12, 2023 8:01 AM > > > > > > I see this problem as a few basic requirements from a qemu-like > > > application: > > > > > > 1) Does the configuration I was given support reset right now? > > > 2) Will the configuration I was given support reset for the duration > > > of my execution? > > > 3) What groups of the devices I already have open does the reset > > > effect? > > > 4) For debugging, report to the user the full list of devices in the > > > reset group, in a way that relates back to sysfs. > > > 5) Away to trigger a reset on a group of devices > > > > > > #1/#2 is the API I suggested here. Ask the kernel if the current > > > configuration works, and ask it to keep it working. > > > > > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id > > > > > > #4 is either INFO and print the BDFs or INFO2 reporting the struct > > > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/). > > > > mdev doesn't have BDF. Of course it doesn't support hot_reset either. > > It should support a reset.. Maybe idxd doesn't, but it should be part > of the SIOV model. Our SIOV devices would need it for instance. yes, supporting VFIO_DEVICE_RESET is assumed. That is required by the siov spec. Then no need to support hot_reset. > > > but it's presented to userspace as a pci device. Is it weird for a pci > > device which doesn't provide a BDF cap? > > It is weird for a PCI device, but it is not weird for a VFIO > device. Leaking the physical labels out of the uAPI is not clean, > IMHO. yes. Reporting pasid is also incorrect since it's invisible to user. > > > from this point the vfio_device IDR# sounds more generic. > > Yes, I was thinking about this for the SIOV model. > > Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, April 13, 2023 4:07 AM > > > > in which case we need c) a way to > > report the overall set of affected devices regardless of ownership in > > support of 4), BDF? > > Yes, continue to use INFO unmodified. > > > Are we back to replacing group-ids with dev-ids in the INFO structure, > > where an invalid dev-id either indicates an affected device with > > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere > > is meant to indicate the overall disposition based on the availability > > of reset? > > As you explore in the following this gets ugly. I prefer to keep INFO > unchanged and add INFO2. > INFO needs a change when VFIO_GROUP is disabled. Now it assumes a valid iommu group always exists: vfio_pci_fill_devs() { ... iommu_group = iommu_group_get(&pdev->dev); if (!iommu_group) return -EPERM; /* Cannot reset non-isolated devices */ ... } Probably we need a special value e.g. -1 to represent noiommu case given valid group ids are positive. with that plus BDF cap, I'm curious what is the actual purpose of INFO2 or why cannot requirement#3 reuse the information collected via existing INFO? For each opened device Qemu can find the related group id via sysfs (if group exists) or an optional GROUP cap and use that id to match the group id in INFO. For noiommu it has a group id if VFIO_GROUP=y then same case. For noiommu if VFIO_GROUP=n just do exact match based on BDF. Either way the information returned by INFO is a superset of knowing the reset scope between opened devices. Thanks Kevin
On Thu, Apr 13, 2023 at 08:25:52AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Thursday, April 13, 2023 4:07 AM > > > > > > > in which case we need c) a way to > > > report the overall set of affected devices regardless of ownership in > > > support of 4), BDF? > > > > Yes, continue to use INFO unmodified. > > > > > Are we back to replacing group-ids with dev-ids in the INFO structure, > > > where an invalid dev-id either indicates an affected device with > > > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere > > > is meant to indicate the overall disposition based on the availability > > > of reset? > > > > As you explore in the following this gets ugly. I prefer to keep INFO > > unchanged and add INFO2. > > > > INFO needs a change when VFIO_GROUP is disabled. Now it assumes > a valid iommu group always exists: > > vfio_pci_fill_devs() > { > ... > iommu_group = iommu_group_get(&pdev->dev); > if (!iommu_group) > return -EPERM; /* Cannot reset non-isolated devices */ > ... > } This can still work in a ugly way. With a INFO2 the only purpose of INFO would be debugging, so if someone uses no-iommu, with hotreset and misconfigures it then the only downside is they don't get the debugging print. But we know of nothing that uses this combination anyhow.. > with that plus BDF cap, I'm curious what is the actual purpose of > INFO2 or why cannot requirement#3 reuse the information collected > via existing INFO? It can - it is just more complicated for userspace to do it, it has to extract and match the BDFs and then run some algorithm to determine if the opened devices cover the right set of devices in the reset group, and it has to have some special code for no-iommu. VS info2 would return the dev_id's and a single yes/no if the right set is present. Kernel runs the algorithm instead of userspace, it seems more abstract this way. Also, if we make iommufd return a 'ioas dev_id group' as well it composes nicely that userspace just needs one translation from dev_id. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, April 13, 2023 7:51 PM > > On Thu, Apr 13, 2023 at 08:25:52AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Thursday, April 13, 2023 4:07 AM > > > > > > > > > > in which case we need c) a way to > > > > report the overall set of affected devices regardless of ownership in > > > > support of 4), BDF? > > > > > > Yes, continue to use INFO unmodified. > > > > > > > Are we back to replacing group-ids with dev-ids in the INFO structure, > > > > where an invalid dev-id either indicates an affected device with > > > > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere > > > > is meant to indicate the overall disposition based on the availability > > > > of reset? > > > > > > As you explore in the following this gets ugly. I prefer to keep INFO > > > unchanged and add INFO2. > > > > > > > INFO needs a change when VFIO_GROUP is disabled. Now it assumes > > a valid iommu group always exists: > > > > vfio_pci_fill_devs() > > { > > ... > > iommu_group = iommu_group_get(&pdev->dev); > > if (!iommu_group) > > return -EPERM; /* Cannot reset non-isolated devices */ > > ... > > } > > This can still work in a ugly way. With a INFO2 the only purpose of > INFO would be debugging, so if someone uses no-iommu, with hotreset > and misconfigures it then the only downside is they don't get the > debugging print. But we know of nothing that uses this combination > anyhow.. Today, at least QEMU will not go to do hot-reset if _INFO fails. I think this check may need to be relaxed if want _INFO work when there is no VFIO_GROUP (also no fake iommu_group). Regards, Yi Liu
On Thu, Apr 13, 2023 at 02:35:57PM +0000, Liu, Yi L wrote: > Today, at least QEMU will not go to do hot-reset if _INFO fails. I think > this check may need to be relaxed if want _INFO work when there is > no VFIO_GROUP (also no fake iommu_group). Current qemu does not work if there is no VFIO_GROUP, so it doesn't matter. In cdev mode qemu should work differently, we can make the kernel return -1 for group_id and qemu can ignore group_id for the debug print, or we can just make it fail. Given qemu doesn't, and can't, support no-iommu this is pretty fringe stuff. Jason
On Thu, 13 Apr 2023 08:50:45 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Thu, Apr 13, 2023 at 08:25:52AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Thursday, April 13, 2023 4:07 AM > > > > > > > > > > in which case we need c) a way to > > > > report the overall set of affected devices regardless of ownership in > > > > support of 4), BDF? > > > > > > Yes, continue to use INFO unmodified. > > > > > > > Are we back to replacing group-ids with dev-ids in the INFO structure, > > > > where an invalid dev-id either indicates an affected device with > > > > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere > > > > is meant to indicate the overall disposition based on the availability > > > > of reset? > > > > > > As you explore in the following this gets ugly. I prefer to keep INFO > > > unchanged and add INFO2. > > > > > > > INFO needs a change when VFIO_GROUP is disabled. Now it assumes > > a valid iommu group always exists: > > > > vfio_pci_fill_devs() > > { > > ... > > iommu_group = iommu_group_get(&pdev->dev); > > if (!iommu_group) > > return -EPERM; /* Cannot reset non-isolated devices */ > > ... > > } > > This can still work in a ugly way. With a INFO2 the only purpose of > INFO would be debugging, so if someone uses no-iommu, with hotreset > and misconfigures it then the only downside is they don't get the > debugging print. But we know of nothing that uses this combination > anyhow.. > > > with that plus BDF cap, I'm curious what is the actual purpose of > > INFO2 or why cannot requirement#3 reuse the information collected > > via existing INFO? > > It can - it is just more complicated for userspace to do it, it has to > extract and match the BDFs and then run some algorithm to determine if > the opened devices cover the right set of devices in the reset group, > and it has to have some special code for no-iommu. > > VS info2 would return the dev_id's and a single yes/no if the right > set is present. Kernel runs the algorithm instead of userspace, it > seems more abstract this way. > > Also, if we make iommufd return a 'ioas dev_id group' as well it > composes nicely that userspace just needs one translation from dev_id. IIUC, the semantics we're proposing is that an INFO2 ioctl would return success or failure indicating whether the user has sufficient ownership of the affected devices, and in the success case returns an array of affected dev-ids within the user's iommufd_ctx. Unopened, affected devices, are not reported via INFO2, and unopened, affected devices outside the user's scope of ownership (ie. outside the owned IOMMU group) will generate a failure condition. As for the INFO ioctl, it's described as unchanged, which does raise the question of what is reported for IOMMU groups and how does the value there coherently relate to anything else in the cdev-exclusive vfio API... We had already iterated a proposal where the group-id is replaced with a dev-id in the existing ioctl and a flag indicates when the return value is a dev-id vs group-id. This had a gap that userspace cannot determine if a reset is available given this information since un-owned devices report an invalid dev-id and userspace can't know if it has implicit ownership. It seems cleaner to me though that we would could still re-use INFO in a similar way, simply defining a new flag bit which is valid only in the case of returning dev-ids and indicates if the reset is available. Therefore in one ioctl, userspace knows if hot-reset is available (based on a kernel determination) and can pull valid dev-ids from the array to associate affected, owned devices, and still has the equivalent information to know that one or more of the devices listed with an invalid dev-id are preventing the hot-reset from being available. Is that an option? Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, April 14, 2023 2:07 AM > > We had already iterated a proposal where the group-id is replaced with > a dev-id in the existing ioctl and a flag indicates when the return > value is a dev-id vs group-id. This had a gap that userspace cannot > determine if a reset is available given this information since un-owned > devices report an invalid dev-id and userspace can't know if it has > implicit ownership. > > It seems cleaner to me though that we would could still re-use INFO in > a similar way, simply defining a new flag bit which is valid only in > the case of returning dev-ids and indicates if the reset is available. > Therefore in one ioctl, userspace knows if hot-reset is available > (based on a kernel determination) and can pull valid dev-ids from the So the kernel needs to compare the group id between devices with valid dev-ids and devices with invalid dev-ids to decide the implicit ownership. For noiommu device which has no group_id when VFIO_GROUP is off then it's resettable only if having a valid dev_id. The only corner case with this option is when a user mixes group and cdev usages. iirc you mentioned it's a valid usage to be supported. In that case the kernel doesn't have sufficient knowledge to judge 'resettable' as it doesn't know which groups are opened by this user. Not sure whether we can leave it in a ugly way so INFO may not tell 'resettable' accurately in that weird scenario. > array to associate affected, owned devices, and still has the > equivalent information to know that one or more of the devices listed > with an invalid dev-id are preventing the hot-reset from being > available. > > Is that an option? Thanks, > This works for me if above corner case can be waived.
> From: Tian, Kevin <kevin.tian@intel.com> > Sent: Friday, April 14, 2023 5:12 PM > > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Friday, April 14, 2023 2:07 AM > > > > We had already iterated a proposal where the group-id is replaced with > > a dev-id in the existing ioctl and a flag indicates when the return > > value is a dev-id vs group-id. This had a gap that userspace cannot > > determine if a reset is available given this information since un-owned > > devices report an invalid dev-id and userspace can't know if it has > > implicit ownership. > > > > > It seems cleaner to me though that we would could still re-use INFO in > > a similar way, simply defining a new flag bit which is valid only in > > the case of returning dev-ids and indicates if the reset is available. > > Therefore in one ioctl, userspace knows if hot-reset is available > > (based on a kernel determination) and can pull valid dev-ids from the Need to confirm the meaning of hot-reset available flag. I think it should at least meet below two conditions to set this flag. Although it may not mean hot-reset is for sure to succeed. (but should be a high chance). 1) dev_set is resettable (all affected device are in dev_set) 2) affected device are owned by the current user Also, we need to has assumption that below two cases are rare if user encounters it, it just bad luck for them. I think the existing _INFO and hot-reset already has such assumption. So cdev mode can adopt it as well. a) physical topology change (e.g. new devices plugged to affected slot) b) an affected device is unbound from vfio > So the kernel needs to compare the group id between devices with > valid dev-ids and devices with invalid dev-ids to decide the implicit > ownership. For noiommu device which has no group_id when > VFIO_GROUP is off then it's resettable only if having a valid dev_id. In cdev mode, noiommu device doesn't have dev_id as it is not bound to valid iommufd. So if VFIO_GROUP is off, we may never allow hot-reset for noiommu devices. But we don't want to have regression with noiommu devices. Perhaps we may define the usage of the resettable flag like this: 1) if it is set, user does not need to own all the affected devices as some of them may have been owned implicitly. Kernel should have checked it. 2) if the flag is not set, that means user needs to check ownership by itself. It needs to own all the affected devices. If not, don't do hot-reset. This way we can still make noiommu devices support hot-reset just like VFIO_GROUP is on. Because noiommu devices have fake groups, such groups are all singleton. So checking all affected devices are opened by user is just same as check all affected groups. > The only corner case with this option is when a user mixes group > and cdev usages. iirc you mentioned it's a valid usage to be supported. > In that case the kernel doesn't have sufficient knowledge to judge > 'resettable' as it doesn't know which groups are opened by this user. > > Not sure whether we can leave it in a ugly way so INFO may not tell > 'resettable' accurately in that weird scenario. This seems not easy to support. If above scenario is allowed there can be three cases that returns invalid dev_id. 1) devices not opened by user but owned implicitly 2) devices not owned by user 3) devices opened via group but owned by user User would require more info to tell the above cases from each other. > > array to associate affected, owned devices, and still has the > > equivalent information to know that one or more of the devices listed > > with an invalid dev-id are preventing the hot-reset from being > > available. > > > > Is that an option? Thanks, > > > > This works for me if above corner case can be waived. One side check, perhaps already confirmed in prior email. @Alex, So the reason for the prediction of hot-reset is to avoid the possible vfio_pci_pre_reset() which does heavy operations like stop DMA and copy config space. Is it? Any other special reason? Anyhow, this reason is enough for this prediction per my understanding. Regards, Yi Liu
On Fri, 14 Apr 2023 09:11:30 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Friday, April 14, 2023 2:07 AM > > > > We had already iterated a proposal where the group-id is replaced with > > a dev-id in the existing ioctl and a flag indicates when the return > > value is a dev-id vs group-id. This had a gap that userspace cannot > > determine if a reset is available given this information since un-owned > > devices report an invalid dev-id and userspace can't know if it has > > implicit ownership. > > > > It seems cleaner to me though that we would could still re-use INFO in > > a similar way, simply defining a new flag bit which is valid only in > > the case of returning dev-ids and indicates if the reset is available. > > Therefore in one ioctl, userspace knows if hot-reset is available > > (based on a kernel determination) and can pull valid dev-ids from the > > So the kernel needs to compare the group id between devices with > valid dev-ids and devices with invalid dev-ids to decide the implicit > ownership. For noiommu device which has no group_id when > VFIO_GROUP is off then it's resettable only if having a valid dev_id. With no-iommu and VFIO_GROUP on, each no-iommu device gets it's own group and the user must have ownership of each affected group, so there's really no difference here. Every affected no-iommu device must be owned in either case. > The only corner case with this option is when a user mixes group > and cdev usages. iirc you mentioned it's a valid usage to be supported. > In that case the kernel doesn't have sufficient knowledge to judge > 'resettable' as it doesn't know which groups are opened by this user. So for example we might have a 2-function device, fn0 is opened via cdev and part of an iommufd ctx and fn1 is opened via the group interface and potentially bound to a type1 container context. In the INFO/INFO2 proposal, the INFO ioctl would return an array reporting the group and BDF for each function. The INFO ioctl is callable from either device (aiui). The INFO2 ioctl would fail on the group opened device because it doesn't have an iommufd_ctx. When called on the cdev opened device, INFO2 would fail because the dev-set is not represented within the iommufd_ctx. Is this right? In my proposal, the INFO ioctl can also be called on either device. When called on the cdev opened device, the return structure provides dev-ids with a flag indicating such in the return structure. The cdev device has a valid dev-id, the group device invalid. The reset-available flag is clear because the kernel cannot infer ownership of the group opened device. When called on the group opened device, the IOMMU group and BDF are returned for each device. So both approaches have similar issues here, but I think there's an advantage to the approach of extending INFO. In that case, the user still gets the dev-id of the affected cdev device and therefore could build a hot-reset ioctl call using a combination of groupfds and devicefds, even if the cdev opened device are passed by fd. Perhaps it's obvious that the hot-reset device is itself affected by the reset, but I think the example scenario could be extended to one where there are multiple cdev opened devices and one or more group opened devices. AIUI, the INFO2 proposal essentially only returns success if the null-array approach is supported, ie. the kernel can infer the full ownership of the dev-set. However, I think we could still support a proof-of-ownership based hot-reset with devicefds and groupfds provide by the user. I think what this means is that the flag we're exposing is not "hot-reset available", but really whether the kernel can infer ownership and the ownership conditions are satisfied. Therefore it essentially only flags the availability of the null-array interface while the proof-of-ownership approach is always available. > Not sure whether we can leave it in a ugly way so INFO may not tell > 'resettable' accurately in that weird scenario. Is it still ugly with the above design? Thanks, Alex
On Fri, 14 Apr 2023 11:38:24 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Tian, Kevin <kevin.tian@intel.com> > > Sent: Friday, April 14, 2023 5:12 PM > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Friday, April 14, 2023 2:07 AM > > > > > > We had already iterated a proposal where the group-id is replaced with > > > a dev-id in the existing ioctl and a flag indicates when the return > > > value is a dev-id vs group-id. This had a gap that userspace cannot > > > determine if a reset is available given this information since un-owned > > > devices report an invalid dev-id and userspace can't know if it has > > > implicit ownership. > > > > > > > > It seems cleaner to me though that we would could still re-use INFO in > > > a similar way, simply defining a new flag bit which is valid only in > > > the case of returning dev-ids and indicates if the reset is available. > > > Therefore in one ioctl, userspace knows if hot-reset is available > > > (based on a kernel determination) and can pull valid dev-ids from the > > Need to confirm the meaning of hot-reset available flag. I think it > should at least meet below two conditions to set this flag. Although > it may not mean hot-reset is for sure to succeed. (but should be > a high chance). > > 1) dev_set is resettable (all affected device are in dev_set) > 2) affected device are owned by the current user Per thread with Kevin, ownership can't always be known by the kernel. Beyond the group vs cdev discussion there, isn't it also possible (though perhaps not recommended) that a user can have multiple iommufd ctxs? So I think 2) becomes "ownership of the affected dev-set can be inferred from the iommufd_ctx of the calling device", iow, the null-array calling model is available and the flag is redefined to match. Reset may still be available via the proof-of-ownership model. > Also, we need to has assumption that below two cases are rare > if user encounters it, it just bad luck for them. I think the existing > _INFO and hot-reset already has such assumption. So cdev mode > can adopt it as well. > > a) physical topology change (e.g. new devices plugged to affected slot) > b) an affected device is unbound from vfio Yes, these are sufficiently rare that we can't do much about them. > > So the kernel needs to compare the group id between devices with > > valid dev-ids and devices with invalid dev-ids to decide the implicit > > ownership. For noiommu device which has no group_id when > > VFIO_GROUP is off then it's resettable only if having a valid dev_id. > > In cdev mode, noiommu device doesn't have dev_id as it is not > bound to valid iommufd. So if VFIO_GROUP is off, we may never > allow hot-reset for noiommu devices. But we don't want to have > regression with noiommu devices. Perhaps we may define the usage > of the resettable flag like this: > 1) if it is set, user does not need to own all the affected devices as > some of them may have been owned implicitly. Kernel should have > checked it. > 2) if the flag is not set, that means user needs to check ownership > by itself. It needs to own all the affected devices. If not, don't > do hot-reset. Exactly, the flag essentially indicates that the null-array approach is available, lack of the flag indicates proof-of-ownership is required. > This way we can still make noiommu devices support hot-reset > just like VFIO_GROUP is on. Because noiommu devices have fake > groups, such groups are all singleton. So checking all affected > devices are opened by user is just same as check all affected > groups. Yep. > > The only corner case with this option is when a user mixes group > > and cdev usages. iirc you mentioned it's a valid usage to be supported. > > In that case the kernel doesn't have sufficient knowledge to judge > > 'resettable' as it doesn't know which groups are opened by this user. > > > > Not sure whether we can leave it in a ugly way so INFO may not tell > > 'resettable' accurately in that weird scenario. > > This seems not easy to support. If above scenario is allowed there can be > three cases that returns invalid dev_id. > 1) devices not opened by user but owned implicitly The cdev approach has a hard time with this in general, it has no way to represent unopened devices. so any case where the nature of an unopened device block reset on the dev-set is rather opaque to the user. > 2) devices not owned by user (and presumable not owned) We still provide BDF. Not much difference from the group case here, being able to point to a BDF or group is about all we can do. > 3) devices opened via group but owned by user I think this still works in the proof-of-ownership, passing fds to hot-reset model. > User would require more info to tell the above cases from each other. Obviously we could be equivalent to the group model if IOMMU groups were exposed for a device and all devices had IOMMU groups, but reasons... > > > array to associate affected, owned devices, and still has the > > > equivalent information to know that one or more of the devices listed > > > with an invalid dev-id are preventing the hot-reset from being > > > available. > > > > > > Is that an option? Thanks, > > > > > > > This works for me if above corner case can be waived. > > One side check, perhaps already confirmed in prior email. @Alex, So > the reason for the prediction of hot-reset is to avoid the possible > vfio_pci_pre_reset() which does heavy operations like stop DMA and > copy config space. Is it? Any other special reason? Anyhow, this reason > is enough for this prediction per my understanding. It's not clear to me what "prediction" is referring to. As above, I think we can redefine the reset-available flag I proposed to more restrictively indicate that the null-array approach is available based on the dev-set group in relation to the iommufd_ctx of the calling device. Prediction of the affected devices seems like basic functionality to me, we can't assume the user's usage model, they must be able to make a well informed decision regarding affected devices. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Saturday, April 15, 2023 1:11 AM > > On Fri, 14 Apr 2023 11:38:24 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > From: Tian, Kevin <kevin.tian@intel.com> > > > Sent: Friday, April 14, 2023 5:12 PM > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Friday, April 14, 2023 2:07 AM > > > > > > > > We had already iterated a proposal where the group-id is replaced with > > > > a dev-id in the existing ioctl and a flag indicates when the return > > > > value is a dev-id vs group-id. This had a gap that userspace cannot > > > > determine if a reset is available given this information since un-owned > > > > devices report an invalid dev-id and userspace can't know if it has > > > > implicit ownership. > > > > > > > > > > > It seems cleaner to me though that we would could still re-use INFO in > > > > a similar way, simply defining a new flag bit which is valid only in > > > > the case of returning dev-ids and indicates if the reset is available. > > > > Therefore in one ioctl, userspace knows if hot-reset is available > > > > (based on a kernel determination) and can pull valid dev-ids from the > > > > Need to confirm the meaning of hot-reset available flag. I think it > > should at least meet below two conditions to set this flag. Although > > it may not mean hot-reset is for sure to succeed. (but should be > > a high chance). > > > > 1) dev_set is resettable (all affected device are in dev_set) > > 2) affected device are owned by the current user > > Per thread with Kevin, ownership can't always be known by the kernel. > Beyond the group vs cdev discussion there, isn't it also possible > (though perhaps not recommended) that a user can have multiple iommufd > ctxs? So I think 2) becomes "ownership of the affected dev-set can be > inferred from the iommufd_ctx of the calling device", iow, the > null-array calling model is available and the flag is redefined to > match. Reset may still be available via the proof-of-ownership model. Yes, if there are multiple iommufd ctxs, this shall fall back to use the proof-of-ownership model. > > > Also, we need to has assumption that below two cases are rare > > if user encounters it, it just bad luck for them. I think the existing > > _INFO and hot-reset already has such assumption. So cdev mode > > can adopt it as well. > > > > a) physical topology change (e.g. new devices plugged to affected slot) > > b) an affected device is unbound from vfio > > Yes, these are sufficiently rare that we can't do much about them. > > > > So the kernel needs to compare the group id between devices with > > > valid dev-ids and devices with invalid dev-ids to decide the implicit > > > ownership. For noiommu device which has no group_id when > > > VFIO_GROUP is off then it's resettable only if having a valid dev_id. > > > > In cdev mode, noiommu device doesn't have dev_id as it is not > > bound to valid iommufd. So if VFIO_GROUP is off, we may never > > allow hot-reset for noiommu devices. But we don't want to have > > regression with noiommu devices. Perhaps we may define the usage > > of the resettable flag like this: > > 1) if it is set, user does not need to own all the affected devices as > > some of them may have been owned implicitly. Kernel should have > > checked it. > > 2) if the flag is not set, that means user needs to check ownership > > by itself. It needs to own all the affected devices. If not, don't > > do hot-reset. > > Exactly, the flag essentially indicates that the null-array approach is > available, lack of the flag indicates proof-of-ownership is required. > > > This way we can still make noiommu devices support hot-reset > > just like VFIO_GROUP is on. Because noiommu devices have fake > > groups, such groups are all singleton. So checking all affected > > devices are opened by user is just same as check all affected > > groups. > > Yep. > > > > The only corner case with this option is when a user mixes group > > > and cdev usages. iirc you mentioned it's a valid usage to be supported. > > > In that case the kernel doesn't have sufficient knowledge to judge > > > 'resettable' as it doesn't know which groups are opened by this user. > > > > > > Not sure whether we can leave it in a ugly way so INFO may not tell > > > 'resettable' accurately in that weird scenario. > > > > This seems not easy to support. If above scenario is allowed there can be > > three cases that returns invalid dev_id. > > 1) devices not opened by user but owned implicitly > > The cdev approach has a hard time with this in general, it has no way to > represent unopened devices. so any case where the nature of an unopened > device block reset on the dev-set is rather opaque to the user. > > > 2) devices not owned by user > > (and presumable not owned) We still provide BDF. Not much difference > from the group case here, being able to point to a BDF or group is > about all we can do. > > > 3) devices opened via group but owned by user > > I think this still works in the proof-of-ownership, passing fds to > hot-reset model. Ok. let's see below scenario and user's processing makes sense. Say there are five devices (devA, devB, devC, devD, devE) in the same reset group. devA and devB are in the same iommu group. devC, devD, and devE have separate iommu groups. Say devA is opened via cdev, devB is not opened, devC is opened via group, devD is opened cdev but bound to another iommufdctx that is different with devA. devE is not opened by any user If this INFO is called on devA, user should get a valid dev_id for devA, but four invalid dev_ids. The resettable flag should be clear. Below is how user to handle the info returned. - For devB, user shall get the group_id for devA, and also get group_id for devB, hence able to check ownership of devB by checking the group - For devC, user can check ownership by the group_id and bdf returned - For devD, if it is opened by the user, should be able to find it by bdf - For devE, user shall fail to find it hence consider no ownership on it. To finish the above check, user needs to get group_id via devid an also needs to get group_id via device fd. Is it? The above example may be the most tricky scenario. Is it? user shall not do hot-reset as not all affected devices are owned by user. But if devE is also opened by user, it could do hot-reset. > > User would require more info to tell the above cases from each other. > > Obviously we could be equivalent to the group model if IOMMU groups > were exposed for a device and all devices had IOMMU groups, but > reasons... > > > > > array to associate affected, owned devices, and still has the > > > > equivalent information to know that one or more of the devices listed > > > > with an invalid dev-id are preventing the hot-reset from being > > > > available. > > > > > > > > Is that an option? Thanks, > > > > > > > > > > This works for me if above corner case can be waived. > > > > One side check, perhaps already confirmed in prior email. @Alex, So > > the reason for the prediction of hot-reset is to avoid the possible > > vfio_pci_pre_reset() which does heavy operations like stop DMA and > > copy config space. Is it? Any other special reason? Anyhow, this reason > > is enough for this prediction per my understanding. > > It's not clear to me what "prediction" is referring to. It is predicting whether hot-reset ioctl can work or not as you mentioned in prior discussion.[1]. "I disagree, as I've argued before, the info ioctl becomes so weak and effectively arbitrary from a user perspective at being able to predict whether the hot-reset ioctl works that it becomes useless, diminishing the entire hot-reset info/execute API." [1] https://lore.kernel.org/kvm/20230405134945.29e967be.alex.williamson@redhat.com/ > As above, I > think we can redefine the reset-available flag I proposed to more > restrictively indicate that the null-array approach is available based > on the dev-set group in relation to the iommufd_ctx of the calling > device. Prediction of the affected devices seems like basic > functionality to me, we can't assume the user's usage model, they must > be able to make a well informed decision regarding affected devices. > Thanks, As my above reply with the five-device scenario. It still needs to get group_id to check implicit ownership in the case of sharing the same iommu_group. Regards, Yi Liu
On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote: > The only corner case with this option is when a user mixes group > and cdev usages. iirc you mentioned it's a valid usage to be supported. > In that case the kernel doesn't have sufficient knowledge to judge > 'resettable' as it doesn't know which groups are opened by this user. IMHO we don't need to support this combination. We can say that to use the hot reset API the user must put all their devices into the same iommufd_ctx and cover 100% of the known use cases for this. There are already other situations, like nesting, that do force users to put everything into one iommufd_ctx. No reason to make things harder and more complicated. I'm coming to the feeling that we should put no-iommu devices in iommufd_ctx's as well. They would be an iommufd_access like mdevs. That would clean up the complications they cause here. I suppose we should have done that from the beginning - no-iommu is an IOMMUFD access, it just uses a crazy /proc based way to learn the PFNs. Making it a proper access and making a real VFIO ioctl that calls iommufd_access_pin_pages() and returns the DMA mapped addresses to userspace would go a long way to making no-iommu work in a logical, usable, way. Jason
On Thu, Apr 13, 2023 at 12:07:12PM -0600, Alex Williamson wrote: > IIUC, the semantics we're proposing is that an INFO2 ioctl would return > success or failure indicating whether the user has sufficient ownership > of the affected devices, Or a flag, but yes > and in the success case returns an array of > affected dev-ids within the user's iommufd_ctx. Unopened, affected > devices, are not reported via INFO2, and unopened, affected devices > outside the user's scope of ownership (ie. outside the owned IOMMU > group) will generate a failure condition. Yes > As for the INFO ioctl, it's described as unchanged, which does raise > the question of what is reported for IOMMU groups and how does the > value there coherently relate to anything else in the cdev-exclusive > vfio API... For cdev mode the value of the group_id has no functional purpose. INFO has no functional purpose beyond debugging. The cdev enabled userspace should print the BDFs from the INFO in a debug message and ignore the group_id. Kernel will still fill the group_id using the iommu_get_group() stuff, and set -1 for no-iommu. > We had already iterated a proposal where the group-id is replaced with > a dev-id in the existing ioctl and a flag indicates when the return > value is a dev-id vs group-id. This had a gap that userspace cannot > determine if a reset is available given this information since un-owned > devices report an invalid dev-id and userspace can't know if it has > implicit ownership. IIRC, yes. > It seems cleaner to me though that we would could still re-use INFO in > a similar way, simply defining a new flag bit which is valid only in > the case of returning dev-ids and indicates if the reset is > available. Yes, it could be done like this as well. INFO2 is more a discussion object, how we encode it in the uAPI matters a lot less. The point is that INFO2, as an idea, returns information that no other existing API returns: the "ownership passed flag" and "dev_id list" Then as I said in the other mail we roll no-iommu into an iommufd_ctx object and just follow the design that userspace must have a single iommufd_ctx containing all the devices to use the hot reset feature. Jason
On Mon, 17 Apr 2023 04:20:27 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Saturday, April 15, 2023 1:11 AM > > > > On Fri, 14 Apr 2023 11:38:24 +0000 > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > From: Tian, Kevin <kevin.tian@intel.com> > > > > Sent: Friday, April 14, 2023 5:12 PM > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > Sent: Friday, April 14, 2023 2:07 AM > > > > > > > > > > We had already iterated a proposal where the group-id is replaced with > > > > > a dev-id in the existing ioctl and a flag indicates when the return > > > > > value is a dev-id vs group-id. This had a gap that userspace cannot > > > > > determine if a reset is available given this information since un-owned > > > > > devices report an invalid dev-id and userspace can't know if it has > > > > > implicit ownership. > > > > > > > > > > > > > > It seems cleaner to me though that we would could still re-use INFO in > > > > > a similar way, simply defining a new flag bit which is valid only in > > > > > the case of returning dev-ids and indicates if the reset is available. > > > > > Therefore in one ioctl, userspace knows if hot-reset is available > > > > > (based on a kernel determination) and can pull valid dev-ids from the > > > > > > Need to confirm the meaning of hot-reset available flag. I think it > > > should at least meet below two conditions to set this flag. Although > > > it may not mean hot-reset is for sure to succeed. (but should be > > > a high chance). > > > > > > 1) dev_set is resettable (all affected device are in dev_set) > > > 2) affected device are owned by the current user > > > > Per thread with Kevin, ownership can't always be known by the kernel. > > Beyond the group vs cdev discussion there, isn't it also possible > > (though perhaps not recommended) that a user can have multiple iommufd > > ctxs? So I think 2) becomes "ownership of the affected dev-set can be > > inferred from the iommufd_ctx of the calling device", iow, the > > null-array calling model is available and the flag is redefined to > > match. Reset may still be available via the proof-of-ownership model. > > Yes, if there are multiple iommufd ctxs, this shall fall back to use > the proof-of-ownership model. > > > > > > Also, we need to has assumption that below two cases are rare > > > if user encounters it, it just bad luck for them. I think the existing > > > _INFO and hot-reset already has such assumption. So cdev mode > > > can adopt it as well. > > > > > > a) physical topology change (e.g. new devices plugged to affected slot) > > > b) an affected device is unbound from vfio > > > > Yes, these are sufficiently rare that we can't do much about them. > > > > > > So the kernel needs to compare the group id between devices with > > > > valid dev-ids and devices with invalid dev-ids to decide the implicit > > > > ownership. For noiommu device which has no group_id when > > > > VFIO_GROUP is off then it's resettable only if having a valid dev_id. > > > > > > In cdev mode, noiommu device doesn't have dev_id as it is not > > > bound to valid iommufd. So if VFIO_GROUP is off, we may never > > > allow hot-reset for noiommu devices. But we don't want to have > > > regression with noiommu devices. Perhaps we may define the usage > > > of the resettable flag like this: > > > 1) if it is set, user does not need to own all the affected devices as > > > some of them may have been owned implicitly. Kernel should have > > > checked it. > > > 2) if the flag is not set, that means user needs to check ownership > > > by itself. It needs to own all the affected devices. If not, don't > > > do hot-reset. > > > > Exactly, the flag essentially indicates that the null-array approach is > > available, lack of the flag indicates proof-of-ownership is required. > > > > > This way we can still make noiommu devices support hot-reset > > > just like VFIO_GROUP is on. Because noiommu devices have fake > > > groups, such groups are all singleton. So checking all affected > > > devices are opened by user is just same as check all affected > > > groups. > > > > Yep. > > > > > > The only corner case with this option is when a user mixes group > > > > and cdev usages. iirc you mentioned it's a valid usage to be supported. > > > > In that case the kernel doesn't have sufficient knowledge to judge > > > > 'resettable' as it doesn't know which groups are opened by this user. > > > > > > > > Not sure whether we can leave it in a ugly way so INFO may not tell > > > > 'resettable' accurately in that weird scenario. > > > > > > This seems not easy to support. If above scenario is allowed there can be > > > three cases that returns invalid dev_id. > > > 1) devices not opened by user but owned implicitly > > > > The cdev approach has a hard time with this in general, it has no way to > > represent unopened devices. so any case where the nature of an unopened > > device block reset on the dev-set is rather opaque to the user. > > > > > 2) devices not owned by user > > > > (and presumable not owned) We still provide BDF. Not much difference > > from the group case here, being able to point to a BDF or group is > > about all we can do. > > > > > 3) devices opened via group but owned by user > > > > I think this still works in the proof-of-ownership, passing fds to > > hot-reset model. > > Ok. let's see below scenario and user's processing makes sense. > > Say there are five devices (devA, devB, devC, devD, devE) in the same reset > group. devA and devB are in the same iommu group. devC, devD, and devE have > separate iommu groups. Say devA is opened via cdev, devB is not opened, devC > is opened via group, devD is opened cdev but bound to another iommufdctx that > is different with devA. devE is not opened by any user > > If this INFO is called on devA, user should get a valid dev_id for devA, but > four invalid dev_ids. The resettable flag should be clear. Below is how user > to handle the info returned. INFO from devA returns: flags: NOT_RESETABLE | DEV_ID { { valid devA-id, devA-BDF }, { invalid dev-id, devB-BDF }, { invalid dev-id, devC-BDF }, { invalid dev-id, devD-BDF }, { invalid dev-id, devE-BDF }, } User knows devA-id, learns devA-BDF from devC: { { devA/B-group-id, devA-BDF }, { devA/B-group-id, devB-BDF }, { devC-group-id, devC-BDF }, { devD-group-id, devD-BDF }, { devE-group-id, devE-BDF }, } User is assumed to know devC group-id + BDF given group semantics, knows devA ownership, infers devB ownership. from devD: flags: NOT_RESETABLE | DEV_ID { { invalid dev-id, devA-BDF }, { invalid dev-id, devB-BDF }, { invalid dev-id, devC-BDF }, { valid devD-id, devD-BDF }, { invalid dev-id, devE-BDF }, } User knows devD-id, learns devD-bdf, knows devA and devC ownership, and inferred devB ownership > - For devB, user shall get the group_id for devA, and also get group_id for > devB, hence able to check ownership of devB by checking the group Per above, groups are only available through the group devices, therefore inferred ownership of devB can only be learned from devC. > - For devC, user can check ownership by the group_id and bdf returned Yes, the INFO ioctl on devC can confirm devC is affected, but more importantly this is the bridge to learn BDF of other affected devices and their groups. > - For devD, if it is opened by the user, should be able to find it by bdf I think the reverse, the user presumably already knows the dev-id for devD and knows that a hot-reset of the calling device necessarily affects the device, but it learns the BDF, which helps it connect 4 of the 5 device affected by the reset. > - For devE, user shall fail to find it hence consider no ownership on it. Yes, which is correct. > To finish the above check, user needs to get group_id via devid an also needs > to get group_id via device fd. Is it? Not absolutely required, but the user needs to do a lot of inferring via BDF. > The above example may be the most tricky scenario. Is it? user shall not do > hot-reset as not all affected devices are owned by user. But if devE is also > opened by user, it could do hot-reset. Yes, it's not trivial, but Jason is now proposing that we consider mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think this means that regardless of which device calls INFO, there's only one answer (assuming same set of devices opened, all cdev, all within same iommufd_ctx). Based on what I explained about my understanding of INFO2 and Jason agreed to, I think the output would be: flags: NOT_RESETABLE | DEV_ID { { valid devA-id, devA-BDF }, { valid devC-id, devC-BDF }, { valid devD-id, devD-BDF }, { invalid dev-id, devE-BDF }, } Here devB gets dropped because the kernel understands that devB is unopened, affected, and owned. It's therefore not a blocker for hot-reset. OTOH, devE is unopened, affected, and un-owned, and we previously agreed against the opportunistic un-opened/un-owned loophole. If devA and devD were separate iommufd_ctxs, with devC in the same ctx as devA, I think this becomes: INFO on devA: flags: NOT_RESETABLE | DEV_ID { { valid devA-id, devA-BDF }, { valid devC-id, devC-BDF }, { invalid dev-id, devD-BDF }, { invalid dev-id, devE-BDF }, } INFO on devD: flags: NOT_RESETABLE | DEV_ID { { invalid dev-id, devA-BDF }, { invalid dev-id, devB-BDF }, { invalid dev-id, devC-BDF }, { valid devD-id, devD-BDF }, { invalid dev-id, devE-BDF }, } I think this illustrates that it makes sense for unopened affected devices with implicit ownership to always be hidden, but otherwise are fully enumerated. > > > User would require more info to tell the above cases from each other. > > > > Obviously we could be equivalent to the group model if IOMMU groups > > were exposed for a device and all devices had IOMMU groups, but > > reasons... > > > > > > > array to associate affected, owned devices, and still has the > > > > > equivalent information to know that one or more of the devices listed > > > > > with an invalid dev-id are preventing the hot-reset from being > > > > > available. > > > > > > > > > > Is that an option? Thanks, > > > > > > > > > > > > > This works for me if above corner case can be waived. > > > > > > One side check, perhaps already confirmed in prior email. @Alex, So > > > the reason for the prediction of hot-reset is to avoid the possible > > > vfio_pci_pre_reset() which does heavy operations like stop DMA and > > > copy config space. Is it? Any other special reason? Anyhow, this reason > > > is enough for this prediction per my understanding. > > > > It's not clear to me what "prediction" is referring to. > > It is predicting whether hot-reset ioctl can work or not as you mentioned > in prior discussion.[1]. > > "I disagree, as I've argued before, the info ioctl becomes so weak and > effectively arbitrary from a user perspective at being able to predict > whether the hot-reset ioctl works that it becomes useless, diminishing > the entire hot-reset info/execute API." > > [1] https://lore.kernel.org/kvm/20230405134945.29e967be.alex.williamson@redhat.com/ I think we're narrowing in on an interface that isn't as arbitrary. If we assume the restrictions that Jason proposes, then cdev is exclusively a kernel determined reset availability model, where I'd agree that passing device-fds as a proof of ownership is pointless. The group interface would therefore remain exclusively a proof-of-ownership model since we have no incentive to extend it to kernel-determined given the limited use case of all affected devices managed by the same vfio container. > > As above, I > > think we can redefine the reset-available flag I proposed to more > > restrictively indicate that the null-array approach is available based > > on the dev-set group in relation to the iommufd_ctx of the calling > > device. Prediction of the affected devices seems like basic > > functionality to me, we can't assume the user's usage model, they must > > be able to make a well informed decision regarding affected devices. > > Thanks, > > As my above reply with the five-device scenario. It still needs to get > group_id to check implicit ownership in the case of sharing the same > iommu_group. Moot, but there's actually enough information there to infer IOMMU groups for each device, but we probably can't prove that would always be the case. If we adopt Jason's proposal though, I don't see that we need either a group-id or BDF capability, the BDF is only for debug reporting. However, there is a new burden on the kernel to identify the affected, un-owned devices for that report. Thanks, Alex
On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote: > Yes, it's not trivial, but Jason is now proposing that we consider > mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think > this means that regardless of which device calls INFO, there's only one > answer (assuming same set of devices opened, all cdev, all within same > iommufd_ctx). Based on what I explained about my understanding of INFO2 > and Jason agreed to, I think the output would be: > > flags: NOT_RESETABLE | DEV_ID > { > { valid devA-id, devA-BDF }, > { valid devC-id, devC-BDF }, > { valid devD-id, devD-BDF }, > { invalid dev-id, devE-BDF }, > } > > Here devB gets dropped because the kernel understands that devB is > unopened, affected, and owned. It's therefore not a blocker for > hot-reset. I don't think we want to drop anything because it makes the API ill suited for the debugging purpose. devb should be returned with an invalid dev_id if I understand your example. Maybe it should return with -1 as the dev_id instead of 0, to make the debugging a bit better. Userspace should look at only NOT_RESETTABLE to determine if it proceeds or not, and it should use the valid dev_id list to iterate over the devices it has open to do the config stuff. > OTOH, devE is unopened, affected, and un-owned, and we > previously agreed against the opportunistic un-opened/un-owned loophole. NOT_RESETABLE should be returned in this case, yes. If we want to enable userspace to use the loophole it should be an additional flag. RESETABLE_FOR_NOW or something > I think we're narrowing in on an interface that isn't as arbitrary. If > we assume the restrictions that Jason proposes, then cdev is exclusively > a kernel determined reset availability model Yes, I think this is probably best looking forward. > where I'd agree that > passing device-fds as a proof of ownership is pointless. The group > interface would therefore remain exclusively a proof-of-ownership > model since we have no incentive to extend it to kernel-determined > given the limited use case of all affected devices managed by the same > vfio container. Yes > Moot, but there's actually enough information there to infer IOMMU > groups for each device, but we probably can't prove that would always > be the case. If we adopt Jason's proposal though, I don't see that we > need either a group-id or BDF capability, the BDF is only for debug > reporting. However, there is a new burden on the kernel to identify > the affected, un-owned devices for that report. Yes and yes Jason
On Mon, 17 Apr 2023 16:31:56 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote: > > Yes, it's not trivial, but Jason is now proposing that we consider > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think > > this means that regardless of which device calls INFO, there's only one > > answer (assuming same set of devices opened, all cdev, all within same > > iommufd_ctx). Based on what I explained about my understanding of INFO2 > > and Jason agreed to, I think the output would be: > > > > flags: NOT_RESETABLE | DEV_ID > > { > > { valid devA-id, devA-BDF }, > > { valid devC-id, devC-BDF }, > > { valid devD-id, devD-BDF }, > > { invalid dev-id, devE-BDF }, > > } > > > > Here devB gets dropped because the kernel understands that devB is > > unopened, affected, and owned. It's therefore not a blocker for > > hot-reset. > > I don't think we want to drop anything because it makes the API > ill suited for the debugging purpose. > > devb should be returned with an invalid dev_id if I understand your > example. Maybe it should return with -1 as the dev_id instead of 0, to > make the debugging a bit better. > > Userspace should look at only NOT_RESETTABLE to determine if it > proceeds or not, and it should use the valid dev_id list to iterate > over the devices it has open to do the config stuff. If an affected device is owned, not opened, and not interfering with the reset, what is it adding to the API to report it for debugging purposes? I'm afraid this leads into expanding "invalid dev-id" into an errno or bitmap of error conditions that the user needs to parse. > > OTOH, devE is unopened, affected, and un-owned, and we > > previously agreed against the opportunistic un-opened/un-owned loophole. > > NOT_RESETABLE should be returned in this case, yes. > > If we want to enable userspace to use the loophole it should be an > additional flag. RESETABLE_FOR_NOW or something Ugh, please no. It's always a volatile result, but a volatile result that relies on device state outside the scope or control of the user is not even worthwhile imo. Thanks, Alex
> From: Jason Gunthorpe > Sent: Monday, April 17, 2023 9:39 PM > > On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote: > > > The only corner case with this option is when a user mixes group > > and cdev usages. iirc you mentioned it's a valid usage to be supported. > > In that case the kernel doesn't have sufficient knowledge to judge > > 'resettable' as it doesn't know which groups are opened by this user. > > IMHO we don't need to support this combination. > > We can say that to use the hot reset API the user must put all their > devices into the same iommufd_ctx and cover 100% of the known use > cases for this. Make sense. > > There are already other situations, like nesting, that do force users > to put everything into one iommufd_ctx. > > No reason to make things harder and more complicated. > > I'm coming to the feeling that we should put no-iommu devices in > iommufd_ctx's as well. They would be an iommufd_access like > mdevs. That would clean up the complications they cause here. This certainly simplifies the matter a lot! > > I suppose we should have done that from the beginning - no-iommu is an > IOMMUFD access, it just uses a crazy /proc based way to learn the > PFNs. Making it a proper access and making a real VFIO ioctl that > calls iommufd_access_pin_pages() and returns the DMA mapped addresses > to userspace would go a long way to making no-iommu work in a logical, > usable, way. > Yes. This would provide a more reliable/clean way to learn PFNs for noiommufd case.
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Tuesday, April 18, 2023 4:07 AM > > On Mon, 17 Apr 2023 16:31:56 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote: > > > Yes, it's not trivial, but Jason is now proposing that we consider > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think > > > this means that regardless of which device calls INFO, there's only one > > > answer (assuming same set of devices opened, all cdev, all within same > > > iommufd_ctx). Based on what I explained about my understanding of > INFO2 > > > and Jason agreed to, I think the output would be: > > > > > > flags: NOT_RESETABLE | DEV_ID > > > { > > > { valid devA-id, devA-BDF }, > > > { valid devC-id, devC-BDF }, > > > { valid devD-id, devD-BDF }, > > > { invalid dev-id, devE-BDF }, > > > } > > > > > > Here devB gets dropped because the kernel understands that devB is > > > unopened, affected, and owned. It's therefore not a blocker for > > > hot-reset. > > > > I don't think we want to drop anything because it makes the API > > ill suited for the debugging purpose. > > > > devb should be returned with an invalid dev_id if I understand your > > example. Maybe it should return with -1 as the dev_id instead of 0, to > > make the debugging a bit better. > > > > Userspace should look at only NOT_RESETTABLE to determine if it > > proceeds or not, and it should use the valid dev_id list to iterate > > over the devices it has open to do the config stuff. > > If an affected device is owned, not opened, and not interfering with > the reset, what is it adding to the API to report it for debugging > purposes? I'm afraid this leads into expanding "invalid dev-id" into an consistent output before and after devB is opened. > errno or bitmap of error conditions that the user needs to parse. > Not exactly. If RESETABLE invalid dev_id doesn't matter. The user only use the valid dev_id list to iterate as Jason pointed out. If NOT_RESETTABLE due to devE not assigned to the VM one can easily figure out the fact by simply looking at the list of affected BDFs and the configuration of assigned devices of the VM. Then invalid dev_id also doesn't matter. If NOT_RESETTABLE while devE is already assigned to the VM then it's indication of mixing groups, cdevs or multiple iommufd_ctxs. Then people should debug with other means/hints to dig out the exact culprit.
On Tue, 18 Apr 2023 03:24:46 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Tuesday, April 18, 2023 4:07 AM > > > > On Mon, 17 Apr 2023 16:31:56 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote: > > > > Yes, it's not trivial, but Jason is now proposing that we consider > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think > > > > this means that regardless of which device calls INFO, there's only one > > > > answer (assuming same set of devices opened, all cdev, all within same > > > > iommufd_ctx). Based on what I explained about my understanding of > > INFO2 > > > > and Jason agreed to, I think the output would be: > > > > > > > > flags: NOT_RESETABLE | DEV_ID > > > > { > > > > { valid devA-id, devA-BDF }, > > > > { valid devC-id, devC-BDF }, > > > > { valid devD-id, devD-BDF }, > > > > { invalid dev-id, devE-BDF }, > > > > } > > > > > > > > Here devB gets dropped because the kernel understands that devB is > > > > unopened, affected, and owned. It's therefore not a blocker for > > > > hot-reset. > > > > > > I don't think we want to drop anything because it makes the API > > > ill suited for the debugging purpose. > > > > > > devb should be returned with an invalid dev_id if I understand your > > > example. Maybe it should return with -1 as the dev_id instead of 0, to > > > make the debugging a bit better. > > > > > > Userspace should look at only NOT_RESETTABLE to determine if it > > > proceeds or not, and it should use the valid dev_id list to iterate > > > over the devices it has open to do the config stuff. > > > > If an affected device is owned, not opened, and not interfering with > > the reset, what is it adding to the API to report it for debugging > > purposes? I'm afraid this leads into expanding "invalid dev-id" into an > > consistent output before and after devB is opened. In the case where devB is not opened including it only provides useless information. In the case where devB is opened it's necessary to be reported as an opened, affected device. > > errno or bitmap of error conditions that the user needs to parse. > > > > Not exactly. > > If RESETABLE invalid dev_id doesn't matter. The user only use the > valid dev_id list to iterate as Jason pointed out. Yes, but... > If NOT_RESETTABLE due to devE not assigned to the VM one can > easily figure out the fact by simply looking at the list of affected BDFs > and the configuration of assigned devices of the VM. Then invalid > dev_id also doesn't matter. Huh? Given: flags: NOT_RESETABLE | DEV_ID { { valid devA-id, devA-BDF }, { invalid dev-id, devB-BDF }, { valid devC-id, devC-BDF }, { valid devD-id, devD-BDF }, { invalid dev-id, devE-BDF }, } How does the user determine that devE is to blame and not devB based on BDF? The user cannot rely on sysfs for help, they don't know the IOMMU grouping, nor do they know the BDF except as inferred by matching valid dev-ids in the above output. > If NOT_RESETTABLE while devE is already assigned to the VM then it's > indication of mixing groups, cdevs or multiple iommufd_ctxs. Then > people should debug with other means/hints to dig out the exact > culprit. I don't know what situation you're trying to explain here. If devE were opened within the same iommufd_ctx, this becomes: flags: RESETABLE | DEV_ID { { valid devA-id, devA-BDF }, { invalid dev-id, devB-BDF }, { valid devC-id, devC-BDF }, { valid devD-id, devD-BDF }, { valid devE-id, devE-BDF }, } Yes, the user should only be looking at the flag to determine the availability of hot-reset, (here's the but) but how is it consistent to indicate both that hot-reset is available and include an invalid dev-id? The consistency as I propose is that an invalid dev-id is only presented with NOT_RESETTABLE for the device blocking hot-reset. In the previous case, devB is not blocking reset and reporting an invalid dev-id only serves to obfuscate determining the blocking device. For the cases of affected group-opened devices or separate iommufd_ctxs, the user gets invalid dev-ids for anything outside of the calling device's iommufd_ctx. We haven't discussed how it fails when called on a group-opened device in a mixed environment. I'd propose that the INFO ioctl behaves exactly as it does today, reporting group-id and BDF for each affected device. However, the hot-reset ioctl itself is not extended to accept devicefd because there is no proof-of-ownership model for cdevs. Therefore even if the user could map group-id to devicefd, they get -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from a group-opened device. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Tuesday, April 18, 2023 12:11 PM > > On Tue, 18 Apr 2023 03:24:46 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Tuesday, April 18, 2023 4:07 AM > > > > > > On Mon, 17 Apr 2023 16:31:56 -0300 > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote: > > > > > Yes, it's not trivial, but Jason is now proposing that we consider > > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think > > > > > this means that regardless of which device calls INFO, there's only one > > > > > answer (assuming same set of devices opened, all cdev, all within > same > > > > > iommufd_ctx). Based on what I explained about my understanding of > > > INFO2 > > > > > and Jason agreed to, I think the output would be: > > > > > > > > > > flags: NOT_RESETABLE | DEV_ID > > > > > { > > > > > { valid devA-id, devA-BDF }, > > > > > { valid devC-id, devC-BDF }, > > > > > { valid devD-id, devD-BDF }, > > > > > { invalid dev-id, devE-BDF }, > > > > > } > > > > > > > > > > Here devB gets dropped because the kernel understands that devB is > > > > > unopened, affected, and owned. It's therefore not a blocker for > > > > > hot-reset. > > > > > > > > I don't think we want to drop anything because it makes the API > > > > ill suited for the debugging purpose. > > > > > > > > devb should be returned with an invalid dev_id if I understand your > > > > example. Maybe it should return with -1 as the dev_id instead of 0, to > > > > make the debugging a bit better. > > > > > > > > Userspace should look at only NOT_RESETTABLE to determine if it > > > > proceeds or not, and it should use the valid dev_id list to iterate > > > > over the devices it has open to do the config stuff. > > > > > > If an affected device is owned, not opened, and not interfering with > > > the reset, what is it adding to the API to report it for debugging > > > purposes? I'm afraid this leads into expanding "invalid dev-id" into an > > > > consistent output before and after devB is opened. > > In the case where devB is not opened including it only provides > useless information. In the case where devB is opened it's necessary > to be reported as an opened, affected device. > > > > errno or bitmap of error conditions that the user needs to parse. > > > > > > > Not exactly. > > > > If RESETABLE invalid dev_id doesn't matter. The user only use the > > valid dev_id list to iterate as Jason pointed out. > > Yes, but... > > > If NOT_RESETTABLE due to devE not assigned to the VM one can > > easily figure out the fact by simply looking at the list of affected BDFs > > and the configuration of assigned devices of the VM. Then invalid > > dev_id also doesn't matter. > > Huh? > > Given: > > flags: NOT_RESETABLE | DEV_ID > { > { valid devA-id, devA-BDF }, > { invalid dev-id, devB-BDF }, > { valid devC-id, devC-BDF }, > { valid devD-id, devD-BDF }, > { invalid dev-id, devE-BDF }, > } > > How does the user determine that devE is to blame and not devB based on > BDF? The user cannot rely on sysfs for help, they don't know the IOMMU > grouping, nor do they know the BDF except as inferred by matching valid > dev-ids in the above output. emmm aren't we talking about the 'person' who does diagnostic? This guy will look at the VM configuration file to know that devA/B/C/D have been assigned to the VM but not devE. > > > If NOT_RESETTABLE while devE is already assigned to the VM then it's > > indication of mixing groups, cdevs or multiple iommufd_ctxs. Then > > people should debug with other means/hints to dig out the exact > > culprit. > > I don't know what situation you're trying to explain here. If devE > were opened within the same iommufd_ctx, this becomes: It's about a scenario where the mgmt.. stack has assigned all affected devices to Qemu but Qemu itself messed it up with mixed group/cdev or multiple iommufd_ctx so hitting the NON_RESETTABLE situation. > > flags: RESETABLE | DEV_ID > { > { valid devA-id, devA-BDF }, > { invalid dev-id, devB-BDF }, > { valid devC-id, devC-BDF }, > { valid devD-id, devD-BDF }, > { valid devE-id, devE-BDF }, > } > > Yes, the user should only be looking at the flag to determine the > availability of hot-reset, (here's the but) but how is it consistent to > indicate both that hot-reset is available and include an invalid > dev-id? The consistency as I propose is that an invalid dev-id is only > presented with NOT_RESETTABLE for the device blocking hot-reset. In > the previous case, devB is not blocking reset and reporting an invalid > dev-id only serves to obfuscate determining the blocking device. > > For the cases of affected group-opened devices or separate > iommufd_ctxs, the user gets invalid dev-ids for anything outside of > the calling device's iommufd_ctx. > > We haven't discussed how it fails when called on a group-opened device > in a mixed environment. I'd propose that the INFO ioctl behaves > exactly as it does today, reporting group-id and BDF for each affected > device. However, the hot-reset ioctl itself is not extended to accept > devicefd because there is no proof-of-ownership model for cdevs. > Therefore even if the user could map group-id to devicefd, they get > -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from > a group-opened device. Thanks, > Yes I chatted with Yi about it. If the calling device of the INFO ioctl is opened by group then behave as it does today. If the calling device is opened via cdev then use dev_id scheme as discussed above. in hot_reset ioctl the fd array only accepts group fd's. cdev can be reset only via null fd array. It remains a small open that null fd array could potentially work for group-opened device too if vfio-compat is used. In that case devices are in same iommufd ctx with valid dev_id even though they are opened via group. But probably it's not worthy blocking it?
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Monday, April 17, 2023 9:39 PM > > On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote: > > > The only corner case with this option is when a user mixes group > > and cdev usages. iirc you mentioned it's a valid usage to be supported. > > In that case the kernel doesn't have sufficient knowledge to judge > > 'resettable' as it doesn't know which groups are opened by this user. > > IMHO we don't need to support this combination. Do you mean we don't support hot-reset for this combination or we don't support user using this combination. I guess the prior one. Right? > > We can say that to use the hot reset API the user must put all their > devices into the same iommufd_ctx and cover 100% of the known use > cases for this. > > There are already other situations, like nesting, that do force users > to put everything into one iommufd_ctx. > > No reason to make things harder and more complicated. Ditto. We just fail hot-reset for the multiple iommufds case. Is it? Otherwise, we need to prevent users from using multiple iommufds. > I'm coming to the feeling that we should put no-iommu devices in > iommufd_ctx's as well. They would be an iommufd_access like > mdevs. That would clean up the complications they cause here. Ok, the lucky thing is you have merged the patch series that creates iommufd_access for emulated devices in bind. So cdev series needs to handle noiommu case by creating iommufd_access. > > I suppose we should have done that from the beginning - no-iommu is an > IOMMUFD access, it just uses a crazy /proc based way to learn the > PFNs. Making it a proper access and making a real VFIO ioctl that > calls iommufd_access_pin_pages() and returns the DMA mapped addresses > to userspace would go a long way to making no-iommu work in a logical, > usable, way. This seems to be an improvement for noiommu mode. It can be done later. For now, generating access_id and binding noiommu devices with iommufdctx is enough for supporting noiommu hot-reset. Regards, Yi Liu
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Tuesday, April 18, 2023 12:11 PM > [...] > > We haven't discussed how it fails when called on a group-opened device > in a mixed environment. I'd propose that the INFO ioctl behaves > exactly as it does today, reporting group-id and BDF for each affected > device. However, the hot-reset ioctl itself is not extended to accept > devicefd because there is no proof-of-ownership model for cdevs. > Therefore even if the user could map group-id to devicefd, they get > -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from > a group-opened device. Thanks, Will it be better to let userspace know it shall fail if invoking hot reset due to no proof-of-ownership as it also has cdev devices? Maybe the RESETTABLE flag should always be meaningful. Even if the calling device of _INFO is group-opened device. Old user applications does not need to check it as it will never have such mixed environment. But for new applications or the applications that have been updated per latest vfio uapi, it should strictly check this flag before going ahead to do hot-reset. Regards, Yi Liu
On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote: > On Mon, 17 Apr 2023 16:31:56 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote: > > > Yes, it's not trivial, but Jason is now proposing that we consider > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think > > > this means that regardless of which device calls INFO, there's only one > > > answer (assuming same set of devices opened, all cdev, all within same > > > iommufd_ctx). Based on what I explained about my understanding of INFO2 > > > and Jason agreed to, I think the output would be: > > > > > > flags: NOT_RESETABLE | DEV_ID > > > { > > > { valid devA-id, devA-BDF }, > > > { valid devC-id, devC-BDF }, > > > { valid devD-id, devD-BDF }, > > > { invalid dev-id, devE-BDF }, > > > } > > > > > > Here devB gets dropped because the kernel understands that devB is > > > unopened, affected, and owned. It's therefore not a blocker for > > > hot-reset. > > > > I don't think we want to drop anything because it makes the API > > ill suited for the debugging purpose. > > > > devb should be returned with an invalid dev_id if I understand your > > example. Maybe it should return with -1 as the dev_id instead of 0, to > > make the debugging a bit better. > > > > Userspace should look at only NOT_RESETTABLE to determine if it > > proceeds or not, and it should use the valid dev_id list to iterate > > over the devices it has open to do the config stuff. > > If an affected device is owned, not opened, and not interfering with > the reset, what is it adding to the API to report it for debugging > purposes? It lets it print the entire group of devices, this is the only way something can learn the actual list of all BDFs affected. dev_id can just return 0, we don't need a complex bitmap. Userspace looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0. Jason
On Tue, Apr 18, 2023 at 05:02:44AM +0000, Tian, Kevin wrote: > Yes I chatted with Yi about it. > > If the calling device of the INFO ioctl is opened by group then behave > as it does today. > > If the calling device is opened via cdev then use dev_id scheme as > discussed above. > > in hot_reset ioctl the fd array only accepts group fd's. > > cdev can be reset only via null fd array. Agree > It remains a small open that null fd array could potentially work for > group-opened device too if vfio-compat is used. In that case devices > are in same iommufd ctx with valid dev_id even though they are opened > via group. But probably it's not worthy blocking it? IMHO not worth the complexity to block. Security is maintained if we use an iommufd_ctx check. Jason
On Tue, Apr 18, 2023 at 10:23:55AM +0000, Liu, Yi L wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Monday, April 17, 2023 9:39 PM > > > > On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote: > > > > > The only corner case with this option is when a user mixes group > > > and cdev usages. iirc you mentioned it's a valid usage to be supported. > > > In that case the kernel doesn't have sufficient knowledge to judge > > > 'resettable' as it doesn't know which groups are opened by this user. > > > > IMHO we don't need to support this combination. > > Do you mean we don't support hot-reset for this combination or we don't > support user using this combination. I guess the prior one. Right? Yes > Ditto. We just fail hot-reset for the multiple iommufds case. Is it? Yes > > I suppose we should have done that from the beginning - no-iommu is an > > IOMMUFD access, it just uses a crazy /proc based way to learn the > > PFNs. Making it a proper access and making a real VFIO ioctl that > > calls iommufd_access_pin_pages() and returns the DMA mapped addresses > > to userspace would go a long way to making no-iommu work in a logical, > > usable, way. > > This seems to be an improvement for noiommu mode. It can be done later. > For now, generating access_id and binding noiommu devices with iommufdctx > is enough for supporting noiommu hot-reset. Yes, I'm not sure there is much value in improving no-iommu unless someone also wants to go in and update dpdk. At some point we will need to revise dpdk to use iommufd, maybe that would be a good time to fix this too. The point is that using an access is actually a logical and sensible thing to do, no a hack to make hot reset work better. Jason
On Tue, 18 Apr 2023 05:02:44 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Tuesday, April 18, 2023 12:11 PM > > > > On Tue, 18 Apr 2023 03:24:46 +0000 > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Tuesday, April 18, 2023 4:07 AM > > > > > > > > On Mon, 17 Apr 2023 16:31:56 -0300 > > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote: > > > > > > Yes, it's not trivial, but Jason is now proposing that we consider > > > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think > > > > > > this means that regardless of which device calls INFO, there's only one > > > > > > answer (assuming same set of devices opened, all cdev, all within > > same > > > > > > iommufd_ctx). Based on what I explained about my understanding of > > > > INFO2 > > > > > > and Jason agreed to, I think the output would be: > > > > > > > > > > > > flags: NOT_RESETABLE | DEV_ID > > > > > > { > > > > > > { valid devA-id, devA-BDF }, > > > > > > { valid devC-id, devC-BDF }, > > > > > > { valid devD-id, devD-BDF }, > > > > > > { invalid dev-id, devE-BDF }, > > > > > > } > > > > > > > > > > > > Here devB gets dropped because the kernel understands that devB is > > > > > > unopened, affected, and owned. It's therefore not a blocker for > > > > > > hot-reset. > > > > > > > > > > I don't think we want to drop anything because it makes the API > > > > > ill suited for the debugging purpose. > > > > > > > > > > devb should be returned with an invalid dev_id if I understand your > > > > > example. Maybe it should return with -1 as the dev_id instead of 0, to > > > > > make the debugging a bit better. > > > > > > > > > > Userspace should look at only NOT_RESETTABLE to determine if it > > > > > proceeds or not, and it should use the valid dev_id list to iterate > > > > > over the devices it has open to do the config stuff. > > > > > > > > If an affected device is owned, not opened, and not interfering with > > > > the reset, what is it adding to the API to report it for debugging > > > > purposes? I'm afraid this leads into expanding "invalid dev-id" into an > > > > > > consistent output before and after devB is opened. > > > > In the case where devB is not opened including it only provides > > useless information. In the case where devB is opened it's necessary > > to be reported as an opened, affected device. > > > > > > errno or bitmap of error conditions that the user needs to parse. > > > > > > > > > > Not exactly. > > > > > > If RESETABLE invalid dev_id doesn't matter. The user only use the > > > valid dev_id list to iterate as Jason pointed out. > > > > Yes, but... > > > > > If NOT_RESETTABLE due to devE not assigned to the VM one can > > > easily figure out the fact by simply looking at the list of affected BDFs > > > and the configuration of assigned devices of the VM. Then invalid > > > dev_id also doesn't matter. > > > > Huh? > > > > Given: > > > > flags: NOT_RESETABLE | DEV_ID > > { > > { valid devA-id, devA-BDF }, > > { invalid dev-id, devB-BDF }, > > { valid devC-id, devC-BDF }, > > { valid devD-id, devD-BDF }, > > { invalid dev-id, devE-BDF }, > > } > > > > How does the user determine that devE is to blame and not devB based on > > BDF? The user cannot rely on sysfs for help, they don't know the IOMMU > > grouping, nor do they know the BDF except as inferred by matching valid > > dev-ids in the above output. > > emmm aren't we talking about the 'person' who does diagnostic? This guy > will look at the VM configuration file to know that devA/B/C/D have been > assigned to the VM but not devE. Actually the scenario is that devA/C/D are assigned, devB is implicitly owned, and it's devE that blocks the reset. If you've followed any of the community forums for vfio over the years, it should be readily apparent that placing the burden solely on the end user to perform such a diagnosis is an unreasonable expectation. > > > If NOT_RESETTABLE while devE is already assigned to the VM then it's > > > indication of mixing groups, cdevs or multiple iommufd_ctxs. Then > > > people should debug with other means/hints to dig out the exact > > > culprit. > > > > I don't know what situation you're trying to explain here. If devE > > were opened within the same iommufd_ctx, this becomes: > > It's about a scenario where the mgmt.. stack has assigned all affected > devices to Qemu but Qemu itself messed it up with mixed group/cdev > or multiple iommufd_ctx so hitting the NON_RESETTABLE situation. Is this a reasonable scenario? I expect the QEMU support to favor cdev access where available and fd passing methods will only use cdev, so QEMU should never mess up to create such an environment. There should never be a case where a device is exclusively available via group rather than cdev. > > flags: RESETABLE | DEV_ID > > { > > { valid devA-id, devA-BDF }, > > { invalid dev-id, devB-BDF }, > > { valid devC-id, devC-BDF }, > > { valid devD-id, devD-BDF }, > > { valid devE-id, devE-BDF }, > > } > > > > Yes, the user should only be looking at the flag to determine the > > availability of hot-reset, (here's the but) but how is it consistent to > > indicate both that hot-reset is available and include an invalid > > dev-id? The consistency as I propose is that an invalid dev-id is only > > presented with NOT_RESETTABLE for the device blocking hot-reset. In > > the previous case, devB is not blocking reset and reporting an invalid > > dev-id only serves to obfuscate determining the blocking device. > > > > For the cases of affected group-opened devices or separate > > iommufd_ctxs, the user gets invalid dev-ids for anything outside of > > the calling device's iommufd_ctx. > > > > We haven't discussed how it fails when called on a group-opened device > > in a mixed environment. I'd propose that the INFO ioctl behaves > > exactly as it does today, reporting group-id and BDF for each affected > > device. However, the hot-reset ioctl itself is not extended to accept > > devicefd because there is no proof-of-ownership model for cdevs. > > Therefore even if the user could map group-id to devicefd, they get > > -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from > > a group-opened device. Thanks, > > > > Yes I chatted with Yi about it. > > If the calling device of the INFO ioctl is opened by group then behave > as it does today. > > If the calling device is opened via cdev then use dev_id scheme as > discussed above. > > in hot_reset ioctl the fd array only accepts group fd's. > > cdev can be reset only via null fd array. > > It remains a small open that null fd array could potentially work for > group-opened device too if vfio-compat is used. In that case devices > are in same iommufd ctx with valid dev_id even though they are opened > via group. But probably it's not worthy blocking it? Yes, let's not create new models for the compatibility interface, stick with group-opened = group-id = proof-of-ownership. Thanks, Alex
On Tue, 18 Apr 2023 10:34:45 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Tuesday, April 18, 2023 12:11 PM > > > [...] > > > > We haven't discussed how it fails when called on a group-opened device > > in a mixed environment. I'd propose that the INFO ioctl behaves > > exactly as it does today, reporting group-id and BDF for each affected > > device. However, the hot-reset ioctl itself is not extended to accept > > devicefd because there is no proof-of-ownership model for cdevs. > > Therefore even if the user could map group-id to devicefd, they get > > -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from > > a group-opened device. Thanks, > > Will it be better to let userspace know it shall fail if invoking hot > reset due to no proof-of-ownership as it also has cdev devices? Maybe > the RESETTABLE flag should always be meaningful. Even if the calling > device of _INFO is group-opened device. Old user applications does not > need to check it as it will never have such mixed environment. But for > new applications or the applications that have been updated per latest > vfio uapi, it should strictly check this flag before going ahead to do > hot-reset. The group-opened model cannot consistently predict whether the user can provide proof-of-ownership. I don't think we should define a flag simply because there's a case that we can predict, the definition of that flag becomes problematic. Let's not complicate the interface by trying to optimize a case that will likely never exist in practice and can be handled via the existing legacy API. Thanks, Alex
On Tue, 18 Apr 2023 09:57:32 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote: > > On Mon, 17 Apr 2023 16:31:56 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote: > > > > Yes, it's not trivial, but Jason is now proposing that we consider > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think > > > > this means that regardless of which device calls INFO, there's only one > > > > answer (assuming same set of devices opened, all cdev, all within same > > > > iommufd_ctx). Based on what I explained about my understanding of INFO2 > > > > and Jason agreed to, I think the output would be: > > > > > > > > flags: NOT_RESETABLE | DEV_ID > > > > { > > > > { valid devA-id, devA-BDF }, > > > > { valid devC-id, devC-BDF }, > > > > { valid devD-id, devD-BDF }, > > > > { invalid dev-id, devE-BDF }, > > > > } > > > > > > > > Here devB gets dropped because the kernel understands that devB is > > > > unopened, affected, and owned. It's therefore not a blocker for > > > > hot-reset. > > > > > > I don't think we want to drop anything because it makes the API > > > ill suited for the debugging purpose. > > > > > > devb should be returned with an invalid dev_id if I understand your > > > example. Maybe it should return with -1 as the dev_id instead of 0, to > > > make the debugging a bit better. > > > > > > Userspace should look at only NOT_RESETTABLE to determine if it > > > proceeds or not, and it should use the valid dev_id list to iterate > > > over the devices it has open to do the config stuff. > > > > If an affected device is owned, not opened, and not interfering with > > the reset, what is it adding to the API to report it for debugging > > purposes? > > It lets it print the entire group of devices, this is the only way > something can learn the actual list of all BDFs affected. If we do so, userspace must be able to differentiate which devices are blocking, which necessitates at least a bi-modal invalid dev-id. > dev_id can just return 0, we don't need a complex bitmap. Userspace > looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0. I'm having trouble with a succinct definition of dev-id == 0, is it "A device affected by the hot-reset reset, which does not directly contribute to the availability of the hot-reset, ex. an unopened device within the same IOMMU group as an opened device (ie. this is not the device responsible if hot-reset is unavailable). Whereas dev-id < 0 (== -1) is an affected device which prevents hot-reset, ex. an un-owned device, device configured within a different iommufd_ctx, or device opened outside of the vfio cdev API." Is that about right? Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, April 19, 2023 2:39 AM > > On Tue, 18 Apr 2023 09:57:32 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote: > > > On Mon, 17 Apr 2023 16:31:56 -0300 > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote: > > > > > Yes, it's not trivial, but Jason is now proposing that we consider > > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think > > > > > this means that regardless of which device calls INFO, there's only one > > > > > answer (assuming same set of devices opened, all cdev, all within same > > > > > iommufd_ctx). Based on what I explained about my understanding of INFO2 > > > > > and Jason agreed to, I think the output would be: > > > > > > > > > > flags: NOT_RESETABLE | DEV_ID > > > > > { > > > > > { valid devA-id, devA-BDF }, > > > > > { valid devC-id, devC-BDF }, > > > > > { valid devD-id, devD-BDF }, > > > > > { invalid dev-id, devE-BDF }, > > > > > } > > > > > > > > > > Here devB gets dropped because the kernel understands that devB is > > > > > unopened, affected, and owned. It's therefore not a blocker for > > > > > hot-reset. > > > > > > > > I don't think we want to drop anything because it makes the API > > > > ill suited for the debugging purpose. > > > > > > > > devb should be returned with an invalid dev_id if I understand your > > > > example. Maybe it should return with -1 as the dev_id instead of 0, to > > > > make the debugging a bit better. > > > > > > > > Userspace should look at only NOT_RESETTABLE to determine if it > > > > proceeds or not, and it should use the valid dev_id list to iterate > > > > over the devices it has open to do the config stuff. > > > > > > If an affected device is owned, not opened, and not interfering with > > > the reset, what is it adding to the API to report it for debugging > > > purposes? > > > > It lets it print the entire group of devices, this is the only way > > something can learn the actual list of all BDFs affected. > > If we do so, userspace must be able to differentiate which devices are > blocking, which necessitates at least a bi-modal invalid dev-id. > > > dev_id can just return 0, we don't need a complex bitmap. Userspace > > looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0. > > I'm having trouble with a succinct definition of dev-id == 0, is it "A > device affected by the hot-reset reset, which does not directly > contribute to the availability of the hot-reset, ex. an unopened device > within the same IOMMU group as an opened device (ie. this is not the > device responsible if hot-reset is unavailable). Hide this device in the list looks fine to me. But the calling user should not do any new device open before finishing hot-reset. Otherwise, user may miss a device that needs to do pre/post reset. I think this requirement is acceptable. Is it? > Whereas dev-id < 0 > (== -1) is an affected device which prevents hot-reset, ex. an un-owned > device, device configured within a different iommufd_ctx, or device > opened outside of the vfio cdev API." Is that about right? Thanks, Do you mean to have separate err-code for the three possibilities? As the devid is generated by iommufd and it is u32. I'm not sure if we can have such err-code definition without reserving some ids in iommufd. Regards, Yi Liu
On Thu, 20 Apr 2023 12:10:20 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, April 19, 2023 2:39 AM > > > > On Tue, 18 Apr 2023 09:57:32 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote: > > > > On Mon, 17 Apr 2023 16:31:56 -0300 > > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote: > > > > > > Yes, it's not trivial, but Jason is now proposing that we consider > > > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid. I think > > > > > > this means that regardless of which device calls INFO, there's only one > > > > > > answer (assuming same set of devices opened, all cdev, all within same > > > > > > iommufd_ctx). Based on what I explained about my understanding of INFO2 > > > > > > and Jason agreed to, I think the output would be: > > > > > > > > > > > > flags: NOT_RESETABLE | DEV_ID > > > > > > { > > > > > > { valid devA-id, devA-BDF }, > > > > > > { valid devC-id, devC-BDF }, > > > > > > { valid devD-id, devD-BDF }, > > > > > > { invalid dev-id, devE-BDF }, > > > > > > } > > > > > > > > > > > > Here devB gets dropped because the kernel understands that devB is > > > > > > unopened, affected, and owned. It's therefore not a blocker for > > > > > > hot-reset. > > > > > > > > > > I don't think we want to drop anything because it makes the API > > > > > ill suited for the debugging purpose. > > > > > > > > > > devb should be returned with an invalid dev_id if I understand your > > > > > example. Maybe it should return with -1 as the dev_id instead of 0, to > > > > > make the debugging a bit better. > > > > > > > > > > Userspace should look at only NOT_RESETTABLE to determine if it > > > > > proceeds or not, and it should use the valid dev_id list to iterate > > > > > over the devices it has open to do the config stuff. > > > > > > > > If an affected device is owned, not opened, and not interfering with > > > > the reset, what is it adding to the API to report it for debugging > > > > purposes? > > > > > > It lets it print the entire group of devices, this is the only way > > > something can learn the actual list of all BDFs affected. > > > > If we do so, userspace must be able to differentiate which devices are > > blocking, which necessitates at least a bi-modal invalid dev-id. > > > > > dev_id can just return 0, we don't need a complex bitmap. Userspace > > > looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0. > > > > I'm having trouble with a succinct definition of dev-id == 0, is it "A > > device affected by the hot-reset reset, which does not directly > > contribute to the availability of the hot-reset, ex. an unopened device > > within the same IOMMU group as an opened device (ie. this is not the > > device responsible if hot-reset is unavailable). > > Hide this device in the list looks fine to me. But the calling user should > not do any new device open before finishing hot-reset. Otherwise, user may > miss a device that needs to do pre/post reset. I think this requirement is > acceptable. Is it? I think Kevin and Jason are leaning towards reporting the entire dev-set. The INFO ioctl has always been a point-in-time reading, no guarantees are made if the host or user configuration is changed. Nothing changes in that respect. > > Whereas dev-id < 0 > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned > > device, device configured within a different iommufd_ctx, or device > > opened outside of the vfio cdev API." Is that about right? Thanks, > > Do you mean to have separate err-code for the three possibilities? As > the devid is generated by iommufd and it is u32. I'm not sure if we can > have such err-code definition without reserving some ids in iommufd. Yes, if we're going to report the full dev-set, I think we need at least two unique error codes or else the user has no way to determine the subset of invalid dev-ids which block the reset. I think Jason is proposing the set of valid dev-ids are >0, a dev-id of zero indicates some form of non-blocking, while <0 (or maybe specifically -1) indicates a blocking device. I was trying to get consensus on a formal definition of each of those error codes in my previous reply. Thanks, Alex
On Thu, Apr 20, 2023 at 08:08:39AM -0600, Alex Williamson wrote: > > Hide this device in the list looks fine to me. But the calling user should > > not do any new device open before finishing hot-reset. Otherwise, user may > > miss a device that needs to do pre/post reset. I think this requirement is > > acceptable. Is it? > > I think Kevin and Jason are leaning towards reporting the entire > dev-set. The INFO ioctl has always been a point-in-time reading, no > guarantees are made if the host or user configuration is changed. > Nothing changes in that respect. Yeah, I think your point about qemu community formus suggest we should err toward having qemu provide some fully detailed debug report. > > > Whereas dev-id < 0 > > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned > > > device, device configured within a different iommufd_ctx, or device > > > opened outside of the vfio cdev API." Is that about right? Thanks, > > > > Do you mean to have separate err-code for the three possibilities? As > > the devid is generated by iommufd and it is u32. I'm not sure if we can > > have such err-code definition without reserving some ids in iommufd. > > Yes, if we're going to report the full dev-set, I think we need at > least two unique error codes or else the user has no way to determine > the subset of invalid dev-ids which block the reset. If you think this is important to report we should report 0 and -1, and adjust the iommufd xarray allocator to reserve -1 It depends what you want to show for the debugging. eg if we have debugging where qemu dumps this table: BDF In VM iommu_group Has VFIO driver Has Kernel Driver By also doing various sysfs probes based on the BDF, then the admin action to remedy the situation is: Make "Has VFIO driver = y" or "Has Kernel Driver = n" for every row in the table to make the reset work. And we don't need the distinction. Adding the 0/-1 lets you make a useful table without doing any sysfs work. > I think Jason is proposing the set of valid dev-ids are >0, a dev-id > of zero indicates some form of non-blocking, while <0 (or maybe > specifically -1) indicates a blocking device. Yes, 0 and -1 would be fine with those definitions. The only use of the data is to add a 'blocking use of reset' colum to the table above.. Thanks, Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, April 18, 2023 9:02 PM > > On Tue, Apr 18, 2023 at 10:23:55AM +0000, Liu, Yi L wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Monday, April 17, 2023 9:39 PM > > > > > > On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote: > > > > > > > The only corner case with this option is when a user mixes group > > > > and cdev usages. iirc you mentioned it's a valid usage to be supported. > > > > In that case the kernel doesn't have sufficient knowledge to judge > > > > 'resettable' as it doesn't know which groups are opened by this user. > > > > > > IMHO we don't need to support this combination. > > > > Do you mean we don't support hot-reset for this combination or we don't > > support user using this combination. I guess the prior one. Right? > > Yes > > > Ditto. We just fail hot-reset for the multiple iommufds case. Is it? > > Yes > > > > I suppose we should have done that from the beginning - no-iommu is an > > > IOMMUFD access, it just uses a crazy /proc based way to learn the > > > PFNs. Making it a proper access and making a real VFIO ioctl that > > > calls iommufd_access_pin_pages() and returns the DMA mapped addresses > > > to userspace would go a long way to making no-iommu work in a logical, > > > usable, way. > > > > This seems to be an improvement for noiommu mode. It can be done later. > > For now, generating access_id and binding noiommu devices with iommufdctx > > is enough for supporting noiommu hot-reset. > > Yes, I'm not sure there is much value in improving no-iommu unless > someone also wants to go in and update dpdk. > > At some point we will need to revise dpdk to use iommufd, maybe that > would be a good time to fix this too. This noiommu improvement shall allow user to attach ioas to noiommu devices. is it? This may be done by calling iommufd_access_attach(). So there is a quick question. In the cdev series, shall we allow the attachment for noiommu? I think the noiommu improvement shall require extra effort, so it is not ready yet. If so, seems like I just need to fail the attachment for noiommu devices. But when in the future it is ready, how can userspace know attach is allowed for noiommu devices? Will it be an easy thing? or we may just let the attach as a noop and always succeed for noiommu devices? any suggestions? Regards, Yi Liu
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Saturday, April 22, 2023 6:36 AM > > On Thu, Apr 20, 2023 at 08:08:39AM -0600, Alex Williamson wrote: > > > > Hide this device in the list looks fine to me. But the calling user should > > > not do any new device open before finishing hot-reset. Otherwise, user may > > > miss a device that needs to do pre/post reset. I think this requirement is > > > acceptable. Is it? > > > > I think Kevin and Jason are leaning towards reporting the entire > > dev-set. The INFO ioctl has always been a point-in-time reading, no > > guarantees are made if the host or user configuration is changed. > > Nothing changes in that respect. > > Yeah, I think your point about qemu community formus suggest we should > err toward having qemu provide some fully detailed debug report. > > > > > Whereas dev-id < 0 > > > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned > > > > device, device configured within a different iommufd_ctx, or device > > > > opened outside of the vfio cdev API." Is that about right? Thanks, > > > > > > Do you mean to have separate err-code for the three possibilities? As > > > the devid is generated by iommufd and it is u32. I'm not sure if we can > > > have such err-code definition without reserving some ids in iommufd. > > > > Yes, if we're going to report the full dev-set, I think we need at > > least two unique error codes or else the user has no way to determine > > the subset of invalid dev-ids which block the reset. > > If you think this is important to report we should report 0 and -1, > and adjust the iommufd xarray allocator to reserve -1 Then the alloc range should be from 1 to 0xffffffff. > > It depends what you want to show for the debugging. > > eg if we have debugging where qemu dumps this table: > > BDF In VM iommu_group Has VFIO driver Has Kernel Driver > > By also doing various sysfs probes based on the BDF, then the admin > action to remedy the situation is: > > Make "Has VFIO driver = y" or "Has Kernel Driver = n" for every row in > the table to make the reset work. > > And we don't need the distinction. Adding the 0/-1 lets you make a > useful table without doing any sysfs work. > > > I think Jason is proposing the set of valid dev-ids are >0, a dev-id > > of zero indicates some form of non-blocking, while <0 (or maybe > > specifically -1) indicates a blocking device. > > Yes, 0 and -1 would be fine with those definitions. The only use of > the data is to add a 'blocking use of reset' colum to the table > above.. Should -1 and 0 be defined in uapi as well? If yes, this seems not easy to get a proper naming for them. Or just document it in vfio uapi header to say -1 (blocking) and 0 (no-devid-but-not-blocking) blabla. Regards, Yi Liu
On Sun, Apr 23, 2023 at 10:28:58AM +0000, Liu, Yi L wrote: > This noiommu improvement shall allow user to attach ioas to noiommu devices. > is it? This may be done by calling iommufd_access_attach(). So there is a > quick question. In the cdev series, shall we allow the attachment > for noiommu? Yes, I think we need to undo the decision we talked about earlier where no-iommu would be asked for with a -1 iommufd. All vfio_devices should have an iommufd_ctx when container is compiled out. You don't need to do anything with the ctx for no-iommu beyond demand that userspace provide it. Jason
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Thursday, April 20, 2023 10:09 PM [...] > > > Whereas dev-id < 0 > > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned > > > device, device configured within a different iommufd_ctx, or device > > > opened outside of the vfio cdev API." Is that about right? Thanks, > > > > Do you mean to have separate err-code for the three possibilities? As > > the devid is generated by iommufd and it is u32. I'm not sure if we can > > have such err-code definition without reserving some ids in iommufd. > > Yes, if we're going to report the full dev-set, I think we need at > least two unique error codes or else the user has no way to determine > the subset of invalid dev-ids which block the reset. I think Jason is > proposing the set of valid dev-ids are >0, a dev-id of zero indicates > some form of non-blocking, while <0 (or maybe specifically -1) > indicates a blocking device. I was trying to get consensus on a formal > definition of each of those error codes in my previous reply. Thanks, Seems like RESETTABLE flag is not needed if we report -1 for the devices that block hotreset. Userspace can deduce if the calling device is resettable or not by checking if there is any -1 in the affected device list. Regards, Yi Liu
On Wed, 26 Apr 2023 07:22:17 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Thursday, April 20, 2023 10:09 PM > [...] > > > > Whereas dev-id < 0 > > > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned > > > > device, device configured within a different iommufd_ctx, or device > > > > opened outside of the vfio cdev API." Is that about right? Thanks, > > > > > > Do you mean to have separate err-code for the three possibilities? As > > > the devid is generated by iommufd and it is u32. I'm not sure if we can > > > have such err-code definition without reserving some ids in iommufd. > > > > Yes, if we're going to report the full dev-set, I think we need at > > least two unique error codes or else the user has no way to determine > > the subset of invalid dev-ids which block the reset. I think Jason is > > proposing the set of valid dev-ids are >0, a dev-id of zero indicates > > some form of non-blocking, while <0 (or maybe specifically -1) > > indicates a blocking device. I was trying to get consensus on a formal > > definition of each of those error codes in my previous reply. Thanks, > > Seems like RESETTABLE flag is not needed if we report -1 for the devices > that block hotreset. Userspace can deduce if the calling device is resettable > or not by checking if there is any -1 in the affected device list. There is some redundancy there, yes. Given the desire for a null array on the actual reset ioctl I assumed there would also be a desire to streamline the info ioctl such that userspace isn't required to parse the return array, for example maybe userspace isn't required to pass a full buffer and can get the reset availability status from only the header. Of course it's still the responsibility of userspace to know the extent of the reset. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, April 26, 2023 9:20 PM > > On Wed, 26 Apr 2023 07:22:17 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Thursday, April 20, 2023 10:09 PM > > [...] > > > > > Whereas dev-id < 0 > > > > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned > > > > > device, device configured within a different iommufd_ctx, or device > > > > > opened outside of the vfio cdev API." Is that about right? Thanks, > > > > > > > > Do you mean to have separate err-code for the three possibilities? As > > > > the devid is generated by iommufd and it is u32. I'm not sure if we can > > > > have such err-code definition without reserving some ids in iommufd. > > > > > > Yes, if we're going to report the full dev-set, I think we need at > > > least two unique error codes or else the user has no way to determine > > > the subset of invalid dev-ids which block the reset. I think Jason is > > > proposing the set of valid dev-ids are >0, a dev-id of zero indicates > > > some form of non-blocking, while <0 (or maybe specifically -1) > > > indicates a blocking device. I was trying to get consensus on a formal > > > definition of each of those error codes in my previous reply. Thanks, > > > > Seems like RESETTABLE flag is not needed if we report -1 for the devices > > that block hotreset. Userspace can deduce if the calling device is resettable > > or not by checking if there is any -1 in the affected device list. > > There is some redundancy there, yes. Given the desire for a null array > on the actual reset ioctl I assumed there would also be a desire to > streamline the info ioctl such that userspace isn't required to parse > the return array, for example maybe userspace isn't required to pass a > full buffer and can get the reset availability status from only the > header. Of course it's still the responsibility of userspace to know > the extent of the reset. Thanks, I keep it and has sent a refreshed version for hot-reset.
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 19f5b075d70a..a5a7e148dce1 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -30,6 +30,7 @@ #if IS_ENABLED(CONFIG_EEH) #include <asm/eeh.h> #endif +#include <uapi/linux/iommufd.h> #include "vfio_pci_priv.h" @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ return 0; } +static struct vfio_device * +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set, + struct pci_dev *pdev) +{ + struct vfio_device *cur; + + lockdep_assert_held(&dev_set->lock); + + list_for_each_entry(cur, &dev_set->device_list, dev_set_list) + if (cur->dev == &pdev->dev) + return cur; + return NULL; +} + static int vfio_pci_count_devs(struct pci_dev *pdev, void *data) { (*(int *)data)++; @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void *data) struct vfio_pci_fill_info { int max; int cur; + bool require_devid; + struct iommufd_ctx *iommufd; + struct vfio_device_set *dev_set; struct vfio_pci_dependent_device *devices; }; static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) { struct vfio_pci_fill_info *fill = data; + struct vfio_device_set *dev_set = fill->dev_set; struct iommu_group *iommu_group; + struct vfio_device *vdev; + + lockdep_assert_held(&dev_set->lock); if (fill->cur == fill->max) return -EAGAIN; /* Something changed, try again */ @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) if (!iommu_group) return -EPERM; /* Cannot reset non-isolated devices */ - fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); + if (fill->require_devid) { + /* + * Report dev_id of the devices that are opened as cdev + * and have the same iommufd with the fill->iommufd. + * Otherwise, just fill IOMMUFD_INVALID_ID. + */ + vdev = vfio_pci_find_device_in_devset(dev_set, pdev); + if (vdev && vfio_device_cdev_opened(vdev) && + fill->iommufd == vfio_iommufd_physical_ictx(vdev)) + vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id); + else + fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID; + } else { + fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); + } fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus); fill->devices[fill->cur].bus = pdev->bus->number; fill->devices[fill->cur].devfn = pdev->devfn; @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info( return -ENOMEM; fill.devices = devices; + fill.dev_set = vdev->vdev.dev_set; + mutex_lock(&vdev->vdev.dev_set->lock); + if (vfio_device_cdev_opened(&vdev->vdev)) { + fill.require_devid = true; + fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); + } ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs, &fill, slot); + mutex_unlock(&vdev->vdev.dev_set->lock); /* * If a device was removed between counting and filling, we may come up * short of fill.max. If a device was added, we'll have a return of * -EAGAIN above. */ - if (!ret) + if (!ret) { hdr.count = fill.cur; + if (fill.require_devid) + hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID; + } reset_info_exit: if (copy_to_user(arg, &hdr, minsz)) @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev, static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data) { struct vfio_device_set *dev_set = data; - struct vfio_device *cur; - list_for_each_entry(cur, &dev_set->device_list, dev_set_list) - if (cur->dev == &pdev->dev) - return 0; - return -EBUSY; + lockdep_assert_held(&dev_set->lock); + + return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY; } /* diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 25432ef213ee..5a34364e3b94 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -650,11 +650,32 @@ enum { * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12, * struct vfio_pci_hot_reset_info) * + * This command is used to query the affected devices in the hot reset for + * a given device. User could use the information reported by this command + * to figure out the affected devices among the devices it has opened. + * This command always reports the segment, bus and devfn information for + * each affected device, and selectively report the group_id or the dev_id + * per the way how the device being queried is opened. + * - If the device is opened via the traditional group/container manner, + * this command reports the group_id for each affected device. + * + * - If the device is opened as a cdev, this command needs to report + * dev_id for each affected device and set the + * VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag. For the affected + * devices that are not opened as cdev or bound to different iommufds + * with the device that is queried, report an invalid dev_id to avoid + * potential dev_id conflict as dev_id is local to iommufd. For such + * affected devices, user shall fall back to use the segment, bus and + * devfn info to map it to opened device. + * * Return: 0 on success, -errno on failure: * -enospc = insufficient buffer, -enodev = unsupported for device. */ struct vfio_pci_dependent_device { - __u32 group_id; + union { + __u32 group_id; + __u32 dev_id; + }; __u16 segment; __u8 bus; __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */ @@ -663,6 +684,7 @@ struct vfio_pci_dependent_device { struct vfio_pci_hot_reset_info { __u32 argsz; __u32 flags; +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID (1 << 0) __u32 count; struct vfio_pci_dependent_device devices[]; };
for the users that accept device fds passed from management stacks to be able to figure out the host reset affected devices among the devices opened by the user. This is needed as such users do not have BDF (bus, devfn) knowledge about the devices it has opened, hence unable to use the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to figure out the affected devices. Signed-off-by: Yi Liu <yi.l.liu@intel.com> --- drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++---- include/uapi/linux/vfio.h | 24 ++++++++++++- 2 files changed, 74 insertions(+), 8 deletions(-)