Message ID | 20230327093458.44939-11-yi.l.liu@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Introduce new methods for verifying ownership in vfio PCI hot reset | expand |
On Mon, 27 Mar 2023 02:34:58 -0700 Yi Liu <yi.l.liu@intel.com> wrote: > This is a preparation for vfio device cdev as cdev gives userspace the > capability to open device cdev fd and management stack (e.g. libvirt) > could pass the device fd to the actual user (e.g. QEMU). As a result, > the actual user has no idea about the device's bus:devfn information. > This is a problem when user uses VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to > know the hot reset affected device scope as this ioctl returns bus:devfn > info. For the fd passing usage, the acutal user cannot map the bus:devfn > to the devices it has opened via the fd passed from management stack. So > a new ioctl is required. > > This new ioctl reports the list of iommufd dev_id that is opened by the > user. If there is affected device that is not bound to vfio driver or > opened by another user, this command shall fail with -EPERM. For the > noiommu mode in the vfio device cdev path, this shall fail as no dev_id > would be generated, hence nothing to report. > > This ioctl is useless to the users that open vfio group as such users > have no idea about the iommufd dev_id and it can use the existing > VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. The user that uses the traditional > mode vfio group/container would be failed if invoking this ioctl. But > the user that uses the iommufd compat mode vfio group/container shall > succeed. This is harmless as long as user cannot make use of it and > should use VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. So VFIO_DEVICE_GET_PCI_HOT_RESET_INFO reports a group and bdf, but VFIO_DEVICE_GET_PCI_HOT_RESET_*GROUP*_INFO is meant for the non-group, cdev use case and returns a dev_id rather than a group??? Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg that isn't used, why do we need a new ioctl vs defining VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID. In fact, we could define vfio_dependent_device as: struct vfio_pci_dependent_device { union { __u32 group_id; __u32 dev_id; } __u16 segment; __u8 bus; __u8 devfn; }; If the user calls with the above flag, dev_id is valid, otherwise group_id. Perhaps segment:buus:devfn could still be filled in with a NULL/invalid dev_id if the user doesn't have permissions for the device so that debugging from userspace isn't so opaque. Thanks, Alex > Signed-off-by: Yi Liu <yi.l.liu@intel.com> > --- > drivers/vfio/pci/vfio_pci_core.c | 98 ++++++++++++++++++++++++++++++++ > include/uapi/linux/vfio.h | 47 +++++++++++++++ > 2 files changed, 145 insertions(+) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 19f5b075d70a..45edf4e9b98b 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, > return ret; > } > > +static struct pci_dev * > +vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set); > + > +static int vfio_pci_ioctl_get_pci_hot_reset_group_info( > + struct vfio_pci_core_device *vdev, > + struct vfio_pci_hot_reset_group_info __user *arg) > +{ > + unsigned long minsz = > + offsetofend(struct vfio_pci_hot_reset_group_info, count); > + struct vfio_pci_hot_reset_group_info hdr; > + struct iommufd_ctx *iommufd, *cur_iommufd; > + u32 count = 0, index = 0, *devices = NULL; > + struct vfio_pci_core_device *cur; > + bool slot = false; > + int ret = 0; > + > + if (copy_from_user(&hdr, arg, minsz)) > + return -EFAULT; > + > + if (hdr.argsz < minsz) > + return -EINVAL; > + > + hdr.flags = 0; > + > + /* Can we do a slot or bus reset or neither? */ > + if (!pci_probe_reset_slot(vdev->pdev->slot)) > + slot = true; > + else if (pci_probe_reset_bus(vdev->pdev->bus)) > + return -ENODEV; > + > + mutex_lock(&vdev->vdev.dev_set->lock); > + if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) { > + ret = -EPERM; > + goto out_unlock; > + } > + > + iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); > + if (!iommufd) { > + ret = -EPERM; > + goto out_unlock; > + } > + > + /* How many devices are affected? */ > + ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs, > + &count, slot); > + if (ret) > + goto out_unlock; > + > + WARN_ON(!count); /* Should always be at least one */ > + > + /* > + * If there's enough space, fill it now, otherwise return -ENOSPC and > + * the number of devices affected. > + */ > + if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) { > + ret = -ENOSPC; > + hdr.count = count; > + goto reset_info_exit; > + } > + > + devices = kcalloc(count, sizeof(*devices), GFP_KERNEL); > + if (!devices) { > + ret = -ENOMEM; > + goto reset_info_exit; > + } > + > + list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, vdev.dev_set_list) { > + cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev); > + if (cur->vdev.open_count) { > + if (cur_iommufd != iommufd) { > + ret = -EPERM; > + break; > + } > + ret = vfio_iommufd_physical_devid(&cur->vdev, &devices[index]); > + if (ret) > + break; > + index++; > + } > + } > + > +reset_info_exit: > + if (copy_to_user(arg, &hdr, minsz)) > + ret = -EFAULT; > + > + if (!ret) { > + if (copy_to_user(&arg->devices, devices, > + hdr.count * sizeof(*devices))) > + ret = -EFAULT; > + } > + > + kfree(devices); > +out_unlock: > + mutex_unlock(&vdev->vdev.dev_set->lock); > + return ret; > +} > + > static int vfio_pci_ioctl_get_pci_hot_reset_info( > struct vfio_pci_core_device *vdev, > struct vfio_pci_hot_reset_info __user *arg) > @@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, > return vfio_pci_ioctl_get_irq_info(vdev, uarg); > case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO: > return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg); > + case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO: > + return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev, uarg); > case VFIO_DEVICE_GET_REGION_INFO: > return vfio_pci_ioctl_get_region_info(vdev, uarg); > case VFIO_DEVICE_IOEVENTFD: > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 25432ef213ee..61b801dfd40b 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -669,6 +669,53 @@ struct vfio_pci_hot_reset_info { > > #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12) > > +/** > + * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12, > + * struct vfio_pci_hot_reset_group_info) > + * > + * This is used in the vfio device cdev mode. It returns the list of > + * affected devices (represented by iommufd dev_id) when hot reset is > + * issued on the current device with which this ioctl is invoked. It > + * only includes the devices that are opened by the current user in the > + * time of this command is invoked. This list may change when user opens > + * new device or close opened device, hence user should re-invoke it > + * after open/close devices. This command has no guarantee on the result > + * of VFIO_DEVICE_PCI_HOT_RESET since the not-opened affected device can > + * be by other users in the window between the two ioctls. If the affected > + * devices are opened by multiple users, the VFIO_DEVICE_PCI_HOT_RESET > + * shall fail, detail can check the description of VFIO_DEVICE_PCI_HOT_RESET. > + * > + * For the users that open vfio group/container, this ioctl is useless as > + * they have no idea about the iommufd dev_id returned by this ioctl. For > + * the users of the traditional mode vfio group/container, this ioctl will > + * fail as this mode does not use iommufd hence no dev_id to report back. > + * For the users of the iommufd compat mode vfio group/container, this ioctl > + * would succeed as this mode uses iommufd as container fd. But such users > + * still have no idea about the iommufd dev_id as the dev_id is only stored > + * in kernel in this mode. For the users of the vfio group/container, the > + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO should be used to know the hot reset > + * affected devices. > + * > + * Return: 0 on success, -errno on failure: > + * -enospc = insufficient buffer; > + * -enodev = unsupported for device; > + * -eperm = no permission for device, this error comes: > + * - when there are affected devices that are opened but > + * not bound to the same iommufd with the current device > + * with which this ioctl is invoked, > + * - there are affected devices that are not bound to vfio > + * driver yet. > + * - no valid iommufd is bound (e.g. noiommu mode) > + */ > +struct vfio_pci_hot_reset_group_info { > + __u32 argsz; > + __u32 flags; > + __u32 count; > + __u32 devices[]; > +}; > + > +#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO _IO(VFIO_TYPE, VFIO_BASE + 18) > + > /** > * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13, > * struct vfio_pci_hot_reset)
On Mon, 27 Mar 2023 13:26:19 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Mon, 27 Mar 2023 02:34:58 -0700 > Yi Liu <yi.l.liu@intel.com> wrote: > > > This is a preparation for vfio device cdev as cdev gives userspace the > > capability to open device cdev fd and management stack (e.g. libvirt) > > could pass the device fd to the actual user (e.g. QEMU). As a result, > > the actual user has no idea about the device's bus:devfn information. > > This is a problem when user uses VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to > > know the hot reset affected device scope as this ioctl returns bus:devfn > > info. For the fd passing usage, the acutal user cannot map the bus:devfn > > to the devices it has opened via the fd passed from management stack. So > > a new ioctl is required. > > > > This new ioctl reports the list of iommufd dev_id that is opened by the > > user. If there is affected device that is not bound to vfio driver or > > opened by another user, this command shall fail with -EPERM. For the > > noiommu mode in the vfio device cdev path, this shall fail as no dev_id > > would be generated, hence nothing to report. > > > > This ioctl is useless to the users that open vfio group as such users > > have no idea about the iommufd dev_id and it can use the existing > > VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. The user that uses the traditional > > mode vfio group/container would be failed if invoking this ioctl. But > > the user that uses the iommufd compat mode vfio group/container shall > > succeed. This is harmless as long as user cannot make use of it and > > should use VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. > > > So VFIO_DEVICE_GET_PCI_HOT_RESET_INFO reports a group and bdf, but > VFIO_DEVICE_GET_PCI_HOT_RESET_*GROUP*_INFO is meant for the non-group, > cdev use case and returns a dev_id rather than a group??? > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg that > isn't used, why do we need a new ioctl vs defining > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID. In fact, we could define > vfio_dependent_device as: > > struct vfio_pci_dependent_device { > union { > __u32 group_id; > __u32 dev_id; > } > __u16 segment; > __u8 bus; > __u8 devfn; > }; > > If the user calls with the above flag, dev_id is valid, otherwise > group_id. Perhaps segment:buus:devfn could still be filled in with a > NULL/invalid dev_id if the user doesn't have permissions for the device > so that debugging from userspace isn't so opaque. Thanks, > > Alex > > > Signed-off-by: Yi Liu <yi.l.liu@intel.com> > > --- > > drivers/vfio/pci/vfio_pci_core.c | 98 ++++++++++++++++++++++++++++++++ > > include/uapi/linux/vfio.h | 47 +++++++++++++++ > > 2 files changed, 145 insertions(+) > > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > > index 19f5b075d70a..45edf4e9b98b 100644 > > --- a/drivers/vfio/pci/vfio_pci_core.c > > +++ b/drivers/vfio/pci/vfio_pci_core.c > > @@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, > > return ret; > > } > > > > +static struct pci_dev * > > +vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set); > > + > > +static int vfio_pci_ioctl_get_pci_hot_reset_group_info( > > + struct vfio_pci_core_device *vdev, > > + struct vfio_pci_hot_reset_group_info __user *arg) > > +{ > > + unsigned long minsz = > > + offsetofend(struct vfio_pci_hot_reset_group_info, count); > > + struct vfio_pci_hot_reset_group_info hdr; > > + struct iommufd_ctx *iommufd, *cur_iommufd; > > + u32 count = 0, index = 0, *devices = NULL; > > + struct vfio_pci_core_device *cur; > > + bool slot = false; > > + int ret = 0; > > + > > + if (copy_from_user(&hdr, arg, minsz)) > > + return -EFAULT; > > + > > + if (hdr.argsz < minsz) > > + return -EINVAL; > > + > > + hdr.flags = 0; > > + > > + /* Can we do a slot or bus reset or neither? */ > > + if (!pci_probe_reset_slot(vdev->pdev->slot)) > > + slot = true; > > + else if (pci_probe_reset_bus(vdev->pdev->bus)) > > + return -ENODEV; > > + > > + mutex_lock(&vdev->vdev.dev_set->lock); > > + if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) { > > + ret = -EPERM; > > + goto out_unlock; > > + } > > + > > + iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); > > + if (!iommufd) { > > + ret = -EPERM; > > + goto out_unlock; > > + } > > + > > + /* How many devices are affected? */ > > + ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs, > > + &count, slot); > > + if (ret) > > + goto out_unlock; > > + > > + WARN_ON(!count); /* Should always be at least one */ > > + > > + /* > > + * If there's enough space, fill it now, otherwise return -ENOSPC and > > + * the number of devices affected. > > + */ > > + if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) { > > + ret = -ENOSPC; > > + hdr.count = count; > > + goto reset_info_exit; > > + } > > + > > + devices = kcalloc(count, sizeof(*devices), GFP_KERNEL); > > + if (!devices) { > > + ret = -ENOMEM; > > + goto reset_info_exit; > > + } > > + > > + list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, vdev.dev_set_list) { > > + cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev); > > + if (cur->vdev.open_count) { > > + if (cur_iommufd != iommufd) { > > + ret = -EPERM; > > + break; > > + } > > + ret = vfio_iommufd_physical_devid(&cur->vdev, &devices[index]); > > + if (ret) > > + break; > > + index++; > > + } > > + } This also makes use of vfio_pci_for_each_slot_or_bus() to iterate affected devices, but then still assumes those affected devices are necessarily represented in the dev_set. For example, a two-function device with ACS isolation can have function 0 bound to vfio and function 1 bound to a native host driver. The above code requires the user to pass a buffer large enough for both functions, but only function 0 is part of the dev_set, so function 1 is not reported as affected, nor does it generate an error. It looks like we also fail to set hdr.count except in the error path above, so the below copy_to_user() copies a user specified range beyond the end the allocated devices buffer out to userspace! Thanks, Alex > > + > > +reset_info_exit: > > + if (copy_to_user(arg, &hdr, minsz)) > > + ret = -EFAULT; > > + > > + if (!ret) { > > + if (copy_to_user(&arg->devices, devices, > > + hdr.count * sizeof(*devices))) > > + ret = -EFAULT; > > + } > > + > > + kfree(devices); > > +out_unlock: > > + mutex_unlock(&vdev->vdev.dev_set->lock); > > + return ret; > > +} > > + > > static int vfio_pci_ioctl_get_pci_hot_reset_info( > > struct vfio_pci_core_device *vdev, > > struct vfio_pci_hot_reset_info __user *arg) > > @@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, > > return vfio_pci_ioctl_get_irq_info(vdev, uarg); > > case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO: > > return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg); > > + case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO: > > + return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev, uarg); > > case VFIO_DEVICE_GET_REGION_INFO: > > return vfio_pci_ioctl_get_region_info(vdev, uarg); > > case VFIO_DEVICE_IOEVENTFD: > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index 25432ef213ee..61b801dfd40b 100644 > > --- a/include/uapi/linux/vfio.h > > +++ b/include/uapi/linux/vfio.h > > @@ -669,6 +669,53 @@ struct vfio_pci_hot_reset_info { > > > > #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12) > > > > +/** > > + * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12, > > + * struct vfio_pci_hot_reset_group_info) > > + * > > + * This is used in the vfio device cdev mode. It returns the list of > > + * affected devices (represented by iommufd dev_id) when hot reset is > > + * issued on the current device with which this ioctl is invoked. It > > + * only includes the devices that are opened by the current user in the > > + * time of this command is invoked. This list may change when user opens > > + * new device or close opened device, hence user should re-invoke it > > + * after open/close devices. This command has no guarantee on the result > > + * of VFIO_DEVICE_PCI_HOT_RESET since the not-opened affected device can > > + * be by other users in the window between the two ioctls. If the affected > > + * devices are opened by multiple users, the VFIO_DEVICE_PCI_HOT_RESET > > + * shall fail, detail can check the description of VFIO_DEVICE_PCI_HOT_RESET. > > + * > > + * For the users that open vfio group/container, this ioctl is useless as > > + * they have no idea about the iommufd dev_id returned by this ioctl. For > > + * the users of the traditional mode vfio group/container, this ioctl will > > + * fail as this mode does not use iommufd hence no dev_id to report back. > > + * For the users of the iommufd compat mode vfio group/container, this ioctl > > + * would succeed as this mode uses iommufd as container fd. But such users > > + * still have no idea about the iommufd dev_id as the dev_id is only stored > > + * in kernel in this mode. For the users of the vfio group/container, the > > + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO should be used to know the hot reset > > + * affected devices. > > + * > > + * Return: 0 on success, -errno on failure: > > + * -enospc = insufficient buffer; > > + * -enodev = unsupported for device; > > + * -eperm = no permission for device, this error comes: > > + * - when there are affected devices that are opened but > > + * not bound to the same iommufd with the current device > > + * with which this ioctl is invoked, > > + * - there are affected devices that are not bound to vfio > > + * driver yet. > > + * - no valid iommufd is bound (e.g. noiommu mode) > > + */ > > +struct vfio_pci_hot_reset_group_info { > > + __u32 argsz; > > + __u32 flags; > > + __u32 count; > > + __u32 devices[]; > > +}; > > + > > +#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO _IO(VFIO_TYPE, VFIO_BASE + 18) > > + > > /** > > * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13, > > * struct vfio_pci_hot_reset) >
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Tuesday, March 28, 2023 3:26 AM > > On Mon, 27 Mar 2023 02:34:58 -0700 > Yi Liu <yi.l.liu@intel.com> wrote: > > > This is a preparation for vfio device cdev as cdev gives userspace the > > capability to open device cdev fd and management stack (e.g. libvirt) > > could pass the device fd to the actual user (e.g. QEMU). As a result, > > the actual user has no idea about the device's bus:devfn information. > > This is a problem when user uses VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to > > know the hot reset affected device scope as this ioctl returns bus:devfn > > info. For the fd passing usage, the acutal user cannot map the bus:devfn > > to the devices it has opened via the fd passed from management stack. So > > a new ioctl is required. > > > > This new ioctl reports the list of iommufd dev_id that is opened by the > > user. If there is affected device that is not bound to vfio driver or > > opened by another user, this command shall fail with -EPERM. For the > > noiommu mode in the vfio device cdev path, this shall fail as no dev_id > > would be generated, hence nothing to report. > > > > This ioctl is useless to the users that open vfio group as such users > > have no idea about the iommufd dev_id and it can use the existing > > VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. The user that uses the traditional > > mode vfio group/container would be failed if invoking this ioctl. But > > the user that uses the iommufd compat mode vfio group/container shall > > succeed. This is harmless as long as user cannot make use of it and > > should use VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. > > > So VFIO_DEVICE_GET_PCI_HOT_RESET_INFO reports a group and bdf, but > VFIO_DEVICE_GET_PCI_HOT_RESET_*GROUP*_INFO is meant for the non- > group, > cdev use case and returns a dev_id rather than a group??? Yes, this is the meaning, but poor naming here ☹ I also struggled on it. Perhaps your below Suggestion makes more sense. Introducing a flag and reuse the existing _INFO ioctl. > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg that > isn't used, why do we need a new ioctl vs defining > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID. Sure. I can follow this suggestion. BTW. I have a doubt here. This new flag is set by user. What if in the future kernel has new extensions and needs to report something new to the user and add new flags to tell user? Such flag is set by kernel. Then the flags field may have two kinds of flags (some set by user while some set by kernel). Will it mess up the flags space? > In fact, we could define vfio_dependent_device as: > > struct vfio_pci_dependent_device { > union { > __u32 group_id; > __u32 dev_id; > } > __u16 segment; > __u8 bus; > __u8 devfn; > }; > > If the user calls with the above flag, dev_id is valid, otherwise > group_id. Perhaps segment:buus:devfn could still be filled in with a > NULL/invalid dev_id if the user doesn't have permissions for the device > so that debugging from userspace isn't so opaque. Thanks, Also, have one question here. Should the invalid dev_id be defined in the vfio uapi or iommufd uapi? Maybe the latter one since dev_id is generated by iommufd subsystem. Regards, Yi Liu > > > Signed-off-by: Yi Liu <yi.l.liu@intel.com> > > --- > > drivers/vfio/pci/vfio_pci_core.c | 98 > ++++++++++++++++++++++++++++++++ > > include/uapi/linux/vfio.h | 47 +++++++++++++++ > > 2 files changed, 145 insertions(+) > > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c > b/drivers/vfio/pci/vfio_pci_core.c > > index 19f5b075d70a..45edf4e9b98b 100644 > > --- a/drivers/vfio/pci/vfio_pci_core.c > > +++ b/drivers/vfio/pci/vfio_pci_core.c > > @@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct > vfio_pci_core_device *vdev, > > return ret; > > } > > > > +static struct pci_dev * > > +vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set); > > + > > +static int vfio_pci_ioctl_get_pci_hot_reset_group_info( > > + struct vfio_pci_core_device *vdev, > > + struct vfio_pci_hot_reset_group_info __user *arg) > > +{ > > + unsigned long minsz = > > + offsetofend(struct vfio_pci_hot_reset_group_info, count); > > + struct vfio_pci_hot_reset_group_info hdr; > > + struct iommufd_ctx *iommufd, *cur_iommufd; > > + u32 count = 0, index = 0, *devices = NULL; > > + struct vfio_pci_core_device *cur; > > + bool slot = false; > > + int ret = 0; > > + > > + if (copy_from_user(&hdr, arg, minsz)) > > + return -EFAULT; > > + > > + if (hdr.argsz < minsz) > > + return -EINVAL; > > + > > + hdr.flags = 0; > > + > > + /* Can we do a slot or bus reset or neither? */ > > + if (!pci_probe_reset_slot(vdev->pdev->slot)) > > + slot = true; > > + else if (pci_probe_reset_bus(vdev->pdev->bus)) > > + return -ENODEV; > > + > > + mutex_lock(&vdev->vdev.dev_set->lock); > > + if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) { > > + ret = -EPERM; > > + goto out_unlock; > > + } > > + > > + iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); > > + if (!iommufd) { > > + ret = -EPERM; > > + goto out_unlock; > > + } > > + > > + /* How many devices are affected? */ > > + ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, > vfio_pci_count_devs, > > + &count, slot); > > + if (ret) > > + goto out_unlock; > > + > > + WARN_ON(!count); /* Should always be at least one */ > > + > > + /* > > + * If there's enough space, fill it now, otherwise return -ENOSPC and > > + * the number of devices affected. > > + */ > > + if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) { > > + ret = -ENOSPC; > > + hdr.count = count; > > + goto reset_info_exit; > > + } > > + > > + devices = kcalloc(count, sizeof(*devices), GFP_KERNEL); > > + if (!devices) { > > + ret = -ENOMEM; > > + goto reset_info_exit; > > + } > > + > > + list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, > vdev.dev_set_list) { > > + cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev); > > + if (cur->vdev.open_count) { > > + if (cur_iommufd != iommufd) { > > + ret = -EPERM; > > + break; > > + } > > + ret = vfio_iommufd_physical_devid(&cur->vdev, > &devices[index]); > > + if (ret) > > + break; > > + index++; > > + } > > + } > > + > > +reset_info_exit: > > + if (copy_to_user(arg, &hdr, minsz)) > > + ret = -EFAULT; > > + > > + if (!ret) { > > + if (copy_to_user(&arg->devices, devices, > > + hdr.count * sizeof(*devices))) > > + ret = -EFAULT; > > + } > > + > > + kfree(devices); > > +out_unlock: > > + mutex_unlock(&vdev->vdev.dev_set->lock); > > + return ret; > > +} > > + > > static int vfio_pci_ioctl_get_pci_hot_reset_info( > > struct vfio_pci_core_device *vdev, > > struct vfio_pci_hot_reset_info __user *arg) > > @@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device > *core_vdev, unsigned int cmd, > > return vfio_pci_ioctl_get_irq_info(vdev, uarg); > > case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO: > > return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg); > > + case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO: > > + return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev, > uarg); > > case VFIO_DEVICE_GET_REGION_INFO: > > return vfio_pci_ioctl_get_region_info(vdev, uarg); > > case VFIO_DEVICE_IOEVENTFD: > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index 25432ef213ee..61b801dfd40b 100644 > > --- a/include/uapi/linux/vfio.h > > +++ b/include/uapi/linux/vfio.h > > @@ -669,6 +669,53 @@ struct vfio_pci_hot_reset_info { > > > > #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO _IO(VFIO_TYPE, > VFIO_BASE + 12) > > > > +/** > > + * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - _IOWR(VFIO_TYPE, > VFIO_BASE + 12, > > + * struct > vfio_pci_hot_reset_group_info) > > + * > > + * This is used in the vfio device cdev mode. It returns the list of > > + * affected devices (represented by iommufd dev_id) when hot reset is > > + * issued on the current device with which this ioctl is invoked. It > > + * only includes the devices that are opened by the current user in the > > + * time of this command is invoked. This list may change when user > opens > > + * new device or close opened device, hence user should re-invoke it > > + * after open/close devices. This command has no guarantee on the > result > > + * of VFIO_DEVICE_PCI_HOT_RESET since the not-opened affected device > can > > + * be by other users in the window between the two ioctls. If the > affected > > + * devices are opened by multiple users, the > VFIO_DEVICE_PCI_HOT_RESET > > + * shall fail, detail can check the description of > VFIO_DEVICE_PCI_HOT_RESET. > > + * > > + * For the users that open vfio group/container, this ioctl is useless as > > + * they have no idea about the iommufd dev_id returned by this ioctl. > For > > + * the users of the traditional mode vfio group/container, this ioctl will > > + * fail as this mode does not use iommufd hence no dev_id to report > back. > > + * For the users of the iommufd compat mode vfio group/container, this > ioctl > > + * would succeed as this mode uses iommufd as container fd. But such > users > > + * still have no idea about the iommufd dev_id as the dev_id is only > stored > > + * in kernel in this mode. For the users of the vfio group/container, the > > + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO should be used to know the > hot reset > > + * affected devices. > > + * > > + * Return: 0 on success, -errno on failure: > > + * -enospc = insufficient buffer; > > + * -enodev = unsupported for device; > > + * -eperm = no permission for device, this error comes: > > + * - when there are affected devices that are opened but > > + * not bound to the same iommufd with the current device > > + * with which this ioctl is invoked, > > + * - there are affected devices that are not bound to vfio > > + * driver yet. > > + * - no valid iommufd is bound (e.g. noiommu mode) > > + */ > > +struct vfio_pci_hot_reset_group_info { > > + __u32 argsz; > > + __u32 flags; > > + __u32 count; > > + __u32 devices[]; > > +}; > > + > > +#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO > _IO(VFIO_TYPE, VFIO_BASE + 18) > > + > > /** > > * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13, > > * struct vfio_pci_hot_reset)
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Tuesday, March 28, 2023 4:41 AM > > On Mon, 27 Mar 2023 13:26:19 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > [...] > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c > b/drivers/vfio/pci/vfio_pci_core.c > > > index 19f5b075d70a..45edf4e9b98b 100644 > > > --- a/drivers/vfio/pci/vfio_pci_core.c > > > +++ b/drivers/vfio/pci/vfio_pci_core.c > > > @@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct > vfio_pci_core_device *vdev, > > > return ret; > > > } > > > > > > +static struct pci_dev * > > > +vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set); > > > + > > > +static int vfio_pci_ioctl_get_pci_hot_reset_group_info( > > > + struct vfio_pci_core_device *vdev, > > > + struct vfio_pci_hot_reset_group_info __user *arg) > > > +{ > > > + unsigned long minsz = > > > + offsetofend(struct vfio_pci_hot_reset_group_info, count); > > > + struct vfio_pci_hot_reset_group_info hdr; > > > + struct iommufd_ctx *iommufd, *cur_iommufd; > > > + u32 count = 0, index = 0, *devices = NULL; > > > + struct vfio_pci_core_device *cur; > > > + bool slot = false; > > > + int ret = 0; > > > + > > > + if (copy_from_user(&hdr, arg, minsz)) > > > + return -EFAULT; > > > + > > > + if (hdr.argsz < minsz) > > > + return -EINVAL; > > > + > > > + hdr.flags = 0; > > > + > > > + /* Can we do a slot or bus reset or neither? */ > > > + if (!pci_probe_reset_slot(vdev->pdev->slot)) > > > + slot = true; > > > + else if (pci_probe_reset_bus(vdev->pdev->bus)) > > > + return -ENODEV; > > > + > > > + mutex_lock(&vdev->vdev.dev_set->lock); > > > + if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) { [1] > > > + ret = -EPERM; > > > + goto out_unlock; > > > + } > > > + > > > + iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); > > > + if (!iommufd) { > > > + ret = -EPERM; > > > + goto out_unlock; > > > + } > > > + > > > + /* How many devices are affected? */ > > > + ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, > vfio_pci_count_devs, > > > + &count, slot); > > > + if (ret) > > > + goto out_unlock; > > > + > > > + WARN_ON(!count); /* Should always be at least one */ > > > + > > > + /* > > > + * If there's enough space, fill it now, otherwise return -ENOSPC and > > > + * the number of devices affected. > > > + */ > > > + if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) { > > > + ret = -ENOSPC; > > > + hdr.count = count; > > > + goto reset_info_exit; > > > + } > > > + > > > + devices = kcalloc(count, sizeof(*devices), GFP_KERNEL); > > > + if (!devices) { > > > + ret = -ENOMEM; > > > + goto reset_info_exit; > > > + } > > > + > > > + list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, > vdev.dev_set_list) { > > > + cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev); > > > + if (cur->vdev.open_count) { > > > + if (cur_iommufd != iommufd) { > > > + ret = -EPERM; > > > + break; > > > + } > > > + ret = vfio_iommufd_physical_devid(&cur->vdev, > &devices[index]); > > > + if (ret) > > > + break; > > > + index++; > > > + } > > > + } > > This also makes use of vfio_pci_for_each_slot_or_bus() to iterate > affected devices, but then still assumes those affected devices are > necessarily represented in the dev_set. For example, a two-function > device with ACS isolation can have function 0 bound to vfio and > function 1 bound to a native host driver. The above code requires the > user to pass a buffer large enough for both functions, but only > function 0 is part of the dev_set, so function 1 is not reported as > affected, nor does it generate an error. The vfio_pci_dev_set_resettable() is used at [1] to check if all the affected devices are in the dev_set. If not, then it fails at the first place. So in the following code, looping the devices in the dev_set is equivalent to looping devices with vfio_pci_for_each_slot_or_bus(). I need to loop dev_set as dev_set has the vfio_device which is more convenient to get dev_id. > > It looks like we also fail to set hdr.count except in the error path > above, so the below copy_to_user() copies a user specified range beyond > the end the allocated devices buffer out to userspace! Thanks, Oops, yes, it is. My test only has one device affected, so it does not hit the problem. ☹ Thanks, Yi Liu > Alex > > > > + > > > +reset_info_exit: > > > + if (copy_to_user(arg, &hdr, minsz)) > > > + ret = -EFAULT; > > > + > > > + if (!ret) { > > > + if (copy_to_user(&arg->devices, devices, > > > + hdr.count * sizeof(*devices))) > > > + ret = -EFAULT; > > > + } > > > + > > > + kfree(devices); > > > +out_unlock: > > > + mutex_unlock(&vdev->vdev.dev_set->lock); > > > + return ret; > > > +} > > > + > > > static int vfio_pci_ioctl_get_pci_hot_reset_info( > > > struct vfio_pci_core_device *vdev, > > > struct vfio_pci_hot_reset_info __user *arg) > > > @@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device > *core_vdev, unsigned int cmd, > > > return vfio_pci_ioctl_get_irq_info(vdev, uarg); > > > case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO: > > > return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg); > > > + case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO: > > > + return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev, > uarg); > > > case VFIO_DEVICE_GET_REGION_INFO: > > > return vfio_pci_ioctl_get_region_info(vdev, uarg); > > > case VFIO_DEVICE_IOEVENTFD: > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > > index 25432ef213ee..61b801dfd40b 100644 > > > --- a/include/uapi/linux/vfio.h > > > +++ b/include/uapi/linux/vfio.h > > > @@ -669,6 +669,53 @@ struct vfio_pci_hot_reset_info { > > > > > > #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO _IO(VFIO_TYPE, > VFIO_BASE + 12) > > > > > > +/** > > > + * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - > _IOWR(VFIO_TYPE, VFIO_BASE + 12, > > > + * struct > vfio_pci_hot_reset_group_info) > > > + * > > > + * This is used in the vfio device cdev mode. It returns the list of > > > + * affected devices (represented by iommufd dev_id) when hot reset is > > > + * issued on the current device with which this ioctl is invoked. It > > > + * only includes the devices that are opened by the current user in the > > > + * time of this command is invoked. This list may change when user > opens > > > + * new device or close opened device, hence user should re-invoke it > > > + * after open/close devices. This command has no guarantee on the > result > > > + * of VFIO_DEVICE_PCI_HOT_RESET since the not-opened affected > device can > > > + * be by other users in the window between the two ioctls. If the > affected > > > + * devices are opened by multiple users, the > VFIO_DEVICE_PCI_HOT_RESET > > > + * shall fail, detail can check the description of > VFIO_DEVICE_PCI_HOT_RESET. > > > + * > > > + * For the users that open vfio group/container, this ioctl is useless as > > > + * they have no idea about the iommufd dev_id returned by this ioctl. > For > > > + * the users of the traditional mode vfio group/container, this ioctl will > > > + * fail as this mode does not use iommufd hence no dev_id to report > back. > > > + * For the users of the iommufd compat mode vfio group/container, > this ioctl > > > + * would succeed as this mode uses iommufd as container fd. But such > users > > > + * still have no idea about the iommufd dev_id as the dev_id is only > stored > > > + * in kernel in this mode. For the users of the vfio group/container, the > > > + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO should be used to know > the hot reset > > > + * affected devices. > > > + * > > > + * Return: 0 on success, -errno on failure: > > > + * -enospc = insufficient buffer; > > > + * -enodev = unsupported for device; > > > + * -eperm = no permission for device, this error comes: > > > + * - when there are affected devices that are opened but > > > + * not bound to the same iommufd with the current device > > > + * with which this ioctl is invoked, > > > + * - there are affected devices that are not bound to vfio > > > + * driver yet. > > > + * - no valid iommufd is bound (e.g. noiommu mode) > > > + */ > > > +struct vfio_pci_hot_reset_group_info { > > > + __u32 argsz; > > > + __u32 flags; > > > + __u32 count; > > > + __u32 devices[]; > > > +}; > > > + > > > +#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO > _IO(VFIO_TYPE, VFIO_BASE + 18) > > > + > > > /** > > > * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13, > > > * struct vfio_pci_hot_reset) > >
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Tuesday, March 28, 2023 11:32 AM > > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Tuesday, March 28, 2023 3:26 AM > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg that > > isn't used, why do we need a new ioctl vs defining > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID. > > Sure. I can follow this suggestion. BTW. I have a doubt here. This new flag > is set by user. What if in the future kernel has new extensions and needs > to report something new to the user and add new flags to tell user? Such > flag is set by kernel. Then the flags field may have two kinds of flags (some > set by user while some set by kernel). Will it mess up the flags space? > flags in a GET_INFO ioctl is for output. if user needs to use flags as input to select different type of info then it should be split into multiple GET_INFO cmds.
On Mon, Mar 27, 2023 at 02:34:58AM -0700, Yi Liu wrote: > + devices = kcalloc(count, sizeof(*devices), GFP_KERNEL); > + if (!devices) { > + ret = -ENOMEM; > + goto reset_info_exit; > + } This doesn't need to be so complicated > + list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, vdev.dev_set_list) { > + cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev); > + if (cur->vdev.open_count) { > + if (cur_iommufd != iommufd) { > + ret = -EPERM; > + break; > + } > + ret = vfio_iommufd_physical_devid(&cur->vdev, &devices[index]); u32 device; if (index >= hdr.count) return -ENOSPC; ret = vfio_iommufd_physical_devid(&cur->vdev, &devices); ... if (put_user(&arg->devices[index], device)) -EFAULT index++; Jason
On Tue, 28 Mar 2023 06:19:06 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Liu, Yi L <yi.l.liu@intel.com> > > Sent: Tuesday, March 28, 2023 11:32 AM > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Tuesday, March 28, 2023 3:26 AM > > > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg that > > > isn't used, why do we need a new ioctl vs defining > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID. > > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This new flag > > is set by user. What if in the future kernel has new extensions and needs > > to report something new to the user and add new flags to tell user? Such > > flag is set by kernel. Then the flags field may have two kinds of flags (some > > set by user while some set by kernel). Will it mess up the flags space? > > > > flags in a GET_INFO ioctl is for output. > > if user needs to use flags as input to select different type of info then it should > be split into multiple GET_INFO cmds. I don't know that that's actually a rule, however we don't currently test flags is zero for input, so in this case I think we are stuck with it only being for output. Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO automatically return the dev_id variant of the output and set a flag to indicate this is the case when called on a device fd opened as a cdev? Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Tuesday, March 28, 2023 10:26 PM > > On Tue, 28 Mar 2023 06:19:06 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > Sent: Tuesday, March 28, 2023 11:32 AM > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Tuesday, March 28, 2023 3:26 AM > > > > > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg > that > > > > isn't used, why do we need a new ioctl vs defining > > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID. > > > > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This new > flag > > > is set by user. What if in the future kernel has new extensions and needs > > > to report something new to the user and add new flags to tell user? Such > > > flag is set by kernel. Then the flags field may have two kinds of flags > (some > > > set by user while some set by kernel). Will it mess up the flags space? > > > > > > > flags in a GET_INFO ioctl is for output. > > > > if user needs to use flags as input to select different type of info then it > should > > be split into multiple GET_INFO cmds. > > I don't know that that's actually a rule, however we don't currently > test flags is zero for input, so in this case I think we are stuck with > it only being for output. > > Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > automatically > return the dev_id variant of the output and set a flag to indicate this > is the case when called on a device fd opened as a cdev? Thanks, Personally I prefer that user asks for dev_id info explicitly. The major reason that we return dev_id is that the group/bdf info is not enough for the device fd passing case. But if qemu opens device by itself, the group/bdf info is still enough. So a device opened as a cdev doesn't mean it should return dev_id, it depends on if user has the bdf knowledge. Regards, Yi Liu
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, March 28, 2023 8:40 PM > > On Mon, Mar 27, 2023 at 02:34:58AM -0700, Yi Liu wrote: > > > + devices = kcalloc(count, sizeof(*devices), GFP_KERNEL); > > + if (!devices) { > > + ret = -ENOMEM; > > + goto reset_info_exit; > > + } > > This doesn't need to be so complicated > > > + list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, > vdev.dev_set_list) { > > + cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev); > > + if (cur->vdev.open_count) { > > + if (cur_iommufd != iommufd) { > > + ret = -EPERM; > > + break; > > + } > > + ret = vfio_iommufd_physical_devid(&cur->vdev, > &devices[index]); > > > u32 device; > > if (index >= hdr.count) > return -ENOSPC; > > ret = vfio_iommufd_physical_devid(&cur->vdev, &devices); Ok, so the whole point is that if cur->vdev->iommufd_ctx==null, then the vfio_iommufd_physical_devid() shall fail as well. > ... > > if (put_user(&arg->devices[index], device)) Will modify it. let's close the "separate ioctl" v.s. "reuse ioctl + new flag" open with Alex first. Thanks, Yi Liu > -EFAULT > > index++; > > Jason
On Tue, 28 Mar 2023 14:38:12 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Tuesday, March 28, 2023 10:26 PM > > > > On Tue, 28 Mar 2023 06:19:06 +0000 > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > Sent: Tuesday, March 28, 2023 11:32 AM > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > Sent: Tuesday, March 28, 2023 3:26 AM > > > > > > > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg > > that > > > > > isn't used, why do we need a new ioctl vs defining > > > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID. > > > > > > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This new > > flag > > > > is set by user. What if in the future kernel has new extensions and needs > > > > to report something new to the user and add new flags to tell user? Such > > > > flag is set by kernel. Then the flags field may have two kinds of flags > > (some > > > > set by user while some set by kernel). Will it mess up the flags space? > > > > > > > > > > flags in a GET_INFO ioctl is for output. > > > > > > if user needs to use flags as input to select different type of info then it > > should > > > be split into multiple GET_INFO cmds. > > > > I don't know that that's actually a rule, however we don't currently > > test flags is zero for input, so in this case I think we are stuck with > > it only being for output. > > > > Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > > automatically > > return the dev_id variant of the output and set a flag to indicate this > > is the case when called on a device fd opened as a cdev? Thanks, > > Personally I prefer that user asks for dev_id info explicitly. The major reason > that we return dev_id is that the group/bdf info is not enough for the device > fd passing case. But if qemu opens device by itself, the group/bdf info is still > enough. So a device opened as a cdev doesn't mean it should return dev_id, > it depends on if user has the bdf knowledge. But if QEMU opens the cdev, vs getting it from the group, does it make any sense to return a set of group-ids + bdf in the host-reset info? I'm inclined to think the answer is no. Per my previous suggestion, I think we should always return the bdf. We can't know if the user is accessing through an fd they opened themselves or were passed, but it allows an actually useful debugging report if userspace can know "I can't perform a hot reset of this device because my iommufd context doesn't know about device <bdf>", vs an opaque -EPERM. Therefore I think we're only discussing the data conveyed in the u32, a group-id or dev_id. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Tuesday, March 28, 2023 10:46 PM > > On Tue, 28 Mar 2023 14:38:12 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Tuesday, March 28, 2023 10:26 PM > > > > > > On Tue, 28 Mar 2023 06:19:06 +0000 > > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > > Sent: Tuesday, March 28, 2023 11:32 AM > > > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > > Sent: Tuesday, March 28, 2023 3:26 AM > > > > > > > > > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags > arg > > > that > > > > > > isn't used, why do we need a new ioctl vs defining > > > > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID. > > > > > > > > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This > new > > > flag > > > > > is set by user. What if in the future kernel has new extensions and > needs > > > > > to report something new to the user and add new flags to tell user? > Such > > > > > flag is set by kernel. Then the flags field may have two kinds of flags > > > (some > > > > > set by user while some set by kernel). Will it mess up the flags space? > > > > > > > > > > > > > flags in a GET_INFO ioctl is for output. > > > > > > > > if user needs to use flags as input to select different type of info then it > > > should > > > > be split into multiple GET_INFO cmds. > > > > > > I don't know that that's actually a rule, however we don't currently > > > test flags is zero for input, so in this case I think we are stuck with > > > it only being for output. > > > > > > Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > > > automatically > > > return the dev_id variant of the output and set a flag to indicate this > > > is the case when called on a device fd opened as a cdev? Thanks, > > > > Personally I prefer that user asks for dev_id info explicitly. The major > reason > > that we return dev_id is that the group/bdf info is not enough for the > device > > fd passing case. But if qemu opens device by itself, the group/bdf info is > still > > enough. So a device opened as a cdev doesn't mean it should return > dev_id, > > it depends on if user has the bdf knowledge. > > But if QEMU opens the cdev, vs getting it from the group, does it make > any sense to return a set of group-ids + bdf in the host-reset info? > I'm inclined to think the answer is no. > > Per my previous suggestion, I think we should always return the bdf. We > can't know if the user is accessing through an fd they opened > themselves or were passed, Oh, yes. I'm convinced by this reason since only cdev mode supports device fd passing. So I'll reuse the existing _INFO and let kernel set a flag to mark the returned info is dev_id+bdf. A check. If the device that the _INFIO is invoked is opened via cdev, but there are devices in the dev_set that are got via VFIO_GROUP_GET_DEVICE_FD, should I fail it or allow it? > but it allows an actually useful debugging > report if userspace can know "I can't perform a hot reset of this > device because my iommufd context doesn't know about device <bdf>", vs > an opaque -EPERM. Therefore I think we're only discussing the data > conveyed in the u32, a group-id or dev_id. Thanks, Sure. Regards, Yi Liu
On Tue, 28 Mar 2023 15:00:42 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Tuesday, March 28, 2023 10:46 PM > > > > On Tue, 28 Mar 2023 14:38:12 +0000 > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Tuesday, March 28, 2023 10:26 PM > > > > > > > > On Tue, 28 Mar 2023 06:19:06 +0000 > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > > > Sent: Tuesday, March 28, 2023 11:32 AM > > > > > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > > > Sent: Tuesday, March 28, 2023 3:26 AM > > > > > > > > > > > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags > > arg > > > > that > > > > > > > isn't used, why do we need a new ioctl vs defining > > > > > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID. > > > > > > > > > > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This > > new > > > > flag > > > > > > is set by user. What if in the future kernel has new extensions and > > needs > > > > > > to report something new to the user and add new flags to tell user? > > Such > > > > > > flag is set by kernel. Then the flags field may have two kinds of flags > > > > (some > > > > > > set by user while some set by kernel). Will it mess up the flags space? > > > > > > > > > > > > > > > > flags in a GET_INFO ioctl is for output. > > > > > > > > > > if user needs to use flags as input to select different type of info then it > > > > should > > > > > be split into multiple GET_INFO cmds. > > > > > > > > I don't know that that's actually a rule, however we don't currently > > > > test flags is zero for input, so in this case I think we are stuck with > > > > it only being for output. > > > > > > > > Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > > > > automatically > > > > return the dev_id variant of the output and set a flag to indicate this > > > > is the case when called on a device fd opened as a cdev? Thanks, > > > > > > Personally I prefer that user asks for dev_id info explicitly. The major > > reason > > > that we return dev_id is that the group/bdf info is not enough for the > > device > > > fd passing case. But if qemu opens device by itself, the group/bdf info is > > still > > > enough. So a device opened as a cdev doesn't mean it should return > > dev_id, > > > it depends on if user has the bdf knowledge. > > > > But if QEMU opens the cdev, vs getting it from the group, does it make > > any sense to return a set of group-ids + bdf in the host-reset info? > > I'm inclined to think the answer is no. > > > > Per my previous suggestion, I think we should always return the bdf. We > > can't know if the user is accessing through an fd they opened > > themselves or were passed, > > Oh, yes. I'm convinced by this reason since only cdev mode supports device fd > passing. So I'll reuse the existing _INFO and let kernel set a flag to mark the returned > info is dev_id+bdf. > > A check. If the device that the _INFIO is invoked is opened via cdev, but there > are devices in the dev_set that are got via VFIO_GROUP_GET_DEVICE_FD, should > I fail it or allow it? It's a niche case, but I think it needs to be allowed. We'd still report the bdf for those devices, but make use of the invalid/null dev-id. I think this empowers userspace that they could make the same call on a group opened fd if necessary. An alternative would be to redefine the returned data structure entirely with a flag per entry describing the output, but then I think we need to invent a kernel policy of which gets reported, which seems overly complicated if our goal is to phase out group usage. Make sense, or will this bite us? Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Tuesday, March 28, 2023 11:18 PM > > On Tue, 28 Mar 2023 15:00:42 +0000 > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Tuesday, March 28, 2023 10:46 PM > > > > > > On Tue, 28 Mar 2023 14:38:12 +0000 > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > Sent: Tuesday, March 28, 2023 10:26 PM > > > > > > > > > > On Tue, 28 Mar 2023 06:19:06 +0000 > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > > > > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > > > > Sent: Tuesday, March 28, 2023 11:32 AM > > > > > > > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > > > > Sent: Tuesday, March 28, 2023 3:26 AM > > > > > > > > > > > > > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a > flags > > > arg > > > > > that > > > > > > > > isn't used, why do we need a new ioctl vs defining > > > > > > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID. > > > > > > > > > > > > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This > > > new > > > > > flag > > > > > > > is set by user. What if in the future kernel has new extensions and > > > needs > > > > > > > to report something new to the user and add new flags to tell > user? > > > Such > > > > > > > flag is set by kernel. Then the flags field may have two kinds of > flags > > > > > (some > > > > > > > set by user while some set by kernel). Will it mess up the flags > space? > > > > > > > > > > > > > > > > > > > flags in a GET_INFO ioctl is for output. > > > > > > > > > > > > if user needs to use flags as input to select different type of info > then it > > > > > should > > > > > > be split into multiple GET_INFO cmds. > > > > > > > > > > I don't know that that's actually a rule, however we don't currently > > > > > test flags is zero for input, so in this case I think we are stuck with > > > > > it only being for output. > > > > > > > > > > Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > > > > > automatically > > > > > return the dev_id variant of the output and set a flag to indicate this > > > > > is the case when called on a device fd opened as a cdev? Thanks, > > > > > > > > Personally I prefer that user asks for dev_id info explicitly. The major > > > reason > > > > that we return dev_id is that the group/bdf info is not enough for the > > > device > > > > fd passing case. But if qemu opens device by itself, the group/bdf info > is > > > still > > > > enough. So a device opened as a cdev doesn't mean it should return > > > dev_id, > > > > it depends on if user has the bdf knowledge. > > > > > > But if QEMU opens the cdev, vs getting it from the group, does it make > > > any sense to return a set of group-ids + bdf in the host-reset info? > > > I'm inclined to think the answer is no. > > > > > > Per my previous suggestion, I think we should always return the bdf. We > > > can't know if the user is accessing through an fd they opened > > > themselves or were passed, > > > > Oh, yes. I'm convinced by this reason since only cdev mode supports > device fd > > passing. So I'll reuse the existing _INFO and let kernel set a flag to mark > the returned > > info is dev_id+bdf. > > > > A check. If the device that the _INFIO is invoked is opened via cdev, but > there > > are devices in the dev_set that are got via VFIO_GROUP_GET_DEVICE_FD, > should > > I fail it or allow it? > > It's a niche case, but I think it needs to be allowed. I'm also wondering if it is common in the future. Actually, a user should be preferred to either use the group or cdev, but not both. Otherwise, it looks to be bad programming.:-) Also, as an earlier remark from Jason. If there are affected devices that are opened by other users, the new _INFO should fail with -EPERM. I know this remark was for the new _INFO ioctl. But now, we are going to reuse the existing _INFO, so I'd also want to check if we still need this policy? If yes, then it is a problem to check the owner of the devices that are opened by the group path. https://lore.kernel.org/kvm/ZBsF950laMs2ldFc@nvidia.com/ > We'd still > report the bdf for those devices, but make use of the invalid/null > dev-id. I think this empowers userspace that they could make the same > call on a group opened fd if necessary. For the devices opened via group path, it should have an entry that includes invalid_dev_id+bdf. So user can map it to the device. But there is no group_id, this may be fine since group is just a shortcut to find the device. Is it? > An alternative would be to > redefine the returned data structure entirely with a flag per entry > describing the output, but then I think we need to invent a kernel > policy of which gets reported, which seems overly complicated if our > goal is to phase out group usage. Make sense, or will this bite us? This does smell complicated.
On Tue, 28 Mar 2023 15:45:38 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Tuesday, March 28, 2023 11:18 PM > > > > On Tue, 28 Mar 2023 15:00:42 +0000 > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Tuesday, March 28, 2023 10:46 PM > > > > > > > > On Tue, 28 Mar 2023 14:38:12 +0000 > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > > Sent: Tuesday, March 28, 2023 10:26 PM > > > > > > > > > > > > On Tue, 28 Mar 2023 06:19:06 +0000 > > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > > > > > > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > > > > > Sent: Tuesday, March 28, 2023 11:32 AM > > > > > > > > > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > > > > > Sent: Tuesday, March 28, 2023 3:26 AM > > > > > > > > > > > > > > > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a > > flags > > > > arg > > > > > > that > > > > > > > > > isn't used, why do we need a new ioctl vs defining > > > > > > > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID. > > > > > > > > > > > > > > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This > > > > new > > > > > > flag > > > > > > > > is set by user. What if in the future kernel has new extensions and > > > > needs > > > > > > > > to report something new to the user and add new flags to tell > > user? > > > > Such > > > > > > > > flag is set by kernel. Then the flags field may have two kinds of > > flags > > > > > > (some > > > > > > > > set by user while some set by kernel). Will it mess up the flags > > space? > > > > > > > > > > > > > > > > > > > > > > flags in a GET_INFO ioctl is for output. > > > > > > > > > > > > > > if user needs to use flags as input to select different type of info > > then it > > > > > > should > > > > > > > be split into multiple GET_INFO cmds. > > > > > > > > > > > > I don't know that that's actually a rule, however we don't currently > > > > > > test flags is zero for input, so in this case I think we are stuck with > > > > > > it only being for output. > > > > > > > > > > > > Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > > > > > > automatically > > > > > > return the dev_id variant of the output and set a flag to indicate this > > > > > > is the case when called on a device fd opened as a cdev? Thanks, > > > > > > > > > > Personally I prefer that user asks for dev_id info explicitly. The major > > > > reason > > > > > that we return dev_id is that the group/bdf info is not enough for the > > > > device > > > > > fd passing case. But if qemu opens device by itself, the group/bdf info > > is > > > > still > > > > > enough. So a device opened as a cdev doesn't mean it should return > > > > dev_id, > > > > > it depends on if user has the bdf knowledge. > > > > > > > > But if QEMU opens the cdev, vs getting it from the group, does it make > > > > any sense to return a set of group-ids + bdf in the host-reset info? > > > > I'm inclined to think the answer is no. > > > > > > > > Per my previous suggestion, I think we should always return the bdf. We > > > > can't know if the user is accessing through an fd they opened > > > > themselves or were passed, > > > > > > Oh, yes. I'm convinced by this reason since only cdev mode supports > > device fd > > > passing. So I'll reuse the existing _INFO and let kernel set a flag to mark > > the returned > > > info is dev_id+bdf. > > > > > > A check. If the device that the _INFIO is invoked is opened via cdev, but > > there > > > are devices in the dev_set that are got via VFIO_GROUP_GET_DEVICE_FD, > > should > > > I fail it or allow it? > > > > It's a niche case, but I think it needs to be allowed. > > I'm also wondering if it is common in the future. Actually, a user should be > preferred to either use the group or cdev, but not both. Otherwise, it looks > to be bad programming.:-) > > Also, as an earlier remark from Jason. If there are affected devices that are > opened by other users, the new _INFO should fail with -EPERM. I know this > remark was for the new _INFO ioctl. But now, we are going to reuse the > existing _INFO, so I'd also want to check if we still need this policy? If yes, > then it is a problem to check the owner of the devices that are opened by > the group path. > > https://lore.kernel.org/kvm/ZBsF950laMs2ldFc@nvidia.com/ Personally I don't like the suggestion to fail with -EPERM if the user doesn't own all the affected devices. This isn't a "probe if I can do a reset" ioctl, it's a "provide information about the devices affected by a reset to know how to call the hot-reset ioctl". We're returning the bdf to the cdev version of this ioctl for exactly this debugging purpose when the devices are not owned, that becomes useless if we give up an return -EPERM if ownership doesn't align. > > We'd still > > report the bdf for those devices, but make use of the invalid/null > > dev-id. I think this empowers userspace that they could make the same > > call on a group opened fd if necessary. > > For the devices opened via group path, it should have an entry that > includes invalid_dev_id+bdf. So user can map it to the device. But > there is no group_id, this may be fine since group is just a shortcut > to find the device. Is it? Yes, it could be argued that the group-id itself is superfluous, the user can determine the group via the bdf, but it also aligns with the hot-reset ioctl, which currently requires the group fd. Thanks, Alex
On Tue, Mar 28, 2023 at 09:18:01AM -0600, Alex Williamson wrote: > It's a niche case, but I think it needs to be allowed. We'd still > report the bdf for those devices, but make use of the invalid/null > dev-id. IDK, it makes the whole implementation much more complicated. Instead of just copying the current dev_set to the output and calling vfio_pci_dev_set_resettable() we need to do something more complex.. Keeping the current ioctl as-is means this IOCTL can be used to do any debugging by getting the actual BDF list. It means we can make the a new ioctl simple and just return the dev_id array without these edge complications. I don't think merging two different ioctls is helping make things simple.. It seems like it does what qemu wants: call the new IOCTL, if it fails, call the old IOCTL and print out the BDF list to help debug and then exit. On success use the data in the new ioctl to generate the machine configuration to pass the reset grouping into the VM. When reset actually comes in just trigger it. Jason
On Tue, 28 Mar 2023 13:29:23 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Mar 28, 2023 at 09:18:01AM -0600, Alex Williamson wrote: > > > It's a niche case, but I think it needs to be allowed. We'd still > > report the bdf for those devices, but make use of the invalid/null > > dev-id. > > IDK, it makes the whole implementation much more complicated. Instead > of just copying the current dev_set to the output and calling > vfio_pci_dev_set_resettable() we need to do something more complex.. > > Keeping the current ioctl as-is means this IOCTL can be used to do any > debugging by getting the actual BDF list. > > It means we can make the a new ioctl simple and just return the dev_id > array without these edge complications. I don't think merging two > different ioctls is helping make things simple.. OTOH, we already have an ioctl that essentially "does the right thing", we just want to swap out one return field for another. So which is more complicated, adding another ioctl that does not quite the same thing but still needing to maintain the old ioctl for detailed information, or making the old ioctl bi-modal to return the appropriate information for the type of device used to access it? > It seems like it does what qemu wants: call the new IOCTL, if it > fails, call the old IOCTL and print out the BDF list to help debug and > then exit. Userspace is already dealing with a variable length array for the return value, why would it ever want to repeat that process to get debugging info. Besides, wouldn't QEMU prefer the similarity of making the same call for groups and cdev and simply keying on the data type of one field? > On success use the data in the new ioctl to generate the machine > configuration to pass the reset grouping into the VM. > > When reset actually comes in just trigger it. "Just trigger it" is the same in either case. It seems bold to play the complexity argument when we already have a function that does 90% the correct thing where we can share much of the implementation and userspace code without duplicating, but still relying on a legacy interface for debugging. Thanks, Alex
On Tue, Mar 28, 2023 at 01:09:49PM -0600, Alex Williamson wrote: > "Just trigger it" is the same in either case. It seems bold to play > the complexity argument when we already have a function that does 90% > the correct thing where we can share much of the implementation and > userspace code without duplicating, but still relying on a legacy > interface for debugging. It just doesn't seem like we are sharing a lot, this patch is a whole new function and you are saying the unique implementation needs to be more complex still. Maybe the next version will be able to share more?? Like can we just patch the existing code that sets the group_id/dev_id or somethiing??? Still, I'm not sure that qemu will share really share much here, because if it runs in group mode then it has to parse the result in a totally different way than if it runs in dev_id mode. The ioctl call is only a line or two. I imagine qemu will end up with two different functions that both return some kind of internal list of qemu devices in the reset group. Jason
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, March 29, 2023 12:00 AM > [...] > > > > A check. If the device that the _INFIO is invoked is opened via cdev, > but > > > there > > > > are devices in the dev_set that are got via > VFIO_GROUP_GET_DEVICE_FD, > > > should > > > > I fail it or allow it? > > > > > > It's a niche case, but I think it needs to be allowed. > > > > I'm also wondering if it is common in the future. Actually, a user should be > > preferred to either use the group or cdev, but not both. Otherwise, it > looks > > to be bad programming.:-) > > > > Also, as an earlier remark from Jason. If there are affected devices that > are > > opened by other users, the new _INFO should fail with -EPERM. I know > this > > remark was for the new _INFO ioctl. But now, we are going to reuse the > > existing _INFO, so I'd also want to check if we still need this policy? If yes, > > then it is a problem to check the owner of the devices that are opened by > > the group path. > > > > https://lore.kernel.org/kvm/ZBsF950laMs2ldFc@nvidia.com/ > > Personally I don't like the suggestion to fail with -EPERM if the user > doesn't own all the affected devices. This isn't a "probe if I can do > a reset" ioctl, it's a "provide information about the devices affected > by a reset to know how to call the hot-reset ioctl". We're returning > the bdf to the cdev version of this ioctl for exactly this debugging > purpose when the devices are not owned, that becomes useless if we give > up an return -EPERM if ownership doesn't align. Jason's suggestion makes sense for returning the case of returning dev_id as dev_id is local to iommufd. If there are devices in the same dev_set are opened by multiple users, multiple iommufd would be used. Then the dev_id would have overlap. e.g. a dev_set has three devices. Device A and B are opened by the current user as cdev, dev_id #1 and #2 are generated. While device C opened by another user as cdev, dev_id #n is generated for it. If dev_id #n happens to be #1, then user gets two info entries that have the same dev_id. I know the existing _INFO does not have such policy since the group/bdf info are unique in the system. But for the dev_id case, seems really necessary to fail if the dev_set is not only opened by one user. From this angle, will it be good to have two ioctls. Sorry for the back-forth here. ☹ > > > We'd still > > > report the bdf for those devices, but make use of the invalid/null > > > dev-id. I think this empowers userspace that they could make the same > > > call on a group opened fd if necessary. > > > > For the devices opened via group path, it should have an entry that > > includes invalid_dev_id+bdf. So user can map it to the device. But > > there is no group_id, this may be fine since group is just a shortcut > > to find the device. Is it? > > Yes, it could be argued that the group-id itself is superfluous, the > user can determine the group via the bdf, but it also aligns with the > hot-reset ioctl, which currently requires the group fd. Thanks, I see. For the existing _INFO and HOT_RESET ioctl, I have no doubt. Both of them use the group as a short-cut. Regards, Yi Liu
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Wednesday, March 29, 2023 11:14 AM > > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, March 29, 2023 12:00 AM > > > > > > Personally I don't like the suggestion to fail with -EPERM if the user > > doesn't own all the affected devices. This isn't a "probe if I can do > > a reset" ioctl, it's a "provide information about the devices affected > > by a reset to know how to call the hot-reset ioctl". We're returning > > the bdf to the cdev version of this ioctl for exactly this debugging > > purpose when the devices are not owned, that becomes useless if we give > > up an return -EPERM if ownership doesn't align. > > Jason's suggestion makes sense for returning the case of returning dev_id > as dev_id is local to iommufd. If there are devices in the same dev_set are > opened by multiple users, multiple iommufd would be used. Then the > dev_id would have overlap. e.g. a dev_set has three devices. Device A and > B are opened by the current user as cdev, dev_id #1 and #2 are generated. > While device C opened by another user as cdev, dev_id #n is generated for > it. If dev_id #n happens to be #1, then user gets two info entries that have > the same dev_id. > In Alex's proposal you'll set a invalid dev_id for device C so the user can still get the info for diagnostic purpose instead of seeing an -EPERM error. btw I found an open about fd pass scheme which may affect the choice here. In concept even with cdev we still expect the userspace to maintain the group knowledge so it won't inadvertently attempt to assign devices in the same group to different IOAS's. It also needs such knowledge when constructing guest topology. with fd passed in Qemu has no way to associate the fd to a group. We could extend bind_iommufd to return the group id or introduce a new ioctl to query it per dev_id. Once that is in place looks we don't need a new _INFO ioctl? Thanks Kevin
On Wed, 29 Mar 2023 09:41:26 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Liu, Yi L <yi.l.liu@intel.com> > > Sent: Wednesday, March 29, 2023 11:14 AM > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Wednesday, March 29, 2023 12:00 AM > > > > > > > > > Personally I don't like the suggestion to fail with -EPERM if the user > > > doesn't own all the affected devices. This isn't a "probe if I can do > > > a reset" ioctl, it's a "provide information about the devices affected > > > by a reset to know how to call the hot-reset ioctl". We're returning > > > the bdf to the cdev version of this ioctl for exactly this debugging > > > purpose when the devices are not owned, that becomes useless if we give > > > up an return -EPERM if ownership doesn't align. > > > > Jason's suggestion makes sense for returning the case of returning dev_id > > as dev_id is local to iommufd. If there are devices in the same dev_set are > > opened by multiple users, multiple iommufd would be used. Then the > > dev_id would have overlap. e.g. a dev_set has three devices. Device A and > > B are opened by the current user as cdev, dev_id #1 and #2 are generated. > > While device C opened by another user as cdev, dev_id #n is generated for > > it. If dev_id #n happens to be #1, then user gets two info entries that have > > the same dev_id. > > > > In Alex's proposal you'll set a invalid dev_id for device C so the user can > still get the info for diagnostic purpose instead of seeing an -EPERM error. Yes, we shouldn't be reporting dev_ids outside of the user's iommufd context. > btw I found an open about fd pass scheme which may affect the choice here. > > In concept even with cdev we still expect the userspace to maintain the > group knowledge so it won't inadvertently attempt to assign devices in > the same group to different IOAS's. It also needs such knowledge when > constructing guest topology. > > with fd passed in Qemu has no way to associate the fd to a group. Hmm, QEMU tries to get the group for the device address space in the guest, so finding an existing group with a different address space indeed allows QEMU to know of this conflict since the group is the fundamental unit IOMMU context in the legacy vfio model. > We could extend bind_iommufd to return the group id or introduce a > new ioctl to query it per dev_id. That would be ironic to go to all this trouble to remove groups from the API only to have them show up here. But with a cdev interface, don't we break that model of conflating isolation and address-ability? For example, devices within a group cannot be bound to separate iommufds due to lack of isolation, which is handled via DMA ownership, but barring DMA aliasing issues, due to conventional PCI buses or quirks, cdev could allow devices within the same group to be managed by separate IOAS's. So the group information really isn't enough for userspace to infer address space restrictions with cdev anyway. Therefore aren't we expecting this to be denied at attach_ioas() and QEMU shouldn't be making these sorts of assumptions for cdev anyway? Thanks, Alex
On Wed, Mar 29, 2023 at 09:41:26AM +0000, Tian, Kevin wrote: > We could extend bind_iommufd to return the group id or introduce a > new ioctl to query it per dev_id. > Once that is in place looks we don't need a new _INFO ioctl? The iommu_group and the reset group are different things The issue is processing the BDF strings, not the group ID. Probably we should have some way for iommufd to report the group_id from the dev_id? Jason
On Wed, Mar 29, 2023 at 09:49:44AM -0600, Alex Williamson wrote: > > We could extend bind_iommufd to return the group id or introduce a > > new ioctl to query it per dev_id. > > That would be ironic to go to all this trouble to remove groups from > the API only to have them show up here. Groups always had to be part of the API for advanced cases like qemu - the point was to make them a small side bit of information not front and center in control of everything. > For example, devices within a group cannot be bound to separate > iommufds due to lack of isolation, which is handled via DMA ownership, > but barring DMA aliasing issues, due to conventional PCI buses or > quirks, cdev could allow devices within the same group to be managed by > separate IOAS's. Maybe some future kernel could do this, the API allows it at least.. > So the group information really isn't enough for > userspace to infer address space restrictions with cdev anyway. > > Therefore aren't we expecting this to be denied at attach_ioas() and > QEMU shouldn't be making these sorts of assumptions for cdev anyway? I guess we could make an API specifically to report same-iommu_domina information? I was assuming qemu would use the group for now as I don't see a likely future when we would relax that restriction.. So I was keeping a "add it when we need it" attitude here. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, March 29, 2023 11:50 PM > > On Wed, Mar 29, 2023 at 09:41:26AM +0000, Tian, Kevin wrote: > > > We could extend bind_iommufd to return the group id or introduce a > > new ioctl to query it per dev_id. > > > Once that is in place looks we don't need a new _INFO ioctl? > > The iommu_group and the reset group are different things > > The issue is processing the BDF strings, not the group ID. > > Probably we should have some way for iommufd to report the group_id > from the dev_id? > Yes, that is my thought. Though iommu_group and reset group are different things we could still leverage existing _INFO ioctl once there is a way to associated dev_id to group_id.
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, March 29, 2023 11:58 PM > > On Wed, Mar 29, 2023 at 09:49:44AM -0600, Alex Williamson wrote: > > > > We could extend bind_iommufd to return the group id or introduce a > > > new ioctl to query it per dev_id. > > > > That would be ironic to go to all this trouble to remove groups from > > the API only to have them show up here. > > Groups always had to be part of the API for advanced cases like qemu - > the point was to make them a small side bit of information not front > and center in control of everything. Agree. > > > For example, devices within a group cannot be bound to separate > > iommufds due to lack of isolation, which is handled via DMA ownership, > > but barring DMA aliasing issues, due to conventional PCI buses or > > quirks, cdev could allow devices within the same group to be managed by > > separate IOAS's. > > Maybe some future kernel could do this, the API allows it at least.. > > > So the group information really isn't enough for > > userspace to infer address space restrictions with cdev anyway. > > > > Therefore aren't we expecting this to be denied at attach_ioas() and > > QEMU shouldn't be making these sorts of assumptions for cdev anyway? > > I guess we could make an API specifically to report same-iommu_domina > information? > > I was assuming qemu would use the group for now as I don't see a > likely future when we would relax that restriction.. So I was keeping > a "add it when we need it" attitude here. > IIRC we discussed this subgroup concept in the thread of reviewing my high level design proposal 2yrs ago. The consensus at the moment was that subgroup is architecturally allowed w/o DMA aliasing issues but we're yet to see a real demand of relaxing current group restriction to support it. Also with time moving newer platforms should have less multi-devices group so the need of subgroup is further decreased. So I'm also inclined to laying the existing group restriction with cdev for now. Then can we make a decision how this group_id might be reported? In nesting series we'll have a GET_INFO ioctl per dev_id. It could be extended to report group_id too. Or alternatively just return it in BIND_IOMMUFD together with dev_id.
> From: Tian, Kevin <kevin.tian@intel.com> > Sent: Thursday, March 30, 2023 9:10 AM > > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, March 29, 2023 11:50 PM > > > > On Wed, Mar 29, 2023 at 09:41:26AM +0000, Tian, Kevin wrote: > > > > > We could extend bind_iommufd to return the group id or introduce a > > > new ioctl to query it per dev_id. > > > > > Once that is in place looks we don't need a new _INFO ioctl? > > > > The iommu_group and the reset group are different things > > > > The issue is processing the BDF strings, not the group ID. > > > > Probably we should have some way for iommufd to report the group_id > > from the dev_id? > > > > Yes, that is my thought. Though iommu_group and reset group are > different things we could still leverage existing _INFO ioctl once there > is a way to associated dev_id to group_id. Please ignore this comment. Yes they are different things so even if a dev_id is in a group_id reported on a reset BDF string it doesn't mean this dev_id is in the reset group. Qemu can know that all affected devices are either owned by itself or not used by other processes if dev_id's opened by itself can be associated to all group_id's reported in the BDF strings. But it still lacks of information to tell the reset dependency within those opened devices within Qemu. So we do need a new _INFO ioctl for cdev. :/
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, March 29, 2023 11:50 PM > > On Wed, 29 Mar 2023 09:41:26 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > Sent: Wednesday, March 29, 2023 11:14 AM > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Wednesday, March 29, 2023 12:00 AM > > > > > > > > > > > > Personally I don't like the suggestion to fail with -EPERM if the user > > > > doesn't own all the affected devices. This isn't a "probe if I can do > > > > a reset" ioctl, it's a "provide information about the devices affected > > > > by a reset to know how to call the hot-reset ioctl". We're returning > > > > the bdf to the cdev version of this ioctl for exactly this debugging > > > > purpose when the devices are not owned, that becomes useless if we give > > > > up an return -EPERM if ownership doesn't align. > > > > > > Jason's suggestion makes sense for returning the case of returning dev_id > > > as dev_id is local to iommufd. If there are devices in the same dev_set are > > > opened by multiple users, multiple iommufd would be used. Then the > > > dev_id would have overlap. e.g. a dev_set has three devices. Device A and > > > B are opened by the current user as cdev, dev_id #1 and #2 are generated. > > > While device C opened by another user as cdev, dev_id #n is generated for > > > it. If dev_id #n happens to be #1, then user gets two info entries that have > > > the same dev_id. > > > > > > > In Alex's proposal you'll set a invalid dev_id for device C so the user can > > still get the info for diagnostic purpose instead of seeing an -EPERM error. > > Yes, we shouldn't be reporting dev_ids outside of the user's iommufd > context. Following Alex's suggestion, here are two commits to extend existing _INFO to report dev_id. From ad5c81366813c5effd707a0b5f5e797f5fdb3115 Mon Sep 17 00:00:00 2001 From: Yi Liu <yi.l.liu@intel.com> Date: Thu, 30 Mar 2023 05:29:36 -0700 Subject: [PATCH] vfio: Mark cdev usage in vfio_device There are users that needs to check if vfio_device is opened as cdev. e.g. vfio-pci. Signed-off-by: Yi Liu <yi.l.liu@intel.com> --- drivers/vfio/device_cdev.c | 2 ++ include/linux/vfio.h | 14 ++++++++++++++ 2 files changed, 16 insertions(+) diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c index b5de997bff6d..56f3bbe34680 100644 --- a/drivers/vfio/device_cdev.c +++ b/drivers/vfio/device_cdev.c @@ -66,6 +66,7 @@ void vfio_device_cdev_close(struct vfio_device_file *df) return; mutex_lock(&device->dev_set->lock); + device->cdev_opened = false; vfio_device_close(df); vfio_device_put_kvm(device); if (df->iommufd) @@ -180,6 +181,7 @@ long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df, * read/write/mmap */ smp_store_release(&df->access_granted, true); + device->cdev_opened = true; mutex_unlock(&device->dev_set->lock); return 0; diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 1367605d617c..86efc1640940 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -58,6 +58,7 @@ struct vfio_device { struct device device; /* device.kref covers object life circle */ #if IS_ENABLED(CONFIG_VFIO_DEVICE_CDEV) struct cdev cdev; + bool cdev_opened; #endif refcount_t refcount; /* user count on registered device*/ unsigned int open_count; @@ -167,6 +168,19 @@ static inline int vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id) ((void (*)(struct vfio_device *vdev)) NULL) #endif +#if IS_ENABLED(CONFIG_VFIO_DEVICE_CDEV) +static inline bool vfio_device_cdev_opened(struct vfio_device *device) +{ + lockdep_assert_held(&device->dev_set->lock); + return device->cdev_opened; +} +#else +static inline bool vfio_device_cdev_opened(struct vfio_device *device) +{ + return false; +} +#endif + /** * @migration_set_state: Optional callback to change the migration state for * devices that support migration. It's mandatory for
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Thursday, March 30, 2023 8:48 PM > if (fill->cur == fill->max) > return -EAGAIN; /* Something changed, try again */ > @@ -791,7 +812,24 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data) > if (!iommu_group) > return -EPERM; /* Cannot reset non-isolated devices */ > > - fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); > + /* > + * If dev_id is needed, fill in the dev_id field, otherwise > + * fill in group_id. > + */ > + if (fill->require_devid) { > + /* > + * Report the devices that are opened as cdev and have > + * the same iommufd with the fill->iommufd. Otherwise, > + * just fill in an IOMMUFD_INVALID_ID. > + */ > + vdev = vfio_pci_find_device_in_devset(dev_set, pdev); > + if (vdev && !vfio_device_cdev_opened(vdev) && a typo..it should be if (vdev && vfio_device_cdev_opened(vdev) && > + fill->iommufd == vfio_iommufd_physical_ictx(vdev)) > + vfio_iommufd_physical_devid(vdev, &fill->devices[fill- > >cur].dev_id); > + fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID; > + } else { > + fill->devices[fill->cur].group_id = iommu_group_id(iommu_group); > + } > fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus); > fill->devices[fill->cur].bus = pdev->bus->number; > fill->devices[fill->cur].devfn = pdev->devfn; Regards, Yi Liu
On Thu, Mar 30, 2023 at 01:17:03AM +0000, Tian, Kevin wrote: > In nesting series we'll have a GET_INFO ioctl per dev_id. It could be > extended to report group_id too. I like the GET_INFO because it would work with VDPA, is it possible? Jason
On Thu, Mar 30, 2023 at 12:48:03PM +0000, Liu, Yi L wrote: > + /* > + * If dev_id is needed, fill in the dev_id field, otherwise > + * fill in group_id. > + */ > + if (fill->require_devid) { > + /* > + * Report the devices that are opened as cdev and have > + * the same iommufd with the fill->iommufd. Otherwise, > + * just fill in an IOMMUFD_INVALID_ID. > + */ > + vdev = vfio_pci_find_device_in_devset(dev_set, pdev); > + if (vdev && !vfio_device_cdev_opened(vdev) && > + fill->iommufd == vfio_iommufd_physical_ictx(vdev)) > + vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id); > + fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID; This needs an else? I suggest to check for VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID on input as well. I know the old kernels don't enforce this but at least we could start enforcing it going forward so that the group path would reject it to catch userspace bugs. May as well fix it up to fully validate the flags Jason
On Thu, 30 Mar 2023 19:44:55 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Thu, Mar 30, 2023 at 12:48:03PM +0000, Liu, Yi L wrote: > > + /* > > + * If dev_id is needed, fill in the dev_id field, otherwise > > + * fill in group_id. > > + */ > > + if (fill->require_devid) { > > + /* > > + * Report the devices that are opened as cdev and have > > + * the same iommufd with the fill->iommufd. Otherwise, > > + * just fill in an IOMMUFD_INVALID_ID. > > + */ > > + vdev = vfio_pci_find_device_in_devset(dev_set, pdev); > > + if (vdev && !vfio_device_cdev_opened(vdev) && > > + fill->iommufd == vfio_iommufd_physical_ictx(vdev)) > > + vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id); > > + fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID; > > This needs an else? > > I suggest to check for VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID on input > as well. I know the old kernels don't enforce this but at least we > could start enforcing it going forward so that the group path would > reject it to catch userspace bugs. > > May as well fix it up to fully validate the flags Is this under the guise of "if nobody complains it's ok, otherwise revert" plan? We report dev-id based on the nature of the device, not the provided flags, so I'm not sure I follow how this protects the group path, unless we've failed to clear the output flags on that path with this change. Thanks, Alex
On Thu, Mar 30, 2023 at 05:05:31PM -0600, Alex Williamson wrote: > > I suggest to check for VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID on input > > as well. I know the old kernels don't enforce this but at least we > > could start enforcing it going forward so that the group path would > > reject it to catch userspace bugs. > > > > May as well fix it up to fully validate the flags > > Is this under the guise of "if nobody complains it's ok, otherwise > revert" plan? Yah, assuming people don't pass in uninited structs. Jason
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 19f5b075d70a..45edf4e9b98b 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, return ret; } +static struct pci_dev * +vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set); + +static int vfio_pci_ioctl_get_pci_hot_reset_group_info( + struct vfio_pci_core_device *vdev, + struct vfio_pci_hot_reset_group_info __user *arg) +{ + unsigned long minsz = + offsetofend(struct vfio_pci_hot_reset_group_info, count); + struct vfio_pci_hot_reset_group_info hdr; + struct iommufd_ctx *iommufd, *cur_iommufd; + u32 count = 0, index = 0, *devices = NULL; + struct vfio_pci_core_device *cur; + bool slot = false; + int ret = 0; + + if (copy_from_user(&hdr, arg, minsz)) + return -EFAULT; + + if (hdr.argsz < minsz) + return -EINVAL; + + hdr.flags = 0; + + /* Can we do a slot or bus reset or neither? */ + if (!pci_probe_reset_slot(vdev->pdev->slot)) + slot = true; + else if (pci_probe_reset_bus(vdev->pdev->bus)) + return -ENODEV; + + mutex_lock(&vdev->vdev.dev_set->lock); + if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) { + ret = -EPERM; + goto out_unlock; + } + + iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); + if (!iommufd) { + ret = -EPERM; + goto out_unlock; + } + + /* How many devices are affected? */ + ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs, + &count, slot); + if (ret) + goto out_unlock; + + WARN_ON(!count); /* Should always be at least one */ + + /* + * If there's enough space, fill it now, otherwise return -ENOSPC and + * the number of devices affected. + */ + if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) { + ret = -ENOSPC; + hdr.count = count; + goto reset_info_exit; + } + + devices = kcalloc(count, sizeof(*devices), GFP_KERNEL); + if (!devices) { + ret = -ENOMEM; + goto reset_info_exit; + } + + list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, vdev.dev_set_list) { + cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev); + if (cur->vdev.open_count) { + if (cur_iommufd != iommufd) { + ret = -EPERM; + break; + } + ret = vfio_iommufd_physical_devid(&cur->vdev, &devices[index]); + if (ret) + break; + index++; + } + } + +reset_info_exit: + if (copy_to_user(arg, &hdr, minsz)) + ret = -EFAULT; + + if (!ret) { + if (copy_to_user(&arg->devices, devices, + hdr.count * sizeof(*devices))) + ret = -EFAULT; + } + + kfree(devices); +out_unlock: + mutex_unlock(&vdev->vdev.dev_set->lock); + return ret; +} + static int vfio_pci_ioctl_get_pci_hot_reset_info( struct vfio_pci_core_device *vdev, struct vfio_pci_hot_reset_info __user *arg) @@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, return vfio_pci_ioctl_get_irq_info(vdev, uarg); case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO: return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg); + case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO: + return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev, uarg); case VFIO_DEVICE_GET_REGION_INFO: return vfio_pci_ioctl_get_region_info(vdev, uarg); case VFIO_DEVICE_IOEVENTFD: diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 25432ef213ee..61b801dfd40b 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -669,6 +669,53 @@ struct vfio_pci_hot_reset_info { #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12) +/** + * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12, + * struct vfio_pci_hot_reset_group_info) + * + * This is used in the vfio device cdev mode. It returns the list of + * affected devices (represented by iommufd dev_id) when hot reset is + * issued on the current device with which this ioctl is invoked. It + * only includes the devices that are opened by the current user in the + * time of this command is invoked. This list may change when user opens + * new device or close opened device, hence user should re-invoke it + * after open/close devices. This command has no guarantee on the result + * of VFIO_DEVICE_PCI_HOT_RESET since the not-opened affected device can + * be by other users in the window between the two ioctls. If the affected + * devices are opened by multiple users, the VFIO_DEVICE_PCI_HOT_RESET + * shall fail, detail can check the description of VFIO_DEVICE_PCI_HOT_RESET. + * + * For the users that open vfio group/container, this ioctl is useless as + * they have no idea about the iommufd dev_id returned by this ioctl. For + * the users of the traditional mode vfio group/container, this ioctl will + * fail as this mode does not use iommufd hence no dev_id to report back. + * For the users of the iommufd compat mode vfio group/container, this ioctl + * would succeed as this mode uses iommufd as container fd. But such users + * still have no idea about the iommufd dev_id as the dev_id is only stored + * in kernel in this mode. For the users of the vfio group/container, the + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO should be used to know the hot reset + * affected devices. + * + * Return: 0 on success, -errno on failure: + * -enospc = insufficient buffer; + * -enodev = unsupported for device; + * -eperm = no permission for device, this error comes: + * - when there are affected devices that are opened but + * not bound to the same iommufd with the current device + * with which this ioctl is invoked, + * - there are affected devices that are not bound to vfio + * driver yet. + * - no valid iommufd is bound (e.g. noiommu mode) + */ +struct vfio_pci_hot_reset_group_info { + __u32 argsz; + __u32 flags; + __u32 count; + __u32 devices[]; +}; + +#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO _IO(VFIO_TYPE, VFIO_BASE + 18) + /** * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13, * struct vfio_pci_hot_reset)
This is a preparation for vfio device cdev as cdev gives userspace the capability to open device cdev fd and management stack (e.g. libvirt) could pass the device fd to the actual user (e.g. QEMU). As a result, the actual user has no idea about the device's bus:devfn information. This is a problem when user uses VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to know the hot reset affected device scope as this ioctl returns bus:devfn info. For the fd passing usage, the acutal user cannot map the bus:devfn to the devices it has opened via the fd passed from management stack. So a new ioctl is required. This new ioctl reports the list of iommufd dev_id that is opened by the user. If there is affected device that is not bound to vfio driver or opened by another user, this command shall fail with -EPERM. For the noiommu mode in the vfio device cdev path, this shall fail as no dev_id would be generated, hence nothing to report. This ioctl is useless to the users that open vfio group as such users have no idea about the iommufd dev_id and it can use the existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. The user that uses the traditional mode vfio group/container would be failed if invoking this ioctl. But the user that uses the iommufd compat mode vfio group/container shall succeed. This is harmless as long as user cannot make use of it and should use VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. Signed-off-by: Yi Liu <yi.l.liu@intel.com> --- drivers/vfio/pci/vfio_pci_core.c | 98 ++++++++++++++++++++++++++++++++ include/uapi/linux/vfio.h | 47 +++++++++++++++ 2 files changed, 145 insertions(+)