Message ID | 20230221034812.138051-10-yi.l.liu@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add vfio_device cdev for iommufd support | expand |
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Tuesday, February 21, 2023 11:48 AM > > /* > * We can't let userspace give us an arbitrarily large buffer to copy, > - * so verify how many we think there could be. Note groups can have > - * multiple devices so one group per device is the max. > + * so verify how many we think there could be. Note user may > provide > + * a set of groups, group can have multiple devices so one group per > + * device is the max. well this change doesn't include cdev > @@ -1320,7 +1321,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct > vfio_pci_core_device *vdev, > } > > /* Ensure the FD is a vfio FD.*/ > - if (!vfio_file_is_valid(file)) { > + if (!vfio_file_is_device_opened(file)) { > fput(file); > ret = -EINVAL; > break; that function is not just for checking device. Probably rename it to vfio_file_is_reset_valid(). btw this patch is insufficient to handle device fd. The current logic requires every device in the dev_set covered by provided fd's: static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev, struct vfio_pci_group_info *groups) { unsigned int i; for (i = 0; i < groups->count; i++) if (vfio_file_has_dev(groups->files[i], &vdev->vdev)) return true; return false; } Presumably when cdev fd is provided above should compare iommu group of the fd and that of the vdev. Otherwise it expects the user to have full access to every device in the set which is impractical.
> From: Tian, Kevin <kevin.tian@intel.com> > Sent: Wednesday, February 22, 2023 3:26 PM > > > From: Liu, Yi L <yi.l.liu@intel.com> > > Sent: Tuesday, February 21, 2023 11:48 AM > > > > /* > > * We can't let userspace give us an arbitrarily large buffer to copy, > > - * so verify how many we think there could be. Note groups can > have > > - * multiple devices so one group per device is the max. > > + * so verify how many we think there could be. Note user may > > provide > > + * a set of groups, group can have multiple devices so one group per > > + * device is the max. > > well this change doesn't include cdev For cdev, it should be the number of devices.
On Wed, Feb 22, 2023 at 01:35:06PM +0000, Liu, Yi L wrote: > > btw this patch is insufficient to handle device fd. The current logic > > requires every device in the dev_set covered by provided fd's: Yes, which is what it should be > > static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev, > > struct vfio_pci_group_info *groups) > > { > > unsigned int i; > > > > for (i = 0; i < groups->count; i++) > > if (vfio_file_has_dev(groups->files[i], &vdev->vdev)) > > return true; > > return false; > > } > > > > Presumably when cdev fd is provided above should compare iommu > > group of the fd and that of the vdev. Otherwise it expects the user > > to have full access to every device in the set which is impractical. No, it should check the dev's directly, userspace has to provide every dev in the dev set to do a reset. We should not allow userspace to take a shortcut based on hidden group stuff. The dev set is already unrelated to the groups, and userspace cannot discover the devset, so nothing has changed. This is looking worse to me. I think we should not require userspace to pass in lists of devices here. The simpler solution is to just take in a single iommufd and use that as the ownership proof. Something like the below. Jason diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index d81f93a321afcb..a5833bfdd7307e 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -114,6 +114,34 @@ struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx, } EXPORT_SYMBOL_NS_GPL(iommufd_device_bind, IOMMUFD); +/** + * iommufd_ctx_has_device - True if the struct device is bound to this ictx + * @ictx: iommufd file descriptor + * @dev: Pointer to a physical device struct + * + * True if a iommufd_device_bind() is present for dev. + */ +bool iommufd_ctx_has_device(struct iommufd_ctx *ictx, struct device *dev) +{ + unsigned long index; + struct iommufd_object *obj; + + if (!ictx) + return false; + + xa_lock(&ictx->objects); + xa_for_each(&ictx->objects, index, obj) { + if (obj->type == IOMMUFD_OBJ_DEVICE && + container_of(obj, struct iommufd_device, obj)->dev == dev) { + xa_unlock(&ictx->objects); + return true; + } + } + xa_unlock(&ictx->objects); + return false; +} +EXPORT_SYMBOL_NS_GPL(iommufd_ctx_has_device, IOMMUFD); + /** * iommufd_device_unbind - Undo iommufd_device_bind() * @idev: Device returned by iommufd_device_bind() diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 26a541cc64d114..28f6db1b81c1af 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -27,6 +27,7 @@ #include <linux/vgaarb.h> #include <linux/nospec.h> #include <linux/sched/mm.h> +#include <linux/iommufd.h> #if IS_ENABLED(CONFIG_EEH) #include <asm/eeh.h> #endif @@ -179,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev) struct vfio_pci_group_info; static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, - struct vfio_pci_group_info *groups); + struct vfio_pci_group_info *groups, + struct iommufd_ctx *iommufd_ctx); /* * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND @@ -1254,29 +1256,17 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info( return ret; } -static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, - struct vfio_pci_hot_reset __user *arg) +static int +vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev, + struct vfio_pci_hot_reset *hdr, + struct vfio_pci_hot_reset __user *arg) { - unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count); - struct vfio_pci_hot_reset hdr; int32_t *group_fds; struct file **files; struct vfio_pci_group_info info; bool slot = false; int file_idx, count = 0, ret = 0; - if (copy_from_user(&hdr, arg, minsz)) - return -EFAULT; - - if (hdr.argsz < minsz || hdr.flags) - return -EINVAL; - - /* Can we do a slot or bus reset or neither? */ - if (!pci_probe_reset_slot(vdev->pdev->slot)) - slot = true; - else if (pci_probe_reset_bus(vdev->pdev->bus)) - return -ENODEV; - /* * We can't let userspace give us an arbitrarily large buffer to copy, * so verify how many we think there could be. Note groups can have @@ -1288,11 +1278,11 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, return ret; /* Somewhere between 1 and count is OK */ - if (!hdr.count || hdr.count > count) + if (!hdr->count || hdr->count > count) return -EINVAL; - group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL); - files = kcalloc(hdr.count, sizeof(*files), GFP_KERNEL); + group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL); + files = kcalloc(hdr->count, sizeof(*files), GFP_KERNEL); if (!group_fds || !files) { kfree(group_fds); kfree(files); @@ -1300,7 +1290,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, } if (copy_from_user(group_fds, arg->group_fds, - hdr.count * sizeof(*group_fds))) { + hdr->count * sizeof(*group_fds))) { kfree(group_fds); kfree(files); return -EFAULT; @@ -1311,7 +1301,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, * interface and store the group and iommu ID. This ensures the group * is held across the reset. */ - for (file_idx = 0; file_idx < hdr.count; file_idx++) { + for (file_idx = 0; file_idx < hdr->count; file_idx++) { struct file *file = fget(group_fds[file_idx]); if (!file) { @@ -1335,10 +1325,10 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, if (ret) goto hot_reset_release; - info.count = hdr.count; + info.count = hdr->count; info.files = files; - ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info); + ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL); hot_reset_release: for (file_idx--; file_idx >= 0; file_idx--) @@ -1348,6 +1338,50 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, return ret; } +static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, + struct vfio_pci_hot_reset __user *arg) +{ + unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count); + struct vfio_pci_hot_reset hdr; + struct iommufd_ctx *iommufd; + bool slot = false; + struct fd f; + int32_t fd; + int ret; + + if (copy_from_user(&hdr, arg, minsz)) + return -EFAULT; + + if (hdr.argsz < minsz || hdr.flags) + return -EINVAL; + + /* Can we do a slot or bus reset or neither? */ + if (!pci_probe_reset_slot(vdev->pdev->slot)) + slot = true; + else if (pci_probe_reset_bus(vdev->pdev->bus)) + return -ENODEV; + + if (hdr.count != 1) + return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, arg); + + if (copy_from_user(&fd, arg->group_fds, sizeof(fd))) + return -EFAULT; + + f = fdget(fd); + if (!f.file) + return -EBADF; + iommufd = iommufd_ctx_from_file(f.file); + if (IS_ERR(iommufd)) { + fdput(f); + return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, arg); + } + fdput(f); + + ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd); + iommufd_ctx_put(iommufd); + return ret; +} + static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev, struct vfio_device_ioeventfd __user *arg) { @@ -2317,6 +2351,9 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev, { unsigned int i; + if (!groups) + return false; + for (i = 0; i < groups->count; i++) if (vfio_file_has_dev(groups->files[i], &vdev->vdev)) return true; @@ -2398,7 +2435,8 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set) * get each memory_lock. */ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, - struct vfio_pci_group_info *groups) + struct vfio_pci_group_info *groups, + struct iommufd_ctx *iommufd_ctx) { struct vfio_pci_core_device *cur_mem; struct vfio_pci_core_device *cur_vma; @@ -2432,7 +2470,8 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * Test whether all the affected devices are contained by the * set of groups provided by the user. */ - if (!vfio_dev_in_groups(cur_vma, groups)) { + if (!vfio_dev_in_groups(cur_vma, groups) && + !iommufd_ctx_has_device(iommufd_ctx, &cur_vma->pdev->dev)) { ret = -EINVAL; goto err_undo; } diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 650d45629647a7..1f58673701cb1e 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -58,6 +58,7 @@ void iommufd_access_unpin_pages(struct iommufd_access *access, int iommufd_access_rw(struct iommufd_access *access, unsigned long iova, void *data, size_t len, unsigned int flags); int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx, u32 *out_ioas_id); +bool iommufd_ctx_has_device(struct iommufd_ctx *ictx, struct device *dev); #else /* !CONFIG_IOMMUFD */ static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file) { @@ -94,5 +95,12 @@ static inline int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx, { return -EOPNOTSUPP; } + +static inline bool iommufd_ctx_has_device(struct iommufd_ctx *ictx, + struct device *dev) +{ + return false; +} + #endif /* CONFIG_IOMMUFD */ #endif
> From: Jason Gunthorpe > Sent: Thursday, February 23, 2023 1:18 AM > > > > static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev, > > > struct vfio_pci_group_info *groups) > > > { > > > unsigned int i; > > > > > > for (i = 0; i < groups->count; i++) > > > if (vfio_file_has_dev(groups->files[i], &vdev->vdev)) > > > return true; > > > return false; > > > } > > > > > > Presumably when cdev fd is provided above should compare iommu > > > group of the fd and that of the vdev. Otherwise it expects the user > > > to have full access to every device in the set which is impractical. > > No, it should check the dev's directly, userspace has to provide every > dev in the dev set to do a reset. We should not allow userspace to > take a shortcut based on hidden group stuff. > > The dev set is already unrelated to the groups, and userspace cannot > discover the devset, so nothing has changed. Agree. But I envision there might be a user-visible impact. Say a scenario where group happens to overlap with devset. Let's say two devices in the group/devset. An existing deployment assigns only dev1 to Qemu. In this case dev1 is resettable via group fd given dev2 cannot be opened by another user. Now the admin upgrades Qemu to a newer version incorporating cdev and your change. Then immediately dev1 cannot be reset since dev2 is not opened by this Qemu. Do we consider it as a regression? Or is the answer to ask the user to upgrade the mgmt stack? > > This is looking worse to me. I think we should not require userspace > to pass in lists of devices here. The simpler solution is to just take > in a single iommufd and use that as the ownership proof. Something > like the below. > As you said the dev set info is not exposed to the admin today. It's only available via VFIO_DEVICE_GET_PCI_HOT_RESET_INFO after a device is opened. My question is more on whether in real deployments the mgmt stack always tries to identify the reset dependency indirectly (is there a reliable way?) and assign all relevant devices to one VM. If it's not the case, then this change (requiring user to open all devices in the dev set) can certainly cause regression in those deployments because old group-level check covers more devices hence has higher possibility of being resettable than what your change implies. Alex probably has more insight into this usage open. Thanks Kevin
On Thu, Feb 23, 2023 at 07:55:21AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Thursday, February 23, 2023 1:18 AM > > > > > > static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev, > > > > struct vfio_pci_group_info *groups) > > > > { > > > > unsigned int i; > > > > > > > > for (i = 0; i < groups->count; i++) > > > > if (vfio_file_has_dev(groups->files[i], &vdev->vdev)) > > > > return true; > > > > return false; > > > > } > > > > > > > > Presumably when cdev fd is provided above should compare iommu > > > > group of the fd and that of the vdev. Otherwise it expects the user > > > > to have full access to every device in the set which is impractical. > > > > No, it should check the dev's directly, userspace has to provide every > > dev in the dev set to do a reset. We should not allow userspace to > > take a shortcut based on hidden group stuff. > > > > The dev set is already unrelated to the groups, and userspace cannot > > discover the devset, so nothing has changed. > > Agree. But I envision there might be a user-visible impact. > > Say a scenario where group happens to overlap with devset. Let's say > two devices in the group/devset. > > An existing deployment assigns only dev1 to Qemu. In this case dev1 > is resettable via group fd given dev2 cannot be opened by another > user. Oh, that is just because we took a shortcut in this logic and assumed that if the group is open then all the devices are opened by the same security domain. But we can also more clearly state that any closed device is acceptable for reset and doesn't need to be presented. So, like this: if (cur_vma->vdev.open_count && !vfio_dev_in_groups(cur_vma, groups) && !iommufd_ctx_has_device(iommufd_ctx, &cur_vma->pdev->dev)) { ret = -EINVAL; goto err_undo; } Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, February 23, 2023 9:22 PM > > On Thu, Feb 23, 2023 at 07:55:21AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Thursday, February 23, 2023 1:18 AM > > > > > > > > static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev, > > > > > struct vfio_pci_group_info *groups) > > > > > { > > > > > unsigned int i; > > > > > > > > > > for (i = 0; i < groups->count; i++) > > > > > if (vfio_file_has_dev(groups->files[i], &vdev->vdev)) > > > > > return true; > > > > > return false; > > > > > } > > > > > > > > > > Presumably when cdev fd is provided above should compare iommu > > > > > group of the fd and that of the vdev. Otherwise it expects the user > > > > > to have full access to every device in the set which is impractical. > > > > > > No, it should check the dev's directly, userspace has to provide every > > > dev in the dev set to do a reset. We should not allow userspace to > > > take a shortcut based on hidden group stuff. > > > > > > The dev set is already unrelated to the groups, and userspace cannot > > > discover the devset, so nothing has changed. > > > > Agree. But I envision there might be a user-visible impact. > > > > Say a scenario where group happens to overlap with devset. Let's say > > two devices in the group/devset. > > > > An existing deployment assigns only dev1 to Qemu. In this case dev1 > > is resettable via group fd given dev2 cannot be opened by another > > user. > > Oh, that is just because we took a shortcut in this logic and assumed > that if the group is open then all the devices are opened by the same > security domain. > > But we can also more clearly state that any closed device is > acceptable for reset and doesn't need to be presented. > > So, like this: > > if (cur_vma->vdev.open_count && > !vfio_dev_in_groups(cur_vma, groups) && > !iommufd_ctx_has_device(iommufd_ctx, &cur_vma- > >pdev->dev)) { > ret = -EINVAL; > goto err_undo; > } > Yes, this makes sense. Yi, while you are incorporating this change please also update the uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to explain that it could be an array of group fds or a single iommufd.
On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote: > Yi, while you are incorporating this change please also update the > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to > explain that it could be an array of group fds or a single iommufd. Upon reflection we can probably make it even simpler and just have a 0 length fd array mean to use the iommufd the vfio_device is already associated with And the check for correctness can be simplified to simply see if each vfio_device in the dev_set is attached to the same iommufd ctx pointer instead of searching the xarray. Would need to double check that the locking is OK but seems doable Jason
> From: Jason Gunthorpe > Sent: Friday, February 24, 2023 10:36 AM > > On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote: > > > Yi, while you are incorporating this change please also update the > > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to > > explain that it could be an array of group fds or a single iommufd. > > Upon reflection we can probably make it even simpler and just have a 0 > length fd array mean to use the iommufd the vfio_device is already > associated with > > And the check for correctness can be simplified to simply see if each > vfio_device in the dev_set is attached to the same iommufd ctx > pointer instead of searching the xarray. yes, this is simpler > > Would need to double check that the locking is OK but seems doable > Locking is fine since dev_set->lock is already held in the reset path.
> From: Tian, Kevin <kevin.tian@intel.com> > Sent: Friday, February 24, 2023 10:48 AM > > > From: Jason Gunthorpe > > Sent: Friday, February 24, 2023 10:36 AM > > > > On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote: > > > > > Yi, while you are incorporating this change please also update the > > > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to > > > explain that it could be an array of group fds or a single iommufd. > > > > Upon reflection we can probably make it even simpler and just have a 0 > > length fd array mean to use the iommufd the vfio_device is already > > associated with > > > > And the check for correctness can be simplified to simply see if each > > vfio_device in the dev_set is attached to the same iommufd ctx > > pointer instead of searching the xarray. How about the hot reset info path? We can still keep reporting the current information to userspace. Isn't it? another tricky question. If user passess iommufd down for reset in the vfio iommufd compatible mode, should we support it as well? > yes, this is simpler > > > > > Would need to double check that the locking is OK but seems doable > > > > Locking is fine since dev_set->lock is already held in the reset path. dev_set->lock is held prior to call bind_iommufd, so I agree locking is ok. Regards, Yi Liu
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Friday, February 24, 2023 11:44 AM > > > Upon reflection we can probably make it even simpler and just have a 0 > > > length fd array mean to use the iommufd the vfio_device is already > > > associated with > > > > > > And the check for correctness can be simplified to simply see if each > > > vfio_device in the dev_set is attached to the same iommufd ctx > > > pointer instead of searching the xarray. > > How about the hot reset info path? We can still keep reporting the > current information to userspace. Isn't it? No need to change that. It's already reported per device. > > another tricky question. If user passess iommufd down for reset > in the vfio iommufd compatible mode, should we support it as > well? > I don't see why we want to ban it. It does change the result from error (vfio container) to success (iommufd vfio-compat) when using the container fd/iommufd. But do we actually have a use case relying on such error pattern? On the other hand an user who knows the presence of vfio-compat should be allowed to pass iommufd to reset even when it still uses the legacy group/container interfaces.
> From: Tian, Kevin <kevin.tian@intel.com> > Sent: Friday, February 24, 2023 11:57 AM > > > From: Liu, Yi L <yi.l.liu@intel.com> > > Sent: Friday, February 24, 2023 11:44 AM > > > > Upon reflection we can probably make it even simpler and just have a > 0 > > > > length fd array mean to use the iommufd the vfio_device is already > > > > associated with > > > > > > > > And the check for correctness can be simplified to simply see if each > > > > vfio_device in the dev_set is attached to the same iommufd ctx > > > > pointer instead of searching the xarray. > > > > How about the hot reset info path? We can still keep reporting the > > current information to userspace. Isn't it? > > No need to change that. It's already reported per device. > > > > > another tricky question. If user passess iommufd down for reset > > in the vfio iommufd compatible mode, should we support it as > > well? > > > > I don't see why we want to ban it. It does change the result from > error (vfio container) to success (iommufd vfio-compat) when using > the container fd/iommufd. But do we actually have a use case > relying on such error pattern? > > On the other hand an user who knows the presence of vfio-compat > should be allowed to pass iommufd to reset even when it still uses > the legacy group/container interfaces. Yes. although I guess no user would do such a strange thing if no special reason.
On Fri, Feb 24, 2023 at 03:43:37AM +0000, Liu, Yi L wrote: > > From: Tian, Kevin <kevin.tian@intel.com> > > Sent: Friday, February 24, 2023 10:48 AM > > > > > From: Jason Gunthorpe > > > Sent: Friday, February 24, 2023 10:36 AM > > > > > > On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote: > > > > > > > Yi, while you are incorporating this change please also update the > > > > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to > > > > explain that it could be an array of group fds or a single iommufd. > > > > > > Upon reflection we can probably make it even simpler and just have a 0 > > > length fd array mean to use the iommufd the vfio_device is already > > > associated with > > > > > > And the check for correctness can be simplified to simply see if each > > > vfio_device in the dev_set is attached to the same iommufd ctx > > > pointer instead of searching the xarray. > > How about the hot reset info path? We can still keep reporting the > current information to userspace. Isn't it? Yeah, but I wonder if it is useful > another tricky question. If user passess iommufd down for reset > in the vfio iommufd compatible mode, should we support it as > well? I would say if the 0 fds mode is used and the current vfio_Device does not have an iommufd ctx then fail. That is the only requirement, however it got that ctx doesn't matter. > > Locking is fine since dev_set->lock is already held in the reset path. > > dev_set->lock is held prior to call bind_iommufd, so I agree locking > is ok. As long as the vdev's iommufd ctx and opencount cannot change under the devset lock, which I think is the case. It should be documented though in the vfio core code, as it is a bit subtle what the devset lock actually covers. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, February 24, 2023 10:36 AM > > On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote: > > > Yi, while you are incorporating this change please also update the > > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to > > explain that it could be an array of group fds or a single iommufd. > > Upon reflection we can probably make it even simpler and just have a 0 > length fd array mean to use the iommufd the vfio_device is already > associated with > > And the check for correctness can be simplified to simply see if each > vfio_device in the dev_set is attached to the same iommufd ctx > pointer instead of searching the xarray. Sorry, it appears to me the below concern is not solved as above logic still requires userspace to open and bind devices to the same iommufd. " > Say a scenario where group happens to overlap with devset. Let's say > two devices in the group/devset. > > An existing deployment assigns only dev1 to Qemu. In this case dev1 > is resettable via group fd given dev2 cannot be opened by another > user. " Thus, I think we still need to search the xarray to check if a device is bound to iommufd or not. And this check needs to be more relaxed. E.g. dev1 is bound to iommufd, but dev2 has not. However, they have the same group, so dev2 should be considered to be "bound" as well. When 0 length fd is used, vfio just gets the iommufd_ctx from the device that is to be reset, then check if all other devices in the dev_set are considered as bound with the below interface. +/** + * iommufd_ctx_has_group_for_device - True if any alias of struct device + is bound to this ictx + * @ictx: iommufd file descriptor + * @dev: Pointer to a physical device struct + * + * True if a iommufd_device_bind() is present for any alias of this dev + */ +bool iommufd_ctx_has_group_for_device(struct iommufd_ctx *ictx, struct device *dev) +{ + unsigned long index; + struct iommu_group *group; + struct iommufd_object *obj; + + if (!ictx) + return false; + + group = iommu_group_get(dev); + if (!group) + return false; + + xa_lock(&ictx->objects); + xa_for_each(&ictx->objects, index, obj) { + if (obj->type == IOMMUFD_OBJ_DEVICE) && + container_of(obj, struct iommufd_device, obj)->group == group) { + xa_unlock(&ictx->objects); + iommu_group_put(group); + return true; + } + } + xa_unlock(&ictx->objects); + iommu_group_put(group); + return false; +} +EXPORT_SYMBOL_NS_GPL(iommufd_ctx_has_group_for_device, IOMMUFD); Regards Yi Liu
On Sun, Feb 26, 2023 at 08:59:01AM +0000, Liu, Yi L wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Friday, February 24, 2023 10:36 AM > > > > On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote: > > > > > Yi, while you are incorporating this change please also update the > > > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to > > > explain that it could be an array of group fds or a single iommufd. > > > > Upon reflection we can probably make it even simpler and just have a 0 > > length fd array mean to use the iommufd the vfio_device is already > > associated with > > > > And the check for correctness can be simplified to simply see if each > > vfio_device in the dev_set is attached to the same iommufd ctx > > pointer instead of searching the xarray. > > Sorry, it appears to me the below concern is not solved as above logic > still requires userspace to open and bind devices to the same iommufd. > > " > > Say a scenario where group happens to overlap with devset. Let's say > > two devices in the group/devset. > > > > An existing deployment assigns only dev1 to Qemu. In this case dev1 > > is resettable via group fd given dev2 cannot be opened by another > > user. > " You solve this by checking for a 0 open count as already discussed. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Monday, February 27, 2023 7:40 AM > On Sun, Feb 26, 2023 at 08:59:01AM +0000, Liu, Yi L wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Friday, February 24, 2023 10:36 AM > > > > > > On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote: > > > > > > > Yi, while you are incorporating this change please also update the > > > > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to > > > > explain that it could be an array of group fds or a single iommufd. > > > > > > Upon reflection we can probably make it even simpler and just have a 0 > > > length fd array mean to use the iommufd the vfio_device is already > > > associated with > > > > > > And the check for correctness can be simplified to simply see if each > > > vfio_device in the dev_set is attached to the same iommufd ctx > > > pointer instead of searching the xarray. > > > > Sorry, it appears to me the below concern is not solved as above logic > > still requires userspace to open and bind devices to the same iommufd. > > > > " > > > Say a scenario where group happens to overlap with devset. Let's say > > > two devices in the group/devset. > > > > > > An existing deployment assigns only dev1 to Qemu. In this case dev1 > > > is resettable via group fd given dev2 cannot be opened by another > > > user. > > " > > You solve this by checking for a 0 open count as already discussed. Ok. this scenario is solved. so if open_count is non-zero, it should be either bound with iommufd or should be within groups as your below suggestion. if (cur_vma->vdev.open_count && !vfio_dev_in_groups(cur_vma, groups) && !iommufd_ctx_has_device(iommufd_ctx, &cur_vma->pdev->dev)) { ret = -EINVAL; goto err_undo; } Btw. In cdev path, open_count++ and iommufd bound are done in a single dev_set->lock lock and unlock pair, so if cur_vma->vdev has iommufd_ctx, then its open_count is non-zero. I have another scenario that dev1 and dev2 are from different groups but happen to be in the same dev_set, and userspace only opens dev1, this logic will allow userspace to reset dev1, but dev2 may be opened by another userspace. This seems to be a problem in my prior thinking. However, dev_set->lock Is held in the reset path, so other userspace cannot open and bind cur_vma->vdev to iommufd during reset.
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 827524510f3f..09086fefd515 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1280,8 +1280,9 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, /* * We can't let userspace give us an arbitrarily large buffer to copy, - * so verify how many we think there could be. Note groups can have - * multiple devices so one group per device is the max. + * so verify how many we think there could be. Note user may provide + * a set of groups, group can have multiple devices so one group per + * device is the max. */ ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs, &count, slot); @@ -1308,7 +1309,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, } /* - * For each group_fd, get the group file, this ensures the group + * For each fd, get the file, this ensures the group or device * is held across the reset. */ for (file_idx = 0; file_idx < hdr.count; file_idx++) { @@ -1320,7 +1321,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, } /* Ensure the FD is a vfio FD.*/ - if (!vfio_file_is_valid(file)) { + if (!vfio_file_is_device_opened(file)) { fput(file); ret = -EINVAL; break; @@ -2430,7 +2431,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) { /* * Test whether all the affected devices are contained by the - * set of groups provided by the user. + * set of files provided by the user. */ if (!vfio_dev_in_groups(cur_vma, groups)) { ret = -EINVAL; diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 91c8f25393db..2c851e172586 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -1209,6 +1209,25 @@ bool vfio_file_is_valid(struct file *file) } EXPORT_SYMBOL_GPL(vfio_file_is_valid); +/** + * vfio_file_is_device_opened - True if the file is fully opened + * @file: VFIO group file or VFIO device file + */ +bool vfio_file_is_device_opened(struct file *file) +{ + struct vfio_device *device; + + if (vfio_group_from_file(file)) + return true; + + device = vfio_device_from_file(file); + if (device) + return READ_ONCE(device->open_count); + + return false; +} +EXPORT_SYMBOL_GPL(vfio_file_is_device_opened); + /** * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file * is always CPU cache coherent diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 6a07e1c6c38e..615f8a081a41 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -249,6 +249,7 @@ bool vfio_file_is_valid(struct file *file); bool vfio_file_enforced_coherent(struct file *file); void vfio_file_set_kvm(struct file *file, struct kvm *kvm); bool vfio_file_has_dev(struct file *file, struct vfio_device *device); +bool vfio_file_is_device_opened(struct file *file); #define VFIO_PIN_PAGES_MAX_ENTRIES (PAGE_SIZE/sizeof(unsigned long))
This prepares for using vfio device cdev as no group fd will be opened in device cdev usage. vfio_file_is_device_opened() is added for checking a given vfio file is able to be a proof for the device ownership or not. The reason is that the cdev path has the device opened in an in-between state, in which the device is not fully opened. But to be proof of ownership, device should be fully opened. This also updates some comments as it also accepts device fd passed by user. The uapi has no change, but user can specify a set of device fds in the struct vfio_pci_hot_reset::group_fds field. Signed-off-by: Yi Liu <yi.l.liu@intel.com> --- drivers/vfio/pci/vfio_pci_core.c | 11 ++++++----- drivers/vfio/vfio_main.c | 19 +++++++++++++++++++ include/linux/vfio.h | 1 + 3 files changed, 26 insertions(+), 5 deletions(-)