[v4,09/19] vfio/pci: Accept device fd for hot reset

Message ID	20230221034812.138051-10-yi.l.liu@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Yi Liu <yi.l.liu@intel.com> To: alex.williamson@redhat.com, jgg@nvidia.com, kevin.tian@intel.com Date: Mon, 20 Feb 2023 19:48:02 -0800 Message-Id: <20230221034812.138051-10-yi.l.liu@intel.com> In-Reply-To: <20230221034812.138051-1-yi.l.liu@intel.com> References: <20230221034812.138051-1-yi.l.liu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Intel-gfx] [PATCH v4 09/19] vfio/pci: Accept device fd for hot reset Precedence: list Cc: linux-s390@vger.kernel.org, yi.l.liu@intel.com, yi.y.sun@linux.intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, joro@8bytes.org, cohuck@redhat.com, xudong.hao@intel.com, peterx@redhat.com, yan.y.zhao@intel.com, eric.auger@redhat.com, terrence.xu@intel.com, nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com, suravee.suthikulpanit@amd.com, intel-gfx@lists.freedesktop.org, chao.p.peng@linux.intel.com, lulu@redhat.com, intel-gvt-dev@lists.freedesktop.org, jasowang@redhat.com Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	Add vfio_device cdev for iommufd support \| expand [v4,00/19] Add vfio_device cdev for iommufd support [v4,01/19] vfio: Allocate per device file structure [v4,02/19] vfio: Refine vfio file kAPIs [v4,03/19] vfio: Accept vfio device file in the driver facing kAPI [v4,04/19] kvm/vfio: Rename kvm_vfio_group to prepare for accepting vfio device fd [v4,05/19] kvm/vfio: Accept vfio device file from userspace [v4,06/19] vfio: Pass struct vfio_device_file * to vfio_device_open/close() [v4,07/19] vfio: Block device access via device fd until device is opened [v4,08/19] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() [v4,09/19] vfio/pci: Accept device fd for hot reset [v4,10/19] vfio: Add infrastructure for bind_iommufd from userspace [v4,11/19] vfio-iommufd: Add detach_ioas support for physical VFIO devices [v4,12/19] vfio-iommufd: Add detach_ioas for emulated VFIO devices [v4,13/19] vfio: Add cdev_device_open_cnt to vfio_group [v4,14/19] vfio: Make vfio_device_open() single open for device cdev path [v4,15/19] vfio: Add cdev for vfio_device [v4,16/19] vfio: Add VFIO_DEVICE_BIND_IOMMUFD [v4,17/19] vfio: Add VFIO_DEVICE_AT[DE]TACH_IOMMUFD_PT [v4,18/19] vfio: Compile group optionally [v4,19/19] docs: vfio: Add vfio device cdev description

Yi Liu Feb. 21, 2023, 3:48 a.m. UTC

This prepares for using vfio device cdev as no group fd will be opened
in device cdev usage.

vfio_file_is_device_opened() is added for checking a given vfio file is
able to be a proof for the device ownership or not. The reason is that
the cdev path has the device opened in an in-between state, in which the
device is not fully opened. But to be proof of ownership, device should
be fully opened.

This also updates some comments as it also accepts device fd passed by
user. The uapi has no change, but user can specify a set of device fds
in the struct vfio_pci_hot_reset::group_fds field.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 11 ++++++-----
 drivers/vfio/vfio_main.c         | 19 +++++++++++++++++++
 include/linux/vfio.h             |  1 +
 3 files changed, 26 insertions(+), 5 deletions(-)

Tian, Kevin Feb. 22, 2023, 7:26 a.m. UTC | #1

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, February 21, 2023 11:48 AM
> 
>  	/*
>  	 * We can't let userspace give us an arbitrarily large buffer to copy,
> -	 * so verify how many we think there could be.  Note groups can have
> -	 * multiple devices so one group per device is the max.
> +	 * so verify how many we think there could be.  Note user may
> provide
> +	 * a set of groups, group can have multiple devices so one group per
> +	 * device is the max.

well this change doesn't include cdev

> @@ -1320,7 +1321,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
>  		}
> 
>  		/* Ensure the FD is a vfio FD.*/
> -		if (!vfio_file_is_valid(file)) {
> +		if (!vfio_file_is_device_opened(file)) {
>  			fput(file);
>  			ret = -EINVAL;
>  			break;

that function is not just for checking device.

Probably rename it to vfio_file_is_reset_valid().

btw this patch is insufficient to handle device fd. The current logic
requires every device in the dev_set covered by provided fd's:

static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
                               struct vfio_pci_group_info *groups)
{
 	unsigned int i;

	for (i = 0; i < groups->count; i++)
		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
			return true;
	return false;
}

Presumably when cdev fd is provided above should compare iommu
group of the fd and that of the vdev. Otherwise it expects the user
to have full access to every device in the set which is impractical.

Yi Liu Feb. 22, 2023, 1:35 p.m. UTC | #2

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Wednesday, February 22, 2023 3:26 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Tuesday, February 21, 2023 11:48 AM
> >
> >  	/*
> >  	 * We can't let userspace give us an arbitrarily large buffer to copy,
> > -	 * so verify how many we think there could be.  Note groups can
> have
> > -	 * multiple devices so one group per device is the max.
> > +	 * so verify how many we think there could be.  Note user may
> > provide
> > +	 * a set of groups, group can have multiple devices so one group per
> > +	 * device is the max.
> 
> well this change doesn't include cdev

For cdev, it should be the number of devices.

Jason Gunthorpe Feb. 22, 2023, 5:17 p.m. UTC | #3

On Wed, Feb 22, 2023 at 01:35:06PM +0000, Liu, Yi L wrote:

> > btw this patch is insufficient to handle device fd. The current logic
> > requires every device in the dev_set covered by provided fd's:

Yes, which is what it should be

> > static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
> >                                struct vfio_pci_group_info *groups)
> > {
> >  	unsigned int i;
> > 
> > 	for (i = 0; i < groups->count; i++)
> > 		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
> > 			return true;
> > 	return false;
> > }
> > 
> > Presumably when cdev fd is provided above should compare iommu
> > group of the fd and that of the vdev. Otherwise it expects the user
> > to have full access to every device in the set which is impractical.

No, it should check the dev's directly, userspace has to provide every
dev in the dev set to do a reset. We should not allow userspace to
take a shortcut based on hidden group stuff.

The dev set is already unrelated to the groups, and userspace cannot
discover the devset, so nothing has changed.

This is looking worse to me. I think we should not require userspace
to pass in lists of devices here. The simpler solution is to just take
in a single iommufd and use that as the ownership proof. Something
like the below.

Jason

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index d81f93a321afcb..a5833bfdd7307e 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -114,6 +114,34 @@ struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_bind, IOMMUFD);
 
+/**
+ * iommufd_ctx_has_device - True if the struct device is bound to this ictx
+ * @ictx: iommufd file descriptor
+ * @dev: Pointer to a physical device struct
+ *
+ * True if a iommufd_device_bind() is present for dev.
+ */
+bool iommufd_ctx_has_device(struct iommufd_ctx *ictx, struct device *dev)
+{
+	unsigned long index;
+	struct iommufd_object *obj;
+
+	if (!ictx)
+		return false;
+
+	xa_lock(&ictx->objects);
+	xa_for_each(&ictx->objects, index, obj) {
+		if (obj->type == IOMMUFD_OBJ_DEVICE &&
+		    container_of(obj, struct iommufd_device, obj)->dev == dev) {
+			xa_unlock(&ictx->objects);
+			return true;
+		}
+	}
+	xa_unlock(&ictx->objects);
+	return false;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_ctx_has_device, IOMMUFD);
+
 /**
  * iommufd_device_unbind - Undo iommufd_device_bind()
  * @idev: Device returned by iommufd_device_bind()
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 26a541cc64d114..28f6db1b81c1af 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -27,6 +27,7 @@
 #include <linux/vgaarb.h>
 #include <linux/nospec.h>
 #include <linux/sched/mm.h>
+#include <linux/iommufd.h>
 #if IS_ENABLED(CONFIG_EEH)
 #include <asm/eeh.h>
 #endif
@@ -179,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 struct vfio_pci_group_info;
 static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_group_info *groups);
+				      struct vfio_pci_group_info *groups,
+				      struct iommufd_ctx *iommufd_ctx);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -1254,29 +1256,17 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
 	return ret;
 }
 
-static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
-					struct vfio_pci_hot_reset __user *arg)
+static int
+vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
+				    struct vfio_pci_hot_reset *hdr,
+				    struct vfio_pci_hot_reset __user *arg)
 {
-	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
-	struct vfio_pci_hot_reset hdr;
 	int32_t *group_fds;
 	struct file **files;
 	struct vfio_pci_group_info info;
 	bool slot = false;
 	int file_idx, count = 0, ret = 0;
 
-	if (copy_from_user(&hdr, arg, minsz))
-		return -EFAULT;
-
-	if (hdr.argsz < minsz || hdr.flags)
-		return -EINVAL;
-
-	/* Can we do a slot or bus reset or neither? */
-	if (!pci_probe_reset_slot(vdev->pdev->slot))
-		slot = true;
-	else if (pci_probe_reset_bus(vdev->pdev->bus))
-		return -ENODEV;
-
 	/*
 	 * We can't let userspace give us an arbitrarily large buffer to copy,
 	 * so verify how many we think there could be.  Note groups can have
@@ -1288,11 +1278,11 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		return ret;
 
 	/* Somewhere between 1 and count is OK */
-	if (!hdr.count || hdr.count > count)
+	if (!hdr->count || hdr->count > count)
 		return -EINVAL;
 
-	group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
-	files = kcalloc(hdr.count, sizeof(*files), GFP_KERNEL);
+	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
+	files = kcalloc(hdr->count, sizeof(*files), GFP_KERNEL);
 	if (!group_fds || !files) {
 		kfree(group_fds);
 		kfree(files);
@@ -1300,7 +1290,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	}
 
 	if (copy_from_user(group_fds, arg->group_fds,
-			   hdr.count * sizeof(*group_fds))) {
+			   hdr->count * sizeof(*group_fds))) {
 		kfree(group_fds);
 		kfree(files);
 		return -EFAULT;
@@ -1311,7 +1301,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	 * interface and store the group and iommu ID.  This ensures the group
 	 * is held across the reset.
 	 */
-	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
+	for (file_idx = 0; file_idx < hdr->count; file_idx++) {
 		struct file *file = fget(group_fds[file_idx]);
 
 		if (!file) {
@@ -1335,10 +1325,10 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	if (ret)
 		goto hot_reset_release;
 
-	info.count = hdr.count;
+	info.count = hdr->count;
 	info.files = files;
 
-	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
+	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
 
 hot_reset_release:
 	for (file_idx--; file_idx >= 0; file_idx--)
@@ -1348,6 +1338,50 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
+					struct vfio_pci_hot_reset __user *arg)
+{
+	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
+	struct vfio_pci_hot_reset hdr;
+	struct iommufd_ctx *iommufd;
+	bool slot = false;
+	struct fd f;
+	int32_t fd;
+	int ret;
+
+	if (copy_from_user(&hdr, arg, minsz))
+		return -EFAULT;
+
+	if (hdr.argsz < minsz || hdr.flags)
+		return -EINVAL;
+
+	/* Can we do a slot or bus reset or neither? */
+	if (!pci_probe_reset_slot(vdev->pdev->slot))
+		slot = true;
+	else if (pci_probe_reset_bus(vdev->pdev->bus))
+		return -ENODEV;
+
+	if (hdr.count != 1)
+		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, arg);
+
+	if (copy_from_user(&fd, arg->group_fds, sizeof(fd)))
+		return -EFAULT;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+	iommufd = iommufd_ctx_from_file(f.file);
+	if (IS_ERR(iommufd)) {
+		fdput(f);
+		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, arg);
+	}
+	fdput(f);
+
+	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);
+	iommufd_ctx_put(iommufd);
+	return ret;
+}
+
 static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
 				    struct vfio_device_ioeventfd __user *arg)
 {
@@ -2317,6 +2351,9 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
 {
 	unsigned int i;
 
+	if (!groups)
+		return false;
+
 	for (i = 0; i < groups->count; i++)
 		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
 			return true;
@@ -2398,7 +2435,8 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
  * get each memory_lock.
  */
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_group_info *groups)
+				      struct vfio_pci_group_info *groups,
+				      struct iommufd_ctx *iommufd_ctx)
 {
 	struct vfio_pci_core_device *cur_mem;
 	struct vfio_pci_core_device *cur_vma;
@@ -2432,7 +2470,8 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		 * Test whether all the affected devices are contained by the
 		 * set of groups provided by the user.
 		 */
-		if (!vfio_dev_in_groups(cur_vma, groups)) {
+		if (!vfio_dev_in_groups(cur_vma, groups) &&
+		    !iommufd_ctx_has_device(iommufd_ctx, &cur_vma->pdev->dev)) {
 			ret = -EINVAL;
 			goto err_undo;
 		}
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 650d45629647a7..1f58673701cb1e 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -58,6 +58,7 @@ void iommufd_access_unpin_pages(struct iommufd_access *access,
 int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
 		      void *data, size_t len, unsigned int flags);
 int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx, u32 *out_ioas_id);
+bool iommufd_ctx_has_device(struct iommufd_ctx *ictx, struct device *dev);
 #else /* !CONFIG_IOMMUFD */
 static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
 {
@@ -94,5 +95,12 @@ static inline int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline bool iommufd_ctx_has_device(struct iommufd_ctx *ictx,
+					  struct device *dev)
+{
+	return false;
+}
+
 #endif /* CONFIG_IOMMUFD */
 #endif

Tian, Kevin Feb. 23, 2023, 7:55 a.m. UTC | #4

> From: Jason Gunthorpe
> Sent: Thursday, February 23, 2023 1:18 AM
> 
> > > static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
> > >                                struct vfio_pci_group_info *groups)
> > > {
> > >  	unsigned int i;
> > >
> > > 	for (i = 0; i < groups->count; i++)
> > > 		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
> > > 			return true;
> > > 	return false;
> > > }
> > >
> > > Presumably when cdev fd is provided above should compare iommu
> > > group of the fd and that of the vdev. Otherwise it expects the user
> > > to have full access to every device in the set which is impractical.
> 
> No, it should check the dev's directly, userspace has to provide every
> dev in the dev set to do a reset. We should not allow userspace to
> take a shortcut based on hidden group stuff.
> 
> The dev set is already unrelated to the groups, and userspace cannot
> discover the devset, so nothing has changed.

Agree. But I envision there might be a user-visible impact.

Say a scenario where group happens to overlap with devset. Let's say
two devices in the group/devset.

An existing deployment assigns only dev1 to Qemu. In this case dev1
is resettable via group fd given dev2 cannot be opened by another
user.

Now the admin upgrades Qemu to a newer version incorporating
cdev and your change. Then immediately dev1 cannot be reset
since dev2 is not opened by this Qemu.

Do we consider it as a regression? Or is the answer to ask the user
to upgrade the mgmt stack?

> 
> This is looking worse to me. I think we should not require userspace
> to pass in lists of devices here. The simpler solution is to just take
> in a single iommufd and use that as the ownership proof. Something
> like the below.
> 

As you said the dev set info is not exposed to the admin today. It's
only available via VFIO_DEVICE_GET_PCI_HOT_RESET_INFO after a
device is opened.

My question is more on whether in real deployments the mgmt
stack always tries to identify the reset dependency indirectly (is there
a reliable way?) and assign all relevant devices to one VM. If it's not
the case, then this change (requiring user to open all devices in the
dev set) can certainly cause regression in those deployments because
old group-level check covers more devices hence has higher possibility
of being resettable than what your change implies.

Alex probably has more insight into this usage open.

Thanks
Kevin

Jason Gunthorpe Feb. 23, 2023, 1:21 p.m. UTC | #5

On Thu, Feb 23, 2023 at 07:55:21AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Thursday, February 23, 2023 1:18 AM
> > 
> > > > static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
> > > >                                struct vfio_pci_group_info *groups)
> > > > {
> > > >  	unsigned int i;
> > > >
> > > > 	for (i = 0; i < groups->count; i++)
> > > > 		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
> > > > 			return true;
> > > > 	return false;
> > > > }
> > > >
> > > > Presumably when cdev fd is provided above should compare iommu
> > > > group of the fd and that of the vdev. Otherwise it expects the user
> > > > to have full access to every device in the set which is impractical.
> > 
> > No, it should check the dev's directly, userspace has to provide every
> > dev in the dev set to do a reset. We should not allow userspace to
> > take a shortcut based on hidden group stuff.
> > 
> > The dev set is already unrelated to the groups, and userspace cannot
> > discover the devset, so nothing has changed.
> 
> Agree. But I envision there might be a user-visible impact.
> 
> Say a scenario where group happens to overlap with devset. Let's say
> two devices in the group/devset.
> 
> An existing deployment assigns only dev1 to Qemu. In this case dev1
> is resettable via group fd given dev2 cannot be opened by another
> user.

Oh, that is just because we took a shortcut in this logic and assumed
that if the group is open then all the devices are opened by the same
security domain.

But we can also more clearly state that any closed device is
acceptable for reset and doesn't need to be presented.

So, like this:

		if (cur_vma->vdev.open_count &&
		    !vfio_dev_in_groups(cur_vma, groups) &&
		    !iommufd_ctx_has_device(iommufd_ctx, &cur_vma->pdev->dev)) {
			ret = -EINVAL;
			goto err_undo;
		}

Jason

Tian, Kevin Feb. 24, 2023, 2:21 a.m. UTC | #6

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, February 23, 2023 9:22 PM
> 
> On Thu, Feb 23, 2023 at 07:55:21AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Thursday, February 23, 2023 1:18 AM
> > >
> > > > > static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
> > > > >                                struct vfio_pci_group_info *groups)
> > > > > {
> > > > >  	unsigned int i;
> > > > >
> > > > > 	for (i = 0; i < groups->count; i++)
> > > > > 		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
> > > > > 			return true;
> > > > > 	return false;
> > > > > }
> > > > >
> > > > > Presumably when cdev fd is provided above should compare iommu
> > > > > group of the fd and that of the vdev. Otherwise it expects the user
> > > > > to have full access to every device in the set which is impractical.
> > >
> > > No, it should check the dev's directly, userspace has to provide every
> > > dev in the dev set to do a reset. We should not allow userspace to
> > > take a shortcut based on hidden group stuff.
> > >
> > > The dev set is already unrelated to the groups, and userspace cannot
> > > discover the devset, so nothing has changed.
> >
> > Agree. But I envision there might be a user-visible impact.
> >
> > Say a scenario where group happens to overlap with devset. Let's say
> > two devices in the group/devset.
> >
> > An existing deployment assigns only dev1 to Qemu. In this case dev1
> > is resettable via group fd given dev2 cannot be opened by another
> > user.
> 
> Oh, that is just because we took a shortcut in this logic and assumed
> that if the group is open then all the devices are opened by the same
> security domain.
> 
> But we can also more clearly state that any closed device is
> acceptable for reset and doesn't need to be presented.
> 
> So, like this:
> 
> 		if (cur_vma->vdev.open_count &&
> 		    !vfio_dev_in_groups(cur_vma, groups) &&
> 		    !iommufd_ctx_has_device(iommufd_ctx, &cur_vma-
> >pdev->dev)) {
> 			ret = -EINVAL;
> 			goto err_undo;
> 		}
> 

Yes, this makes sense.

Yi, while you are incorporating this change please also update the
uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to
explain that it could be an array of group fds or a single iommufd.

Jason Gunthorpe Feb. 24, 2023, 2:36 a.m. UTC | #7

On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote:

> Yi, while you are incorporating this change please also update the
> uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to
> explain that it could be an array of group fds or a single iommufd.

Upon reflection we can probably make it even simpler and just have a 0
length fd array mean to use the iommufd the vfio_device is already
associated with

And the check for correctness can be simplified to simply see if each
vfio_device in the dev_set is attached to the same iommufd ctx
pointer instead of searching the xarray.

Would need to double check that the locking is OK but seems doable

Jason

Tian, Kevin Feb. 24, 2023, 2:48 a.m. UTC | #8

> From: Jason Gunthorpe
> Sent: Friday, February 24, 2023 10:36 AM
> 
> On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote:
> 
> > Yi, while you are incorporating this change please also update the
> > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to
> > explain that it could be an array of group fds or a single iommufd.
> 
> Upon reflection we can probably make it even simpler and just have a 0
> length fd array mean to use the iommufd the vfio_device is already
> associated with
> 
> And the check for correctness can be simplified to simply see if each
> vfio_device in the dev_set is attached to the same iommufd ctx
> pointer instead of searching the xarray.

yes, this is simpler

> 
> Would need to double check that the locking is OK but seems doable
> 

Locking is fine since dev_set->lock is already held in the reset path.

Yi Liu Feb. 24, 2023, 3:43 a.m. UTC | #9

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Friday, February 24, 2023 10:48 AM
> 
> > From: Jason Gunthorpe
> > Sent: Friday, February 24, 2023 10:36 AM
> >
> > On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote:
> >
> > > Yi, while you are incorporating this change please also update the
> > > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to
> > > explain that it could be an array of group fds or a single iommufd.
> >
> > Upon reflection we can probably make it even simpler and just have a 0
> > length fd array mean to use the iommufd the vfio_device is already
> > associated with
> >
> > And the check for correctness can be simplified to simply see if each
> > vfio_device in the dev_set is attached to the same iommufd ctx
> > pointer instead of searching the xarray.

How about the hot reset info path? We can still keep reporting the
current information to userspace. Isn't it?

another tricky question. If user passess iommufd down for reset
in the vfio iommufd compatible mode, should we support it as
well?

> yes, this is simpler
> 
> >
> > Would need to double check that the locking is OK but seems doable
> >
> 
> Locking is fine since dev_set->lock is already held in the reset path.

dev_set->lock is held prior to call bind_iommufd, so I agree locking
is ok.

Regards,
Yi Liu

Tian, Kevin Feb. 24, 2023, 3:56 a.m. UTC | #10

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, February 24, 2023 11:44 AM
> > > Upon reflection we can probably make it even simpler and just have a 0
> > > length fd array mean to use the iommufd the vfio_device is already
> > > associated with
> > >
> > > And the check for correctness can be simplified to simply see if each
> > > vfio_device in the dev_set is attached to the same iommufd ctx
> > > pointer instead of searching the xarray.
> 
> How about the hot reset info path? We can still keep reporting the
> current information to userspace. Isn't it?

No need to change that. It's already reported per device.

> 
> another tricky question. If user passess iommufd down for reset
> in the vfio iommufd compatible mode, should we support it as
> well?
> 

I don't see why we want to ban it. It does change the result from
error (vfio container) to success (iommufd vfio-compat) when using
the container fd/iommufd. But do we actually have a use case
relying on such error pattern?

On the other hand an user who knows the presence of vfio-compat
should be allowed to pass iommufd to reset even when it still uses
the legacy group/container interfaces.

Yi Liu Feb. 24, 2023, 5:09 a.m. UTC | #11

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Friday, February 24, 2023 11:57 AM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Friday, February 24, 2023 11:44 AM
> > > > Upon reflection we can probably make it even simpler and just have a
> 0
> > > > length fd array mean to use the iommufd the vfio_device is already
> > > > associated with
> > > >
> > > > And the check for correctness can be simplified to simply see if each
> > > > vfio_device in the dev_set is attached to the same iommufd ctx
> > > > pointer instead of searching the xarray.
> >
> > How about the hot reset info path? We can still keep reporting the
> > current information to userspace. Isn't it?
> 
> No need to change that. It's already reported per device.
> 
> >
> > another tricky question. If user passess iommufd down for reset
> > in the vfio iommufd compatible mode, should we support it as
> > well?
> >
> 
> I don't see why we want to ban it. It does change the result from
> error (vfio container) to success (iommufd vfio-compat) when using
> the container fd/iommufd. But do we actually have a use case
> relying on such error pattern?
>
> On the other hand an user who knows the presence of vfio-compat
> should be allowed to pass iommufd to reset even when it still uses
> the legacy group/container interfaces.

Yes. although I guess no user would do such a strange thing if
no special reason.

Jason Gunthorpe Feb. 24, 2023, 2:30 p.m. UTC | #12

On Fri, Feb 24, 2023 at 03:43:37AM +0000, Liu, Yi L wrote:
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Friday, February 24, 2023 10:48 AM
> > 
> > > From: Jason Gunthorpe
> > > Sent: Friday, February 24, 2023 10:36 AM
> > >
> > > On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote:
> > >
> > > > Yi, while you are incorporating this change please also update the
> > > > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to
> > > > explain that it could be an array of group fds or a single iommufd.
> > >
> > > Upon reflection we can probably make it even simpler and just have a 0
> > > length fd array mean to use the iommufd the vfio_device is already
> > > associated with
> > >
> > > And the check for correctness can be simplified to simply see if each
> > > vfio_device in the dev_set is attached to the same iommufd ctx
> > > pointer instead of searching the xarray.
> 
> How about the hot reset info path? We can still keep reporting the
> current information to userspace. Isn't it?

Yeah, but I wonder if it is useful
 
> another tricky question. If user passess iommufd down for reset
> in the vfio iommufd compatible mode, should we support it as
> well?

I would say if the 0 fds mode is used and the current vfio_Device does
not have an iommufd ctx then fail.

That is the only requirement, however it got that ctx doesn't matter.

> > Locking is fine since dev_set->lock is already held in the reset path.
> 
> dev_set->lock is held prior to call bind_iommufd, so I agree locking
> is ok.

As long as the vdev's iommufd ctx and opencount cannot change under
the devset lock, which I think is the case. It should be documented
though in the vfio core code, as it is a bit subtle what the devset
lock actually covers.

Jason

Yi Liu Feb. 26, 2023, 8:59 a.m. UTC | #13

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, February 24, 2023 10:36 AM
> 
> On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote:
> 
> > Yi, while you are incorporating this change please also update the
> > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to
> > explain that it could be an array of group fds or a single iommufd.
> 
> Upon reflection we can probably make it even simpler and just have a 0
> length fd array mean to use the iommufd the vfio_device is already
> associated with
> 
> And the check for correctness can be simplified to simply see if each
> vfio_device in the dev_set is attached to the same iommufd ctx
> pointer instead of searching the xarray.

Sorry, it appears to me the below concern is not solved as above logic
still requires userspace to open and bind devices to the same iommufd.

"
  > Say a scenario where group happens to overlap with devset. Let's say
  > two devices in the group/devset.
  > 
  > An existing deployment assigns only dev1 to Qemu. In this case dev1
  > is resettable via group fd given dev2 cannot be opened by another
  > user.
"

Thus, I think we still need to search the xarray to check if a device is
bound to iommufd or not. And this check needs to be more relaxed.
E.g. dev1 is bound to iommufd, but dev2 has not. However, they have
the same group, so dev2 should be considered to be "bound" as well.
When 0 length fd is used, vfio just gets the iommufd_ctx from the device
that is to be reset, then check if all other devices in the dev_set are
considered as bound with the below interface.

+/**
+ * iommufd_ctx_has_group_for_device - True if any alias of struct device
+					is bound to this ictx
+ * @ictx: iommufd file descriptor
+ * @dev: Pointer to a physical device struct
+ *
+ * True if a iommufd_device_bind() is present for any alias of this dev
+ */
+bool iommufd_ctx_has_group_for_device(struct iommufd_ctx *ictx, struct device *dev)
+{
+	unsigned long index;
+	struct iommu_group *group;
+	struct iommufd_object *obj;
+
+	if (!ictx)
+		return false;
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return false;
+
+	xa_lock(&ictx->objects);
+	xa_for_each(&ictx->objects, index, obj) {
+		if (obj->type == IOMMUFD_OBJ_DEVICE)  &&
+		    container_of(obj, struct iommufd_device, obj)->group == group) {
+			xa_unlock(&ictx->objects);
+			iommu_group_put(group);
+			return true;
+		}
+	}
+	xa_unlock(&ictx->objects);
+	iommu_group_put(group);
+	return false;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_ctx_has_group_for_device, IOMMUFD);

Regards
Yi Liu

Jason Gunthorpe Feb. 26, 2023, 11:40 p.m. UTC | #14

On Sun, Feb 26, 2023 at 08:59:01AM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, February 24, 2023 10:36 AM
> > 
> > On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote:
> > 
> > > Yi, while you are incorporating this change please also update the
> > > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to
> > > explain that it could be an array of group fds or a single iommufd.
> > 
> > Upon reflection we can probably make it even simpler and just have a 0
> > length fd array mean to use the iommufd the vfio_device is already
> > associated with
> > 
> > And the check for correctness can be simplified to simply see if each
> > vfio_device in the dev_set is attached to the same iommufd ctx
> > pointer instead of searching the xarray.
> 
> Sorry, it appears to me the below concern is not solved as above logic
> still requires userspace to open and bind devices to the same iommufd.
> 
> "
>   > Say a scenario where group happens to overlap with devset. Let's say
>   > two devices in the group/devset.
>   > 
>   > An existing deployment assigns only dev1 to Qemu. In this case dev1
>   > is resettable via group fd given dev2 cannot be opened by another
>   > user.
> "

You solve this by checking for a 0 open count as already discussed.

Jason

Yi Liu Feb. 27, 2023, 2:53 a.m. UTC | #15

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, February 27, 2023 7:40 AM
> On Sun, Feb 26, 2023 at 08:59:01AM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, February 24, 2023 10:36 AM
> > >
> > > On Fri, Feb 24, 2023 at 02:21:33AM +0000, Tian, Kevin wrote:
> > >
> > > > Yi, while you are incorporating this change please also update the
> > > > uapi header. Rename 'group_fds[]' to 'fds[]' and add comment to
> > > > explain that it could be an array of group fds or a single iommufd.
> > >
> > > Upon reflection we can probably make it even simpler and just have a 0
> > > length fd array mean to use the iommufd the vfio_device is already
> > > associated with
> > >
> > > And the check for correctness can be simplified to simply see if each
> > > vfio_device in the dev_set is attached to the same iommufd ctx
> > > pointer instead of searching the xarray.
> >
> > Sorry, it appears to me the below concern is not solved as above logic
> > still requires userspace to open and bind devices to the same iommufd.
> >
> > "
> >   > Say a scenario where group happens to overlap with devset. Let's say
> >   > two devices in the group/devset.
> >   >
> >   > An existing deployment assigns only dev1 to Qemu. In this case dev1
> >   > is resettable via group fd given dev2 cannot be opened by another
> >   > user.
> > "
> 
> You solve this by checking for a 0 open count as already discussed.

Ok. this scenario is solved. so if open_count is non-zero, it should be
either bound with iommufd or should be within groups as your below
suggestion. 

		if (cur_vma->vdev.open_count &&
		    !vfio_dev_in_groups(cur_vma, groups) &&
		    !iommufd_ctx_has_device(iommufd_ctx, &cur_vma->pdev->dev)) {
			ret = -EINVAL;
			goto err_undo;
		}

Btw. In cdev path, open_count++ and iommufd bound are done in a
single dev_set->lock lock and unlock pair, so if cur_vma->vdev has
iommufd_ctx, then its open_count is non-zero. I have another scenario
that dev1 and dev2 are from different groups but happen to be in
the same dev_set, and userspace only opens dev1, this logic will allow
userspace to reset dev1, but dev2 may be opened by another userspace.
This seems to be a problem in my prior thinking. However, dev_set->lock
Is held in the reset path, so other userspace cannot open and bind
cur_vma->vdev to iommufd during reset.

[v4,09/19] vfio/pci: Accept device fd for hot reset

Commit Message

Comments

Patch