diff mbox series

[v1,9/9] smaples: add vfio-mdev-pci driver

Message ID 1560000071-3543-10-git-send-email-yi.l.liu@intel.com (mailing list archive)
State New, archived
Headers show
Series vfio_pci: wrap pci device as a mediated device | expand

Commit Message

Yi Liu June 8, 2019, 1:21 p.m. UTC
This patch adds sample driver named vfio-mdev-pci. It is to wrap
a PCI device as a mediated device. For a pci device, once bound
to vfio-mdev-pci driver, user space access of this device will
go through vfio mdev framework. The usage of the device follows
mdev management method. e.g. user should create a mdev before
exposing the device to user-space.

Benefit of this new driver would be acting as a sample driver
for recent changes from "vfio/mdev: IOMMU aware mediated device"
patchset. Also it could be a good experiment driver for future
device specific mdev migration support.

To use this driver:
a) build and load vfio-mdev-pci.ko module
   execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
   then load it with following command
   > sudo modprobe vfio
   > sudo modprobe vfio-pci
   > sudo insmod drivers/vfio/pci/vfio-mdev-pci.ko

b) unbind original device driver
   e.g. use following command to unbind its original driver
   > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind

c) bind vfio-mdev-pci driver to the physical device
   > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id

d) check the supported mdev instances
   > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/
     vfio-mdev-pci-type1
   > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
     vfio-mdev-pci-type1/
     available_instances  create  device_api  devices  name

e)  create mdev on this physical device (only 1 instance)
   > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \
     /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
     vfio-mdev-pci-type1/create

f) passthru the mdev to guest
   add the following line in Qemu boot command
   -device vfio-pci,\
    sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003

g) destroy mdev
   > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\
     remove

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/Makefile        |   6 +
 drivers/vfio/pci/vfio_mdev_pci.c | 403 +++++++++++++++++++++++++++++++++++++++
 samples/Kconfig                  |  11 ++
 3 files changed, 420 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_mdev_pci.c

Comments

Alex Williamson June 20, 2019, 4:26 a.m. UTC | #1
On Sat,  8 Jun 2019 21:21:11 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch adds sample driver named vfio-mdev-pci. It is to wrap
> a PCI device as a mediated device. For a pci device, once bound
> to vfio-mdev-pci driver, user space access of this device will
> go through vfio mdev framework. The usage of the device follows
> mdev management method. e.g. user should create a mdev before
> exposing the device to user-space.
> 
> Benefit of this new driver would be acting as a sample driver
> for recent changes from "vfio/mdev: IOMMU aware mediated device"
> patchset. Also it could be a good experiment driver for future
> device specific mdev migration support.
> 
> To use this driver:
> a) build and load vfio-mdev-pci.ko module
>    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
>    then load it with following command
>    > sudo modprobe vfio
>    > sudo modprobe vfio-pci
>    > sudo insmod drivers/vfio/pci/vfio-mdev-pci.ko  
> 
> b) unbind original device driver
>    e.g. use following command to unbind its original driver
>    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind  
> 
> c) bind vfio-mdev-pci driver to the physical device
>    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id  
> 
> d) check the supported mdev instances
>    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/  
>      vfio-mdev-pci-type1
>    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\  
>      vfio-mdev-pci-type1/
>      available_instances  create  device_api  devices  name


I think the static type name here is a problem (and why does it
include "type1"?).  We generally consider that a type defines a
software compatible mdev, but in this case any PCI device wrapped in
vfio-mdev-pci gets the same mdev type.  This is only a sample driver,
but that's a bad precedent.  I've taken a stab at fixing this in the
patch below, using the PCI vendor ID, device ID, subsystem vendor ID,
subsystem device ID, class code, and revision to try to make the type
as specific to the physical device assigned as we can through PCI.

> 
> e)  create mdev on this physical device (only 1 instance)
>    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \  
>      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
>      vfio-mdev-pci-type1/create

Whoops, available_instances always reports 1 and it doesn't appear that
the create function prevents additional mdevs.  Also addressed in the
patch below.
 
> f) passthru the mdev to guest
>    add the following line in Qemu boot command
>    -device vfio-pci,\
>     sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> 
> g) destroy mdev
>    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\  
>      remove
> 

I also found that unbinding the parent device doesn't unregister with
mdev, so it cannot be bound again, also fixed below.

However, the patch below just makes the mdev interface behave
correctly, I can't make it work on my system because commit
7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")
used iommu_attach_device() rather than iommu_attach_group() for non-aux
mdev iommu_device.  Is there a requirement that the mdev parent device
is in a singleton iommu group?  If this is a simplification, then
vfio-mdev-pci should not bind to devices where this is violated since
there's no way to use the device.  Can we support it though?

If I have two devices in the same group and bind them both to
vfio-mdev-pci, I end up with three groups, one for each mdev device and
the original physical device group.  vfio.c works with the mdev groups
and will try to match both groups to the container.  vfio_iommu_type1.c
also works with the mdev groups, except for the point where we actually
try to attach a group to a domain, which is the only window where we use
the iommu_device rather than the provided group, but we don't record
that anywhere.  Should struct vfio_group have a pointer to a reference
counted object that tracks the actual iommu_group attached, such that
we can determine that the group is already attached to the domain and
not try to attach again?  Ideally I'd be able to bind one device to
vfio-pci, the other to vfio-mdev-pci, and be able to use them both
within the same container.  It seems like this should be possible, it's
the same effective iommu configuration as if they were both bound to
vfio-pci.  Thanks,

Alex

diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
index 07c8067b3f73..09143d3e5473 100644
--- a/drivers/vfio/pci/vfio_mdev_pci.c
+++ b/drivers/vfio/pci/vfio_mdev_pci.c
@@ -65,18 +65,22 @@ MODULE_PARM_DESC(disable_idle_d3,
 
 static struct pci_driver vfio_mdev_pci_driver;
 
-static ssize_t
-name_show(struct kobject *kobj, struct device *dev, char *buf)
-{
-	return sprintf(buf, "%s-type1\n", dev_name(dev));
-}
-
-MDEV_TYPE_ATTR_RO(name);
+struct vfio_mdev_pci_device {
+	struct vfio_pci_device vdev;
+	struct mdev_parent_ops ops;
+	struct attribute_group *groups[2];
+	struct attribute_group attr;
+	atomic_t avail;
+};
 
 static ssize_t
 available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
 {
-	return sprintf(buf, "%d\n", 1);
+	struct vfio_mdev_pci_device *vmdev;
+
+	vmdev = pci_get_drvdata(to_pci_dev(dev));
+
+	return sprintf(buf, "%d\n", atomic_read(&vmdev->avail));
 }
 
 MDEV_TYPE_ATTR_RO(available_instances);
@@ -90,62 +94,57 @@ static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
 MDEV_TYPE_ATTR_RO(device_api);
 
 static struct attribute *vfio_mdev_pci_types_attrs[] = {
-	&mdev_type_attr_name.attr,
 	&mdev_type_attr_device_api.attr,
 	&mdev_type_attr_available_instances.attr,
 	NULL,
 };
 
-static struct attribute_group vfio_mdev_pci_type_group1 = {
-	.name  = "type1",
-	.attrs = vfio_mdev_pci_types_attrs,
-};
-
-struct attribute_group *vfio_mdev_pci_type_groups[] = {
-	&vfio_mdev_pci_type_group1,
-	NULL,
-};
-
 struct vfio_mdev_pci {
 	struct vfio_pci_device *vdev;
 	struct mdev_device *mdev;
-	unsigned long handle;
 };
 
 static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
 {
 	struct device *pdev;
-	struct vfio_pci_device *vdev;
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_mdev_pci *pmdev;
 	int ret;
 
 	pdev = mdev_parent_dev(mdev);
-	vdev = dev_get_drvdata(pdev);
+	vmdev = dev_get_drvdata(pdev);
+
+	if (atomic_dec_if_positive(&vmdev->avail) < 0)
+		return -ENOSPC;
+
 	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
-	if (pmdev == NULL) {
-		ret = -EBUSY;
-		goto out;
-	}
+	if (!pmdev)
+		return -ENOMEM;
 
 	pmdev->mdev = mdev;
-	pmdev->vdev = vdev;
+	pmdev->vdev = &vmdev->vdev;
 	mdev_set_drvdata(mdev, pmdev);
 	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
 	if (ret) {
 		pr_info("%s, failed to config iommu isolation for mdev: %s on pf: %s\n",
 			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
-		goto out;
+		kfree(pmdev);
+		atomic_inc(&vmdev->avail);
+		return ret;
 	}
 
-out:
-	return ret;
+	return 0;
 }
 
 static int vfio_mdev_pci_remove(struct mdev_device *mdev)
 {
 	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_mdev_pci_device *vmdev;
+
+	vmdev = container_of(pmdev->vdev, struct vfio_mdev_pci_device, vdev);
 
 	kfree(pmdev);
+	atomic_inc(&vmdev->avail);
 	pr_info("%s, succeeded for mdev: %s\n", __func__,
 		     dev_name(mdev_dev(mdev)));
 
@@ -237,24 +236,12 @@ static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
 	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
 }
 
-static const struct mdev_parent_ops vfio_mdev_pci_ops = {
-	.supported_type_groups	= vfio_mdev_pci_type_groups,
-	.create			= vfio_mdev_pci_create,
-	.remove			= vfio_mdev_pci_remove,
-
-	.open			= vfio_mdev_pci_open,
-	.release		= vfio_mdev_pci_release,
-
-	.read			= vfio_mdev_pci_read,
-	.write			= vfio_mdev_pci_write,
-	.mmap			= vfio_mdev_pci_mmap,
-	.ioctl			= vfio_mdev_pci_ioctl,
-};
-
 static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 				       const struct pci_device_id *id)
 {
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_pci_device *vdev;
+	const struct mdev_parent_ops *ops;
 	int ret;
 
 	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
@@ -273,10 +260,38 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 		return -EBUSY;
 	}
 
-	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
-	if (!vdev)
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (!vmdev)
 		return -ENOMEM;
 
+	vmdev->attr.name = kasprintf(GFP_KERNEL,
+				     "%04x:%04x:%04x:%04x:%06x:%02x",
+				     pdev->vendor, pdev->device,
+				     pdev->subsystem_vendor,
+				     pdev->subsystem_device, pdev->class,
+				     pdev->revision);
+	if (!vmdev->attr.name) {
+		kfree(vmdev);
+		return -ENOMEM;
+	}
+
+	atomic_set(&vmdev->avail, 1);
+
+	vmdev->attr.attrs = vfio_mdev_pci_types_attrs;
+	vmdev->groups[0] = &vmdev->attr;
+
+	vmdev->ops.supported_type_groups = vmdev->groups;
+	vmdev->ops.create = vfio_mdev_pci_create;
+	vmdev->ops.remove = vfio_mdev_pci_remove;
+	vmdev->ops.open	= vfio_mdev_pci_open;
+	vmdev->ops.release = vfio_mdev_pci_release;
+	vmdev->ops.read = vfio_mdev_pci_read;
+	vmdev->ops.write = vfio_mdev_pci_write;
+	vmdev->ops.mmap = vfio_mdev_pci_mmap;
+	vmdev->ops.ioctl = vfio_mdev_pci_ioctl;
+	ops = &vmdev->ops;
+
+	vdev = &vmdev->vdev;
 	vdev->pdev = pdev;
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	mutex_init(&vdev->igate);
@@ -289,7 +304,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 #endif
 	vdev->disable_idle_d3 = disable_idle_d3;
 
-	pci_set_drvdata(pdev, vdev);
+	pci_set_drvdata(pdev, vmdev);
 
 	ret = vfio_pci_reflck_attach(vdev);
 	if (ret) {
@@ -320,7 +335,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 	}
 
-	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
+	ret = mdev_register_device(&pdev->dev, ops);
 	if (ret)
 		pr_err("Cannot register mdev for device %s\n",
 			dev_name(&pdev->dev));
@@ -332,12 +347,17 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
 
 static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
 {
+	struct vfio_mdev_pci_device *vmdev;
 	struct vfio_pci_device *vdev;
 
-	vdev = pci_get_drvdata(pdev);
-	if (!vdev)
+	mdev_unregister_device(&pdev->dev);
+
+	vmdev = pci_get_drvdata(pdev);
+	if (!vmdev)
 		return;
 
+	vdev = &vmdev->vdev;
+
 	vfio_pci_reflck_put(vdev->reflck);
 
 	kfree(vdev->region);
@@ -355,7 +375,8 @@ static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
 				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
 	}
 
-	kfree(vdev);
+	kfree(vmdev->attr.name);
+	kfree(vmdev);
 }
 
 static struct pci_driver vfio_mdev_pci_driver = {
Yi Liu June 20, 2019, 1 p.m. UTC | #2
Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, June 20, 2019 12:27 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Sat,  8 Jun 2019 21:21:11 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > a PCI device as a mediated device. For a pci device, once bound
> > to vfio-mdev-pci driver, user space access of this device will
> > go through vfio mdev framework. The usage of the device follows
> > mdev management method. e.g. user should create a mdev before
> > exposing the device to user-space.
> >
> > Benefit of this new driver would be acting as a sample driver
> > for recent changes from "vfio/mdev: IOMMU aware mediated device"
> > patchset. Also it could be a good experiment driver for future
> > device specific mdev migration support.
> >
> > To use this driver:
> > a) build and load vfio-mdev-pci.ko module
> >    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
> >    then load it with following command
> >    > sudo modprobe vfio
> >    > sudo modprobe vfio-pci
> >    > sudo insmod drivers/vfio/pci/vfio-mdev-pci.ko
> >
> > b) unbind original device driver
> >    e.g. use following command to unbind its original driver
> >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind
> >
> > c) bind vfio-mdev-pci driver to the physical device
> >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id
> >
> > d) check the supported mdev instances
> >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/
> >      vfio-mdev-pci-type1
> >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> >      vfio-mdev-pci-type1/
> >      available_instances  create  device_api  devices  name
> 
> 
> I think the static type name here is a problem (and why does it
> include "type1"?).  We generally consider that a type defines a
> software compatible mdev, but in this case any PCI device wrapped in
> vfio-mdev-pci gets the same mdev type.  This is only a sample driver,
> but that's a bad precedent.  I've taken a stab at fixing this in the
> patch below, using the PCI vendor ID, device ID, subsystem vendor ID,
> subsystem device ID, class code, and revision to try to make the type
> as specific to the physical device assigned as we can through PCI.

Thanks, it is much better than what I proposed.

> 
> >
> > e)  create mdev on this physical device (only 1 instance)
> >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \
> >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> >      vfio-mdev-pci-type1/create
> 
> Whoops, available_instances always reports 1 and it doesn't appear that
> the create function prevents additional mdevs.  Also addressed in the
> patch below.

yep, thanks.

> 
> > f) passthru the mdev to guest
> >    add the following line in Qemu boot command
> >    -device vfio-pci,\
> >     sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> >
> > g) destroy mdev
> >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\
> >      remove
> >
> 
> I also found that unbinding the parent device doesn't unregister with
> mdev, so it cannot be bound again, also fixed below.

Oops, good catch. :-)

> However, the patch below just makes the mdev interface behave
> correctly, I can't make it work on my system because commit
> 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")

What error did you encounter. I tested the patch with a device in a
singleton iommu group. I'm also searching a proper machine with
multiple devices in an iommu group and test it.

> used iommu_attach_device() rather than iommu_attach_group() for non-aux
> mdev iommu_device.  Is there a requirement that the mdev parent device
> is in a singleton iommu group?

I don't think there should have such limitation. Per my understanding,
vfio-mdev-pci should also be able to bind to devices which shares
iommu group with other devices. vfio-pci works well for such devices.
And since the two drivers share most of the codes, I think vfio-mdev-pci
should naturally support it as well.

> If this is a simplification, then
> vfio-mdev-pci should not bind to devices where this is violated since
> there's no way to use the device.  Can we support it though?

yeah, I think we need to support it.

> If I have two devices in the same group and bind them both to
> vfio-mdev-pci, I end up with three groups, one for each mdev device and
> the original physical device group.  vfio.c works with the mdev groups
> and will try to match both groups to the container.  vfio_iommu_type1.c
> also works with the mdev groups, except for the point where we actually
> try to attach a group to a domain, which is the only window where we use
> the iommu_device rather than the provided group, but we don't record
> that anywhere.  Should struct vfio_group have a pointer to a reference
> counted object that tracks the actual iommu_group attached, such that
> we can determine that the group is already attached to the domain and
> not try to attach again? 

Agreed, we need to avoid such duplicated attach. Instead of adding
reference counted object in vfio_group. I'm also considering the logic
below:

    /*
      * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
      * is initialized.
      */
    if (vfio_group->mdev_group) {
         /*
           * vfio_group->mdev_group is true means vfio_group->iommu_group
           * is not the actual iommu_group which is going to be attached to
           * domain. To avoid duplicate iommu_group attach, needs to check if
           * the actual iommu_group. vfio_get_parent_iommu_group() is a
           * newly added helper function which returns the actual attach
           * iommu_group going to be attached for this mdev group.
              */
         p_iommu_group = vfio_get_parent_iommu_group(
                                                                         vfio_group->iommu_group);
         list_for_each_entry(d, &iommu->domain_list, next) {
                 if (find_iommu_group(d, p_iommu_group)) {
                         mutex_unlock(&iommu->lock);
                         // skip group attach;
                 }
         }

> Ideally I'd be able to bind one device to
> vfio-pci, the other to vfio-mdev-pci, and be able to use them both
> within the same container.  It seems like this should be possible, it's
> the same effective iommu configuration as if they were both bound to
> vfio-pci.  Thanks,

Agreed. Will test it. And thanks for the fix patch below. I've test it
with a device in a singleton iommu group. Need to test the scenario
you mentioned above. :-)

Thanks,
Yi Liu

> 
> Alex
> 
> diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
> index 07c8067b3f73..09143d3e5473 100644
> --- a/drivers/vfio/pci/vfio_mdev_pci.c
> +++ b/drivers/vfio/pci/vfio_mdev_pci.c
> @@ -65,18 +65,22 @@ MODULE_PARM_DESC(disable_idle_d3,
> 
>  static struct pci_driver vfio_mdev_pci_driver;
> 
> -static ssize_t
> -name_show(struct kobject *kobj, struct device *dev, char *buf)
> -{
> -	return sprintf(buf, "%s-type1\n", dev_name(dev));
> -}
> -
> -MDEV_TYPE_ATTR_RO(name);
> +struct vfio_mdev_pci_device {
> +	struct vfio_pci_device vdev;
> +	struct mdev_parent_ops ops;
> +	struct attribute_group *groups[2];
> +	struct attribute_group attr;
> +	atomic_t avail;
> +};
> 
>  static ssize_t
>  available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
>  {
> -	return sprintf(buf, "%d\n", 1);
> +	struct vfio_mdev_pci_device *vmdev;
> +
> +	vmdev = pci_get_drvdata(to_pci_dev(dev));
> +
> +	return sprintf(buf, "%d\n", atomic_read(&vmdev->avail));
>  }
> 
>  MDEV_TYPE_ATTR_RO(available_instances);
> @@ -90,62 +94,57 @@ static ssize_t device_api_show(struct kobject *kobj, struct
> device *dev,
>  MDEV_TYPE_ATTR_RO(device_api);
> 
>  static struct attribute *vfio_mdev_pci_types_attrs[] = {
> -	&mdev_type_attr_name.attr,
>  	&mdev_type_attr_device_api.attr,
>  	&mdev_type_attr_available_instances.attr,
>  	NULL,
>  };
> 
> -static struct attribute_group vfio_mdev_pci_type_group1 = {
> -	.name  = "type1",
> -	.attrs = vfio_mdev_pci_types_attrs,
> -};
> -
> -struct attribute_group *vfio_mdev_pci_type_groups[] = {
> -	&vfio_mdev_pci_type_group1,
> -	NULL,
> -};
> -
>  struct vfio_mdev_pci {
>  	struct vfio_pci_device *vdev;
>  	struct mdev_device *mdev;
> -	unsigned long handle;
>  };
> 
>  static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
>  {
>  	struct device *pdev;
> -	struct vfio_pci_device *vdev;
> +	struct vfio_mdev_pci_device *vmdev;
>  	struct vfio_mdev_pci *pmdev;
>  	int ret;
> 
>  	pdev = mdev_parent_dev(mdev);
> -	vdev = dev_get_drvdata(pdev);
> +	vmdev = dev_get_drvdata(pdev);
> +
> +	if (atomic_dec_if_positive(&vmdev->avail) < 0)
> +		return -ENOSPC;
> +
>  	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> -	if (pmdev == NULL) {
> -		ret = -EBUSY;
> -		goto out;
> -	}
> +	if (!pmdev)
> +		return -ENOMEM;
> 
>  	pmdev->mdev = mdev;
> -	pmdev->vdev = vdev;
> +	pmdev->vdev = &vmdev->vdev;
>  	mdev_set_drvdata(mdev, pmdev);
>  	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
>  	if (ret) {
>  		pr_info("%s, failed to config iommu isolation for mdev: %s on
> pf: %s\n",
>  			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
> -		goto out;
> +		kfree(pmdev);
> +		atomic_inc(&vmdev->avail);
> +		return ret;
>  	}
> 
> -out:
> -	return ret;
> +	return 0;
>  }
> 
>  static int vfio_mdev_pci_remove(struct mdev_device *mdev)
>  {
>  	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> +	struct vfio_mdev_pci_device *vmdev;
> +
> +	vmdev = container_of(pmdev->vdev, struct vfio_mdev_pci_device, vdev);
> 
>  	kfree(pmdev);
> +	atomic_inc(&vmdev->avail);
>  	pr_info("%s, succeeded for mdev: %s\n", __func__,
>  		     dev_name(mdev_dev(mdev)));
> 
> @@ -237,24 +236,12 @@ static ssize_t vfio_mdev_pci_write(struct mdev_device
> *mdev,
>  	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
>  }
> 
> -static const struct mdev_parent_ops vfio_mdev_pci_ops = {
> -	.supported_type_groups	= vfio_mdev_pci_type_groups,
> -	.create			= vfio_mdev_pci_create,
> -	.remove			= vfio_mdev_pci_remove,
> -
> -	.open			= vfio_mdev_pci_open,
> -	.release		= vfio_mdev_pci_release,
> -
> -	.read			= vfio_mdev_pci_read,
> -	.write			= vfio_mdev_pci_write,
> -	.mmap			= vfio_mdev_pci_mmap,
> -	.ioctl			= vfio_mdev_pci_ioctl,
> -};
> -
>  static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  				       const struct pci_device_id *id)
>  {
> +	struct vfio_mdev_pci_device *vmdev;
>  	struct vfio_pci_device *vdev;
> +	const struct mdev_parent_ops *ops;
>  	int ret;
> 
>  	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> @@ -273,10 +260,38 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev
> *pdev,
>  		return -EBUSY;
>  	}
> 
> -	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> -	if (!vdev)
> +	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
> +	if (!vmdev)
>  		return -ENOMEM;
> 
> +	vmdev->attr.name = kasprintf(GFP_KERNEL,
> +				     "%04x:%04x:%04x:%04x:%06x:%02x",
> +				     pdev->vendor, pdev->device,
> +				     pdev->subsystem_vendor,
> +				     pdev->subsystem_device, pdev->class,
> +				     pdev->revision);
> +	if (!vmdev->attr.name) {
> +		kfree(vmdev);
> +		return -ENOMEM;
> +	}
> +
> +	atomic_set(&vmdev->avail, 1);
> +
> +	vmdev->attr.attrs = vfio_mdev_pci_types_attrs;
> +	vmdev->groups[0] = &vmdev->attr;
> +
> +	vmdev->ops.supported_type_groups = vmdev->groups;
> +	vmdev->ops.create = vfio_mdev_pci_create;
> +	vmdev->ops.remove = vfio_mdev_pci_remove;
> +	vmdev->ops.open	= vfio_mdev_pci_open;
> +	vmdev->ops.release = vfio_mdev_pci_release;
> +	vmdev->ops.read = vfio_mdev_pci_read;
> +	vmdev->ops.write = vfio_mdev_pci_write;
> +	vmdev->ops.mmap = vfio_mdev_pci_mmap;
> +	vmdev->ops.ioctl = vfio_mdev_pci_ioctl;
> +	ops = &vmdev->ops;
> +
> +	vdev = &vmdev->vdev;
>  	vdev->pdev = pdev;
>  	vdev->irq_type = VFIO_PCI_NUM_IRQS;
>  	mutex_init(&vdev->igate);
> @@ -289,7 +304,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  #endif
>  	vdev->disable_idle_d3 = disable_idle_d3;
> 
> -	pci_set_drvdata(pdev, vdev);
> +	pci_set_drvdata(pdev, vmdev);
> 
>  	ret = vfio_pci_reflck_attach(vdev);
>  	if (ret) {
> @@ -320,7 +335,7 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
>  		vfio_pci_set_power_state(vdev, PCI_D3hot);
>  	}
> 
> -	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
> +	ret = mdev_register_device(&pdev->dev, ops);
>  	if (ret)
>  		pr_err("Cannot register mdev for device %s\n",
>  			dev_name(&pdev->dev));
> @@ -332,12 +347,17 @@ static int vfio_mdev_pci_driver_probe(struct pci_dev
> *pdev,
> 
>  static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
>  {
> +	struct vfio_mdev_pci_device *vmdev;
>  	struct vfio_pci_device *vdev;
> 
> -	vdev = pci_get_drvdata(pdev);
> -	if (!vdev)
> +	mdev_unregister_device(&pdev->dev);
> +
> +	vmdev = pci_get_drvdata(pdev);
> +	if (!vmdev)
>  		return;
> 
> +	vdev = &vmdev->vdev;
> +
>  	vfio_pci_reflck_put(vdev->reflck);
> 
>  	kfree(vdev->region);
> @@ -355,7 +375,8 @@ static void vfio_mdev_pci_driver_remove(struct pci_dev
> *pdev)
>  				VGA_RSRC_LEGACY_IO |
> VGA_RSRC_LEGACY_MEM);
>  	}
> 
> -	kfree(vdev);
> +	kfree(vmdev->attr.name);
> +	kfree(vmdev);
>  }
> 
>  static struct pci_driver vfio_mdev_pci_driver = {
Alex Williamson June 20, 2019, 9:07 p.m. UTC | #3
On Thu, 20 Jun 2019 13:00:34 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, June 20, 2019 12:27 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Sat,  8 Jun 2019 21:21:11 +0800
> > Liu Yi L <yi.l.liu@intel.com> wrote:
> >   
> > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > a PCI device as a mediated device. For a pci device, once bound
> > > to vfio-mdev-pci driver, user space access of this device will
> > > go through vfio mdev framework. The usage of the device follows
> > > mdev management method. e.g. user should create a mdev before
> > > exposing the device to user-space.
> > >
> > > Benefit of this new driver would be acting as a sample driver
> > > for recent changes from "vfio/mdev: IOMMU aware mediated device"
> > > patchset. Also it could be a good experiment driver for future
> > > device specific mdev migration support.
> > >
> > > To use this driver:
> > > a) build and load vfio-mdev-pci.ko module
> > >    execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
> > >    then load it with following command  
> > >    > sudo modprobe vfio
> > >    > sudo modprobe vfio-pci
> > >    > sudo insmod drivers/vfio/pci/vfio-mdev-pci.ko  
> > >
> > > b) unbind original device driver
> > >    e.g. use following command to unbind its original driver  
> > >    > echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind  
> > >
> > > c) bind vfio-mdev-pci driver to the physical device  
> > >    > echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id  
> > >
> > > d) check the supported mdev instances  
> > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/  
> > >      vfio-mdev-pci-type1  
> > >    > ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\  
> > >      vfio-mdev-pci-type1/
> > >      available_instances  create  device_api  devices  name  
> > 
> > 
> > I think the static type name here is a problem (and why does it
> > include "type1"?).  We generally consider that a type defines a
> > software compatible mdev, but in this case any PCI device wrapped in
> > vfio-mdev-pci gets the same mdev type.  This is only a sample driver,
> > but that's a bad precedent.  I've taken a stab at fixing this in the
> > patch below, using the PCI vendor ID, device ID, subsystem vendor ID,
> > subsystem device ID, class code, and revision to try to make the type
> > as specific to the physical device assigned as we can through PCI.  
> 
> Thanks, it is much better than what I proposed.
> 
> >   
> > >
> > > e)  create mdev on this physical device (only 1 instance)  
> > >    > echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \  
> > >      /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
> > >      vfio-mdev-pci-type1/create  
> > 
> > Whoops, available_instances always reports 1 and it doesn't appear that
> > the create function prevents additional mdevs.  Also addressed in the
> > patch below.  
> 
> yep, thanks.
> 
> >   
> > > f) passthru the mdev to guest
> > >    add the following line in Qemu boot command
> > >    -device vfio-pci,\
> > >     sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> > >
> > > g) destroy mdev  
> > >    > echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\  
> > >      remove
> > >  
> > 
> > I also found that unbinding the parent device doesn't unregister with
> > mdev, so it cannot be bound again, also fixed below.  
> 
> Oops, good catch. :-)
> 
> > However, the patch below just makes the mdev interface behave
> > correctly, I can't make it work on my system because commit
> > 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")  
> 
> What error did you encounter. I tested the patch with a device in a
> singleton iommu group. I'm also searching a proper machine with
> multiple devices in an iommu group and test it.

In vfio_iommu_type1, iommu backed mdev devices use the
iommu_attach_device() interface, which includes:

        if (iommu_group_device_count(group) != 1)
                goto out_unlock;

So it's impossible to use with non-singleton groups currently.

> > used iommu_attach_device() rather than iommu_attach_group() for non-aux
> > mdev iommu_device.  Is there a requirement that the mdev parent device
> > is in a singleton iommu group?  
> 
> I don't think there should have such limitation. Per my understanding,
> vfio-mdev-pci should also be able to bind to devices which shares
> iommu group with other devices. vfio-pci works well for such devices.
> And since the two drivers share most of the codes, I think vfio-mdev-pci
> should naturally support it as well.

Yes, the difference though is that vfio.c knows when devices are in the
same group, which mdev vfio.c only knows about the non-iommu backed
group, not the group that is actually used for the iommu backing.  So
we either need to enlighten vfio.c or further abstract those details in
vfio_iommu_type1.c.
 
> > If this is a simplification, then
> > vfio-mdev-pci should not bind to devices where this is violated since
> > there's no way to use the device.  Can we support it though?  
> 
> yeah, I think we need to support it.
> 
> > If I have two devices in the same group and bind them both to
> > vfio-mdev-pci, I end up with three groups, one for each mdev device and
> > the original physical device group.  vfio.c works with the mdev groups
> > and will try to match both groups to the container.  vfio_iommu_type1.c
> > also works with the mdev groups, except for the point where we actually
> > try to attach a group to a domain, which is the only window where we use
> > the iommu_device rather than the provided group, but we don't record
> > that anywhere.  Should struct vfio_group have a pointer to a reference
> > counted object that tracks the actual iommu_group attached, such that
> > we can determine that the group is already attached to the domain and
> > not try to attach again?   
> 
> Agreed, we need to avoid such duplicated attach. Instead of adding
> reference counted object in vfio_group. I'm also considering the logic
> below:
> 
>     /*
>       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
>       * is initialized.
>       */
>     if (vfio_group->mdev_group) {
>          /*
>            * vfio_group->mdev_group is true means vfio_group->iommu_group
>            * is not the actual iommu_group which is going to be attached to
>            * domain. To avoid duplicate iommu_group attach, needs to check if
>            * the actual iommu_group. vfio_get_parent_iommu_group() is a
>            * newly added helper function which returns the actual attach
>            * iommu_group going to be attached for this mdev group.
>               */
>          p_iommu_group = vfio_get_parent_iommu_group(
>                                                                          vfio_group->iommu_group);
>          list_for_each_entry(d, &iommu->domain_list, next) {
>                  if (find_iommu_group(d, p_iommu_group)) {
>                          mutex_unlock(&iommu->lock);
>                          // skip group attach;
>                  }
>          }

We don't currently create a struct vfio_group for the parent, only for
the mdev iommu group.  The iommu_attach for an iommu backed mdev
doesn't leave any traces of where it is actually attached, we just
count on retracing our steps for the detach.  That's why I'm thinking
we need an object somewhere to track it and it needs to be reference
counted so that if both a vfio-mdev-pci device and a vfio-pci device
are using it, we leave it in place if either one is removed.
 
> > Ideally I'd be able to bind one device to
> > vfio-pci, the other to vfio-mdev-pci, and be able to use them both
> > within the same container.  It seems like this should be possible, it's
> > the same effective iommu configuration as if they were both bound to
> > vfio-pci.  Thanks,  
> 
> Agreed. Will test it. And thanks for the fix patch below. I've test it
> with a device in a singleton iommu group. Need to test the scenario
> you mentioned above. :-)

Thanks!

Alex
Yi Liu June 21, 2019, 10:23 a.m. UTC | #4
Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, June 21, 2019 5:08 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Thu, 20 Jun 2019 13:00:34 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, June 20, 2019 12:27 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > >
> > > On Sat,  8 Jun 2019 21:21:11 +0800
> > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > >
> > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > a PCI device as a mediated device. For a pci device, once bound
> > > > to vfio-mdev-pci driver, user space access of this device will
> > > > go through vfio mdev framework. The usage of the device follows
> > > > mdev management method. e.g. user should create a mdev before
> > > > exposing the device to user-space.
[...]
> >
> > > However, the patch below just makes the mdev interface behave
> > > correctly, I can't make it work on my system because commit
> > > 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")
> >
> > What error did you encounter. I tested the patch with a device in a
> > singleton iommu group. I'm also searching a proper machine with
> > multiple devices in an iommu group and test it.
> 
> In vfio_iommu_type1, iommu backed mdev devices use the
> iommu_attach_device() interface, which includes:
> 
>         if (iommu_group_device_count(group) != 1)
>                 goto out_unlock;
> 
> So it's impossible to use with non-singleton groups currently.

Hmmm, I think it is no longer good to use iommu_attach_device() for iommu
backed mdev devices now. In this flow, the purpose here is to attach a device
to a domain and no need to check whether the device is in a singleton iommu
group. I think it would be better to use __iommu_attach_device() instead of
iommu_attach_device().

Also I found a potential mutex lock issue if using iommu_attach_device().
In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev() to loop
all the devices in the group. It holds group->mutex. And then vfio_mdev_attach_domain()
calls iommu_attach_device() which also tries to get group->mutex. This would be
an issue. If you are fine with it, I may post another patch for it. :-)

> > > used iommu_attach_device() rather than iommu_attach_group() for non-aux
> > > mdev iommu_device.  Is there a requirement that the mdev parent device
> > > is in a singleton iommu group?
> >
> > I don't think there should have such limitation. Per my understanding,
> > vfio-mdev-pci should also be able to bind to devices which shares
> > iommu group with other devices. vfio-pci works well for such devices.
> > And since the two drivers share most of the codes, I think vfio-mdev-pci
> > should naturally support it as well.
> 
> Yes, the difference though is that vfio.c knows when devices are in the
> same group, which mdev vfio.c only knows about the non-iommu backed
> group, not the group that is actually used for the iommu backing.  So
> we either need to enlighten vfio.c or further abstract those details in
> vfio_iommu_type1.c.

Not sure if it is necessary to introduce more changes to vfio.c or
vfio_iommu_type1.c. If it's only for the scenario which two devices share an
iommu_group, I guess it could be supported by using __iommu_attach_device()
which has no device counting for the group. But maybe I missed something
here. It would be great if you can elaborate a bit for it. :-)

> 
> > > If this is a simplification, then
> > > vfio-mdev-pci should not bind to devices where this is violated since
> > > there's no way to use the device.  Can we support it though?
> >
> > yeah, I think we need to support it.
> >
> > > If I have two devices in the same group and bind them both to
> > > vfio-mdev-pci, I end up with three groups, one for each mdev device and
> > > the original physical device group.  vfio.c works with the mdev groups
> > > and will try to match both groups to the container.  vfio_iommu_type1.c
> > > also works with the mdev groups, except for the point where we actually
> > > try to attach a group to a domain, which is the only window where we use
> > > the iommu_device rather than the provided group, but we don't record
> > > that anywhere.  Should struct vfio_group have a pointer to a reference
> > > counted object that tracks the actual iommu_group attached, such that
> > > we can determine that the group is already attached to the domain and
> > > not try to attach again?
> >
> > Agreed, we need to avoid such duplicated attach. Instead of adding
> > reference counted object in vfio_group. I'm also considering the logic
> > below:

Re-walked the code, I find the duplicated attach will happen on the vfio-mdev-pci
device as vfio_mdev_attach_domain() only attaches the parent devices of
iommu backed mdevs instead of all the devices within the physical iommu_group.
While for a vfio-pci device, it will use iommu_attach_group() which attaches all the
devices within the iommu backed group. The same with detach,
vfio_mdev_detach_domain() detaches selective devices instead of all devices within
the iommu backed group.

> >     /*
> >       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
> >       * is initialized.
> >       */
> >     if (vfio_group->mdev_group) {
> >          /*
> >            * vfio_group->mdev_group is true means vfio_group->iommu_group
> >            * is not the actual iommu_group which is going to be attached to
> >            * domain. To avoid duplicate iommu_group attach, needs to check if
> >            * the actual iommu_group. vfio_get_parent_iommu_group() is a
> >            * newly added helper function which returns the actual attach
> >            * iommu_group going to be attached for this mdev group.
> >               */
> >          p_iommu_group = vfio_get_parent_iommu_group(
> >                                                                          vfio_group->iommu_group);
> >          list_for_each_entry(d, &iommu->domain_list, next) {
> >                  if (find_iommu_group(d, p_iommu_group)) {
> >                          mutex_unlock(&iommu->lock);
> >                          // skip group attach;
> >                  }
> >          }
> 
> We don't currently create a struct vfio_group for the parent, only for
> the mdev iommu group.  The iommu_attach for an iommu backed mdev
> doesn't leave any traces of where it is actually attached, we just
> count on retracing our steps for the detach.  That's why I'm thinking
> we need an object somewhere to track it and it needs to be reference
> counted so that if both a vfio-mdev-pci device and a vfio-pci device
> are using it, we leave it in place if either one is removed.

Hmmm, here we are talking about tracking in iommu_group level though
no good idea on where the object should  be placed yet. However, we may
need to tack in device level as I mentioned in above paragraph. If not,
there may be sequence issue. e.g. if vfio-mdev-pci device is attached
firstly, then the object will be initialized, and when vfio-pci device is
attached, we will find the attach should be skipped and just inc the ref count.
But actually it should not be skipped since the vfio-mdev-pci attach does not
attach all devices within the iommu backed group.

What's more, regards to sIOV case,  a parent devices may have multiple
mdevs and the mdevs may be assigned to the same VM. Thus there will be multiple
attach on this parent device. This also makes me believe track in device level would
be better. 

> 
> > > Ideally I'd be able to bind one device to
> > > vfio-pci, the other to vfio-mdev-pci, and be able to use them both
> > > within the same container.  It seems like this should be possible, it's
> > > the same effective iommu configuration as if they were both bound to
> > > vfio-pci.  Thanks,
> >
> > Agreed. Will test it. And thanks for the fix patch below. I've test it
> > with a device in a singleton iommu group. Need to test the scenario
> > you mentioned above. :-)
> 
> Thanks!

You are welcomed. :-)

> Alex

Regards,
Yi Liu
Alex Williamson June 21, 2019, 3:57 p.m. UTC | #5
On Fri, 21 Jun 2019 10:23:10 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, June 21, 2019 5:08 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Thu, 20 Jun 2019 13:00:34 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, June 20, 2019 12:27 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > >
> > > > On Sat,  8 Jun 2019 21:21:11 +0800
> > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > go through vfio mdev framework. The usage of the device follows
> > > > > mdev management method. e.g. user should create a mdev before
> > > > > exposing the device to user-space.  
> [...]
> > >  
> > > > However, the patch below just makes the mdev interface behave
> > > > correctly, I can't make it work on my system because commit
> > > > 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")  
> > >
> > > What error did you encounter. I tested the patch with a device in a
> > > singleton iommu group. I'm also searching a proper machine with
> > > multiple devices in an iommu group and test it.  
> > 
> > In vfio_iommu_type1, iommu backed mdev devices use the
> > iommu_attach_device() interface, which includes:
> > 
> >         if (iommu_group_device_count(group) != 1)
> >                 goto out_unlock;
> > 
> > So it's impossible to use with non-singleton groups currently.  
> 
> Hmmm, I think it is no longer good to use iommu_attach_device() for iommu
> backed mdev devices now. In this flow, the purpose here is to attach a device
> to a domain and no need to check whether the device is in a singleton iommu
> group. I think it would be better to use __iommu_attach_device() instead of
> iommu_attach_device().

That's a static and unexported, it's intentionally not an exposed
interface.  We can't attach devices in the same group to separate
domains allocated through iommu_domain_alloc(), this would violate the
iommu group isolation principles.

> Also I found a potential mutex lock issue if using iommu_attach_device().
> In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev() to loop
> all the devices in the group. It holds group->mutex. And then vfio_mdev_attach_domain()
> calls iommu_attach_device() which also tries to get group->mutex. This would be
> an issue. If you are fine with it, I may post another patch for it. :-)

Gack, yes, please send a patch.

> > > > used iommu_attach_device() rather than iommu_attach_group() for non-aux
> > > > mdev iommu_device.  Is there a requirement that the mdev parent device
> > > > is in a singleton iommu group?  
> > >
> > > I don't think there should have such limitation. Per my understanding,
> > > vfio-mdev-pci should also be able to bind to devices which shares
> > > iommu group with other devices. vfio-pci works well for such devices.
> > > And since the two drivers share most of the codes, I think vfio-mdev-pci
> > > should naturally support it as well.  
> > 
> > Yes, the difference though is that vfio.c knows when devices are in the
> > same group, which mdev vfio.c only knows about the non-iommu backed
> > group, not the group that is actually used for the iommu backing.  So
> > we either need to enlighten vfio.c or further abstract those details in
> > vfio_iommu_type1.c.  
> 
> Not sure if it is necessary to introduce more changes to vfio.c or
> vfio_iommu_type1.c. If it's only for the scenario which two devices share an
> iommu_group, I guess it could be supported by using __iommu_attach_device()
> which has no device counting for the group. But maybe I missed something
> here. It would be great if you can elaborate a bit for it. :-)

We need to use the group semantics, there's a reason
__iommu_attach_device() is not exposed, it's an internal helper.  I
think there's no way around that we need to somewhere track the actual
group we're attaching to and have the smarts to re-use it for other
devices in the same group.
 
> > > > If this is a simplification, then
> > > > vfio-mdev-pci should not bind to devices where this is violated since
> > > > there's no way to use the device.  Can we support it though?  
> > >
> > > yeah, I think we need to support it.
> > >  
> > > > If I have two devices in the same group and bind them both to
> > > > vfio-mdev-pci, I end up with three groups, one for each mdev device and
> > > > the original physical device group.  vfio.c works with the mdev groups
> > > > and will try to match both groups to the container.  vfio_iommu_type1.c
> > > > also works with the mdev groups, except for the point where we actually
> > > > try to attach a group to a domain, which is the only window where we use
> > > > the iommu_device rather than the provided group, but we don't record
> > > > that anywhere.  Should struct vfio_group have a pointer to a reference
> > > > counted object that tracks the actual iommu_group attached, such that
> > > > we can determine that the group is already attached to the domain and
> > > > not try to attach again?  
> > >
> > > Agreed, we need to avoid such duplicated attach. Instead of adding
> > > reference counted object in vfio_group. I'm also considering the logic
> > > below:  
> 
> Re-walked the code, I find the duplicated attach will happen on the vfio-mdev-pci
> device as vfio_mdev_attach_domain() only attaches the parent devices of
> iommu backed mdevs instead of all the devices within the physical iommu_group.
> While for a vfio-pci device, it will use iommu_attach_group() which attaches all the
> devices within the iommu backed group. The same with detach,
> vfio_mdev_detach_domain() detaches selective devices instead of all devices within
> the iommu backed group.

Yep, that's not good, for the non-aux case we need to follow the usual
group semantics or else we're limited to singleton groups.

> > >     /*
> > >       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
> > >       * is initialized.
> > >       */
> > >     if (vfio_group->mdev_group) {
> > >          /*
> > >            * vfio_group->mdev_group is true means vfio_group->iommu_group
> > >            * is not the actual iommu_group which is going to be attached to
> > >            * domain. To avoid duplicate iommu_group attach, needs to check if
> > >            * the actual iommu_group. vfio_get_parent_iommu_group() is a
> > >            * newly added helper function which returns the actual attach
> > >            * iommu_group going to be attached for this mdev group.
> > >               */
> > >          p_iommu_group = vfio_get_parent_iommu_group(
> > >                                                                          vfio_group->iommu_group);
> > >          list_for_each_entry(d, &iommu->domain_list, next) {
> > >                  if (find_iommu_group(d, p_iommu_group)) {
> > >                          mutex_unlock(&iommu->lock);
> > >                          // skip group attach;
> > >                  }
> > >          }  
> > 
> > We don't currently create a struct vfio_group for the parent, only for
> > the mdev iommu group.  The iommu_attach for an iommu backed mdev
> > doesn't leave any traces of where it is actually attached, we just
> > count on retracing our steps for the detach.  That's why I'm thinking
> > we need an object somewhere to track it and it needs to be reference
> > counted so that if both a vfio-mdev-pci device and a vfio-pci device
> > are using it, we leave it in place if either one is removed.  
> 
> Hmmm, here we are talking about tracking in iommu_group level though
> no good idea on where the object should  be placed yet. However, we may
> need to tack in device level as I mentioned in above paragraph. If not,
> there may be sequence issue. e.g. if vfio-mdev-pci device is attached
> firstly, then the object will be initialized, and when vfio-pci device is
> attached, we will find the attach should be skipped and just inc the ref count.
> But actually it should not be skipped since the vfio-mdev-pci attach does not
> attach all devices within the iommu backed group.

We can't do that though, the entire group needs to be attached.

> What's more, regards to sIOV case,  a parent devices may have multiple
> mdevs and the mdevs may be assigned to the same VM. Thus there will be multiple
> attach on this parent device. This also makes me believe track in device level would
> be better. 

The aux domain support essentially specifies that the device can be
attached to multiple domains, so I think we're ok for device-level
group attach there, but not for bare iommu backed devices.  Thanks,

Alex
Yi Liu June 24, 2019, 8:20 a.m. UTC | #6
Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, June 21, 2019 11:58 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Fri, 21 Jun 2019 10:23:10 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, June 21, 2019 5:08 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > >
> > > On Thu, 20 Jun 2019 13:00:34 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Thursday, June 20, 2019 12:27 PM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > >
> > > > > On Sat,  8 Jun 2019 21:21:11 +0800
> > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > > go through vfio mdev framework. The usage of the device follows
> > > > > > mdev management method. e.g. user should create a mdev before
> > > > > > exposing the device to user-space.
> > [...]
> > > >
> > > > > However, the patch below just makes the mdev interface behave
> > > > > correctly, I can't make it work on my system because commit
> > > > > 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")
> > > >
> > > > What error did you encounter. I tested the patch with a device in a
> > > > singleton iommu group. I'm also searching a proper machine with
> > > > multiple devices in an iommu group and test it.
> > >
> > > In vfio_iommu_type1, iommu backed mdev devices use the
> > > iommu_attach_device() interface, which includes:
> > >
> > >         if (iommu_group_device_count(group) != 1)
> > >                 goto out_unlock;
> > >
> > > So it's impossible to use with non-singleton groups currently.
> >
> > Hmmm, I think it is no longer good to use iommu_attach_device() for iommu
> > backed mdev devices now. In this flow, the purpose here is to attach a device
> > to a domain and no need to check whether the device is in a singleton iommu
> > group. I think it would be better to use __iommu_attach_device() instead of
> > iommu_attach_device().
> 
> That's a static and unexported, it's intentionally not an exposed
> interface.  We can't attach devices in the same group to separate
> domains allocated through iommu_domain_alloc(), this would violate the
> iommu group isolation principles.

Go it. :-) Then not good to expose such interface. But to support devices in
non-singleton iommu group, we need to have a new interface which doesn't
count the devices but attach all the devices.

> > Also I found a potential mutex lock issue if using iommu_attach_device().
> > In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev() to loop
> > all the devices in the group. It holds group->mutex. And then
> vfio_mdev_attach_domain()
> > calls iommu_attach_device() which also tries to get group->mutex. This would be
> > an issue. If you are fine with it, I may post another patch for it. :-)
> 
> Gack, yes, please send a patch.

Would do it, may be together with the support of vfio-mdev-pci on devices in
non-singleton iommu group.

> 
> > > > > used iommu_attach_device() rather than iommu_attach_group() for non-aux
> > > > > mdev iommu_device.  Is there a requirement that the mdev parent device
> > > > > is in a singleton iommu group?
> > > >
> > > > I don't think there should have such limitation. Per my understanding,
> > > > vfio-mdev-pci should also be able to bind to devices which shares
> > > > iommu group with other devices. vfio-pci works well for such devices.
> > > > And since the two drivers share most of the codes, I think vfio-mdev-pci
> > > > should naturally support it as well.
> > >
> > > Yes, the difference though is that vfio.c knows when devices are in the
> > > same group, which mdev vfio.c only knows about the non-iommu backed
> > > group, not the group that is actually used for the iommu backing.  So
> > > we either need to enlighten vfio.c or further abstract those details in
> > > vfio_iommu_type1.c.
> >
> > Not sure if it is necessary to introduce more changes to vfio.c or
> > vfio_iommu_type1.c. If it's only for the scenario which two devices share an
> > iommu_group, I guess it could be supported by using __iommu_attach_device()
> > which has no device counting for the group. But maybe I missed something
> > here. It would be great if you can elaborate a bit for it. :-)
> 
> We need to use the group semantics, there's a reason
> __iommu_attach_device() is not exposed, it's an internal helper.  I
> think there's no way around that we need to somewhere track the actual
> group we're attaching to and have the smarts to re-use it for other
> devices in the same group.

Hmmm, exposing __iommu_attach_device() is not good, let's forget it. :-)

> > > > > If this is a simplification, then
> > > > > vfio-mdev-pci should not bind to devices where this is violated since
> > > > > there's no way to use the device.  Can we support it though?
> > > >
> > > > yeah, I think we need to support it.
> > > >
> > > > > If I have two devices in the same group and bind them both to
> > > > > vfio-mdev-pci, I end up with three groups, one for each mdev device and
> > > > > the original physical device group.  vfio.c works with the mdev groups
> > > > > and will try to match both groups to the container.  vfio_iommu_type1.c
> > > > > also works with the mdev groups, except for the point where we actually
> > > > > try to attach a group to a domain, which is the only window where we use
> > > > > the iommu_device rather than the provided group, but we don't record
> > > > > that anywhere.  Should struct vfio_group have a pointer to a reference
> > > > > counted object that tracks the actual iommu_group attached, such that
> > > > > we can determine that the group is already attached to the domain and
> > > > > not try to attach again?
> > > >
> > > > Agreed, we need to avoid such duplicated attach. Instead of adding
> > > > reference counted object in vfio_group. I'm also considering the logic
> > > > below:
> >
> > Re-walked the code, I find the duplicated attach will happen on the vfio-mdev-pci
> > device as vfio_mdev_attach_domain() only attaches the parent devices of
> > iommu backed mdevs instead of all the devices within the physical iommu_group.
> > While for a vfio-pci device, it will use iommu_attach_group() which attaches all the
> > devices within the iommu backed group. The same with detach,
> > vfio_mdev_detach_domain() detaches selective devices instead of all devices
> within
> > the iommu backed group.
> 
> Yep, that's not good, for the non-aux case we need to follow the usual
> group semantics or else we're limited to singleton groups.

yep.

> 
> > > >     /*
> > > >       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
> > > >       * is initialized.
> > > >       */
> > > >     if (vfio_group->mdev_group) {
> > > >          /*
> > > >            * vfio_group->mdev_group is true means vfio_group->iommu_group
> > > >            * is not the actual iommu_group which is going to be attached to
> > > >            * domain. To avoid duplicate iommu_group attach, needs to check if
> > > >            * the actual iommu_group. vfio_get_parent_iommu_group() is a
> > > >            * newly added helper function which returns the actual attach
> > > >            * iommu_group going to be attached for this mdev group.
> > > >               */
> > > >          p_iommu_group = vfio_get_parent_iommu_group(
> > > >                                                                          vfio_group->iommu_group);
> > > >          list_for_each_entry(d, &iommu->domain_list, next) {
> > > >                  if (find_iommu_group(d, p_iommu_group)) {
> > > >                          mutex_unlock(&iommu->lock);
> > > >                          // skip group attach;
> > > >                  }
> > > >          }
> > >
> > > We don't currently create a struct vfio_group for the parent, only for
> > > the mdev iommu group.  The iommu_attach for an iommu backed mdev
> > > doesn't leave any traces of where it is actually attached, we just
> > > count on retracing our steps for the detach.  That's why I'm thinking
> > > we need an object somewhere to track it and it needs to be reference
> > > counted so that if both a vfio-mdev-pci device and a vfio-pci device
> > > are using it, we leave it in place if either one is removed.
> >
> > Hmmm, here we are talking about tracking in iommu_group level though
> > no good idea on where the object should  be placed yet. However, we may
> > need to tack in device level as I mentioned in above paragraph. If not,
> > there may be sequence issue. e.g. if vfio-mdev-pci device is attached
> > firstly, then the object will be initialized, and when vfio-pci device is
> > attached, we will find the attach should be skipped and just inc the ref count.
> > But actually it should not be skipped since the vfio-mdev-pci attach does not
> > attach all devices within the iommu backed group.
> 
> We can't do that though, the entire group needs to be attached.

Agree, may be getting another interface which is similar with
iommu_attach_device(), but works for devices which is in non-singleton
groups. So the attach for iommu backed mdev will also result in a sound
attach to all the devices which share iommu group with the parent device.
This is just like vfio-pci devices. For the object for tracking purpose may be
as below:

struct vfio_iommu_object {
	struct iommu_group *group;
	struct kref kref;
};

And I think it should be per-domain and per-iommu backed group since
aux-domain support allows a iommu backed group to be attached to
multiple domains. I'm considering if it is ok to have a list in vfio_domain.
Before each domain attach, vfio should do a check in the list if the iommu
backed group has been attached already. For vfio-pci devices, use its iommu
group to do a search in the list. For vfio-mdev-pci devices, use its parent
devices iommu group to do a search. Thus avoid duplicate attach. Thoughts?
 
> > What's more, regards to sIOV case,  a parent devices may have multiple
> > mdevs and the mdevs may be assigned to the same VM. Thus there will be multiple
> > attach on this parent device. This also makes me believe track in device level would
> > be better.
> 
> The aux domain support essentially specifies that the device can be
> attached to multiple domains, so I think we're ok for device-level
> group attach there, but not for bare iommu backed devices.  Thanks,

Got it.

Thanks,
Yi Liu
Alex Williamson June 28, 2019, 3:07 p.m. UTC | #7
On Mon, 24 Jun 2019 08:20:38 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, June 21, 2019 11:58 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Fri, 21 Jun 2019 10:23:10 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Friday, June 21, 2019 5:08 AM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > >
> > > > On Thu, 20 Jun 2019 13:00:34 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > Hi Alex,
> > > > >  
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Thursday, June 20, 2019 12:27 PM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > > >
> > > > > > On Sat,  8 Jun 2019 21:21:11 +0800
> > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > >  
> > > > > > > This patch adds sample driver named vfio-mdev-pci. It is to wrap
> > > > > > > a PCI device as a mediated device. For a pci device, once bound
> > > > > > > to vfio-mdev-pci driver, user space access of this device will
> > > > > > > go through vfio mdev framework. The usage of the device follows
> > > > > > > mdev management method. e.g. user should create a mdev before
> > > > > > > exposing the device to user-space.  
> > > [...]  
> > > > >  
> > > > > > However, the patch below just makes the mdev interface behave
> > > > > > correctly, I can't make it work on my system because commit
> > > > > > 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")  
> > > > >
> > > > > What error did you encounter. I tested the patch with a device in a
> > > > > singleton iommu group. I'm also searching a proper machine with
> > > > > multiple devices in an iommu group and test it.  
> > > >
> > > > In vfio_iommu_type1, iommu backed mdev devices use the
> > > > iommu_attach_device() interface, which includes:
> > > >
> > > >         if (iommu_group_device_count(group) != 1)
> > > >                 goto out_unlock;
> > > >
> > > > So it's impossible to use with non-singleton groups currently.  
> > >
> > > Hmmm, I think it is no longer good to use iommu_attach_device() for iommu
> > > backed mdev devices now. In this flow, the purpose here is to attach a device
> > > to a domain and no need to check whether the device is in a singleton iommu
> > > group. I think it would be better to use __iommu_attach_device() instead of
> > > iommu_attach_device().  
> > 
> > That's a static and unexported, it's intentionally not an exposed
> > interface.  We can't attach devices in the same group to separate
> > domains allocated through iommu_domain_alloc(), this would violate the
> > iommu group isolation principles.  
> 
> Go it. :-) Then not good to expose such interface. But to support devices in
> non-singleton iommu group, we need to have a new interface which doesn't
> count the devices but attach all the devices.

We have iommu_attach_group(), we just need to track which groups are
attached.
 
> > > Also I found a potential mutex lock issue if using iommu_attach_device().
> > > In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev() to loop
> > > all the devices in the group. It holds group->mutex. And then  
> > vfio_mdev_attach_domain()  
> > > calls iommu_attach_device() which also tries to get group->mutex. This would be
> > > an issue. If you are fine with it, I may post another patch for it. :-)  
> > 
> > Gack, yes, please send a patch.  
> 
> Would do it, may be together with the support of vfio-mdev-pci on devices in
> non-singleton iommu group.
> 
> >   
> > > > > > used iommu_attach_device() rather than iommu_attach_group() for non-aux
> > > > > > mdev iommu_device.  Is there a requirement that the mdev parent device
> > > > > > is in a singleton iommu group?  
> > > > >
> > > > > I don't think there should have such limitation. Per my understanding,
> > > > > vfio-mdev-pci should also be able to bind to devices which shares
> > > > > iommu group with other devices. vfio-pci works well for such devices.
> > > > > And since the two drivers share most of the codes, I think vfio-mdev-pci
> > > > > should naturally support it as well.  
> > > >
> > > > Yes, the difference though is that vfio.c knows when devices are in the
> > > > same group, which mdev vfio.c only knows about the non-iommu backed
> > > > group, not the group that is actually used for the iommu backing.  So
> > > > we either need to enlighten vfio.c or further abstract those details in
> > > > vfio_iommu_type1.c.  
> > >
> > > Not sure if it is necessary to introduce more changes to vfio.c or
> > > vfio_iommu_type1.c. If it's only for the scenario which two devices share an
> > > iommu_group, I guess it could be supported by using __iommu_attach_device()
> > > which has no device counting for the group. But maybe I missed something
> > > here. It would be great if you can elaborate a bit for it. :-)  
> > 
> > We need to use the group semantics, there's a reason
> > __iommu_attach_device() is not exposed, it's an internal helper.  I
> > think there's no way around that we need to somewhere track the actual
> > group we're attaching to and have the smarts to re-use it for other
> > devices in the same group.  
> 
> Hmmm, exposing __iommu_attach_device() is not good, let's forget it. :-)
> 
> > > > > > If this is a simplification, then
> > > > > > vfio-mdev-pci should not bind to devices where this is violated since
> > > > > > there's no way to use the device.  Can we support it though?  
> > > > >
> > > > > yeah, I think we need to support it.
> > > > >  
> > > > > > If I have two devices in the same group and bind them both to
> > > > > > vfio-mdev-pci, I end up with three groups, one for each mdev device and
> > > > > > the original physical device group.  vfio.c works with the mdev groups
> > > > > > and will try to match both groups to the container.  vfio_iommu_type1.c
> > > > > > also works with the mdev groups, except for the point where we actually
> > > > > > try to attach a group to a domain, which is the only window where we use
> > > > > > the iommu_device rather than the provided group, but we don't record
> > > > > > that anywhere.  Should struct vfio_group have a pointer to a reference
> > > > > > counted object that tracks the actual iommu_group attached, such that
> > > > > > we can determine that the group is already attached to the domain and
> > > > > > not try to attach again?  
> > > > >
> > > > > Agreed, we need to avoid such duplicated attach. Instead of adding
> > > > > reference counted object in vfio_group. I'm also considering the logic
> > > > > below:  
> > >
> > > Re-walked the code, I find the duplicated attach will happen on the vfio-mdev-pci
> > > device as vfio_mdev_attach_domain() only attaches the parent devices of
> > > iommu backed mdevs instead of all the devices within the physical iommu_group.
> > > While for a vfio-pci device, it will use iommu_attach_group() which attaches all the
> > > devices within the iommu backed group. The same with detach,
> > > vfio_mdev_detach_domain() detaches selective devices instead of all devices  
> > within  
> > > the iommu backed group.  
> > 
> > Yep, that's not good, for the non-aux case we need to follow the usual
> > group semantics or else we're limited to singleton groups.  
> 
> yep.
> 
> >   
> > > > >     /*
> > > > >       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
> > > > >       * is initialized.
> > > > >       */
> > > > >     if (vfio_group->mdev_group) {
> > > > >          /*
> > > > >            * vfio_group->mdev_group is true means vfio_group->iommu_group
> > > > >            * is not the actual iommu_group which is going to be attached to
> > > > >            * domain. To avoid duplicate iommu_group attach, needs to check if
> > > > >            * the actual iommu_group. vfio_get_parent_iommu_group() is a
> > > > >            * newly added helper function which returns the actual attach
> > > > >            * iommu_group going to be attached for this mdev group.
> > > > >               */
> > > > >          p_iommu_group = vfio_get_parent_iommu_group(
> > > > >                                                                          vfio_group->iommu_group);
> > > > >          list_for_each_entry(d, &iommu->domain_list, next) {
> > > > >                  if (find_iommu_group(d, p_iommu_group)) {
> > > > >                          mutex_unlock(&iommu->lock);
> > > > >                          // skip group attach;
> > > > >                  }
> > > > >          }  
> > > >
> > > > We don't currently create a struct vfio_group for the parent, only for
> > > > the mdev iommu group.  The iommu_attach for an iommu backed mdev
> > > > doesn't leave any traces of where it is actually attached, we just
> > > > count on retracing our steps for the detach.  That's why I'm thinking
> > > > we need an object somewhere to track it and it needs to be reference
> > > > counted so that if both a vfio-mdev-pci device and a vfio-pci device
> > > > are using it, we leave it in place if either one is removed.  
> > >
> > > Hmmm, here we are talking about tracking in iommu_group level though
> > > no good idea on where the object should  be placed yet. However, we may
> > > need to tack in device level as I mentioned in above paragraph. If not,
> > > there may be sequence issue. e.g. if vfio-mdev-pci device is attached
> > > firstly, then the object will be initialized, and when vfio-pci device is
> > > attached, we will find the attach should be skipped and just inc the ref count.
> > > But actually it should not be skipped since the vfio-mdev-pci attach does not
> > > attach all devices within the iommu backed group.  
> > 
> > We can't do that though, the entire group needs to be attached.  
> 
> Agree, may be getting another interface which is similar with
> iommu_attach_device(), but works for devices which is in non-singleton
> groups. So the attach for iommu backed mdev will also result in a sound
> attach to all the devices which share iommu group with the parent device.

iommu_attach_group()...

> This is just like vfio-pci devices. For the object for tracking purpose may be
> as below:
> 
> struct vfio_iommu_object {
> 	struct iommu_group *group;
> 	struct kref kref;
> };
> 
> And I think it should be per-domain and per-iommu backed group since
> aux-domain support allows a iommu backed group to be attached to
> multiple domains. I'm considering if it is ok to have a list in vfio_domain.
> Before each domain attach, vfio should do a check in the list if the iommu
> backed group has been attached already. For vfio-pci devices, use its iommu
> group to do a search in the list. For vfio-mdev-pci devices, use its parent
> devices iommu group to do a search. Thus avoid duplicate attach. Thoughts?

vfio_iommu_type1 already creates a struct vfio_iommu per container,
which includes a linked list of struct vfio_domain objects, where each
vfio_domain has a list of struct vfio_group objects.  So we need to
include the iommu device iommu group in that latter list somehow.
Thanks,

Alex
Yi Liu July 3, 2019, 8:25 a.m. UTC | #8
Hi Alex,

Thanks for the comments. Have four inline responses below. And one
of them need your further help. :-)
.
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, June 28, 2019 11:08 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Mon, 24 Jun 2019 08:20:38 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, June 21, 2019 11:58 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > >
> > > On Fri, 21 Jun 2019 10:23:10 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Friday, June 21, 2019 5:08 AM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > >
> > > > > On Thu, 20 Jun 2019 13:00:34 +0000 "Liu, Yi L"
> > > > > <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > Hi Alex,
> > > > > >
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Thursday, June 20, 2019 12:27 PM
> > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci
> > > > > > > driver
> > > > > > >
> > > > > > > On Sat,  8 Jun 2019 21:21:11 +0800 Liu Yi L
> > > > > > > <yi.l.liu@intel.com> wrote:
> > > > > > >
> > > > > > > > This patch adds sample driver named vfio-mdev-pci. It is
> > > > > > > > to wrap a PCI device as a mediated device. For a pci
> > > > > > > > device, once bound to vfio-mdev-pci driver, user space
> > > > > > > > access of this device will go through vfio mdev framework.
> > > > > > > > The usage of the device follows mdev management method.
> > > > > > > > e.g. user should create a mdev before exposing the device to user-space.
> > > > [...]
> > > > > >
> > > > > > > However, the patch below just makes the mdev interface
> > > > > > > behave correctly, I can't make it work on my system because
> > > > > > > commit 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching
> > > > > > > group helpers")
> > > > > >
> > > > > > What error did you encounter. I tested the patch with a device
> > > > > > in a singleton iommu group. I'm also searching a proper
> > > > > > machine with multiple devices in an iommu group and test it.
> > > > >
> > > > > In vfio_iommu_type1, iommu backed mdev devices use the
> > > > > iommu_attach_device() interface, which includes:
> > > > >
> > > > >         if (iommu_group_device_count(group) != 1)
> > > > >                 goto out_unlock;
> > > > >
> > > > > So it's impossible to use with non-singleton groups currently.
> > > >
> > > > Hmmm, I think it is no longer good to use iommu_attach_device()
> > > > for iommu backed mdev devices now. In this flow, the purpose here
> > > > is to attach a device to a domain and no need to check whether the
> > > > device is in a singleton iommu group. I think it would be better
> > > > to use __iommu_attach_device() instead of iommu_attach_device().
> > >
> > > That's a static and unexported, it's intentionally not an exposed
> > > interface.  We can't attach devices in the same group to separate
> > > domains allocated through iommu_domain_alloc(), this would violate
> > > the iommu group isolation principles.
> >
> > Go it. :-) Then not good to expose such interface. But to support
> > devices in non-singleton iommu group, we need to have a new interface
> > which doesn't count the devices but attach all the devices.
> 
> We have iommu_attach_group(), we just need to track which groups are attached.

yep.

> > > > Also I found a potential mutex lock issue if using iommu_attach_device().
> > > > In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev()
> > > > to loop all the devices in the group. It holds group->mutex. And
> > > > then
> > > vfio_mdev_attach_domain()
> > > > calls iommu_attach_device() which also tries to get group->mutex.
> > > > This would be an issue. If you are fine with it, I may post
> > > > another patch for it. :-)
> > >
> > > Gack, yes, please send a patch.
> >
> > Would do it, may be together with the support of vfio-mdev-pci on
> > devices in non-singleton iommu group.
> >
> > >
> > > > > > > used iommu_attach_device() rather than iommu_attach_group()
> > > > > > > for non-aux mdev iommu_device.  Is there a requirement that
> > > > > > > the mdev parent device is in a singleton iommu group?
> > > > > >
> > > > > > I don't think there should have such limitation. Per my
> > > > > > understanding, vfio-mdev-pci should also be able to bind to
> > > > > > devices which shares iommu group with other devices. vfio-pci works well
> for such devices.
> > > > > > And since the two drivers share most of the codes, I think
> > > > > > vfio-mdev-pci should naturally support it as well.
> > > > >
> > > > > Yes, the difference though is that vfio.c knows when devices are
> > > > > in the same group, which mdev vfio.c only knows about the
> > > > > non-iommu backed group, not the group that is actually used for
> > > > > the iommu backing.  So we either need to enlighten vfio.c or
> > > > > further abstract those details in vfio_iommu_type1.c.
> > > >
> > > > Not sure if it is necessary to introduce more changes to vfio.c or
> > > > vfio_iommu_type1.c. If it's only for the scenario which two
> > > > devices share an iommu_group, I guess it could be supported by
> > > > using __iommu_attach_device() which has no device counting for the
> > > > group. But maybe I missed something here. It would be great if you
> > > > can elaborate a bit for it. :-)
> > >
> > > We need to use the group semantics, there's a reason
> > > __iommu_attach_device() is not exposed, it's an internal helper.  I
> > > think there's no way around that we need to somewhere track the
> > > actual group we're attaching to and have the smarts to re-use it for
> > > other devices in the same group.
> >
> > Hmmm, exposing __iommu_attach_device() is not good, let's forget it.
> > :-)
> >
> > > > > > > If this is a simplification, then vfio-mdev-pci should not
> > > > > > > bind to devices where this is violated since there's no way
> > > > > > > to use the device.  Can we support it though?
> > > > > > 
> > > > > > yeah, I think we need to support it.

I've already made vfio-mdev-pci driver work for non-singleton iommu
group. e.g. for devices in a single iommu group, I can bind the devices
to eithervfio-pci or vfio-mdev-pci and then passthru them to a VM. And
it will fail if user tries to passthru a vfio-mdev-pci device via vfio-pci
manner "-device vfio-pci,host=01:00.1". In other words, vfio-mdev-pci
device can only passthru via
"-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID". This is what
we expect.

However, I encountered a problem when trying to prevent user from
passthru these devices to different VMs. I've tried in my side, and I
can passthru vfio-pci device and vfio-mdev-pci device to different
VMs. But actually this operation should be failed. If all the devices
are bound to vfio-pci, Qemu will open iommu backed group. So
Qemu can check if a given group has already been used by an
AddressSpace (a.ka. VM) in vfio_get_group() thus to prevent
user from passthru these devices to different VMs if the devices
are in the same iommu backed group. However, here for a
vfio-mdev-pci device, it has a new group and group ID, Qemu
will not be able to detect if the other devices (share iommu group
with vfio-mdev-pci device) are passthru to existing VMs. This is the
major problem for vfio-mdev-pci to support non-singleton group
in my side now. Even all devices are bound to vfio-mdev-pci driver,
Qemu is still unable to check since all the vfio-mdev-pci devices
have a separate mdev group.

To fix it, may need Qemu to do more things. E.g. If it tries to use a
non-singleton iommu backed group, it needs to check if any mdev
group is created and used by an existing VM. Also it needs check if
iommu backed group is passthru to an existing VM when trying to
use a mdev group. For singleton iommu backed group and
aux-domain enabled physical device, still allow to passthru mdev
group to different VMs. To achieve these checks, Qemu may need
to have knowledge whether a group is iommu backed and singleton
or not. Do you think it is good to expose such info to userspace? or
any other idea? :-)

> > > > > >
> > > > > > > If I have two devices in the same group and bind them both
> > > > > > > to vfio-mdev-pci, I end up with three groups, one for each
> > > > > > > mdev device and the original physical device group.  vfio.c
> > > > > > > works with the mdev groups and will try to match both groups
> > > > > > > to the container.  vfio_iommu_type1.c also works with the
> > > > > > > mdev groups, except for the point where we actually try to
> > > > > > > attach a group to a domain, which is the only window where
> > > > > > > we use the iommu_device rather than the provided group, but
> > > > > > > we don't record that anywhere.  Should struct vfio_group
> > > > > > > have a pointer to a reference counted object that tracks the
> > > > > > > actual iommu_group attached, such that we can determine that the group
> is already attached to the domain and not try to attach again?
> > > > > >
> > > > > > Agreed, we need to avoid such duplicated attach. Instead of
> > > > > > adding reference counted object in vfio_group. I'm also
> > > > > > considering the logic
> > > > > > below:
> > > >
> > > > Re-walked the code, I find the duplicated attach will happen on
> > > > the vfio-mdev-pci device as vfio_mdev_attach_domain() only
> > > > attaches the parent devices of iommu backed mdevs instead of all the devices
> within the physical iommu_group.
> > > > While for a vfio-pci device, it will use iommu_attach_group()
> > > > which attaches all the devices within the iommu backed group. The
> > > > same with detach,
> > > > vfio_mdev_detach_domain() detaches selective devices instead of
> > > > all devices
> > > within
> > > > the iommu backed group.
> > >
> > > Yep, that's not good, for the non-aux case we need to follow the
> > > usual group semantics or else we're limited to singleton groups.
> >
> > yep.
> >
> > >
> > > > > >     /*
> > > > > >       * Do this check in vfio_iommu_type1_attach_group(), after mdev_group
> > > > > >       * is initialized.
> > > > > >       */
> > > > > >     if (vfio_group->mdev_group) {
> > > > > >          /*
> > > > > >            * vfio_group->mdev_group is true means vfio_group->iommu_group
> > > > > >            * is not the actual iommu_group which is going to be attached to
> > > > > >            * domain. To avoid duplicate iommu_group attach, needs to check if
> > > > > >            * the actual iommu_group. vfio_get_parent_iommu_group() is a
> > > > > >            * newly added helper function which returns the actual attach
> > > > > >            * iommu_group going to be attached for this mdev group.
> > > > > >               */
> > > > > >          p_iommu_group = vfio_get_parent_iommu_group(
> > > > > >                                                                          vfio_group->iommu_group);
> > > > > >          list_for_each_entry(d, &iommu->domain_list, next) {
> > > > > >                  if (find_iommu_group(d, p_iommu_group)) {
> > > > > >                          mutex_unlock(&iommu->lock);
> > > > > >                          // skip group attach;
> > > > > >                  }
> > > > > >          }
> > > > >
> > > > > We don't currently create a struct vfio_group for the parent,
> > > > > only for the mdev iommu group.  The iommu_attach for an iommu
> > > > > backed mdev doesn't leave any traces of where it is actually
> > > > > attached, we just count on retracing our steps for the detach.
> > > > > That's why I'm thinking we need an object somewhere to track it
> > > > > and it needs to be reference counted so that if both a
> > > > > vfio-mdev-pci device and a vfio-pci device are using it, we leave it in place if
> either one is removed.
> > > >
> > > > Hmmm, here we are talking about tracking in iommu_group level
> > > > though no good idea on where the object should  be placed yet.
> > > > However, we may need to tack in device level as I mentioned in
> > > > above paragraph. If not, there may be sequence issue. e.g. if
> > > > vfio-mdev-pci device is attached firstly, then the object will be
> > > > initialized, and when vfio-pci device is attached, we will find the attach should
> be skipped and just inc the ref count.
> > > > But actually it should not be skipped since the vfio-mdev-pci
> > > > attach does not attach all devices within the iommu backed group.
> > >
> > > We can't do that though, the entire group needs to be attached.
> >
> > Agree, may be getting another interface which is similar with
> > iommu_attach_device(), but works for devices which is in non-singleton
> > groups. So the attach for iommu backed mdev will also result in a
> > sound attach to all the devices which share iommu group with the parent device.
> 
> iommu_attach_group()...

got it. :-)

> > This is just like vfio-pci devices. For the object for tracking
> > purpose may be as below:
> >
> > struct vfio_iommu_object {
> > 	struct iommu_group *group;
> > 	struct kref kref;
> > };
> >
> > And I think it should be per-domain and per-iommu backed group since
> > aux-domain support allows a iommu backed group to be attached to
> > multiple domains. I'm considering if it is ok to have a list in vfio_domain.
> > Before each domain attach, vfio should do a check in the list if the
> > iommu backed group has been attached already. For vfio-pci devices,
> > use its iommu group to do a search in the list. For vfio-mdev-pci
> > devices, use its parent devices iommu group to do a search. Thus avoid duplicate
> attach. Thoughts?
> 
> vfio_iommu_type1 already creates a struct vfio_iommu per container, which
> includes a linked list of struct vfio_domain objects, where each vfio_domain has a
> list of struct vfio_group objects.  So we need to include the iommu device iommu
> group in that latter list somehow.
> Thanks,

Sure, will try it.

Thanks,
Yi Liu
Alex Williamson July 3, 2019, 5:22 p.m. UTC | #9
On Wed, 3 Jul 2019 08:25:25 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> Thanks for the comments. Have four inline responses below. And one
> of them need your further help. :-)
> .
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, June 28, 2019 11:08 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Mon, 24 Jun 2019 08:20:38 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Friday, June 21, 2019 11:58 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > >
> > > > On Fri, 21 Jun 2019 10:23:10 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > Hi Alex,
> > > > >  
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Friday, June 21, 2019 5:08 AM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > > >
> > > > > > On Thu, 20 Jun 2019 13:00:34 +0000 "Liu, Yi L"
> > > > > > <yi.l.liu@intel.com> wrote:
> > > > > >  
> > > > > > > Hi Alex,
> > > > > > >  
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > Sent: Thursday, June 20, 2019 12:27 PM
> > > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci
> > > > > > > > driver
> > > > > > > >
> > > > > > > > On Sat,  8 Jun 2019 21:21:11 +0800 Liu Yi L
> > > > > > > > <yi.l.liu@intel.com> wrote:
> > > > > > > >  
> > > > > > > > > This patch adds sample driver named vfio-mdev-pci. It is
> > > > > > > > > to wrap a PCI device as a mediated device. For a pci
> > > > > > > > > device, once bound to vfio-mdev-pci driver, user space
> > > > > > > > > access of this device will go through vfio mdev framework.
> > > > > > > > > The usage of the device follows mdev management method.
> > > > > > > > > e.g. user should create a mdev before exposing the device to user-space.  
> > > > > [...]  
> > > > > > >  
> > > > > > > > However, the patch below just makes the mdev interface
> > > > > > > > behave correctly, I can't make it work on my system because
> > > > > > > > commit 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching
> > > > > > > > group helpers")  
> > > > > > >
> > > > > > > What error did you encounter. I tested the patch with a device
> > > > > > > in a singleton iommu group. I'm also searching a proper
> > > > > > > machine with multiple devices in an iommu group and test it.  
> > > > > >
> > > > > > In vfio_iommu_type1, iommu backed mdev devices use the
> > > > > > iommu_attach_device() interface, which includes:
> > > > > >
> > > > > >         if (iommu_group_device_count(group) != 1)
> > > > > >                 goto out_unlock;
> > > > > >
> > > > > > So it's impossible to use with non-singleton groups currently.  
> > > > >
> > > > > Hmmm, I think it is no longer good to use iommu_attach_device()
> > > > > for iommu backed mdev devices now. In this flow, the purpose here
> > > > > is to attach a device to a domain and no need to check whether the
> > > > > device is in a singleton iommu group. I think it would be better
> > > > > to use __iommu_attach_device() instead of iommu_attach_device().  
> > > >
> > > > That's a static and unexported, it's intentionally not an exposed
> > > > interface.  We can't attach devices in the same group to separate
> > > > domains allocated through iommu_domain_alloc(), this would violate
> > > > the iommu group isolation principles.  
> > >
> > > Go it. :-) Then not good to expose such interface. But to support
> > > devices in non-singleton iommu group, we need to have a new interface
> > > which doesn't count the devices but attach all the devices.  
> > 
> > We have iommu_attach_group(), we just need to track which groups are attached.  
> 
> yep.
> 
> > > > > Also I found a potential mutex lock issue if using iommu_attach_device().
> > > > > In vfio_iommu_attach_group(), it uses iommu_group_for_each_dev()
> > > > > to loop all the devices in the group. It holds group->mutex. And
> > > > > then  
> > > > vfio_mdev_attach_domain()  
> > > > > calls iommu_attach_device() which also tries to get group->mutex.
> > > > > This would be an issue. If you are fine with it, I may post
> > > > > another patch for it. :-)  
> > > >
> > > > Gack, yes, please send a patch.  
> > >
> > > Would do it, may be together with the support of vfio-mdev-pci on
> > > devices in non-singleton iommu group.
> > >  
> > > >  
> > > > > > > > used iommu_attach_device() rather than iommu_attach_group()
> > > > > > > > for non-aux mdev iommu_device.  Is there a requirement that
> > > > > > > > the mdev parent device is in a singleton iommu group?  
> > > > > > >
> > > > > > > I don't think there should have such limitation. Per my
> > > > > > > understanding, vfio-mdev-pci should also be able to bind to
> > > > > > > devices which shares iommu group with other devices. vfio-pci works well  
> > for such devices.  
> > > > > > > And since the two drivers share most of the codes, I think
> > > > > > > vfio-mdev-pci should naturally support it as well.  
> > > > > >
> > > > > > Yes, the difference though is that vfio.c knows when devices are
> > > > > > in the same group, which mdev vfio.c only knows about the
> > > > > > non-iommu backed group, not the group that is actually used for
> > > > > > the iommu backing.  So we either need to enlighten vfio.c or
> > > > > > further abstract those details in vfio_iommu_type1.c.  
> > > > >
> > > > > Not sure if it is necessary to introduce more changes to vfio.c or
> > > > > vfio_iommu_type1.c. If it's only for the scenario which two
> > > > > devices share an iommu_group, I guess it could be supported by
> > > > > using __iommu_attach_device() which has no device counting for the
> > > > > group. But maybe I missed something here. It would be great if you
> > > > > can elaborate a bit for it. :-)  
> > > >
> > > > We need to use the group semantics, there's a reason
> > > > __iommu_attach_device() is not exposed, it's an internal helper.  I
> > > > think there's no way around that we need to somewhere track the
> > > > actual group we're attaching to and have the smarts to re-use it for
> > > > other devices in the same group.  
> > >
> > > Hmmm, exposing __iommu_attach_device() is not good, let's forget it.
> > > :-)
> > >  
> > > > > > > > If this is a simplification, then vfio-mdev-pci should not
> > > > > > > > bind to devices where this is violated since there's no way
> > > > > > > > to use the device.  Can we support it though?  
> > > > > > > 
> > > > > > > yeah, I think we need to support it.  
> 
> I've already made vfio-mdev-pci driver work for non-singleton iommu
> group. e.g. for devices in a single iommu group, I can bind the devices
> to eithervfio-pci or vfio-mdev-pci and then passthru them to a VM. And
> it will fail if user tries to passthru a vfio-mdev-pci device via vfio-pci
> manner "-device vfio-pci,host=01:00.1". In other words, vfio-mdev-pci
> device can only passthru via
> "-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID". This is what
> we expect.
> 
> However, I encountered a problem when trying to prevent user from
> passthru these devices to different VMs. I've tried in my side, and I
> can passthru vfio-pci device and vfio-mdev-pci device to different
> VMs. But actually this operation should be failed. If all the devices
> are bound to vfio-pci, Qemu will open iommu backed group. So
> Qemu can check if a given group has already been used by an
> AddressSpace (a.ka. VM) in vfio_get_group() thus to prevent
> user from passthru these devices to different VMs if the devices
> are in the same iommu backed group. However, here for a
> vfio-mdev-pci device, it has a new group and group ID, Qemu
> will not be able to detect if the other devices (share iommu group
> with vfio-mdev-pci device) are passthru to existing VMs. This is the
> major problem for vfio-mdev-pci to support non-singleton group
> in my side now. Even all devices are bound to vfio-mdev-pci driver,
> Qemu is still unable to check since all the vfio-mdev-pci devices
> have a separate mdev group.
> 
> To fix it, may need Qemu to do more things. E.g. If it tries to use a
> non-singleton iommu backed group, it needs to check if any mdev
> group is created and used by an existing VM. Also it needs check if
> iommu backed group is passthru to an existing VM when trying to
> use a mdev group. For singleton iommu backed group and
> aux-domain enabled physical device, still allow to passthru mdev
> group to different VMs. To achieve these checks, Qemu may need
> to have knowledge whether a group is iommu backed and singleton
> or not. Do you think it is good to expose such info to userspace? or
> any other idea? :-)

QEMU is never responsible for isolating a group, QEMU is just a
userspace driver, it's vfio's responsibility to prevent the user from
splitting groups in ways that are not allowed.  QEMU does not know the
true group association, it only knows the "virtual" group of the mdev
device.  QEMU will create a container and add the mdev virtual group to
the container.  In the kernel, the type1 backend should actually do an
iommu_attach_group(), attaching the iommu_device group to the domain.
When QEMU processes the next device, it will have a different group,
but (assuming no vIOMMU) it will try to attach it to the same
container, which should work because the iommu_device group backing the
mdev virtual group is already attached to this domain.

If we had two separate QEMU processes, each with an mdev device from a
common group at the iommu_device level, the type1 backend should fail
to attach the group to the container for the later caller.  I'd think
this should fail at the iommu_attach_group() call since the group we're
trying to attach is already attached to another domain.

It's really unfortunate that we don't have the mdev inheriting the
iommu group of the iommu_device so that userspace can really understand
this relationship.  A separate group makes sense for the aux-domain
case, and is (I guess) not a significant issue in the case of a
singleton iommu_device group, but it's pretty awkward here.  Perhaps
this is something we should correct in design of iommu backed mdevs.
Thanks,

Alex
Yi Liu July 4, 2019, 9:11 a.m. UTC | #10
Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, July 4, 2019 1:22 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Wed, 3 Jul 2019 08:25:25 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > Thanks for the comments. Have four inline responses below. And one
> > of them need your further help. :-)

[...]

> > > > >
> > > > > > > > > used iommu_attach_device() rather than iommu_attach_group()
> > > > > > > > > for non-aux mdev iommu_device.  Is there a requirement that
> > > > > > > > > the mdev parent device is in a singleton iommu group?
> > > > > > > >
> > > > > > > > I don't think there should have such limitation. Per my
> > > > > > > > understanding, vfio-mdev-pci should also be able to bind to
> > > > > > > > devices which shares iommu group with other devices. vfio-pci works
> well
> > > for such devices.
> > > > > > > > And since the two drivers share most of the codes, I think
> > > > > > > > vfio-mdev-pci should naturally support it as well.
> > > > > > >
> > > > > > > Yes, the difference though is that vfio.c knows when devices are
> > > > > > > in the same group, which mdev vfio.c only knows about the
> > > > > > > non-iommu backed group, not the group that is actually used for
> > > > > > > the iommu backing.  So we either need to enlighten vfio.c or
> > > > > > > further abstract those details in vfio_iommu_type1.c.
> > > > > >
> > > > > > Not sure if it is necessary to introduce more changes to vfio.c or
> > > > > > vfio_iommu_type1.c. If it's only for the scenario which two
> > > > > > devices share an iommu_group, I guess it could be supported by
> > > > > > using __iommu_attach_device() which has no device counting for the
> > > > > > group. But maybe I missed something here. It would be great if you
> > > > > > can elaborate a bit for it. :-)
> > > > >
> > > > > We need to use the group semantics, there's a reason
> > > > > __iommu_attach_device() is not exposed, it's an internal helper.  I
> > > > > think there's no way around that we need to somewhere track the
> > > > > actual group we're attaching to and have the smarts to re-use it for
> > > > > other devices in the same group.
> > > >
> > > > Hmmm, exposing __iommu_attach_device() is not good, let's forget it.
> > > > :-)
> > > >
> > > > > > > > > If this is a simplification, then vfio-mdev-pci should not
> > > > > > > > > bind to devices where this is violated since there's no way
> > > > > > > > > to use the device.  Can we support it though?
> > > > > > > >
> > > > > > > > yeah, I think we need to support it.
> >
> > I've already made vfio-mdev-pci driver work for non-singleton iommu
> > group. e.g. for devices in a single iommu group, I can bind the devices
> > to eithervfio-pci or vfio-mdev-pci and then passthru them to a VM. And
> > it will fail if user tries to passthru a vfio-mdev-pci device via vfio-pci
> > manner "-device vfio-pci,host=01:00.1". In other words, vfio-mdev-pci
> > device can only passthru via
> > "-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID". This is what
> > we expect.
> >
> > However, I encountered a problem when trying to prevent user from
> > passthru these devices to different VMs. I've tried in my side, and I
> > can passthru vfio-pci device and vfio-mdev-pci device to different
> > VMs. But actually this operation should be failed. If all the devices
> > are bound to vfio-pci, Qemu will open iommu backed group. So
> > Qemu can check if a given group has already been used by an
> > AddressSpace (a.ka. VM) in vfio_get_group() thus to prevent
> > user from passthru these devices to different VMs if the devices
> > are in the same iommu backed group. However, here for a
> > vfio-mdev-pci device, it has a new group and group ID, Qemu
> > will not be able to detect if the other devices (share iommu group
> > with vfio-mdev-pci device) are passthru to existing VMs. This is the
> > major problem for vfio-mdev-pci to support non-singleton group
> > in my side now. Even all devices are bound to vfio-mdev-pci driver,
> > Qemu is still unable to check since all the vfio-mdev-pci devices
> > have a separate mdev group.
> >
> > To fix it, may need Qemu to do more things. E.g. If it tries to use a
> > non-singleton iommu backed group, it needs to check if any mdev
> > group is created and used by an existing VM. Also it needs check if
> > iommu backed group is passthru to an existing VM when trying to
> > use a mdev group. For singleton iommu backed group and
> > aux-domain enabled physical device, still allow to passthru mdev
> > group to different VMs. To achieve these checks, Qemu may need
> > to have knowledge whether a group is iommu backed and singleton
> > or not. Do you think it is good to expose such info to userspace? or
> > any other idea? :-)
> 
> QEMU is never responsible for isolating a group, QEMU is just a
> userspace driver, it's vfio's responsibility to prevent the user from
> splitting groups in ways that are not allowed.  QEMU does not know the

yep, also my concern.

> true group association, it only knows the "virtual" group of the mdev
> device.  QEMU will create a container and add the mdev virtual group to
> the container.  In the kernel, the type1 backend should actually do an
> iommu_attach_group(), attaching the iommu_device group to the domain.
> When QEMU processes the next device, it will have a different group,
> but (assuming no vIOMMU) it will try to attach it to the same
> container, which should work because the iommu_device group backing the
> mdev virtual group is already attached to this domain.
> If we had two separate QEMU processes, each with an mdev device from a
> common group at the iommu_device level, the type1 backend should fail
> to attach the group to the container for the later caller.  I'd think
> this should fail at the iommu_attach_group() call since the group we're
> trying to attach is already attached to another domain.

Agree with you. At first, I want to fail it in similar way with vfio-pci devices.
For vfio-pci devices from a common group, vfio will fail the operation around
/dev/vfio/group_id open if user tries to assign the vfio-pci devices from common
group to multiple QEMU processes. Meanwhile, QEMU will avoid to open a
/dev/vfio/group_id multiple times, so current vfio/QEMU works well for 
non- singleton group (no vIOMMU). Unfortunately, looks like we have no way
to fail vfio-mdev-pci devices in similar mechanism as each mdev has a separate
group. So yes, I agree with you that we may fail it around the group attach
phase. Below is my draft idea:

In vfio_iommu_type1_attach_group(), we need to do the following checks.

if (mdev_group) {
	if (iommu_device group enabled aux-domain) {
		/*
		  * iommu_group enabled aux-domain means the iommu_devices
		  * in this group are aux-domain enabled. e.g. SIOV capable devices.
		  * Also, I think for aux-domain enabled group, it essentially means
		  * the group is a singleton group as SIOV capable devices require
		  * to be in a singleton group.
		  */
		 iommu_aux_attach_device();
	} else {
		/*
		  * needs to check the group->opened in vfio.c. Just like what
		  * vfio_group_fopen() does. May be a new VFIO interface required
		  * here since the group->opened is within vfio.c.
		  * vfio_iommu_device_group_opened_inc() will inc group->opened, so
		  * that other VM will fail when trying to open the group. And another
		  * VFIO interface is also required to dec group->opened when VM is
		  * down.
		  */
		if (vfio_iommu_device_group_opened_inc(iommu_device_group))
			return -EBUSY;
		iommu_attach_gorup(iommu_device_group);
	}
}

The concern here is the two new VFIO interfaces. Any thoughts on this proposal? :-)

> It's really unfortunate that we don't have the mdev inheriting the
> iommu group of the iommu_device so that userspace can really understand
> this relationship.  A separate group makes sense for the aux-domain
> case, and is (I guess) not a significant issue in the case of a
> singleton iommu_device group, but it's pretty awkward here.  Perhaps
> this is something we should correct in design of iommu backed mdevs.

Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
means singleton iommu_devie group. And in early time, when designing the support
for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
iommu_device group. But this results in an iommu backed group includes mdev and
physical devices, which might also be strange. Do you think it is valuable to reconsider
it?

> Thanks,
> 
> Alex

Thanks,
Yi Liu
Alex Williamson July 5, 2019, 3:55 p.m. UTC | #11
On Thu, 4 Jul 2019 09:11:02 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, July 4, 2019 1:22 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Wed, 3 Jul 2019 08:25:25 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >
> > > Thanks for the comments. Have four inline responses below. And one
> > > of them need your further help. :-)  
> 
> [...]
> 
> > > > > >  
> > > > > > > > > > used iommu_attach_device() rather than iommu_attach_group()
> > > > > > > > > > for non-aux mdev iommu_device.  Is there a requirement that
> > > > > > > > > > the mdev parent device is in a singleton iommu group?  
> > > > > > > > >
> > > > > > > > > I don't think there should have such limitation. Per my
> > > > > > > > > understanding, vfio-mdev-pci should also be able to bind to
> > > > > > > > > devices which shares iommu group with other devices. vfio-pci works  
> > well  
> > > > for such devices.  
> > > > > > > > > And since the two drivers share most of the codes, I think
> > > > > > > > > vfio-mdev-pci should naturally support it as well.  
> > > > > > > >
> > > > > > > > Yes, the difference though is that vfio.c knows when devices are
> > > > > > > > in the same group, which mdev vfio.c only knows about the
> > > > > > > > non-iommu backed group, not the group that is actually used for
> > > > > > > > the iommu backing.  So we either need to enlighten vfio.c or
> > > > > > > > further abstract those details in vfio_iommu_type1.c.  
> > > > > > >
> > > > > > > Not sure if it is necessary to introduce more changes to vfio.c or
> > > > > > > vfio_iommu_type1.c. If it's only for the scenario which two
> > > > > > > devices share an iommu_group, I guess it could be supported by
> > > > > > > using __iommu_attach_device() which has no device counting for the
> > > > > > > group. But maybe I missed something here. It would be great if you
> > > > > > > can elaborate a bit for it. :-)  
> > > > > >
> > > > > > We need to use the group semantics, there's a reason
> > > > > > __iommu_attach_device() is not exposed, it's an internal helper.  I
> > > > > > think there's no way around that we need to somewhere track the
> > > > > > actual group we're attaching to and have the smarts to re-use it for
> > > > > > other devices in the same group.  
> > > > >
> > > > > Hmmm, exposing __iommu_attach_device() is not good, let's forget it.
> > > > > :-)
> > > > >  
> > > > > > > > > > If this is a simplification, then vfio-mdev-pci should not
> > > > > > > > > > bind to devices where this is violated since there's no way
> > > > > > > > > > to use the device.  Can we support it though?  
> > > > > > > > >
> > > > > > > > > yeah, I think we need to support it.  
> > >
> > > I've already made vfio-mdev-pci driver work for non-singleton iommu
> > > group. e.g. for devices in a single iommu group, I can bind the devices
> > > to eithervfio-pci or vfio-mdev-pci and then passthru them to a VM. And
> > > it will fail if user tries to passthru a vfio-mdev-pci device via vfio-pci
> > > manner "-device vfio-pci,host=01:00.1". In other words, vfio-mdev-pci
> > > device can only passthru via
> > > "-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID". This is what
> > > we expect.
> > >
> > > However, I encountered a problem when trying to prevent user from
> > > passthru these devices to different VMs. I've tried in my side, and I
> > > can passthru vfio-pci device and vfio-mdev-pci device to different
> > > VMs. But actually this operation should be failed. If all the devices
> > > are bound to vfio-pci, Qemu will open iommu backed group. So
> > > Qemu can check if a given group has already been used by an
> > > AddressSpace (a.ka. VM) in vfio_get_group() thus to prevent
> > > user from passthru these devices to different VMs if the devices
> > > are in the same iommu backed group. However, here for a
> > > vfio-mdev-pci device, it has a new group and group ID, Qemu
> > > will not be able to detect if the other devices (share iommu group
> > > with vfio-mdev-pci device) are passthru to existing VMs. This is the
> > > major problem for vfio-mdev-pci to support non-singleton group
> > > in my side now. Even all devices are bound to vfio-mdev-pci driver,
> > > Qemu is still unable to check since all the vfio-mdev-pci devices
> > > have a separate mdev group.
> > >
> > > To fix it, may need Qemu to do more things. E.g. If it tries to use a
> > > non-singleton iommu backed group, it needs to check if any mdev
> > > group is created and used by an existing VM. Also it needs check if
> > > iommu backed group is passthru to an existing VM when trying to
> > > use a mdev group. For singleton iommu backed group and
> > > aux-domain enabled physical device, still allow to passthru mdev
> > > group to different VMs. To achieve these checks, Qemu may need
> > > to have knowledge whether a group is iommu backed and singleton
> > > or not. Do you think it is good to expose such info to userspace? or
> > > any other idea? :-)  
> > 
> > QEMU is never responsible for isolating a group, QEMU is just a
> > userspace driver, it's vfio's responsibility to prevent the user from
> > splitting groups in ways that are not allowed.  QEMU does not know the  
> 
> yep, also my concern.
> 
> > true group association, it only knows the "virtual" group of the mdev
> > device.  QEMU will create a container and add the mdev virtual group to
> > the container.  In the kernel, the type1 backend should actually do an
> > iommu_attach_group(), attaching the iommu_device group to the domain.
> > When QEMU processes the next device, it will have a different group,
> > but (assuming no vIOMMU) it will try to attach it to the same
> > container, which should work because the iommu_device group backing the
> > mdev virtual group is already attached to this domain.
> > If we had two separate QEMU processes, each with an mdev device from a
> > common group at the iommu_device level, the type1 backend should fail
> > to attach the group to the container for the later caller.  I'd think
> > this should fail at the iommu_attach_group() call since the group we're
> > trying to attach is already attached to another domain.  
> 
> Agree with you. At first, I want to fail it in similar way with vfio-pci devices.
> For vfio-pci devices from a common group, vfio will fail the operation around
> /dev/vfio/group_id open if user tries to assign the vfio-pci devices from common
> group to multiple QEMU processes. Meanwhile, QEMU will avoid to open a
> /dev/vfio/group_id multiple times, so current vfio/QEMU works well for 
> non- singleton group (no vIOMMU). Unfortunately, looks like we have no way
> to fail vfio-mdev-pci devices in similar mechanism as each mdev has a separate
> group. So yes, I agree with you that we may fail it around the group attach
> phase. Below is my draft idea:
> 
> In vfio_iommu_type1_attach_group(), we need to do the following checks.
> 
> if (mdev_group) {
> 	if (iommu_device group enabled aux-domain) {
> 		/*
> 		  * iommu_group enabled aux-domain means the iommu_devices
> 		  * in this group are aux-domain enabled. e.g. SIOV capable devices.
> 		  * Also, I think for aux-domain enabled group, it essentially means
> 		  * the group is a singleton group as SIOV capable devices require
> 		  * to be in a singleton group.
> 		  */
> 		 iommu_aux_attach_device();
> 	} else {
> 		/*
> 		  * needs to check the group->opened in vfio.c. Just like what
> 		  * vfio_group_fopen() does. May be a new VFIO interface required
> 		  * here since the group->opened is within vfio.c.
> 		  * vfio_iommu_device_group_opened_inc() will inc group->opened, so
> 		  * that other VM will fail when trying to open the group. And another
> 		  * VFIO interface is also required to dec group->opened when VM is
> 		  * down.
> 		  */
> 		if (vfio_iommu_device_group_opened_inc(iommu_device_group))
> 			return -EBUSY;
> 		iommu_attach_gorup(iommu_device_group);
> 	}
> }
> 
> The concern here is the two new VFIO interfaces. Any thoughts on this proposal? :-)
> 
> > It's really unfortunate that we don't have the mdev inheriting the
> > iommu group of the iommu_device so that userspace can really understand
> > this relationship.  A separate group makes sense for the aux-domain
> > case, and is (I guess) not a significant issue in the case of a
> > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > this is something we should correct in design of iommu backed mdevs.  
> 
> Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
> means singleton iommu_devie group. And in early time, when designing the support
> for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> iommu_device group. But this results in an iommu backed group includes mdev and
> physical devices, which might also be strange. Do you think it is valuable to reconsider
> it?

From a group perspective, the cleanest solution would seem to be that
IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
group of the iommu_device, but I think the barrier here is that we have
a difficult time determining if the group is "viable" in that case.
For example a group where one devices is bound to a native host driver
and the other device bound to a vfio driver would typically be
considered non-viable as it breaks the isolation guarantees.  However
I think in this configuration, the parent device is effectively
participating in the isolation and "donating" its iommu group on behalf
of the mdev device.  I don't think we can simultaneously use that iommu
group for any other purpose.  I'm sure we could come up with a way for
vifo-core to understand this relationship and add it to the white list,
I wonder though how confusing this might be to users who now understand
the group/driver requirement to be "all endpoints bound to vfio
drivers".  This might still be the best approach regardless of this.
Thanks,

Alex
Yi Liu July 11, 2019, 12:27 p.m. UTC | #12
Hi Alex,

> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf
> Of Alex Williamson
> Sent: Friday, July 5, 2019 11:55 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Thu, 4 Jul 2019 09:11:02 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, July 4, 2019 1:22 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
[...]
> >
> > > It's really unfortunate that we don't have the mdev inheriting the
> > > iommu group of the iommu_device so that userspace can really understand
> > > this relationship.  A separate group makes sense for the aux-domain
> > > case, and is (I guess) not a significant issue in the case of a
> > > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > > this is something we should correct in design of iommu backed mdevs.
> >
> > Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
> > means singleton iommu_devie group. And in early time, when designing the
> support
> > for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> > iommu_device group. But this results in an iommu backed group includes mdev and
> > physical devices, which might also be strange. Do you think it is valuable to
> reconsider
> > it?
> 
> From a group perspective, the cleanest solution would seem to be that
> IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
> group of the iommu_device,

A confirm here. Regards to inherit the IOMMU group of iommu_device, do
you mean mdev device should be added to the IOMMU group of iommu_device
or maintain a parent and inheritor relationship within vfio? I guess you mean the
later one? :-)

> but I think the barrier here is that we have
> a difficult time determining if the group is "viable" in that case.
> For example a group where one devices is bound to a native host driver
> and the other device bound to a vfio driver would typically be
> considered non-viable as it breaks the isolation guarantees.  However

yes, this is how vfio guarantee the isolation before allowing user to further
add a group to a vfio container and so on.

> I think in this configuration, the parent device is effectively
> participating in the isolation and "donating" its iommu group on behalf
> of the mdev device.  I don't think we can simultaneously use that iommu
> group for any other purpose. 

Agree. At least host cannot make use of the iommu group any more in such
configuration.

> I'm sure we could come up with a way for
> vifo-core to understand this relationship and add it to the white list,

The configuration is host driver still exists while we want to let mdev device
to somehow "own" the iommu backed DMA isolation capability. So one possible
way may be calling vfio_add_group_dev() which will creates a vfio_device instance
for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
iommu group is fairly viable.

> I wonder though how confusing this might be to users who now understand
> the group/driver requirement to be "all endpoints bound to vfio
> drivers".  This might still be the best approach regardless of this.

Yes, another thing I'm considering is how to prevent such a host driver from
issuing DMA. If we finally get a device bound to vfio-pci and another device
wrapped as mdev and passthru them to VM, the host driver is still capable to
issue DMA. Though IOMMU can block some DMAs, but not all of them. If a
DMA issued by host driver happens to have mapping in IOMMU side, then
host is kind of doing things on behalf on VM. Though we may trust the host
driver, but it looks to be a little bit awkward to me. :-(

> Thanks,
> 
> Alex

Regards,
Yi Liu
Alex Williamson July 11, 2019, 7:08 p.m. UTC | #13
On Thu, 11 Jul 2019 12:27:26 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf
> > Of Alex Williamson
> > Sent: Friday, July 5, 2019 11:55 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Thu, 4 Jul 2019 09:11:02 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, July 4, 2019 1:22 AM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver  
> [...]
> > >  
> > > > It's really unfortunate that we don't have the mdev inheriting the
> > > > iommu group of the iommu_device so that userspace can really understand
> > > > this relationship.  A separate group makes sense for the aux-domain
> > > > case, and is (I guess) not a significant issue in the case of a
> > > > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > > > this is something we should correct in design of iommu backed mdevs.  
> > >
> > > Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
> > > means singleton iommu_devie group. And in early time, when designing the  
> > support  
> > > for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> > > iommu_device group. But this results in an iommu backed group includes mdev and
> > > physical devices, which might also be strange. Do you think it is valuable to  
> > reconsider  
> > > it?  
> > 
> > From a group perspective, the cleanest solution would seem to be that
> > IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
> > group of the iommu_device,  
> 
> A confirm here. Regards to inherit the IOMMU group of iommu_device, do
> you mean mdev device should be added to the IOMMU group of iommu_device
> or maintain a parent and inheritor relationship within vfio? I guess you mean the
> later one? :-)

I was thinking the former, I'm not sure what the latter implies.  There
is no hierarchy within or between IOMMU groups, it's simply a set of
devices.  Maybe what you're getting at is that vfio needs to understand
that the mdev is a child of the endpoint device in its determination of
whether the group is viable.  That's true, but we can also have IOMMU
groups composed of SR-IOV VFs along with their parent PF if the root of
the IOMMU group is (for example) a downstream switch port above the PF.
So we can't simply look at the parent/child relationship within the
group, we somehow need to know that the parent device sharing the IOMMU
group is operating in host kernel space on behalf of the mdev.
 
> > but I think the barrier here is that we have
> > a difficult time determining if the group is "viable" in that case.
> > For example a group where one devices is bound to a native host driver
> > and the other device bound to a vfio driver would typically be
> > considered non-viable as it breaks the isolation guarantees.  However  
> 
> yes, this is how vfio guarantee the isolation before allowing user to further
> add a group to a vfio container and so on.
> 
> > I think in this configuration, the parent device is effectively
> > participating in the isolation and "donating" its iommu group on behalf
> > of the mdev device.  I don't think we can simultaneously use that iommu
> > group for any other purpose.   
> 
> Agree. At least host cannot make use of the iommu group any more in such
> configuration.
> 
> > I'm sure we could come up with a way for
> > vifo-core to understand this relationship and add it to the white list,  
> 
> The configuration is host driver still exists while we want to let mdev device
> to somehow "own" the iommu backed DMA isolation capability. So one possible
> way may be calling vfio_add_group_dev() which will creates a vfio_device instance
> for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
> iommu group is fairly viable.

"fairly viable" ;)  It's a correct use of the term, it's a little funny
though as "fairly" can also mean reasonably/sufficiently/adequately as
well as I think the intended use here equivalent to justly. </tangent>

That's an interesting idea to do an implicit vfio_add_group_dev() on
the iommu_device in this case, if you've worked through how that could
play out, it'd be interesting to see.

> > I wonder though how confusing this might be to users who now understand
> > the group/driver requirement to be "all endpoints bound to vfio
> > drivers".  This might still be the best approach regardless of this.  
> 
> Yes, another thing I'm considering is how to prevent such a host driver from
> issuing DMA. If we finally get a device bound to vfio-pci and another device
> wrapped as mdev and passthru them to VM, the host driver is still capable to
> issue DMA. Though IOMMU can block some DMAs, but not all of them. If a
> DMA issued by host driver happens to have mapping in IOMMU side, then
> host is kind of doing things on behalf on VM. Though we may trust the host
> driver, but it looks to be a little bit awkward to me. :-(

vfio is allocating an iommu domain and placing the iommu_device into
that domain, the user therefore own the iova context for the parent
device, how would that not manage all DMA?   The vendor driver could
theoretically also manipulate mappings within that domain, but that
driver is a host kernel driver and therefore essentially trusted like
any other host kernel driver.  The only unique thing here is that it's
part of a channel providing access for an untrusted user, so it needs
to be particularly concerned with keeping that user access within
bounds.  Thanks,

Alex
Yi Liu July 12, 2019, 12:55 p.m. UTC | #14
Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, July 12, 2019 3:08 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Thu, 11 Jul 2019 12:27:26 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On
> Behalf
> > > Of Alex Williamson
> > > Sent: Friday, July 5, 2019 11:55 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > >
> > > On Thu, 4 Jul 2019 09:11:02 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Thursday, July 4, 2019 1:22 AM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > [...]
> > > >
> > > > > It's really unfortunate that we don't have the mdev inheriting the
> > > > > iommu group of the iommu_device so that userspace can really understand
> > > > > this relationship.  A separate group makes sense for the aux-domain
> > > > > case, and is (I guess) not a significant issue in the case of a
> > > > > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > > > > this is something we should correct in design of iommu backed mdevs.
> > > >
> > > > Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
> > > > means singleton iommu_devie group. And in early time, when designing the
> > > support
> > > > for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> > > > iommu_device group. But this results in an iommu backed group includes mdev
> and
> > > > physical devices, which might also be strange. Do you think it is valuable to
> > > reconsider
> > > > it?
> > >
> > > From a group perspective, the cleanest solution would seem to be that
> > > IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
> > > group of the iommu_device,
> >
> > A confirm here. Regards to inherit the IOMMU group of iommu_device, do
> > you mean mdev device should be added to the IOMMU group of iommu_device
> > or maintain a parent and inheritor relationship within vfio? I guess you mean the
> > later one? :-)
> 
> I was thinking the former, I'm not sure what the latter implies.  There
> is no hierarchy within or between IOMMU groups, it's simply a set of
> devices.

I have a concern on adding the mdev device to the iommu_group of
iommu_device. In such configuration, a iommu backed group includes
mdev devices and physical devices. Then it might be necessary to advertise
the mdev info to the in-kernel software which want to loop all devices within
such an iommu_group. An example I can see is the virtual SVA threads in
community. e.g. for a guest pasid bind, the changes below loops all the
devices within an iommu_group, and each loop will call into vendor iommu
driver with a device structure passed in. It is quite possible that vendor
iommu driver need to get something behind a physical device (e.g.
intel_iommu structure). For a physical device, it is fine. While for mdev
device, it would be a problem if no mdev info advertised to iommu driver. :-(
Although we have agreement that PASID support should be disabled for
devices which are from non-singleton group. But I don't feel like to rely on
such assumptions when designing software flows. Also, it's just an example,
we have no idea if there will be more similar flows which require to loop all
devices in an iommu group in future. May be we want to avoid adding a mdev
to an iommu backed group. :-) More replies to you response below.

+static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
+					    void __user *arg,
+					    struct vfio_iommu_type1_bind *bind)
+ ...
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		list_for_each_entry(group, &domain->group_list, next) {
+			ret = iommu_group_for_each_dev(group->iommu_group,
+			   &guest_bind, vfio_bind_gpasid_fn);
+			if (ret)
+				goto out_unbind;
+		}
+	}
+ ...
+}

> Maybe what you're getting at is that vfio needs to understand
> that the mdev is a child of the endpoint device in its determination of
> whether the group is viable.

Is the group here the group of iommu_device or a group of a mdev device?
:-) Actually, I think the group of a mdev device is always viable since
it has only a device and mdev_driver will add the mdev device to vfio
controlled scope to make the mdev group viable. Per my understanding,
VFIO guarantees the isolation by two major arts. First is checking if
group is viable before adding it to a container, second is preventing
multiple opens to /dev/vfio/group_id by the vfio_group->opened field
maintained in vfio.c.

Back to the configuration we are talking here (For example a group where
one devices is bound to a native host driver and the other device bound
to a vfio driver[1].), we have two groups( iommu backed one and mdev group).
I think for iommu_device which wants to "donate" its iommu_group, the
host driver should explicitly call vfio_add_group_dev() to add itself
to the vfio controlled scope. And thus make its iommu backed group be
viable. So that we can have two viable iommu groups. iommu backed group
is viable by the host driver's vfio_add_group_dev() calling, and mdev
group is naturally viable. Until now, we can passthru the devices
(vfio-pci device and a mdev device) under this configuration to VM well.
But we cannot prevent user to passthru the devices to different VMs since
the two iommu groups are both viable. If I'm still understanding vfio
correct until this line, I think we need to fail the attempt of passthru
to multiple VMs in vfio_iommu_type1_attach_group() by checking the
vfio_group->opened field which is maintained in vfio.c. e.g. let's say
for iommu backed group, we have vfio_group#1 and mdev group, we have
vfio_group#2 in vfio.c, then opening vfio_group#1 requires to inc the
vfio_group#2->opened. And vice versa.

[1] the example from the previous reply of you.

> That's true, but we can also have IOMMU
> groups composed of SR-IOV VFs along with their parent PF if the root of
> the IOMMU group is (for example) a downstream switch port above the PF.
> So we can't simply look at the parent/child relationship within the
> group, we somehow need to know that the parent device sharing the IOMMU
> group is operating in host kernel space on behalf of the mdev.

I think for such hardware configuration, we still have only two iommu
group, a iommu backed one and a mdev group. May the idea above still
applicable. :-)

> > > but I think the barrier here is that we have
> > > a difficult time determining if the group is "viable" in that case.
> > > For example a group where one devices is bound to a native host driver
> > > and the other device bound to a vfio driver would typically be
> > > considered non-viable as it breaks the isolation guarantees.  However
> >
> > yes, this is how vfio guarantee the isolation before allowing user to further
> > add a group to a vfio container and so on.
> >
> > > I think in this configuration, the parent device is effectively
> > > participating in the isolation and "donating" its iommu group on behalf
> > > of the mdev device.  I don't think we can simultaneously use that iommu
> > > group for any other purpose.
> >
> > Agree. At least host cannot make use of the iommu group any more in such
> > configuration.
> >
> > > I'm sure we could come up with a way for
> > > vifo-core to understand this relationship and add it to the white list,
> >
> > The configuration is host driver still exists while we want to let mdev device
> > to somehow "own" the iommu backed DMA isolation capability. So one possible
> > way may be calling vfio_add_group_dev() which will creates a vfio_device instance
> > for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
> > iommu group is fairly viable.
> 
> "fairly viable" ;)  It's a correct use of the term, it's a little funny
> though as "fairly" can also mean reasonably/sufficiently/adequately as
> well as I think the intended use here equivalent to justly. </tangent>

Aha, a nice "lesson" for me. Honestly, I have no idea how it came to me
when trying to describe my idea with a moderate term either. Luckily,
it made me well understood. :-)

> That's an interesting idea to do an implicit vfio_add_group_dev() on
> the iommu_device in this case, if you've worked through how that could
> play out, it'd be interesting to see.

I've tried it in my vfio-mdev-pci driver probe() phase, it works well.
And this is an explicit calling. And I guess we may really want host driver
to do it explicitly instead of implicitly as host driver owns the choice
of whether "donating" group or not. While for failing the
vfio_iommu_type1_attach_group() to prevent user passthru the vfio-pci device
and vfio-mdev-pci device (share iommu backed group) to different VMs, I'm
doing some changes. If it's a correct way, I'll try to send out a new version
for your further review. :-)

> > > I wonder though how confusing this might be to users who now understand
> > > the group/driver requirement to be "all endpoints bound to vfio
> > > drivers".  This might still be the best approach regardless of this.
> >
> > Yes, another thing I'm considering is how to prevent such a host driver from
> > issuing DMA. If we finally get a device bound to vfio-pci and another device
> > wrapped as mdev and passthru them to VM, the host driver is still capable to
> > issue DMA. Though IOMMU can block some DMAs, but not all of them. If a
> > DMA issued by host driver happens to have mapping in IOMMU side, then
> > host is kind of doing things on behalf on VM. Though we may trust the host
> > driver, but it looks to be a little bit awkward to me. :-(
> 
> vfio is allocating an iommu domain and placing the iommu_device into
> that domain, the user therefore own the iova context for the parent
> device, how would that not manage all DMA?   The vendor driver could
> theoretically also manipulate mappings within that domain, but that
> driver is a host kernel driver and therefore essentially trusted like
> any other host kernel driver.  The only unique thing here is that it's
> part of a channel providing access for an untrusted user, so it needs
> to be particularly concerned with keeping that user access within
> bounds.  Thanks,

Got it, thanks for the explanation. Looks like I overplayed the concern.

> 
> Alex

Regards,
Yi Liu
Alex Williamson July 19, 2019, 8:57 p.m. UTC | #15
On Fri, 12 Jul 2019 12:55:27 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, July 12, 2019 3:08 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Thu, 11 Jul 2019 12:27:26 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On  
> > Behalf  
> > > > Of Alex Williamson
> > > > Sent: Friday, July 5, 2019 11:55 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > >
> > > > On Thu, 4 Jul 2019 09:11:02 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > Hi Alex,
> > > > >  
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Thursday, July 4, 2019 1:22 AM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver  
> > > [...]  
> > > > >  
> > > > > > It's really unfortunate that we don't have the mdev inheriting the
> > > > > > iommu group of the iommu_device so that userspace can really understand
> > > > > > this relationship.  A separate group makes sense for the aux-domain
> > > > > > case, and is (I guess) not a significant issue in the case of a
> > > > > > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > > > > > this is something we should correct in design of iommu backed mdevs.  
> > > > >
> > > > > Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
> > > > > means singleton iommu_devie group. And in early time, when designing the  
> > > > support  
> > > > > for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> > > > > iommu_device group. But this results in an iommu backed group includes mdev  
> > and  
> > > > > physical devices, which might also be strange. Do you think it is valuable to  
> > > > reconsider  
> > > > > it?  
> > > >
> > > > From a group perspective, the cleanest solution would seem to be that
> > > > IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
> > > > group of the iommu_device,  
> > >
> > > A confirm here. Regards to inherit the IOMMU group of iommu_device, do
> > > you mean mdev device should be added to the IOMMU group of iommu_device
> > > or maintain a parent and inheritor relationship within vfio? I guess you mean the
> > > later one? :-)  
> > 
> > I was thinking the former, I'm not sure what the latter implies.  There
> > is no hierarchy within or between IOMMU groups, it's simply a set of
> > devices.  
> 
> I have a concern on adding the mdev device to the iommu_group of
> iommu_device. In such configuration, a iommu backed group includes
> mdev devices and physical devices. Then it might be necessary to advertise
> the mdev info to the in-kernel software which want to loop all devices within
> such an iommu_group. An example I can see is the virtual SVA threads in
> community. e.g. for a guest pasid bind, the changes below loops all the
> devices within an iommu_group, and each loop will call into vendor iommu
> driver with a device structure passed in. It is quite possible that vendor
> iommu driver need to get something behind a physical device (e.g.
> intel_iommu structure). For a physical device, it is fine. While for mdev
> device, it would be a problem if no mdev info advertised to iommu driver. :-(
> Although we have agreement that PASID support should be disabled for
> devices which are from non-singleton group. But I don't feel like to rely on
> such assumptions when designing software flows. Also, it's just an example,
> we have no idea if there will be more similar flows which require to loop all
> devices in an iommu group in future. May be we want to avoid adding a mdev
> to an iommu backed group. :-) More replies to you response below.
> 
> +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> +					    void __user *arg,
> +					    struct vfio_iommu_type1_bind *bind)
> + ...
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		list_for_each_entry(group, &domain->group_list, next) {
> +			ret = iommu_group_for_each_dev(group->iommu_group,
> +			   &guest_bind, vfio_bind_gpasid_fn);
> +			if (ret)
> +				goto out_unbind;
> +		}
> +	}
> + ...
> +}

Sorry for the delayed response.

I think you're right, making the IOMMU code understand virtual devices
in an IOMMU group makes traversing the group difficult for any layer
that doesn't understand the relationship of these virtual devices.  I
guess we can't go that route.

> > Maybe what you're getting at is that vfio needs to understand
> > that the mdev is a child of the endpoint device in its determination of
> > whether the group is viable.  
> 
> Is the group here the group of iommu_device or a group of a mdev device?
> :-) Actually, I think the group of a mdev device is always viable since
> it has only a device and mdev_driver will add the mdev device to vfio
> controlled scope to make the mdev group viable. Per my understanding,
> VFIO guarantees the isolation by two major arts. First is checking if
> group is viable before adding it to a container, second is preventing
> multiple opens to /dev/vfio/group_id by the vfio_group->opened field
> maintained in vfio.c.

Yes, minor nit, an mdev needs to be bound to vfio-mdev for the group to
be vfio "viable", we expect that there will eventually be non-vfio
drivers for mdev devices.

> Back to the configuration we are talking here (For example a group where
> one devices is bound to a native host driver and the other device bound
> to a vfio driver[1].), we have two groups( iommu backed one and mdev group).
> I think for iommu_device which wants to "donate" its iommu_group, the
> host driver should explicitly call vfio_add_group_dev() to add itself
> to the vfio controlled scope. And thus make its iommu backed group be
> viable. So that we can have two viable iommu groups. iommu backed group
> is viable by the host driver's vfio_add_group_dev() calling, and mdev
> group is naturally viable. Until now, we can passthru the devices
> (vfio-pci device and a mdev device) under this configuration to VM well.
> But we cannot prevent user to passthru the devices to different VMs since
> the two iommu groups are both viable. If I'm still understanding vfio
> correct until this line, I think we need to fail the attempt of passthru
> to multiple VMs in vfio_iommu_type1_attach_group() by checking the
> vfio_group->opened field which is maintained in vfio.c. e.g. let's say
> for iommu backed group, we have vfio_group#1 and mdev group, we have
> vfio_group#2 in vfio.c, then opening vfio_group#1 requires to inc the
> vfio_group#2->opened. And vice versa.
> 
> [1] the example from the previous reply of you.

I think there's a problem with incrementing the group, the user still
needs to be able to open the group for devices within the group that
may be bound to vfio-pci, so I don't think this plan really works.
Also, who would be responsible for calling vfio_add_group_dev(), the
vendor driver is just registering an mdev parent device, it doesn't
know that those devices will be used by vfio-mdev or some other mdev
bus driver.  I think that means that vfio-mdev would need to call this
for mdevs with an iommu_device after it registers the mdev itself.  The
vfio_device_ops it registers would need to essentially be stubbed out
too, in order to prevent direct vfio access to the backing device.

I wonder if the "inheritance" of a group could be isolated to vfio in
such a case.  The vfio group file for the mdev must exist for
userspace compatibility, but I wonder if we could manage to make that be
effectively an alias for the iommu device.  Using a device from a group
without actually opening the group still seems problematic too.  I'm
also wondering how much effort we want to go to in supporting this
versus mdev could essentially fail the call to register an iommu device
for an mdev if that iommu device is not in a singleton group.  It would
limit the application of vfio-mdev-pci, but already being proposed as a
proof of concept sample driver anyway.


> > That's true, but we can also have IOMMU
> > groups composed of SR-IOV VFs along with their parent PF if the root of
> > the IOMMU group is (for example) a downstream switch port above the PF.
> > So we can't simply look at the parent/child relationship within the
> > group, we somehow need to know that the parent device sharing the IOMMU
> > group is operating in host kernel space on behalf of the mdev.  
> 
> I think for such hardware configuration, we still have only two iommu
> group, a iommu backed one and a mdev group. May the idea above still
> applicable. :-)
> 
> > > > but I think the barrier here is that we have
> > > > a difficult time determining if the group is "viable" in that case.
> > > > For example a group where one devices is bound to a native host driver
> > > > and the other device bound to a vfio driver would typically be
> > > > considered non-viable as it breaks the isolation guarantees.  However  
> > >
> > > yes, this is how vfio guarantee the isolation before allowing user to further
> > > add a group to a vfio container and so on.
> > >  
> > > > I think in this configuration, the parent device is effectively
> > > > participating in the isolation and "donating" its iommu group on behalf
> > > > of the mdev device.  I don't think we can simultaneously use that iommu
> > > > group for any other purpose.  
> > >
> > > Agree. At least host cannot make use of the iommu group any more in such
> > > configuration.
> > >  
> > > > I'm sure we could come up with a way for
> > > > vifo-core to understand this relationship and add it to the white list,  
> > >
> > > The configuration is host driver still exists while we want to let mdev device
> > > to somehow "own" the iommu backed DMA isolation capability. So one possible
> > > way may be calling vfio_add_group_dev() which will creates a vfio_device instance
> > > for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
> > > iommu group is fairly viable.  
> > 
> > "fairly viable" ;)  It's a correct use of the term, it's a little funny
> > though as "fairly" can also mean reasonably/sufficiently/adequately as
> > well as I think the intended use here equivalent to justly. </tangent>  
> 
> Aha, a nice "lesson" for me. Honestly, I have no idea how it came to me
> when trying to describe my idea with a moderate term either. Luckily,
> it made me well understood. :-)
> 
> > That's an interesting idea to do an implicit vfio_add_group_dev() on
> > the iommu_device in this case, if you've worked through how that could
> > play out, it'd be interesting to see.  
> 
> I've tried it in my vfio-mdev-pci driver probe() phase, it works well.
> And this is an explicit calling. And I guess we may really want host driver
> to do it explicitly instead of implicitly as host driver owns the choice
> of whether "donating" group or not. While for failing the
> vfio_iommu_type1_attach_group() to prevent user passthru the vfio-pci device
> and vfio-mdev-pci device (share iommu backed group) to different VMs, I'm
> doing some changes. If it's a correct way, I'll try to send out a new version
> for your further review. :-)

I'm interested to see it, but as above, I have some reservations.  And
as I mention, and mdev vendor driver cannot assume the device is used
by vfio-mdev.  I know Intel vGPUs not only assume vfio-mdev, but also
KVM and fail the device open if the constraints aren't met, but I don't
think we can start introducing that sort of vfio specific dependencies
on the mdev bus interface.  Thanks,

Alex
Yi Liu July 26, 2019, 9:04 a.m. UTC | #16
Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Saturday, July 20, 2019 4:58 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> 
> On Fri, 12 Jul 2019 12:55:27 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, July 12, 2019 3:08 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > >
> > > On Thu, 11 Jul 2019 12:27:26 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > > From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On
> > > Behalf
> > > > > Of Alex Williamson
> > > > > Sent: Friday, July 5, 2019 11:55 PM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > >
> > > > > On Thu, 4 Jul 2019 09:11:02 +0000
> > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > Hi Alex,
> > > > > >
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Thursday, July 4, 2019 1:22 AM
> > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > [...]
[...]
> > > Maybe what you're getting at is that vfio needs to understand
> > > that the mdev is a child of the endpoint device in its determination of
> > > whether the group is viable.
> >
> > Is the group here the group of iommu_device or a group of a mdev device?
> > :-) Actually, I think the group of a mdev device is always viable since
> > it has only a device and mdev_driver will add the mdev device to vfio
> > controlled scope to make the mdev group viable. Per my understanding,
> > VFIO guarantees the isolation by two major arts. First is checking if
> > group is viable before adding it to a container, second is preventing
> > multiple opens to /dev/vfio/group_id by the vfio_group->opened field
> > maintained in vfio.c.
> 
> Yes, minor nit, an mdev needs to be bound to vfio-mdev for the group to
> be vfio "viable", we expect that there will eventually be non-vfio
> drivers for mdev devices.

Then I guess the mdev group is non-viable per vfio's mind. :-)

> > Back to the configuration we are talking here (For example a group where
> > one devices is bound to a native host driver and the other device bound
> > to a vfio driver[1].), we have two groups( iommu backed one and mdev group).
> > I think for iommu_device which wants to "donate" its iommu_group, the
> > host driver should explicitly call vfio_add_group_dev() to add itself
> > to the vfio controlled scope. And thus make its iommu backed group be
> > viable. So that we can have two viable iommu groups. iommu backed group
> > is viable by the host driver's vfio_add_group_dev() calling, and mdev
> > group is naturally viable. Until now, we can passthru the devices
> > (vfio-pci device and a mdev device) under this configuration to VM well.
> > But we cannot prevent user to passthru the devices to different VMs since
> > the two iommu groups are both viable. If I'm still understanding vfio
> > correct until this line, I think we need to fail the attempt of passthru
> > to multiple VMs in vfio_iommu_type1_attach_group() by checking the
> > vfio_group->opened field which is maintained in vfio.c. e.g. let's say
> > for iommu backed group, we have vfio_group#1 and mdev group, we have
> > vfio_group#2 in vfio.c, then opening vfio_group#1 requires to inc the
> > vfio_group#2->opened. And vice versa.
> >
> > [1] the example from the previous reply of you.
> 
> I think there's a problem with incrementing the group, the user still
> needs to be able to open the group for devices within the group that
> may be bound to vfio-pci, so I don't think this plan really works.

Perhaps I failed to make it clear. Let me explain. By incrementing the
group, vfio can prevent the usage of passthru a single iommu group
to different QEMUs (VMs). Once a QEMU opens a group. It will not
open again. e.g. current QEMU implementation checks the
vfio_group_list in vfio_get_group() before opening group for an
assigned device. Thus it avoids to open multiple times in a QEMU
process. This makes sense since kernel VFIO will attach all the devices
within a given iommu group to the allocated unmanaged domain in
vfio_iommu_type1_attach_group(). Back to my plan. :-) Say I have a
iommu group with three devices. A device bound to vfio-pci, and two
devices bound to a host driver which will wrap itself as a mdev (e.g.
vfio-mdev-pci driver). So there will be finally three groups, an iommu
backed group, two mdev groups. As I mentioned I may be able to
make the iommu backed group be vfio viable with
vfio_add_group_dev(). Then my plan is simple. Let the three groups
shares a group->open field. When any of the three groups results in
a increment of iommu back group. Also before any open of the three
groups, vfio_group_fops_open() should check the iommu backed
group first. Alternatively, this check can also be done in
vfio_iommu_type1_attach_group(). Looks like it may be better to
happen in vfio_group_fops_open() since we may need to let vfio.c
understand the " inheritance" between the three groups.

> Also, who would be responsible for calling vfio_add_group_dev(), the
> vendor driver is just registering an mdev parent device, it doesn't

I think it should be the vendor driver since I believe it's vendor driver's
duty to make this decision. This would like the vendor driver wants to
"donate" its iommu group to a mdev device.

> know that those devices will be used by vfio-mdev or some other mdev
> bus driver.  I think that means that vfio-mdev would need to call this
> for mdevs with an iommu_device after it registers the mdev itself.  The

Hmmm, it may be a trouble if letting vfio-mdev call this for mdevs. I'm
not sure if vfio-mdev can have the knowledge that the mdev is backed
by an iommu device.

> vfio_device_ops it registers would need to essentially be stubbed out
> too, in order to prevent direct vfio access to the backing device.

yes, the vfio_device_ops would be a dummy. In order to prevent
direct vfio access to the backing device, the vfio_device_ops.open()
should be implemented as always fail the open attempt. Thus no direct
vfio access will be successful.

> I wonder if the "inheritance" of a group could be isolated to vfio in
> such a case.  The vfio group file for the mdev must exist for
> userspace compatibility, but I wonder if we could manage to make that be
> effectively an alias for the iommu device.  Using a device from a group
> without actually opening the group still seems problematic too.  I'm

Yeah, the "inheritance" of iommu backed group and mdev groups should
be kind of "alias".

> also wondering how much effort we want to go to in supporting this
> versus mdev could essentially fail the call to register an iommu device
> for an mdev if that iommu device is not in a singleton group.  It would
> limit the application of vfio-mdev-pci, but already being proposed as a
> proof of concept sample driver anyway.

Let me have a try and get back to you. :-)
 
> > > That's true, but we can also have IOMMU
> > > groups composed of SR-IOV VFs along with their parent PF if the root of
> > > the IOMMU group is (for example) a downstream switch port above the PF.
> > > So we can't simply look at the parent/child relationship within the
> > > group, we somehow need to know that the parent device sharing the IOMMU
> > > group is operating in host kernel space on behalf of the mdev.
> >
> > I think for such hardware configuration, we still have only two iommu
> > group, a iommu backed one and a mdev group. May the idea above still
> > applicable. :-)
> >
> > > > > but I think the barrier here is that we have
> > > > > a difficult time determining if the group is "viable" in that case.
> > > > > For example a group where one devices is bound to a native host driver
> > > > > and the other device bound to a vfio driver would typically be
> > > > > considered non-viable as it breaks the isolation guarantees.  However
> > > >
> > > > yes, this is how vfio guarantee the isolation before allowing user to further
> > > > add a group to a vfio container and so on.
> > > >
> > > > > I think in this configuration, the parent device is effectively
> > > > > participating in the isolation and "donating" its iommu group on behalf
> > > > > of the mdev device.  I don't think we can simultaneously use that iommu
> > > > > group for any other purpose.
> > > >
> > > > Agree. At least host cannot make use of the iommu group any more in such
> > > > configuration.
> > > >
> > > > > I'm sure we could come up with a way for
> > > > > vifo-core to understand this relationship and add it to the white list,
> > > >
> > > > The configuration is host driver still exists while we want to let mdev device
> > > > to somehow "own" the iommu backed DMA isolation capability. So one
> possible
> > > > way may be calling vfio_add_group_dev() which will creates a vfio_device
> instance
> > > > for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
> > > > iommu group is fairly viable.
> > >
> > > "fairly viable" ;)  It's a correct use of the term, it's a little funny
> > > though as "fairly" can also mean reasonably/sufficiently/adequately as
> > > well as I think the intended use here equivalent to justly. </tangent>
> >
> > Aha, a nice "lesson" for me. Honestly, I have no idea how it came to me
> > when trying to describe my idea with a moderate term either. Luckily,
> > it made me well understood. :-)
> >
> > > That's an interesting idea to do an implicit vfio_add_group_dev() on
> > > the iommu_device in this case, if you've worked through how that could
> > > play out, it'd be interesting to see.
> >
> > I've tried it in my vfio-mdev-pci driver probe() phase, it works well.
> > And this is an explicit calling. And I guess we may really want host driver
> > to do it explicitly instead of implicitly as host driver owns the choice
> > of whether "donating" group or not. While for failing the
> > vfio_iommu_type1_attach_group() to prevent user passthru the vfio-pci device
> > and vfio-mdev-pci device (share iommu backed group) to different VMs, I'm
> > doing some changes. If it's a correct way, I'll try to send out a new version
> > for your further review. :-)
> 
> I'm interested to see it, but as above, I have some reservations.  And
> as I mention, and mdev vendor driver cannot assume the device is used
> by vfio-mdev.  I know Intel vGPUs not only assume vfio-mdev, but also
> KVM and fail the device open if the constraints aren't met, but I don't
> think we can start introducing that sort of vfio specific dependencies
> on the mdev bus interface.  Thanks,

Yeah, it's always bad to introduce specific dependencies. But here if
letting vendor driver to call the vfio_add_group_dev(), then it is still
agnostic to vfio-mdev and other potential vfio-mdev alike mdev drivers
in future. Not sure if this is correct, pls feel free correct me. :-)

> Alex

Thanks,
Yi Liu
diff mbox series

Patch

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index d94317a..ac118ef 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -5,4 +5,10 @@  vfio-pci-y := vfio_pci.o vfio_pci_common.o vfio_pci_intrs.o \
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
+vfio-mdev-pci-y := vfio_mdev_pci.o vfio_pci_common.o vfio_pci_intrs.o \
+			vfio_pci_rdwr.o vfio_pci_config.o
+vfio-mdev-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-mdev-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_SAMPLE_VFIO_MDEV_PCI) += vfio-mdev-pci.o
diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
new file mode 100644
index 0000000..07c8067
--- /dev/null
+++ b/drivers/vfio/pci/vfio_mdev_pci.c
@@ -0,0 +1,403 @@ 
+/*
+ * Copyright © 2019 Intel Corporation.
+ *     Author: Liu, Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio_pci.c:
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/vgaarb.h>
+#include <linux/nospec.h>
+#include <linux/mdev.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Liu, Yi L <yi.l.liu@intel.com>"
+#define DRIVER_DESC     "VFIO Mdev PCI - Sample driver for PCI device as a mdev"
+
+#define VFIO_MDEV_PCI_NAME  "vfio-mdev-pci"
+
+static char ids[1024] __initdata;
+module_param_string(ids, ids, sizeof(ids), 0);
+MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio-mdev-pci driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+
+static bool nointxmask;
+module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(nointxmask,
+		  "Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci@vger.kernel.org so the device can be fixed automatically via the broken_intx_masking flag.");
+
+#ifdef CONFIG_VFIO_PCI_VGA
+static bool disable_vga;
+module_param(disable_vga, bool, S_IRUGO);
+MODULE_PARM_DESC(disable_vga, "Disable VGA resource access through vfio-mdev-pci");
+#endif
+
+static bool disable_idle_d3;
+module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable_idle_d3,
+		 "Disable using the PCI D3 low power state for idle, unused devices");
+
+static struct pci_driver vfio_mdev_pci_driver;
+
+static ssize_t
+name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "%s-type1\n", dev_name(dev));
+}
+
+MDEV_TYPE_ATTR_RO(name);
+
+static ssize_t
+available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "%d\n", 1);
+}
+
+MDEV_TYPE_ATTR_RO(available_instances);
+
+static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
+		char *buf)
+{
+	return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
+}
+
+MDEV_TYPE_ATTR_RO(device_api);
+
+static struct attribute *vfio_mdev_pci_types_attrs[] = {
+	&mdev_type_attr_name.attr,
+	&mdev_type_attr_device_api.attr,
+	&mdev_type_attr_available_instances.attr,
+	NULL,
+};
+
+static struct attribute_group vfio_mdev_pci_type_group1 = {
+	.name  = "type1",
+	.attrs = vfio_mdev_pci_types_attrs,
+};
+
+struct attribute_group *vfio_mdev_pci_type_groups[] = {
+	&vfio_mdev_pci_type_group1,
+	NULL,
+};
+
+struct vfio_mdev_pci {
+	struct vfio_pci_device *vdev;
+	struct mdev_device *mdev;
+	unsigned long handle;
+};
+
+static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+	struct device *pdev;
+	struct vfio_pci_device *vdev;
+	struct vfio_mdev_pci *pmdev;
+	int ret;
+
+	pdev = mdev_parent_dev(mdev);
+	vdev = dev_get_drvdata(pdev);
+	pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
+	if (pmdev == NULL) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	pmdev->mdev = mdev;
+	pmdev->vdev = vdev;
+	mdev_set_drvdata(mdev, pmdev);
+	ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
+	if (ret) {
+		pr_info("%s, failed to config iommu isolation for mdev: %s on pf: %s\n",
+			__func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
+		goto out;
+	}
+
+out:
+	return ret;
+}
+
+static int vfio_mdev_pci_remove(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	kfree(pmdev);
+	pr_info("%s, succeeded for mdev: %s\n", __func__,
+		     dev_name(mdev_dev(mdev)));
+
+	return 0;
+}
+
+static int vfio_mdev_pci_open(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_pci_device *vdev = pmdev->vdev;
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!vdev->refcnt) {
+		ret = vfio_pci_enable(vdev);
+		if (ret)
+			goto error;
+
+		vfio_spapr_pci_eeh_open(vdev->pdev);
+	}
+	vdev->refcnt++;
+error:
+	mutex_unlock(&vdev->reflck->lock);
+	if (!ret)
+		pr_info("Succeeded to open mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+	else {
+		pr_info("Failed to open mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+		module_put(THIS_MODULE);
+	}
+	return ret;
+}
+
+static void vfio_mdev_pci_release(struct mdev_device *mdev)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+	struct vfio_pci_device *vdev = pmdev->vdev;
+
+	pr_info("Release mdev: %s on pf: %s\n",
+		dev_name(mdev_dev(mdev)), dev_name(&pmdev->vdev->pdev->dev));
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!(--vdev->refcnt)) {
+		vfio_spapr_pci_eeh_release(vdev->pdev);
+		vfio_pci_disable(vdev);
+	}
+
+	mutex_unlock(&vdev->reflck->lock);
+
+	module_put(THIS_MODULE);
+}
+
+static long vfio_mdev_pci_ioctl(struct mdev_device *mdev, unsigned int cmd,
+			     unsigned long arg)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_ioctl(pmdev->vdev, cmd, arg);
+}
+
+static int vfio_mdev_pci_mmap(struct mdev_device *mdev,
+				struct vm_area_struct *vma)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_mmap(pmdev->vdev, vma);
+}
+
+static ssize_t vfio_mdev_pci_read(struct mdev_device *mdev, char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_read(pmdev->vdev, buf, count, ppos);
+}
+
+static ssize_t vfio_mdev_pci_write(struct mdev_device *mdev,
+				const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
+
+	return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
+}
+
+static const struct mdev_parent_ops vfio_mdev_pci_ops = {
+	.supported_type_groups	= vfio_mdev_pci_type_groups,
+	.create			= vfio_mdev_pci_create,
+	.remove			= vfio_mdev_pci_remove,
+
+	.open			= vfio_mdev_pci_open,
+	.release		= vfio_mdev_pci_release,
+
+	.read			= vfio_mdev_pci_read,
+	.write			= vfio_mdev_pci_write,
+	.mmap			= vfio_mdev_pci_mmap,
+	.ioctl			= vfio_mdev_pci_ioctl,
+};
+
+static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev,
+				       const struct pci_device_id *id)
+{
+	struct vfio_pci_device *vdev;
+	int ret;
+
+	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+		return -EINVAL;
+
+	/*
+	 * Prevent binding to PFs with VFs enabled, this too easily allows
+	 * userspace instance with VFs and PFs from the same device, which
+	 * cannot work.  Disabling SR-IOV here would initiate removing the
+	 * VFs, which would unbind the driver, which is prone to blocking
+	 * if that VF is also in use by vfio-pci or vfio-mdev-pci. Just
+	 * reject these PFs and let the user sort it out.
+	 */
+	if (pci_num_vf(pdev)) {
+		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
+		return -EBUSY;
+	}
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->pdev = pdev;
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	mutex_init(&vdev->igate);
+	spin_lock_init(&vdev->irqlock);
+	mutex_init(&vdev->ioeventfds_lock);
+	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	vdev->nointxmask = nointxmask;
+#ifdef CONFIG_VFIO_PCI_VGA
+	vdev->disable_vga = disable_vga;
+#endif
+	vdev->disable_idle_d3 = disable_idle_d3;
+
+	pci_set_drvdata(pdev, vdev);
+
+	ret = vfio_pci_reflck_attach(vdev);
+	if (ret) {
+		pci_set_drvdata(pdev, NULL);
+		kfree(vdev);
+		return ret;
+	}
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
+		vga_set_legacy_decoding(pdev,
+					vfio_pci_set_vga_decode(vdev, false));
+	}
+
+	vfio_pci_probe_power_state(vdev);
+
+	if (!vdev->disable_idle_d3) {
+		/*
+		 * pci-core sets the device power state to an unknown value at
+		 * bootup and after being removed from a driver.  The only
+		 * transition it allows from this unknown state is to D0, which
+		 * typically happens when a driver calls pci_enable_device().
+		 * We're not ready to enable the device yet, but we do want to
+		 * be able to get to D3.  Therefore first do a D0 transition
+		 * before going to D3.
+		 */
+		vfio_pci_set_power_state(vdev, PCI_D0);
+		vfio_pci_set_power_state(vdev, PCI_D3hot);
+	}
+
+	ret = mdev_register_device(&pdev->dev, &vfio_mdev_pci_ops);
+	if (ret)
+		pr_err("Cannot register mdev for device %s\n",
+			dev_name(&pdev->dev));
+	else
+		pr_info("Wrap device %s as a mdev\n", dev_name(&pdev->dev));
+
+	return ret;
+}
+
+static void vfio_mdev_pci_driver_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_device *vdev;
+
+	vdev = pci_get_drvdata(pdev);
+	if (!vdev)
+		return;
+
+	vfio_pci_reflck_put(vdev->reflck);
+
+	kfree(vdev->region);
+	mutex_destroy(&vdev->ioeventfds_lock);
+
+	if (!disable_idle_d3)
+		vfio_pci_set_power_state(vdev, PCI_D0);
+
+	kfree(vdev->pm_save);
+
+	if (vfio_pci_is_vga(pdev)) {
+		vga_client_register(pdev, NULL, NULL, NULL);
+		vga_set_legacy_decoding(pdev,
+				VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
+				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
+	}
+
+	kfree(vdev);
+}
+
+static struct pci_driver vfio_mdev_pci_driver = {
+	.name		= VFIO_MDEV_PCI_NAME,
+	.id_table	= NULL, /* only dynamic ids */
+	.probe		= vfio_mdev_pci_driver_probe,
+	.remove		= vfio_mdev_pci_driver_remove,
+	.err_handler	= &vfio_err_handlers,
+};
+
+static void __exit vfio_mdev_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_mdev_pci_driver);
+	vfio_pci_uninit_perm_bits();
+}
+
+static int __init vfio_mdev_pci_init(void)
+{
+	int ret;
+
+	/* Allocate shared config space permision data used by all devices */
+	ret = vfio_pci_init_perm_bits();
+	if (ret)
+		return ret;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_mdev_pci_driver);
+	if (ret)
+		goto out_driver;
+
+	vfio_pci_fill_ids(ids, &vfio_mdev_pci_driver);
+
+	return 0;
+out_driver:
+	vfio_pci_uninit_perm_bits();
+	return ret;
+}
+
+module_init(vfio_mdev_pci_init);
+module_exit(vfio_mdev_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/samples/Kconfig b/samples/Kconfig
index d63cc8a..d799ccd 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -161,4 +161,15 @@  config SAMPLE_VFS
 	  as mount API and statx().  Note that this is restricted to the x86
 	  arch whilst it accesses system calls that aren't yet in all arches.
 
+config SAMPLE_VFIO_MDEV_PCI
+	tristate "Sample driver for wrapping PCI device as a mdev"
+	depends on PCI && EVENTFD && VFIO_MDEV_DEVICE
+	select VFIO_VIRQFD
+	select IRQ_BYPASS_MANAGER
+	help
+	  Sample driver for wrapping a PCI device as a mdev. Once bound to
+	  this driver, device passthru should through mdev path.
+
+	  If you don't know what to do here, say N.
+
 endif # SAMPLES