Message ID | 1532239773-15325-4-git-send-email-baolu.lu@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio/mdev: IOMMU aware mediated device | expand |
> From: Lu Baolu [mailto:baolu.lu@linux.intel.com] > Sent: Sunday, July 22, 2018 2:09 PM > > With the Intel IOMMU supporting PASID granularity isolation and protection, a > mediated device could be isolated and protected by an IOMMU unit. We need to > allocate a new group instead of a PCI group. Existing vfio mdev framework also allocates an iommu group for mediate device. mdev_probe() |_ mdev_attach_iommu() |_ iommu_group_alloc() IMHO, this patch actually do a wrapper to the group allocation API. The reason is: it is not necessary to apply the group allocation rules when the allocation is for mediated device. Instead, just allocate a new one for it. :) Thanks, Yi Liu > > Cc: Ashok Raj <ashok.raj@intel.com> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> > Cc: Kevin Tian <kevin.tian@intel.com> > Cc: Liu Yi L <yi.l.liu@intel.com> > Signed-off-by: Sanjay Kumar <sanjay.k.kumar@intel.com> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> > --- > drivers/iommu/intel-iommu.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index > 3ede34a..57ccfc4 100644 > --- a/drivers/iommu/intel-iommu.c > +++ b/drivers/iommu/intel-iommu.c > @@ -5276,6 +5276,17 @@ static void intel_iommu_put_resv_regions(struct device > *dev, > } > } > > +static struct iommu_group *intel_iommu_device_group(struct device *dev) > +{ > + if (dev_is_pci(dev)) > + return pci_device_group(dev); > + > + if (dev_is_mdev(dev)) > + return iommu_group_alloc(); > + > + return ERR_PTR(-EINVAL); > +} > + > #ifdef CONFIG_INTEL_IOMMU_SVM > int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct intel_svm_dev > *sdev) { @@ -5370,7 +5381,7 @@ const struct iommu_ops intel_iommu_ops = { > .remove_device = intel_iommu_remove_device, > .get_resv_regions = intel_iommu_get_resv_regions, > .put_resv_regions = intel_iommu_put_resv_regions, > - .device_group = pci_device_group, > + .device_group = intel_iommu_device_group, > .pgsize_bitmap = INTEL_IOMMU_PGSIZES, > }; > > -- > 2.7.4
Hi, On 07/23/2018 12:44 PM, Liu, Yi L wrote: >> From: Lu Baolu [mailto:baolu.lu@linux.intel.com] >> Sent: Sunday, July 22, 2018 2:09 PM >> >> With the Intel IOMMU supporting PASID granularity isolation and protection, a >> mediated device could be isolated and protected by an IOMMU unit. We need to >> allocate a new group instead of a PCI group. > Existing vfio mdev framework also allocates an iommu group for mediate device. > > mdev_probe() > |_ mdev_attach_iommu() > |_ iommu_group_alloc() When external components ask iommu to allocate a group for a device, it will call pci_device_group in Intel IOMMU driver's @device_group callback. In another word, current Intel IOMMU driver doesn't aware the mediated device and treat all devices as PCI ones. This patch extends the @device_group call back to make it aware of a mediated device. Best regards, Lu Baolu > > IMHO, this patch actually do a wrapper to the group allocation API. The reason is: it > is not necessary to apply the group allocation rules when the allocation is for mediated > device. Instead, just allocate a new one for it. :) > > Thanks, > Yi Liu >> Cc: Ashok Raj <ashok.raj@intel.com> >> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> >> Cc: Kevin Tian <kevin.tian@intel.com> >> Cc: Liu Yi L <yi.l.liu@intel.com> >> Signed-off-by: Sanjay Kumar <sanjay.k.kumar@intel.com> >> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> >> --- >> drivers/iommu/intel-iommu.c | 13 ++++++++++++- >> 1 file changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index >> 3ede34a..57ccfc4 100644 >> --- a/drivers/iommu/intel-iommu.c >> +++ b/drivers/iommu/intel-iommu.c >> @@ -5276,6 +5276,17 @@ static void intel_iommu_put_resv_regions(struct device >> *dev, >> } >> } >> >> +static struct iommu_group *intel_iommu_device_group(struct device *dev) >> +{ >> + if (dev_is_pci(dev)) >> + return pci_device_group(dev); >> + >> + if (dev_is_mdev(dev)) >> + return iommu_group_alloc(); >> + >> + return ERR_PTR(-EINVAL); >> +} >> + >> #ifdef CONFIG_INTEL_IOMMU_SVM >> int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct intel_svm_dev >> *sdev) { @@ -5370,7 +5381,7 @@ const struct iommu_ops intel_iommu_ops = { >> .remove_device = intel_iommu_remove_device, >> .get_resv_regions = intel_iommu_get_resv_regions, >> .put_resv_regions = intel_iommu_put_resv_regions, >> - .device_group = pci_device_group, >> + .device_group = intel_iommu_device_group, >> .pgsize_bitmap = INTEL_IOMMU_PGSIZES, >> }; >> >> -- >> 2.7.4 >
Hi Baolu, On 24/07/18 03:22, Lu Baolu wrote: > Hi, > > On 07/23/2018 12:44 PM, Liu, Yi L wrote: >>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com] >>> Sent: Sunday, July 22, 2018 2:09 PM >>> >>> With the Intel IOMMU supporting PASID granularity isolation and protection, a >>> mediated device could be isolated and protected by an IOMMU unit. We need to >>> allocate a new group instead of a PCI group. >> Existing vfio mdev framework also allocates an iommu group for mediate device. >> >> mdev_probe() >> |_ mdev_attach_iommu() >> |_ iommu_group_alloc() > > When external components ask iommu to allocate a group for a device, > it will call pci_device_group in Intel IOMMU driver's @device_group > callback. In another word, current Intel IOMMU driver doesn't aware > the mediated device and treat all devices as PCI ones. This patch > extends the @device_group call back to make it aware of a mediated > device. I agree that allocating two groups for an mdev seems strange, and in my opinion we shouldn't export the notion of mdev to IOMMU drivers. Otherwise each driver will have to add its own "dev_is_mdev()" special cases, which will get messy in the long run. Besides, the macro is currently private, and to be exported it should be wrapped in symbol_get/put(mdev_bus_type). There is another way: as we're planning to add a generic pasid_alloc() function to the IOMMU API, the mdev module itself could allocate a default PASID for each mdev by calling pasid_alloc() on the mdev's parent, and then do map()/unmap() with that PASID. This way we don't have to add IOMMU ops to the mdev bus, everything can still be done using the ops of the parent. And IOMMU drivers "only" have to implement PASID ops, which will be reused by drivers other than mdev. The allocated PASID also needs to be installed into the parent device. If the mdev module knows the PASID, we can do that by adding set_pasid(mdev, pasid) and clear_pasid(mdev, pasid) operations to mdev_parent_ops. Thanks, Jean
On Tue, 24 Jul 2018 12:30:35 +0100 Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote: > Hi Baolu, > > On 24/07/18 03:22, Lu Baolu wrote: > > Hi, > > > > On 07/23/2018 12:44 PM, Liu, Yi L wrote: > >>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com] > >>> Sent: Sunday, July 22, 2018 2:09 PM > >>> > >>> With the Intel IOMMU supporting PASID granularity isolation and protection, a > >>> mediated device could be isolated and protected by an IOMMU unit. We need to > >>> allocate a new group instead of a PCI group. > >> Existing vfio mdev framework also allocates an iommu group for mediate device. > >> > >> mdev_probe() > >> |_ mdev_attach_iommu() > >> |_ iommu_group_alloc() > > > > When external components ask iommu to allocate a group for a device, > > it will call pci_device_group in Intel IOMMU driver's @device_group > > callback. In another word, current Intel IOMMU driver doesn't aware > > the mediated device and treat all devices as PCI ones. This patch > > extends the @device_group call back to make it aware of a mediated > > device. > > I agree that allocating two groups for an mdev seems strange, and in my > opinion we shouldn't export the notion of mdev to IOMMU drivers. Yep, I was just thinking the same thing. This is too highly integrated into VT-d and too narrowly focused on PASID being the only way that an mdev could make use of the IOMMU. Thanks, Alex
Hi Jean, Pull Kevin in. Not sure why he was removed from cc list. On 07/24/2018 07:30 PM, Jean-Philippe Brucker wrote: > Hi Baolu, > > On 24/07/18 03:22, Lu Baolu wrote: >> Hi, >> >> On 07/23/2018 12:44 PM, Liu, Yi L wrote: >>>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com] >>>> Sent: Sunday, July 22, 2018 2:09 PM >>>> >>>> With the Intel IOMMU supporting PASID granularity isolation and protection, a >>>> mediated device could be isolated and protected by an IOMMU unit. We need to >>>> allocate a new group instead of a PCI group. >>> Existing vfio mdev framework also allocates an iommu group for mediate device. >>> >>> mdev_probe() >>> |_ mdev_attach_iommu() >>> |_ iommu_group_alloc() >> When external components ask iommu to allocate a group for a device, >> it will call pci_device_group in Intel IOMMU driver's @device_group >> callback. In another word, current Intel IOMMU driver doesn't aware >> the mediated device and treat all devices as PCI ones. This patch >> extends the @device_group call back to make it aware of a mediated >> device. > I agree that allocating two groups for an mdev seems strange, and in my > opinion we shouldn't export the notion of mdev to IOMMU drivers. > Otherwise each driver will have to add its own "dev_is_mdev()" special > cases, which will get messy in the long run. Besides, the macro is > currently private, and to be exported it should be wrapped in > symbol_get/put(mdev_bus_type). I agree with you. It should be better if we can make it in a driver agnostic way. > > There is another way: as we're planning to add a generic pasid_alloc() > function to the IOMMU API, the mdev module itself could allocate a > default PASID for each mdev by calling pasid_alloc() on the mdev's > parent, and then do map()/unmap() with that PASID. This way we don't > have to add IOMMU ops to the mdev bus, everything can still be done > using the ops of the parent. And IOMMU drivers "only" have to implement > PASID ops, which will be reused by drivers other than mdev. A quick question when I thinking about this is how to allocate a domain for the mediated device who uses the default pasid for DMA transactions? Best regards, Lu Baolu > > The allocated PASID also needs to be installed into the parent device. > If the mdev module knows the PASID, we can do that by adding > set_pasid(mdev, pasid) and clear_pasid(mdev, pasid) operations to > mdev_parent_ops. > > Thanks, > Jean > >
> From: Jean-Philippe Brucker > Sent: Tuesday, July 24, 2018 7:31 PM > > Hi Baolu, > > On 24/07/18 03:22, Lu Baolu wrote: > > Hi, > > > > On 07/23/2018 12:44 PM, Liu, Yi L wrote: > >>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com] > >>> Sent: Sunday, July 22, 2018 2:09 PM > >>> > >>> With the Intel IOMMU supporting PASID granularity isolation and > protection, a > >>> mediated device could be isolated and protected by an IOMMU unit. > We need to > >>> allocate a new group instead of a PCI group. > >> Existing vfio mdev framework also allocates an iommu group for > mediate device. > >> > >> mdev_probe() > >> |_ mdev_attach_iommu() > >> |_ iommu_group_alloc() > > > > When external components ask iommu to allocate a group for a device, > > it will call pci_device_group in Intel IOMMU driver's @device_group > > callback. In another word, current Intel IOMMU driver doesn't aware > > the mediated device and treat all devices as PCI ones. This patch > > extends the @device_group call back to make it aware of a mediated > > device. > > I agree that allocating two groups for an mdev seems strange, and in my > opinion we shouldn't export the notion of mdev to IOMMU drivers. > Otherwise each driver will have to add its own "dev_is_mdev()" special > cases, which will get messy in the long run. Besides, the macro is > currently private, and to be exported it should be wrapped in > symbol_get/put(mdev_bus_type). There is only one group per mdev, but I agree that avoiding mdev awareness in IOMMU driver is definitely good... We didn't find a good way before, which is why current RFC was implemented that way. > > There is another way: as we're planning to add a generic pasid_alloc() > function to the IOMMU API, the mdev module itself could allocate a > default PASID for each mdev by calling pasid_alloc() on the mdev's > parent, and then do map()/unmap() with that PASID. This way we don't > have to add IOMMU ops to the mdev bus, everything can still be done > using the ops of the parent. And IOMMU drivers "only" have to implement > PASID ops, which will be reused by drivers other than mdev. yes, PASID is more general abstraction than mdev. However there is one problem. Please note with new scalable mode now PASID is not used just for SVA. You can do 1st-level, 2nd-level and nested in PASID granularity, meaning each PASID can be bound to an unique iommu domain. However today IOMMU driver allows only one iommu_domain per device. If we simply install allocated PASID to parent device, we should also remove that iommu_domain limitation. Is it the way that we want to pursue? mdev avoids such problem as it introduces a separate device... > > The allocated PASID also needs to be installed into the parent device. > If the mdev module knows the PASID, we can do that by adding > set_pasid(mdev, pasid) and clear_pasid(mdev, pasid) operations to > mdev_parent_ops. > Thanks Kevin
Hi Jean, > From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com] > Sent: Tuesday, July 24, 2018 7:31 PM > > Hi Baolu, > > On 24/07/18 03:22, Lu Baolu wrote: > > Hi, > > > > On 07/23/2018 12:44 PM, Liu, Yi L wrote: > >>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com] > >>> Sent: Sunday, July 22, 2018 2:09 PM > >>> > >>> With the Intel IOMMU supporting PASID granularity isolation and protection, a > >>> mediated device could be isolated and protected by an IOMMU unit. We need to > >>> allocate a new group instead of a PCI group. > >> Existing vfio mdev framework also allocates an iommu group for mediate device. > >> > >> mdev_probe() > >> |_ mdev_attach_iommu() > >> |_ iommu_group_alloc() > > > > When external components ask iommu to allocate a group for a device, > > it will call pci_device_group in Intel IOMMU driver's @device_group > > callback. In another word, current Intel IOMMU driver doesn't aware > > the mediated device and treat all devices as PCI ones. This patch > > extends the @device_group call back to make it aware of a mediated > > device. > > I agree that allocating two groups for an mdev seems strange, and in my There will not be two groups for a mdev. Pls refer to Patch 08/10 of this series. Baolu added iommu_ops check when doing group allocation in mdev_attach_iommu(). [RFC PATCH 08/10] vfio/mdev: Set iommu ops for mdev bus @@ -21,6 +21,13 @@ static int mdev_attach_iommu(struct mdev_device *mdev) int ret; struct iommu_group *group; + /* + * If iommu_ops is set for bus, add_device() will allocate + * a group and add the device in the group. + */ + if (iommu_present(mdev->dev.bus)) + return 0; + > opinion we shouldn't export the notion of mdev to IOMMU drivers. The key idea of this RFC is to tag iommu domain with PASID, if any mdev is added to such a domain, it would get the PASID and config in its parent. Thus the transactions from mdev can be isolated in iommu hardware. Based on this idea, mdevs can be managed in a flexible manner. e.g. if two mdevs are assigned to same VM, they can share PASID. This share can be easily achieve by adding them to the same domain. If we default allocate a PASID for each mdev, it may be a waste. With vendor-specific iommu driver handle the mdev difference, it can largely keep the fundamental iommu concepts in current software implementation. > Otherwise each driver will have to add its own "dev_is_mdev()" special > cases, which will get messy in the long run. Besides, the macro is > currently private, and to be exported it should be wrapped in > symbol_get/put(mdev_bus_type). Agreed. Should figure out a better manner. > There is another way: as we're planning to add a generic pasid_alloc() > function to the IOMMU API, the mdev module itself could allocate a > default PASID for each mdev by calling pasid_alloc() on the mdev's > parent, and then do map()/unmap() with that PASID. This way we don't so far, map/unmap is per-domain operation. In this way, passing PASID makes it be kind of per-device operation. This may affect too much of existing software implementation. > have to add IOMMU ops to the mdev bus, everything can still be done > using the ops of the parent. And IOMMU drivers "only" have to implement > PASID ops, which will be reused by drivers other than mdev. > > The allocated PASID also needs to be installed into the parent device. > If the mdev module knows the PASID, we can do that by adding > set_pasid(mdev, pasid) and clear_pasid(mdev, pasid) operations to > mdev_parent_ops. Your idea is fascinating. Pls feel free let us know if we missed any from you. :) > Thanks, > Jean Thanks, Yi Liu
> From: Tian, Kevin > Sent: Wednesday, July 25, 2018 10:16 AM > [...] > > > > There is another way: as we're planning to add a generic pasid_alloc() > > function to the IOMMU API, the mdev module itself could allocate a > > default PASID for each mdev by calling pasid_alloc() on the mdev's > > parent, and then do map()/unmap() with that PASID. This way we don't > > have to add IOMMU ops to the mdev bus, everything can still be done > > using the ops of the parent. And IOMMU drivers "only" have to > implement > > PASID ops, which will be reused by drivers other than mdev. > > yes, PASID is more general abstraction than mdev. However there is > one problem. Please note with new scalable mode now PASID is not > used just for SVA. You can do 1st-level, 2nd-level and nested in PASID > granularity, meaning each PASID can be bound to an unique iommu > domain. However today IOMMU driver allows only one iommu_domain > per device. If we simply install allocated PASID to parent device, we > should also remove that iommu_domain limitation. Is it the way that > we want to pursue? > > mdev avoids such problem as it introduces a separate device... another thing to think about is to simplify the caller logic. It would be good to have consistent IOMMU APIs cross PCI device and mdev, for common VFIO operations e.g. map/unmap, iommu_ attach_group, etc. If we need special version ops with PASID, it brings burden to VFIO and other IOMMU clients... For IOMMU-capable mdev, there are two use cases which have been talked so far: 1) mdev with PASID-granular DMA isolation 2) mdev inheriting RID from parent device (no sharing) 1) is what this RFC is currently for. 2) is what Alex concerned which is not covered in RFC yet. In my mind, an ideal design is like below: a) VFIO provides an interface for parent driver to specify whether a mdev is IOMMU capable, with flag to indicate whether it's PASID-granular or RID-inherit; b) Once an IOMMU-capable mdev is created, VFIO treats it no different from normal physical devices, using exactly same IOMMU APIs to operate (including future SVA). VFIO doesn't care whether PASID or RID will be used in hardware; c) IOMMU driver then handle RID/PASID difference internally, based on its own configuration and mdev type. To support above goal, I feel a new device handle is a nice fit to represent mdev within IOMMU driver, with same set of capabilities as observed on its parent device. Jean, maybe I didn't fully understand your proposal. Can you elaborate whether above goal can be achieved with it? Thanks Kevin
On 25/07/18 07:19, Tian, Kevin wrote: >> From: Tian, Kevin >> Sent: Wednesday, July 25, 2018 10:16 AM >> > [...] >>> >>> There is another way: as we're planning to add a generic pasid_alloc() >>> function to the IOMMU API, the mdev module itself could allocate a >>> default PASID for each mdev by calling pasid_alloc() on the mdev's >>> parent, and then do map()/unmap() with that PASID. This way we don't >>> have to add IOMMU ops to the mdev bus, everything can still be done >>> using the ops of the parent. And IOMMU drivers "only" have to >> implement >>> PASID ops, which will be reused by drivers other than mdev. >> >> yes, PASID is more general abstraction than mdev. However there is >> one problem. Please note with new scalable mode now PASID is not >> used just for SVA. You can do 1st-level, 2nd-level and nested in PASID >> granularity, meaning each PASID can be bound to an unique iommu >> domain. However today IOMMU driver allows only one iommu_domain >> per device. If we simply install allocated PASID to parent device, we >> should also remove that iommu_domain limitation. Is it the way that >> we want to pursue? >> >> mdev avoids such problem as it introduces a separate device... > > another thing to think about is to simplify the caller logic. It would > be good to have consistent IOMMU APIs cross PCI device and > mdev, for common VFIO operations e.g. map/unmap, iommu_ > attach_group, etc. If we need special version ops with PASID, it > brings burden to VFIO and other IOMMU clients... Yes, providing special ops is the simplest to implement (or least invasive, I guess) but isn't particularly elegant. See the patch from Jordan below, that I was planning to add to the SVA series (I'm laboriously working towards next version) and my email from last week: https://patchwork.kernel.org/patch/10412251/ Whenever I come back to hierarchical IOMMU domains I reject it as too complicated, but maybe that is what we need. I find it difficult to reason about because domains currently represent both a collection of devices and a one or more address spaces. I proposed the io_mm thing to represent a single address space, and to avoid adding special cases to every function in the IOMMU subsystem that manipulates domains. > For IOMMU-capable mdev, there are two use cases which have > been talked so far: > > 1) mdev with PASID-granular DMA isolation > 2) mdev inheriting RID from parent device (no sharing) Would that be a single mdev per parent device? Otherwise I don't really understand how it would work without all map/unmap operations going to the parent device's driver. > 1) is what this RFC is currently for. 2) is what Alex concerned > which is not covered in RFC yet. > > In my mind, an ideal design is like below: > > a) VFIO provides an interface for parent driver to specify > whether a mdev is IOMMU capable, with flag to indicate > whether it's PASID-granular or RID-inherit; > > b) Once an IOMMU-capable mdev is created, VFIO treats it no > different from normal physical devices, using exactly same > IOMMU APIs to operate (including future SVA). VFIO doesn't > care whether PASID or RID will be used in hardware; > > c) IOMMU driver then handle RID/PASID difference internally, > based on its own configuration and mdev type. > > To support above goal, I feel a new device handle is a nice fit > to represent mdev within IOMMU driver, with same set of > capabilities as observed on its parent device. It's a good abstraction, but I'm still concerned about other users of PASID-granular DMA isolation, for example GPU drivers wanting to improve isolation of DMA bufs, will want the same functionality without going through the vfio-mdev module. The IOMMU operations we care about don't take a device handle, I think, just a domain. And VFIO itself only deals with domains when doing map/unmap. Maybe we could add this operation to the IOMMU subsystem: child_domain = domain_create_child(parent_dev, parent_domain) A child domain would be a smaller isolation granule, getting a PASID if that's what the IOMMU implements or another mechanism for 2). It is automatically attached to its parent's devices, so attach/detach operations wouldn't be necessary on the child. Depending on the underlying architecture the child domain can support map/unmap on a single stage, or map/unmap for 2nd level and bind_pgtable for 1st level. I'm not sure how this works for host SVA though. I think the sva_bind_dev() API could stay the same, but the internals will need to change. Thanks, Jean ----- I was going to send the following patch for dealing with private PASIDs. I used it for a vfio-mdev prototype internally: VFIO would call iommu_sva_alloc_pasid on the parent device and then sva_map/sva_unmap on the parent domain + allocated io_mm (which redirects to iommu_map/iommu_unmap when there is no io_mm). Since it relies on io_mm instead of child domains, it duplicates the iommu ops, and as discussed might not be adapted to Vt-d scalable mode. Subject: [PATCH] iommu: sva: Add support for private PASIDs Provide an API for allocating PASIDs and populating them manually. To ease cleanup and factor allocation code, reuse the io_mm structure for private PASID. Private io_mm has a NULL mm_struct pointer, and cannot be bound to multiple devices. The mm_alloc() IOMMU op must now check if the mm argument is NULL, in which case it should allocate io_pgtables instead of binding to an mm. Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> --- drivers/iommu/iommu-sva.c | 211 +++++++++++++++++++++++++++++++++----- drivers/iommu/iommu.c | 49 ++++++--- include/linux/iommu.h | 118 +++++++++++++++++++-- 3 files changed, 331 insertions(+), 47 deletions(-) diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c index 35d0f387fbef..12c9b861014d 100644 --- a/drivers/iommu/iommu-sva.c +++ b/drivers/iommu/iommu-sva.c @@ -15,11 +15,11 @@ /** * DOC: io_mm model * - * The io_mm keeps track of process address spaces shared between CPU and IOMMU. - * The following example illustrates the relation between structures - * iommu_domain, io_mm and iommu_bond. An iommu_bond is a link between io_mm and - * device. A device can have multiple io_mm and an io_mm may be bound to - * multiple devices. + * When used with the bind()/unbind() functions, the io_mm keeps track of + * process address spaces shared between CPU and IOMMU. The following example + * illustrates the relation between structures iommu_domain, io_mm and + * iommu_bond. An iommu_bond is a link between io_mm and device. A device can + * have multiple io_mm and an io_mm may be bound to multiple devices. * ___________________________ * | IOMMU domain A | * | ________________ | @@ -97,6 +97,12 @@ * non-PASID translations. In this case PASID 0 is reserved and entry 0 points * to the io_pgtable base. On Intel IOMMU, the io_pgtable base would be held in * the device table and PASID 0 would be available to the allocator. + * + * The io_mm can also represent a private IOMMU address space, which isn't + * shared with a process. The device driver calls iommu_sva_alloc_pasid which + * returns an io_mm that can be populated with the iommu_sva_map/unmap + * functions. The principle is the same as shared io_mm, except that a private + * io_mm cannot be bound to multiple devices. */ struct iommu_bond { @@ -130,6 +136,9 @@ static DEFINE_SPINLOCK(iommu_sva_lock); static struct mmu_notifier_ops iommu_mmu_notifier; +#define io_mm_is_private(io_mm) ((io_mm) != NULL && (io_mm)->mm == NULL) +#define io_mm_is_shared(io_mm) ((io_mm) != NULL && (io_mm)->mm != NULL) + static struct io_mm * io_mm_alloc(struct iommu_domain *domain, struct device *dev, struct mm_struct *mm, unsigned long flags) @@ -148,19 +157,10 @@ io_mm_alloc(struct iommu_domain *domain, struct device *dev, if (!io_mm) return ERR_PTR(-ENOMEM); - /* - * The mm must not be freed until after the driver frees the io_mm - * (which may involve unpinning the CPU ASID for instance, requiring a - * valid mm struct.) - */ - mmgrab(mm); - io_mm->flags = flags; io_mm->mm = mm; - io_mm->notifier.ops = &iommu_mmu_notifier; io_mm->release = domain->ops->mm_free; INIT_LIST_HEAD(&io_mm->devices); - /* Leave kref to zero until the io_mm is fully initialized */ idr_preload(GFP_KERNEL); spin_lock(&iommu_sva_lock); @@ -175,6 +175,32 @@ io_mm_alloc(struct iommu_domain *domain, struct device *dev, goto err_free_mm; } + return io_mm; + +err_free_mm: + io_mm->release(io_mm); + return ERR_PTR(ret); +} + +static struct io_mm * +io_mm_alloc_shared(struct iommu_domain *domain, struct device *dev, + struct mm_struct *mm, unsigned long flags) +{ + int ret; + struct io_mm *io_mm; + + io_mm = io_mm_alloc(domain, dev, mm, flags); + if (IS_ERR(io_mm)) + return io_mm; + + /* + * The mm must not be freed until after the driver frees the io_mm + * (which may involve unpinning the CPU ASID for instance, requiring a + * valid mm struct.) + */ + mmgrab(mm); + + io_mm->notifier.ops = &iommu_mmu_notifier; ret = mmu_notifier_register(&io_mm->notifier, mm); if (ret) goto err_free_pasid; @@ -202,7 +228,6 @@ io_mm_alloc(struct iommu_domain *domain, struct device *dev, idr_remove(&iommu_pasid_idr, io_mm->pasid); spin_unlock(&iommu_sva_lock); -err_free_mm: io_mm->release(io_mm); mmdrop(mm); @@ -231,6 +256,11 @@ static void io_mm_release(struct kref *kref) /* The PASID can now be reallocated for another mm. */ idr_remove(&iommu_pasid_idr, io_mm->pasid); + if (io_mm_is_private(io_mm)) { + io_mm->release(io_mm); + return; + } + /* * If we're being released from mm exit, the notifier callback ->release * has already been called. Otherwise we don't need ->release, the io_mm @@ -258,7 +288,7 @@ static int io_mm_get_locked(struct io_mm *io_mm) if (io_mm && kref_get_unless_zero(&io_mm->kref)) { /* * kref_get_unless_zero doesn't provide ordering for reads. This - * barrier pairs with the one in io_mm_alloc. + * barrier pairs with the one in io_mm_alloc_shared. */ smp_rmb(); return 1; @@ -289,7 +319,7 @@ static int io_mm_attach(struct iommu_domain *domain, struct device *dev, struct iommu_sva_param *param = dev->iommu_param->sva_param; if (!domain->ops->mm_attach || !domain->ops->mm_detach || - !domain->ops->mm_invalidate) + (io_mm_is_shared(io_mm) && !domain->ops->mm_invalidate)) return -ENODEV; if (pasid > param->max_pasid || pasid < param->min_pasid) @@ -550,7 +580,7 @@ int iommu_sva_device_init(struct device *dev, unsigned long features, struct iommu_sva_param *param; struct iommu_domain *domain = iommu_get_domain_for_dev(dev); - if (!domain || !domain->ops->sva_device_init) + if (!domain) return -ENODEV; if (features & ~IOMMU_SVA_FEAT_IOPF) @@ -584,9 +614,11 @@ int iommu_sva_device_init(struct device *dev, unsigned long features, * IOMMU driver updates the limits depending on the IOMMU and device * capabilities. */ - ret = domain->ops->sva_device_init(dev, param); - if (ret) - goto err_free_param; + if (domain->ops->sva_device_init) { + ret = domain->ops->sva_device_init(dev, param); + if (ret) + goto err_free_param; + } dev->iommu_param->sva_param = param; mutex_unlock(&dev->iommu_param->lock); @@ -688,7 +720,7 @@ int __iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, } if (!io_mm) { - io_mm = io_mm_alloc(domain, dev, mm, flags); + io_mm = io_mm_alloc_shared(domain, dev, mm, flags); if (IS_ERR(io_mm)) return PTR_ERR(io_mm); } @@ -723,6 +755,9 @@ int __iommu_sva_unbind_device(struct device *dev, int pasid) /* spin_lock_irq matches the one in wait_event_lock_irq */ spin_lock_irq(&iommu_sva_lock); list_for_each_entry(bond, ¶m->mm_list, dev_head) { + if (io_mm_is_private(bond->io_mm)) + continue; + if (bond->io_mm->pasid == pasid) { io_mm_detach_locked(bond, true); ret = 0; @@ -765,6 +800,136 @@ void __iommu_sva_unbind_dev_all(struct device *dev) } EXPORT_SYMBOL_GPL(__iommu_sva_unbind_dev_all); +/* + * iommu_sva_alloc_pasid - Allocate a private PASID + * + * Allocate a PASID for private map/unmap operations. Create a new I/O address + * space for this device, that isn't bound to any process. + * + * iommu_sva_device_init must have been called first. + */ +int iommu_sva_alloc_pasid(struct device *dev, struct io_mm **out) +{ + int ret; + struct io_mm *io_mm; + struct iommu_domain *domain; + struct iommu_sva_param *param = dev->iommu_param->sva_param; + + if (!out || !param) + return -EINVAL; + + domain = iommu_get_domain_for_dev(dev); + if (!domain) + return -EINVAL; + + io_mm = io_mm_alloc(domain, dev, NULL, 0); + if (IS_ERR(io_mm)) + return PTR_ERR(io_mm); + + kref_init(&io_mm->kref); + + ret = io_mm_attach(domain, dev, io_mm, NULL); + if (ret) { + io_mm_put(io_mm); + return ret; + } + + *out = io_mm; + return 0; +} +EXPORT_SYMBOL_GPL(iommu_sva_alloc_pasid); + +void iommu_sva_free_pasid(struct device *dev, struct io_mm *io_mm) +{ + struct iommu_bond *bond; + + if (WARN_ON(io_mm_is_shared(io_mm))) + return; + + spin_lock(&iommu_sva_lock); + list_for_each_entry(bond, &io_mm->devices, mm_head) { + if (bond->dev == dev) { + io_mm_detach_locked(bond, false); + break; + } + } + spin_unlock(&iommu_sva_lock); +} +EXPORT_SYMBOL_GPL(iommu_sva_free_pasid); + +int iommu_sva_map(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, phys_addr_t paddr, size_t size, int prot) +{ + if (WARN_ON(io_mm_is_shared(io_mm))) + return -ENODEV; + + return __iommu_map(domain, io_mm, iova, paddr, size, prot); +} +EXPORT_SYMBOL_GPL(iommu_sva_map); + +size_t iommu_sva_map_sg(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, struct scatterlist *sg, + unsigned int nents, int prot) +{ + if (WARN_ON(io_mm_is_shared(io_mm))) + return -ENODEV; + + return domain->ops->map_sg(domain, io_mm, iova, sg, nents, prot); +} +EXPORT_SYMBOL_GPL(iommu_sva_map_sg); + +size_t iommu_sva_unmap(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, size_t size) +{ + if (WARN_ON(io_mm_is_shared(io_mm))) + return 0; + + return __iommu_unmap(domain, io_mm, iova, size, true); +} +EXPORT_SYMBOL_GPL(iommu_sva_unmap); + +size_t iommu_sva_unmap_fast(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, size_t size) +{ + if (WARN_ON(io_mm_is_shared(io_mm))) + return 0; + + return __iommu_unmap(domain, io_mm, iova, size, false); +} +EXPORT_SYMBOL_GPL(iommu_sva_unmap_fast); + +phys_addr_t iommu_sva_iova_to_phys(struct iommu_domain *domain, + struct io_mm *io_mm, dma_addr_t iova) +{ + if (!io_mm) + return iommu_iova_to_phys(domain, iova); + + if (WARN_ON(io_mm_is_shared(io_mm))) + return 0; + + if (unlikely(domain->ops->sva_iova_to_phys == NULL)) + return 0; + + return domain->ops->sva_iova_to_phys(domain, io_mm, iova); +} +EXPORT_SYMBOL_GPL(iommu_sva_iova_to_phys); + +void iommu_sva_tlb_range_add(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, size_t size) +{ + if (!io_mm) { + iommu_tlb_range_add(domain, iova, size); + return; + } + + if (WARN_ON(io_mm_is_shared(io_mm))) + return; + + if (domain->ops->sva_iotlb_range_add != NULL) + domain->ops->sva_iotlb_range_add(domain, io_mm, iova, size); +} +EXPORT_SYMBOL_GPL(iommu_sva_tlb_range_add); + /** * iommu_sva_find() - Find mm associated to the given PASID * @pasid: Process Address Space ID assigned to the mm @@ -779,7 +944,7 @@ struct mm_struct *iommu_sva_find(int pasid) spin_lock(&iommu_sva_lock); io_mm = idr_find(&iommu_pasid_idr, pasid); - if (io_mm && io_mm_get_locked(io_mm)) { + if (io_mm_is_shared(io_mm) && io_mm_get_locked(io_mm)) { if (mmget_not_zero(io_mm->mm)) mm = io_mm->mm; diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index b02e1d32c733..26ac9b2a813e 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1816,8 +1816,8 @@ static size_t iommu_pgsize(struct iommu_domain *domain, return pgsize; } -int iommu_map(struct iommu_domain *domain, unsigned long iova, - phys_addr_t paddr, size_t size, int prot) +int __iommu_map(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, phys_addr_t paddr, size_t size, int prot) { unsigned long orig_iova = iova; unsigned int min_pagesz; @@ -1825,7 +1825,8 @@ int iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t orig_paddr = paddr; int ret = 0; - if (unlikely(domain->ops->map == NULL || + if (unlikely((!io_mm && domain->ops->map == NULL) || + (io_mm && domain->ops->sva_map == NULL) || domain->pgsize_bitmap == 0UL)) return -ENODEV; @@ -1854,7 +1855,12 @@ int iommu_map(struct iommu_domain *domain, unsigned long iova, pr_debug("mapping: iova 0x%lx pa %pa pgsize 0x%zx\n", iova, &paddr, pgsize); - ret = domain->ops->map(domain, iova, paddr, pgsize, prot); + if (io_mm) + ret = domain->ops->sva_map(domain, io_mm, iova, paddr, + pgsize, prot); + else + ret = domain->ops->map(domain, iova, paddr, pgsize, + prot); if (ret) break; @@ -1865,24 +1871,30 @@ int iommu_map(struct iommu_domain *domain, unsigned long iova, /* unroll mapping in case something went wrong */ if (ret) - iommu_unmap(domain, orig_iova, orig_size - size); + __iommu_unmap(domain, io_mm, orig_iova, orig_size - size, true); else trace_map(orig_iova, orig_paddr, orig_size); return ret; } + +int iommu_map(struct iommu_domain *domain, unsigned long iov, + phys_addr_t paddr, size_t size, int prot) +{ + return __iommu_map(domain, NULL, iov, paddr, size, prot); +} EXPORT_SYMBOL_GPL(iommu_map); -static size_t __iommu_unmap(struct iommu_domain *domain, - unsigned long iova, size_t size, - bool sync) +size_t __iommu_unmap(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, size_t size, bool sync) { const struct iommu_ops *ops = domain->ops; size_t unmapped_page, unmapped = 0; unsigned long orig_iova = iova; unsigned int min_pagesz; - if (unlikely(ops->unmap == NULL || + if (unlikely((!io_mm && ops->unmap == NULL) || + (io_mm && ops->sva_unmap == NULL) || domain->pgsize_bitmap == 0UL)) return 0; @@ -1912,7 +1924,11 @@ static size_t __iommu_unmap(struct iommu_domain *domain, while (unmapped < size) { size_t pgsize = iommu_pgsize(domain, iova, size - unmapped); - unmapped_page = ops->unmap(domain, iova, pgsize); + if (io_mm) + unmapped_page = ops->sva_unmap(domain, io_mm, iova, + pgsize); + else + unmapped_page = ops->unmap(domain, iova, pgsize); if (!unmapped_page) break; @@ -1936,19 +1952,20 @@ static size_t __iommu_unmap(struct iommu_domain *domain, size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size) { - return __iommu_unmap(domain, iova, size, true); + return __iommu_unmap(domain, NULL, iova, size, true); } EXPORT_SYMBOL_GPL(iommu_unmap); size_t iommu_unmap_fast(struct iommu_domain *domain, unsigned long iova, size_t size) { - return __iommu_unmap(domain, iova, size, false); + return __iommu_unmap(domain, NULL, iova, size, false); } EXPORT_SYMBOL_GPL(iommu_unmap_fast); -size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova, - struct scatterlist *sg, unsigned int nents, int prot) +size_t default_iommu_map_sg(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, struct scatterlist *sg, unsigned + int nents, int prot) { struct scatterlist *s; size_t mapped = 0; @@ -1972,7 +1989,7 @@ size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova, if (!IS_ALIGNED(s->offset, min_pagesz)) goto out_err; - ret = iommu_map(domain, iova + mapped, phys, s->length, prot); + ret = __iommu_map(domain, io_mm, iova + mapped, phys, s->length, prot); if (ret) goto out_err; @@ -1983,7 +2000,7 @@ size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova, out_err: /* undo mappings already done */ - iommu_unmap(domain, iova, mapped); + __iommu_unmap(domain, io_mm, iova, mapped, true); return 0; diff --git a/include/linux/iommu.h b/include/linux/iommu.h index b525f5636df8..1f63a8691173 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -248,13 +248,17 @@ struct iommu_sva_param { * @mm_invalidate: Invalidate a range of mappings for an mm * @map: map a physically contiguous memory region to an iommu domain * @unmap: unmap a physically contiguous memory region from an iommu domain + * @sva_map: map a physically contiguous memory region to an address space + * @sva_unmap: unmap a physically contiguous memory region from an address space * @map_sg: map a scatter-gather list of physically contiguous memory chunks * to an iommu domain * @flush_tlb_all: Synchronously flush all hardware TLBs for this domain * @tlb_range_add: Add a given iova range to the flush queue for this domain + * @sva_iotlb_range_add: Add a given iova range to the flush queue for this mm * @tlb_sync: Flush all queued ranges from the hardware TLBs and empty flush * queue * @iova_to_phys: translate iova to physical address + * @sva_iova_to_phys: translate iova to physical address * @add_device: add device to iommu grouping * @remove_device: remove device from iommu grouping * @device_group: find iommu group for a particular device @@ -302,13 +306,24 @@ struct iommu_ops { phys_addr_t paddr, size_t size, int prot); size_t (*unmap)(struct iommu_domain *domain, unsigned long iova, size_t size); - size_t (*map_sg)(struct iommu_domain *domain, unsigned long iova, - struct scatterlist *sg, unsigned int nents, int prot); + int (*sva_map)(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, phys_addr_t paddr, size_t size, + int prot); + size_t (*sva_unmap)(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, size_t size); + size_t (*map_sg)(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, struct scatterlist *sg, + unsigned int nents, int prot); void (*flush_iotlb_all)(struct iommu_domain *domain); void (*iotlb_range_add)(struct iommu_domain *domain, unsigned long iova, size_t size); + void (*sva_iotlb_range_add)(struct iommu_domain *domain, + struct io_mm *io_mm, unsigned long iova, + size_t size); void (*iotlb_sync)(struct iommu_domain *domain); phys_addr_t (*iova_to_phys)(struct iommu_domain *domain, dma_addr_t iova); + phys_addr_t (*sva_iova_to_phys)(struct iommu_domain *domain, + struct io_mm *io_mm, dma_addr_t iova); int (*add_device)(struct device *dev); void (*remove_device)(struct device *dev); struct iommu_group *(*device_group)(struct device *dev); @@ -528,15 +543,20 @@ extern int iommu_sva_invalidate(struct iommu_domain *domain, struct device *dev, struct tlb_invalidate_info *inv_info); extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev); +extern int __iommu_map(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, phys_addr_t paddr, size_t size, + int prot); extern int iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, size_t size, int prot); +extern size_t __iommu_unmap(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, size_t size, bool sync); extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size); extern size_t iommu_unmap_fast(struct iommu_domain *domain, unsigned long iova, size_t size); -extern size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova, - struct scatterlist *sg,unsigned int nents, - int prot); +extern size_t default_iommu_map_sg(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, struct scatterlist *sg, + unsigned int nents, int prot); extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova); extern void iommu_set_fault_handler(struct iommu_domain *domain, iommu_fault_handler_t handler, void *token); @@ -622,7 +642,7 @@ static inline size_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova, struct scatterlist *sg, unsigned int nents, int prot) { - return domain->ops->map_sg(domain, iova, sg, nents, prot); + return domain->ops->map_sg(domain, NULL, iova, sg, nents, prot); } /* PCI device grouping function */ @@ -710,12 +730,25 @@ static inline struct iommu_domain *iommu_get_domain_for_dev(struct device *dev) return NULL; } +static inline int __iommu_map(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, phys_addr_t paddr, + size_t size, int prot) +{ + return -ENODEV; +} + static inline int iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, size_t size, int prot) { return -ENODEV; } +static inline size_t __iommu_unmap(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, size_t size, bool sync) +{ + return 0; +} + static inline size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size) { @@ -729,8 +762,9 @@ static inline size_t iommu_unmap_fast(struct iommu_domain *domain, } static inline size_t iommu_map_sg(struct iommu_domain *domain, - unsigned long iova, struct scatterlist *sg, - unsigned int nents, int prot) + struct io_mm *io_mm, unsigned long iova, + struct scatterlist *sg, unsigned int nents, + int prot) { return 0; } @@ -1017,6 +1051,22 @@ extern int __iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, extern int __iommu_sva_unbind_device(struct device *dev, int pasid); extern void __iommu_sva_unbind_dev_all(struct device *dev); +int iommu_sva_alloc_pasid(struct device *dev, struct io_mm **io_mm); +void iommu_sva_free_pasid(struct device *dev, struct io_mm *io_mm); + +int iommu_sva_map(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, phys_addr_t paddr, size_t size, int prot); +size_t iommu_sva_map_sg(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, struct scatterlist *sg, + unsigned int nents, int prot); +size_t iommu_sva_unmap(struct iommu_domain *domain, + struct io_mm *io_mm, unsigned long iova, size_t size); +size_t iommu_sva_unmap_fast(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, size_t size); +phys_addr_t iommu_sva_iova_to_phys(struct iommu_domain *domain, + struct io_mm *io_mm, dma_addr_t iova); +void iommu_sva_tlb_range_add(struct iommu_domain *domain, struct io_mm *io_mm, + unsigned long iova, size_t size); extern struct mm_struct *iommu_sva_find(int pasid); #else /* CONFIG_IOMMU_SVA */ static inline int iommu_sva_device_init(struct device *dev, @@ -1048,6 +1098,58 @@ static inline void __iommu_sva_unbind_dev_all(struct device *dev) { } +static inline struct io_mm * +iommu_sva_alloc_pasid(struct iommu_domain *domain, struct device *dev) +{ + return ERR_PTR(-ENODEV); +} + +static inline void iommu_sva_free_pasid(struct io_mm *io_mm, struct device *dev) +{ +} + +static inline int iommu_sva_map(struct iommu_domain *domain, + struct io_mm *io_mm, unsigned long iova, + phys_addr_t paddr, size_t size, int prot) +{ + return -EINVAL; +} + +static inline size_t iommu_sva_map_sg(struct iommu_domain *domain, + struct io_mm *io_mm, unsigned long iova, + struct scatterlist *sg, + unsigned int nents, int prot) +{ + return 0; +} + +static inline size_t iommu_sva_unmap(struct iommu_domain *domain, + struct io_mm *io_mm, unsigned long iova, + size_t size) +{ + return 0; +} + +static inline size_t iommu_sva_unmap_fast(struct iommu_domain *domain, + struct io_mm *io_mm, + unsigned long iova, size_t size) +{ + return 0; +} + +static inline phys_addr_t iommu_sva_iova_to_phys(struct iommu_domain *domain, + struct io_mm *io_mm, + dma_addr_t iova) +{ + return 0; +} + +static inline void iommu_sva_tlb_range_add(struct iommu_domain *domain, + struct io_mm *io_mm, + unsigned long iova, size_t size) +{ +} + static inline struct mm_struct *iommu_sva_find(int pasid) { return NULL;
> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com] > Sent: Thursday, July 26, 2018 3:20 AM > > On 25/07/18 07:19, Tian, Kevin wrote: > >> From: Tian, Kevin > >> Sent: Wednesday, July 25, 2018 10:16 AM > >> > > [...] > >>> > >>> There is another way: as we're planning to add a generic pasid_alloc() > >>> function to the IOMMU API, the mdev module itself could allocate a > >>> default PASID for each mdev by calling pasid_alloc() on the mdev's > >>> parent, and then do map()/unmap() with that PASID. This way we don't > >>> have to add IOMMU ops to the mdev bus, everything can still be done > >>> using the ops of the parent. And IOMMU drivers "only" have to > >> implement > >>> PASID ops, which will be reused by drivers other than mdev. > >> > >> yes, PASID is more general abstraction than mdev. However there is > >> one problem. Please note with new scalable mode now PASID is not > >> used just for SVA. You can do 1st-level, 2nd-level and nested in PASID > >> granularity, meaning each PASID can be bound to an unique iommu > >> domain. However today IOMMU driver allows only one iommu_domain > >> per device. If we simply install allocated PASID to parent device, we > >> should also remove that iommu_domain limitation. Is it the way that > >> we want to pursue? > >> > >> mdev avoids such problem as it introduces a separate device... > > > > another thing to think about is to simplify the caller logic. It would > > be good to have consistent IOMMU APIs cross PCI device and > > mdev, for common VFIO operations e.g. map/unmap, iommu_ > > attach_group, etc. If we need special version ops with PASID, it > > brings burden to VFIO and other IOMMU clients... > > Yes, providing special ops is the simplest to implement (or least > invasive, I guess) but isn't particularly elegant. See the patch from > Jordan below, that I was planning to add to the SVA series (I'm > laboriously working towards next version) and my email from last week: > https://patchwork.kernel.org/patch/10412251/ > > Whenever I come back to hierarchical IOMMU domains I reject it as too > complicated, but maybe that is what we need. I find it difficult to > reason about because domains currently represent both a collection of > devices and a one or more address spaces. I proposed the io_mm thing to > represent a single address space, and to avoid adding special cases to > every function in the IOMMU subsystem that manipulates domains. I suppose io_mm is still under iommu_domain, right? this is different from hierarchical iommu domain concept... > > > For IOMMU-capable mdev, there are two use cases which have > > been talked so far: > > > > 1) mdev with PASID-granular DMA isolation > > 2) mdev inheriting RID from parent device (no sharing) > > Would that be a single mdev per parent device? Otherwise I don't really > understand how it would work without all map/unmap operations going to > the parent device's driver. yes, that is why I said "no sharing". In that special case, host driver itself doesn't do DMA. Only the mdev does. > > > 1) is what this RFC is currently for. 2) is what Alex concerned > > which is not covered in RFC yet. > > > > In my mind, an ideal design is like below: > > > > a) VFIO provides an interface for parent driver to specify > > whether a mdev is IOMMU capable, with flag to indicate > > whether it's PASID-granular or RID-inherit; > > > > b) Once an IOMMU-capable mdev is created, VFIO treats it no > > different from normal physical devices, using exactly same > > IOMMU APIs to operate (including future SVA). VFIO doesn't > > care whether PASID or RID will be used in hardware; > > > > c) IOMMU driver then handle RID/PASID difference internally, > > based on its own configuration and mdev type. > > > > To support above goal, I feel a new device handle is a nice fit > > to represent mdev within IOMMU driver, with same set of > > capabilities as observed on its parent device. > > It's a good abstraction, but I'm still concerned about other users of > PASID-granular DMA isolation, for example GPU drivers wanting to improve > isolation of DMA bufs, will want the same functionality without going > through the vfio-mdev module. for GPU I think you meant SVA. that one anyway needs its own interface as what we have been discussing in yours and Jacob's series. Here mdev is orthogonal to a specific capability like SVA. It is sort of logical representation of subset resource of parent device, on top of which we can enable IOMMU capabilities including SVA. I'm not sure whether it is good to combine two requirements closely... > > The IOMMU operations we care about don't take a device handle, I think, > just a domain. And VFIO itself only deals with domains when doing > map/unmap. Maybe we could add this operation to the IOMMU subsystem: > > child_domain = domain_create_child(parent_dev, parent_domain) > > A child domain would be a smaller isolation granule, getting a PASID if > that's what the IOMMU implements or another mechanism for 2). It is > automatically attached to its parent's devices, so attach/detach > operations wouldn't be necessary on the child. > > Depending on the underlying architecture the child domain can support > map/unmap on a single stage, or map/unmap for 2nd level and > bind_pgtable > for 1st level. > > I'm not sure how this works for host SVA though. I think the > sva_bind_dev() API could stay the same, but the internals will need > to change. > hierarchical domain might be the right way to go, but let's do more thinking on any corner cases. One open is whether iommu domain can fully carry all required attributes on mdev. Note today for physical device each vendor driver maintains a device structure for device specific info which may impact IOMMU setting (e.g. struct device_domain_info in intel-iommu, and struct arm_smmu_device in arm-smmu). If we want mdev to have a different attribute as its parent device, then new representation might be required. But honestly speaking I don't think it is a valid requirement now, since physically finally it is still the IOMMU structure of parent device being configured, so mdev should just inherit same attributes as parent. If the role of mdev representation in current RFC is just to connect iommu_ domain when hierarchical domain is missing, then we might instead just make the latter happen... Let's think more on this direction. btw can you elaborate any other complexities when you evaluated this option earlier? Thanks Kevin
> From: Tian, Kevin > Sent: Thursday, July 26, 2018 11:04 AM [...] > > > > The IOMMU operations we care about don't take a device handle, I think, > > just a domain. And VFIO itself only deals with domains when doing > > map/unmap. Maybe we could add this operation to the IOMMU > subsystem: > > > > child_domain = domain_create_child(parent_dev, parent_domain) > > > > A child domain would be a smaller isolation granule, getting a PASID if > > that's what the IOMMU implements or another mechanism for 2). It is > > automatically attached to its parent's devices, so attach/detach > > operations wouldn't be necessary on the child. > > > > Depending on the underlying architecture the child domain can support > > map/unmap on a single stage, or map/unmap for 2nd level and > > bind_pgtable > > for 1st level. > > > > I'm not sure how this works for host SVA though. I think the > > sva_bind_dev() API could stay the same, but the internals will need > > to change. > > > > hierarchical domain might be the right way to go, but let's do more > thinking on any corner cases. > btw maybe we don't need make it 'hierarchical', as maintaining hierarchy usually brings more work. What we require is possibly just the capability of having one device bind to multiple iommu_domains. One domain is reserved for parent driver's own DMA activities (e.g. serving DMA APIs), while other domains are auxiliary and can be tagged with a PASID (or any other identifier which IOMMU can use to support multiple domains). Such identifiers may be automatically provisioned when auxiliary domain is attached, i.e. not requiring an explicit request from caller. IMO it's safe to assume that supporting multiple iommu domains anyway implies some finer-grained capability than RID-based in underlying IOMMU. Then there is no need of parent/child concept. thoughts? Thanks Kevin
On 26/07/18 04:03, Tian, Kevin wrote: >> Whenever I come back to hierarchical IOMMU domains I reject it as too >> complicated, but maybe that is what we need. I find it difficult to >> reason about because domains currently represent both a collection of >> devices and a one or more address spaces. I proposed the io_mm thing to >> represent a single address space, and to avoid adding special cases to >> every function in the IOMMU subsystem that manipulates domains. > > I suppose io_mm is still under iommu_domain, right? this is different > from hierarchical iommu domain concept... Yes, my initial solution is io_mm attached to domains, but in the rest of the mail yesterday I tried to explore a way to replace io_mm with child domains instead. In my current patches, io_mm when used for private PASIDs is basically a lightweight child domain. I don't know which solution is best or if mdev-aware IOMMU is still preferable. For the moment I'll keep the private PASID proposal that uses io_mm, until we've thought a bit more about this. [...] >> It's a good abstraction, but I'm still concerned about other users of >> PASID-granular DMA isolation, for example GPU drivers wanting to improve >> isolation of DMA bufs, will want the same functionality without going >> through the vfio-mdev module. > > for GPU I think you meant SVA. that one anyway needs its own > interface as what we have been discussing in yours and Jacob's > series. Actually I didn't mean SVA. People would like to allocate PASIDs in kernel drivers and do iommu_map/unmap on them, without binding process address spaces. Not counting hardware validation, I've actually seen more demand for private PASID management than for SVA. See for example the discussion on RFCv2 of SVA, or Jordan's series: https://www.spinics.net/lists/arm-kernel/msg611038.html https://lwn.net/Articles/747889/ It would be good if mdev and non-mdev could reuse most of the same code for private PASID, whatever it turns out to be > Here mdev is orthogonal to a specific capability like SVA. It is > sort of logical representation of subset resource of parent device, > on top of which we can enable IOMMU capabilities including SVA. > > I'm not sure whether it is good to combine two requirements closely... > >> >> The IOMMU operations we care about don't take a device handle, I think, >> just a domain. And VFIO itself only deals with domains when doing >> map/unmap. Maybe we could add this operation to the IOMMU subsystem: >> >> child_domain = domain_create_child(parent_dev, parent_domain) >> >> A child domain would be a smaller isolation granule, getting a PASID if >> that's what the IOMMU implements or another mechanism for 2). It is >> automatically attached to its parent's devices, so attach/detach >> operations wouldn't be necessary on the child. >> >> Depending on the underlying architecture the child domain can support >> map/unmap on a single stage, or map/unmap for 2nd level and >> bind_pgtable >> for 1st level. >> >> I'm not sure how this works for host SVA though. I think the >> sva_bind_dev() API could stay the same, but the internals will need >> to change. Thinking more about this, the SVA case seems a bit nasty. If we replaced io_mm with iommu_domain for SVA, a child domain would represent a single process address space, and therefore have multiple parent domains... Not sure we should go down that road. Maybe we should keep SVA separate, and keep io_mm to be a wrapper for mm_struct > hierarchical domain might be the right way to go, but let's do more > thinking on any corner cases. > > One open is whether iommu domain can fully carry all required > attributes on mdev. Note today for physical device each vendor > driver maintains a device structure for device specific info which > may impact IOMMU setting (e.g. struct device_domain_info in > intel-iommu, and struct arm_smmu_device in arm-smmu). If we > want mdev to have a different attribute as its parent device, then > new representation might be required. But honestly speaking I > don't think it is a valid requirement now, since physically finally > it is still the IOMMU structure of parent device being configured, > so mdev should just inherit same attributes as parent. If the role > of mdev representation in current RFC is just to connect iommu_ > domain when hierarchical domain is missing, then we might > instead just make the latter happen... Agreed, the mdev would inherit properties of its parent since physically the IOMMU sees DMA transactions coming from the parent > Let's think more on this direction. btw can you elaborate any > other complexities when you evaluated this option earlier? I can't think of anything right now - my prototype worked well with iommu_sva_alloc_pasid, but the goal was mainly for me to understand mdev using a toy DMA engine, I didn't spend time thinking about corner cases. Thanks, Jean
On 26/07/18 04:28, Tian, Kevin wrote: >> hierarchical domain might be the right way to go, but let's do more >> thinking on any corner cases. >> > > btw maybe we don't need make it 'hierarchical', as maintaining > hierarchy usually brings more work. What we require is possibly > just the capability of having one device bind to multiple > iommu_domains. One domain is reserved for parent driver's > own DMA activities (e.g. serving DMA APIs), while other domains > are auxiliary and can be tagged with a PASID (or any other identifier > which IOMMU can use to support multiple domains). Such identifiers > may be automatically provisioned when auxiliary domain is attached, > i.e. not requiring an explicit request from caller. IMO it's safe to > assume that supporting multiple iommu domains anyway implies > some finer-grained capability than RID-based in underlying IOMMU. > Then there is no need of parent/child concept. Right, we probably don't need a hierarchy. I like this model (it's actually the one I favor for supporting PASID in virtio-iommu), though I'm hoping we can avoid changing the logic of iommu_attach/detach, and instead introduce a new "get_child_domain", "attach_aux_domain" or simply "clone" op. Currently the attach_dev() op toggles the domain of a device. For example a device automatically gets a default domain for kernel DMA, then VFIO attaches a new domain for assigning to a VM. attach_dev() disables the default domain and installs fresh page tables. detach_dev() isn't called, and the SMMU drivers don't even implement it. If we wanted to reuse attach_dev() for auxiliary domains, we'd need some flag to differentiate the "add auxiliary domain" operation from the "detach everything and attach one new domain" one Thanks, Jean
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 3ede34a..57ccfc4 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -5276,6 +5276,17 @@ static void intel_iommu_put_resv_regions(struct device *dev, } } +static struct iommu_group *intel_iommu_device_group(struct device *dev) +{ + if (dev_is_pci(dev)) + return pci_device_group(dev); + + if (dev_is_mdev(dev)) + return iommu_group_alloc(); + + return ERR_PTR(-EINVAL); +} + #ifdef CONFIG_INTEL_IOMMU_SVM int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct intel_svm_dev *sdev) { @@ -5370,7 +5381,7 @@ const struct iommu_ops intel_iommu_ops = { .remove_device = intel_iommu_remove_device, .get_resv_regions = intel_iommu_get_resv_regions, .put_resv_regions = intel_iommu_put_resv_regions, - .device_group = pci_device_group, + .device_group = intel_iommu_device_group, .pgsize_bitmap = INTEL_IOMMU_PGSIZES, };