Message ID | 20180830040922.30426-1-baolu.lu@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | vfio/mdev: IOMMU aware mediated device | expand |
> From: Lu Baolu [mailto:baolu.lu@linux.intel.com] > Sent: Thursday, August 30, 2018 12:09 PM > [...] > > In order to distinguish the IOMMU-capable mediated devices from those > which still need to rely on parent devices, this patch set adds a > domain type attribute to each mdev. > > enum mdev_domain_type { > DOMAIN_TYPE_NO_IOMMU, /* Don't need any IOMMU support. > * All isolation and protection > * are handled by the parent > * device driver with a device > * specific mechanism. > */ > DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and > protect > * the mdev, and the isolation > * domain should be attaced with > * the parent device. > */ > }; > ATTACH_PARENT is not like a good counterpart to NO_IOMMU. what about DOMAIN_TYPE_NO_IOMMU/DOMAIN_TYPE_IOMMU? whether to attach parent device is just internal logic. Alternatively DOMAIN_TYPE_SOFTWARE/DOMAIN_TYPE_HARDWARE, where software means iommu_domain is managed by software while the other means managed by hardware. One side note to Alex - with multiple domain extension in IOMMU layer, this version combines IOMMU-capable usages in VFIO: PASID-based (as in scalable iov) and RID-based (as the usage of mdev wrapper on any device). Both cases share the common path - just binding the domain to the parent device of mdev. IOMMU layer will handle two cases differently later. Thanks Kevin
On Wed, 5 Sep 2018 03:01:39 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Lu Baolu [mailto:baolu.lu@linux.intel.com] > > Sent: Thursday, August 30, 2018 12:09 PM > > > [...] > > > > In order to distinguish the IOMMU-capable mediated devices from those > > which still need to rely on parent devices, this patch set adds a > > domain type attribute to each mdev. > > > > enum mdev_domain_type { > > DOMAIN_TYPE_NO_IOMMU, /* Don't need any IOMMU support. > > * All isolation and protection > > * are handled by the parent > > * device driver with a device > > * specific mechanism. > > */ > > DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and > > protect > > * the mdev, and the isolation > > * domain should be attaced with > > * the parent device. > > */ > > }; > > > > ATTACH_PARENT is not like a good counterpart to NO_IOMMU. Please do not use NO_IOMMU, we already have a thing called vfio-noiommu, enabled through CONFIG_VFIO_NOIOMMU and module parameter enable_unsafe_noiommu_mode. This is much, much too similar and will generate confusion. > what about DOMAIN_TYPE_NO_IOMMU/DOMAIN_TYPE_IOMMU? whether > to attach parent device is just internal logic. > > Alternatively DOMAIN_TYPE_SOFTWARE/DOMAIN_TYPE_HARDWARE, > where software means iommu_domain is managed by software while > the other means managed by hardware. I haven't gotten deep enough into the series to see how it's used, but my gut reaction is that we don't need an enum, we just need some sort of pointer on the mdev that points to an iommu_parent, which indicates the root of our IOMMU based isolation, or is NULL, which indicates we use vendor defined isolation as we have now. > One side note to Alex - with multiple domain extension in IOMMU layer, > this version combines IOMMU-capable usages in VFIO: PASID-based (as > in scalable iov) and RID-based (as the usage of mdev wrapper on any > device). Both cases share the common path - just binding the domain to the > parent device of mdev. IOMMU layer will handle two cases differently later. Good, I'm glad you've considered the regular (RID) IOMMU domain and not just the new aux domain. Thanks, Alex
Hi, On 09/06/2018 03:15 AM, Alex Williamson wrote: > On Wed, 5 Sep 2018 03:01:39 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > >>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com] >>> Sent: Thursday, August 30, 2018 12:09 PM >>> >> [...] >>> >>> In order to distinguish the IOMMU-capable mediated devices from those >>> which still need to rely on parent devices, this patch set adds a >>> domain type attribute to each mdev. >>> >>> enum mdev_domain_type { >>> DOMAIN_TYPE_NO_IOMMU, /* Don't need any IOMMU support. >>> * All isolation and protection >>> * are handled by the parent >>> * device driver with a device >>> * specific mechanism. >>> */ >>> DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and >>> protect >>> * the mdev, and the isolation >>> * domain should be attaced with >>> * the parent device. >>> */ >>> }; >>> >> >> ATTACH_PARENT is not like a good counterpart to NO_IOMMU. > > Please do not use NO_IOMMU, we already have a thing called > vfio-noiommu, enabled through CONFIG_VFIO_NOIOMMU and module parameter > enable_unsafe_noiommu_mode. This is much, much too similar and will > generate confusion. Sure. Will remove this confusion. > >> what about DOMAIN_TYPE_NO_IOMMU/DOMAIN_TYPE_IOMMU? whether >> to attach parent device is just internal logic. >> >> Alternatively DOMAIN_TYPE_SOFTWARE/DOMAIN_TYPE_HARDWARE, >> where software means iommu_domain is managed by software while >> the other means managed by hardware. > > I haven't gotten deep enough into the series to see how it's used, but > my gut reaction is that we don't need an enum, we just need some sort > of pointer on the mdev that points to an iommu_parent, which indicates > the root of our IOMMU based isolation, or is NULL, which indicates we > use vendor defined isolation as we have now. It works as long as we can distinguish IOMMU based isolation and the vendor defined isolation. How about making the iommu_parent points the device structure who created the mdev? If this pointer is NOT NULL we will bind the domain to the device pointed to by it, otherwise, handle it in the vendor defined way? Best regards, Lu Baolu > >> One side note to Alex - with multiple domain extension in IOMMU layer, >> this version combines IOMMU-capable usages in VFIO: PASID-based (as >> in scalable iov) and RID-based (as the usage of mdev wrapper on any >> device). Both cases share the common path - just binding the domain to the >> parent device of mdev. IOMMU layer will handle two cases differently later. > > Good, I'm glad you've considered the regular (RID) IOMMU domain and not > just the new aux domain. Thanks, > > Alex >
Hi, On 30/08/2018 05:09, Lu Baolu wrote: > Below APIs are introduced in the IOMMU glue for device drivers to use > the finer granularity translation. > > * iommu_capable(IOMMU_CAP_AUX_DOMAIN) > - Represents the ability for supporting multiple domains per device > (a.k.a. finer granularity translations) of the IOMMU hardware. iommu_capable() cannot represent hardware capabilities, we need something else for systems with multiple IOMMUs that have different caps. How about iommu_domain_get_attr on the device's domain instead? > * iommu_en(dis)able_aux_domain(struct device *dev) > - Enable/disable the multiple domains capability for a device > referenced by @dev. > > * iommu_auxiliary_id(struct iommu_domain *domain) > - Return the index value used for finer-granularity DMA translation. > The specific device driver needs to feed the hardware with this > value, so that hardware device could issue the DMA transaction with > this value tagged. This could also reuse iommu_domain_get_attr. More generally I'm having trouble understanding how auxiliary domains will be used. So VFIO allocates PASIDs like this: * iommu_enable_aux_domain(parent_dev) * iommu_domain_alloc() -> dom1 * iommu_domain_alloc() -> dom2 * iommu_attach_device(dom1, parent_dev) -> dom1 gets PASID #1 * iommu_attach_device(dom2, parent_dev) -> dom2 gets PASID #2 Then I'm not sure about the next steps, when userspace does VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's container. Is the following use accurate? For the single translation level: * iommu_map(dom1, ...) updates first-level/second-level pgtables for PASID #1 * iommu_map(dom2, ...) updates first-level/second-level pgtables for PASID #2 Nested translation: * iommu_map(dom1, ...) updates second-level pgtables for PASID #1 * iommu_bind_table(dom1, ...) binds first-level pgtables, provided by the guest, for PASID #1 * iommu_map(dom2, ...) updates second-level pgtables for PASID #2 * iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2 I'm trying to understand how to implement this with SMMU and other IOMMUs. It's not a clean fit since we have a single domain to hold the second-level pgtables. Then again, the nested case probably doesn't matter for us - we might as well assign the parent directly, since all mdevs have the same second-level and can only be assigned to the same VM. Also, can non-VFIO device drivers use auxiliary domains to do map/unmap on PASIDs? They are asking to do that and I'm proposing the private PASID thing, but since aux domains provide a similar feature we should probably converge somehow. Thanks, Jean
Hi, On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote: > Hi, > > On 30/08/2018 05:09, Lu Baolu wrote: >> Below APIs are introduced in the IOMMU glue for device drivers to use >> the finer granularity translation. >> >> * iommu_capable(IOMMU_CAP_AUX_DOMAIN) >> - Represents the ability for supporting multiple domains per device >> (a.k.a. finer granularity translations) of the IOMMU hardware. > > iommu_capable() cannot represent hardware capabilities, we need > something else for systems with multiple IOMMUs that have different > caps. How about iommu_domain_get_attr on the device's domain instead? Domain is not a good choice for per iommu cap query. A domain might be attached to devices belonging to different iommu's. How about an API with device structure as parameter? A device always belongs to a specific iommu. This API is supposed to be used the device driver. > >> * iommu_en(dis)able_aux_domain(struct device *dev) >> - Enable/disable the multiple domains capability for a device >> referenced by @dev. >> >> * iommu_auxiliary_id(struct iommu_domain *domain) >> - Return the index value used for finer-granularity DMA translation. >> The specific device driver needs to feed the hardware with this >> value, so that hardware device could issue the DMA transaction with >> this value tagged. > > This could also reuse iommu_domain_get_attr. > > > More generally I'm having trouble understanding how auxiliary domains > will be used. So VFIO allocates PASIDs like this: As I wrote in the cover letter, "auxiliary domain" is just a name to ease discussion. It's actually has no special meaning (we think a domain as an isolation boundary which could be used by the IOMMU to isolate the DMA transactions out of a PCI device or partial of it). So drivers like vfio should see no difference when use an auxiliary domain. The auxiliary domain is not aware out of iommu driver. > > * iommu_enable_aux_domain(parent_dev) > * iommu_domain_alloc() -> dom1 > * iommu_domain_alloc() -> dom2 > * iommu_attach_device(dom1, parent_dev) > -> dom1 gets PASID #1 > * iommu_attach_device(dom2, parent_dev) > -> dom2 gets PASID #2 > > Then I'm not sure about the next steps, when userspace does > VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's container. Is the > following use accurate? > > For the single translation level: > * iommu_map(dom1, ...) updates first-level/second-level pgtables for > PASID #1 > * iommu_map(dom2, ...) updates first-level/second-level pgtables for > PASID #2 > > Nested translation: > * iommu_map(dom1, ...) updates second-level pgtables for PASID #1 > * iommu_bind_table(dom1, ...) binds first-level pgtables, provided by > the guest, for PASID #1 > * iommu_map(dom2, ...) updates second-level pgtables for PASID #2 > * iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2 > > > I'm trying to understand how to implement this with SMMU and other This is proposed for architectures which support finer granularity second level translation with no impact on architectures which only support Source ID or the similar granularity. > IOMMUs. It's not a clean fit since we have a single domain to hold the > second-level pgtables. Do you mind explaining why a domain holds multiple second-level pgtables? Shouldn't that be multiple domains? > Then again, the nested case probably doesn't > matter for us - we might as well assign the parent directly, since all > mdevs have the same second-level and can only be assigned to the same VM. > > > Also, can non-VFIO device drivers use auxiliary domains to do map/unmap > on PASIDs? They are asking to do that and I'm proposing the private > PASID thing, but since aux domains provide a similar feature we should > probably converge somehow. Yes, any non-VFIO device driver could use aux domain as well. The use model is: iommu_enable_aux_domain(dev) -- enables aux domain support for this device iommu_domain_alloc(dev) -- allocate an iommu domain iommu_attach_device(domain, dev) -- attach the domain to device iommu_auxiliary_id(domain) -- retrieve the pasid id used by this domain The device driver then iommu_map(domain, ...) set the pasid id to hardware register and start to do dma. Best regards, Lu Baolu
On 12/09/2018 03:42, Lu Baolu wrote: > Hi, > > On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote: >> Hi, >> >> On 30/08/2018 05:09, Lu Baolu wrote: >>> Below APIs are introduced in the IOMMU glue for device drivers to use >>> the finer granularity translation. >>> >>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN) >>> - Represents the ability for supporting multiple domains per device >>> (a.k.a. finer granularity translations) of the IOMMU hardware. >> >> iommu_capable() cannot represent hardware capabilities, we need >> something else for systems with multiple IOMMUs that have different >> caps. How about iommu_domain_get_attr on the device's domain instead? > > Domain is not a good choice for per iommu cap query. A domain might be > attached to devices belonging to different iommu's. > > How about an API with device structure as parameter? A device always > belongs to a specific iommu. This API is supposed to be used the > device driver. Ah right, domain attributes won't work. Your suggestion seems more suitable, but maybe users can simply try to enable auxiliary domains first, and conclude that the IOMMU doesn't support it if it returns an error >>> * iommu_en(dis)able_aux_domain(struct device *dev) >>> - Enable/disable the multiple domains capability for a device >>> referenced by @dev. It strikes me now that in the IOMMU driver, iommu_enable/disable_aux_domain() will do the same thing as iommu_sva_device_init/shutdown() (https://www.spinics.net/lists/arm-kernel/msg651896.html). Some IOMMU drivers want to enable PASID and allocate PASID tables only when requested by users, in the sva_init_device IOMMU op (see Joerg's comment last year https://patchwork.kernel.org/patch/9989307/#21025429). Maybe we could simply add a flag to iommu_sva_device_init? >>> * iommu_auxiliary_id(struct iommu_domain *domain) >>> - Return the index value used for finer-granularity DMA translation. >>> The specific device driver needs to feed the hardware with this >>> value, so that hardware device could issue the DMA transaction with >>> this value tagged. >> >> This could also reuse iommu_domain_get_attr. >> >> >> More generally I'm having trouble understanding how auxiliary domains >> will be used. So VFIO allocates PASIDs like this: > > As I wrote in the cover letter, "auxiliary domain" is just a name to > ease discussion. It's actually has no special meaning (we think a domain > as an isolation boundary which could be used by the IOMMU to isolate > the DMA transactions out of a PCI device or partial of it). > > So drivers like vfio should see no difference when use an auxiliary > domain. The auxiliary domain is not aware out of iommu driver. For an auxiliary domain, VFIO does need to retrieve the PASID and write it to hardware. But being able to reuse iommu_map/unmap/iova_to_phys/etc on the auxiliary domain is nice. >> * iommu_enable_aux_domain(parent_dev) >> * iommu_domain_alloc() -> dom1 >> * iommu_domain_alloc() -> dom2 >> * iommu_attach_device(dom1, parent_dev) >> -> dom1 gets PASID #1 >> * iommu_attach_device(dom2, parent_dev) >> -> dom2 gets PASID #2 >> >> Then I'm not sure about the next steps, when userspace does >> VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's container. Is the >> following use accurate? >> >> For the single translation level: >> * iommu_map(dom1, ...) updates first-level/second-level pgtables for >> PASID #1 >> * iommu_map(dom2, ...) updates first-level/second-level pgtables for >> PASID #2 >> >> Nested translation: >> * iommu_map(dom1, ...) updates second-level pgtables for PASID #1 >> * iommu_bind_table(dom1, ...) binds first-level pgtables, provided by >> the guest, for PASID #1 >> * iommu_map(dom2, ...) updates second-level pgtables for PASID #2 >> * iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2 >>> >> I'm trying to understand how to implement this with SMMU and other > > This is proposed for architectures which support finer granularity > second level translation with no impact on architectures which only > support Source ID or the similar granularity. Just to be clear, in this paragraph you're only referring to the Nested/second-level translation for mdev, which is specific to vt-d rev3? Other architectures can still do first-level translation with PASID, to support some use-cases of IOMMU aware mediated device (assigning mdevs to userspace drivers, for example) >> IOMMUs. It's not a clean fit since we have a single domain to hold the >> second-level pgtables. > > Do you mind explaining why a domain holds multiple second-level > pgtables? Shouldn't that be multiple domains? I didn't mean a single domain holding multiple second-level pgtables, but a single domain holding a single set of second-level pgtables for all mdevs. But let's ignore that, mdev and second-level isn't realistic for arm SMMU. >> Then again, the nested case probably doesn't >> matter for us - we might as well assign the parent directly, since all >> mdevs have the same second-level and can only be assigned to the same VM. >> >> >> Also, can non-VFIO device drivers use auxiliary domains to do map/unmap >> on PASIDs? They are asking to do that and I'm proposing the private >> PASID thing, but since aux domains provide a similar feature we should >> probably converge somehow. > > Yes, any non-VFIO device driver could use aux domain as well. The use > model is: > > iommu_enable_aux_domain(dev) > -- enables aux domain support for this device > > iommu_domain_alloc(dev) > -- allocate an iommu domain > > iommu_attach_device(domain, dev) > -- attach the domain to device > > iommu_auxiliary_id(domain) > -- retrieve the pasid id used by this domain > > The device driver then > > iommu_map(domain, ...) > > set the pasid id to hardware register and start to do dma. Sounds good, I'll drop the private PASID patch if we can figure out a solution to the attach/detach_dev problem discussed on patch 8/10 Thanks, Jean
> From: Jean-Philippe Brucker > Sent: Thursday, September 13, 2018 1:54 AM > > On 12/09/2018 03:42, Lu Baolu wrote: > > Hi, > > > > On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote: > >> Hi, > >> > >> On 30/08/2018 05:09, Lu Baolu wrote: > >>> Below APIs are introduced in the IOMMU glue for device drivers to use > >>> the finer granularity translation. > >>> > >>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN) > >>> - Represents the ability for supporting multiple domains per device > >>> (a.k.a. finer granularity translations) of the IOMMU hardware. > >> > >> iommu_capable() cannot represent hardware capabilities, we need > >> something else for systems with multiple IOMMUs that have different > >> caps. How about iommu_domain_get_attr on the device's domain > instead? > > > > Domain is not a good choice for per iommu cap query. A domain might be > > attached to devices belonging to different iommu's. > > > > How about an API with device structure as parameter? A device always > > belongs to a specific iommu. This API is supposed to be used the > > device driver. > > Ah right, domain attributes won't work. Your suggestion seems more > suitable, but maybe users can simply try to enable auxiliary domains > first, and conclude that the IOMMU doesn't support it if it returns an error > > >>> * iommu_en(dis)able_aux_domain(struct device *dev) > >>> - Enable/disable the multiple domains capability for a device > >>> referenced by @dev. > > It strikes me now that in the IOMMU driver, > iommu_enable/disable_aux_domain() will do the same thing as > iommu_sva_device_init/shutdown() > (https://www.spinics.net/lists/arm-kernel/msg651896.html). Some IOMMU > drivers want to enable PASID and allocate PASID tables only when > requested by users, in the sva_init_device IOMMU op (see Joerg's comment > last year https://patchwork.kernel.org/patch/9989307/#21025429). Maybe > we could simply add a flag to iommu_sva_device_init? We could combine, but definitely 'sva' should be removed :-) > > >>> * iommu_auxiliary_id(struct iommu_domain *domain) > >>> - Return the index value used for finer-granularity DMA translation. > >>> The specific device driver needs to feed the hardware with this > >>> value, so that hardware device could issue the DMA transaction with > >>> this value tagged. > >> > >> This could also reuse iommu_domain_get_attr. > >> > >> > >> More generally I'm having trouble understanding how auxiliary domains > >> will be used. So VFIO allocates PASIDs like this: > > > > As I wrote in the cover letter, "auxiliary domain" is just a name to > > ease discussion. It's actually has no special meaning (we think a domain > > as an isolation boundary which could be used by the IOMMU to isolate > > the DMA transactions out of a PCI device or partial of it). > > > > So drivers like vfio should see no difference when use an auxiliary > > domain. The auxiliary domain is not aware out of iommu driver. > > For an auxiliary domain, VFIO does need to retrieve the PASID and write > it to hardware. But being able to reuse > iommu_map/unmap/iova_to_phys/etc > on the auxiliary domain is nice. > > >> * iommu_enable_aux_domain(parent_dev) > >> * iommu_domain_alloc() -> dom1 > >> * iommu_domain_alloc() -> dom2 > >> * iommu_attach_device(dom1, parent_dev) > >> -> dom1 gets PASID #1 > >> * iommu_attach_device(dom2, parent_dev) > >> -> dom2 gets PASID #2 > >> > >> Then I'm not sure about the next steps, when userspace does > >> VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's > container. Is the > >> following use accurate? > >> > >> For the single translation level: > >> * iommu_map(dom1, ...) updates first-level/second-level pgtables for > >> PASID #1 > >> * iommu_map(dom2, ...) updates first-level/second-level pgtables for > >> PASID #2 > >> > >> Nested translation: > >> * iommu_map(dom1, ...) updates second-level pgtables for PASID #1 > >> * iommu_bind_table(dom1, ...) binds first-level pgtables, provided by > >> the guest, for PASID #1 > >> * iommu_map(dom2, ...) updates second-level pgtables for PASID #2 > >> * iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2 > >>> > >> I'm trying to understand how to implement this with SMMU and other > > > > This is proposed for architectures which support finer granularity > > second level translation with no impact on architectures which only > > support Source ID or the similar granularity. > > Just to be clear, in this paragraph you're only referring to the > Nested/second-level translation for mdev, which is specific to vt-d > rev3? Other architectures can still do first-level translation with > PASID, to support some use-cases of IOMMU aware mediated device > (assigning mdevs to userspace drivers, for example) yes. aux domain concept applies only to vt-d rev3 which introduces scalable mode. Care is taken to avoid breaking usages on existing architectures. one note. Assigning mdevs to user space alone doesn't imply IOMMU aware. All existing mdev usages use software or proprietary methods to isolate DMA. There is only one potential IOMMU aware mdev usage which we talked not rely on vt-d rev3 scalable mode - wrap a random PCI device into a single mdev instance (no sharing). In that case mdev inherits RID from parent PCI device, thus is isolated by IOMMU in RID granular. Our RFC supports this usage too. In VFIO two usages (PASID- based and RID-based) use same code path, i.e. always binding domain to the parent device of mdev. But within IOMMU they go different paths. PASID-based will go to aux-domain as iommu_enable_aux_domain has been called on that device. RID-based will follow existing unmanaged domain path, as if it is parent device assignment. > > >> IOMMUs. It's not a clean fit since we have a single domain to hold the > >> second-level pgtables. > > > > Do you mind explaining why a domain holds multiple second-level > > pgtables? Shouldn't that be multiple domains? > > I didn't mean a single domain holding multiple second-level pgtables, > but a single domain holding a single set of second-level pgtables for > all mdevs. But let's ignore that, mdev and second-level isn't realistic > for arm SMMU. yes. single second-level doesn't allow multiple mdevs (each mdev assigned to different user process or VM). that is why vt-d rev3 introduces scalable mode. :-) > > >> Then again, the nested case probably doesn't > >> matter for us - we might as well assign the parent directly, since all > >> mdevs have the same second-level and can only be assigned to the same > VM. > >> > >> > >> Also, can non-VFIO device drivers use auxiliary domains to do > map/unmap > >> on PASIDs? They are asking to do that and I'm proposing the private > >> PASID thing, but since aux domains provide a similar feature we should > >> probably converge somehow. > > > > Yes, any non-VFIO device driver could use aux domain as well. The use > > model is: > > > > iommu_enable_aux_domain(dev) > > -- enables aux domain support for this device > > > > iommu_domain_alloc(dev) > > -- allocate an iommu domain > > > > iommu_attach_device(domain, dev) > > -- attach the domain to device > > > > iommu_auxiliary_id(domain) > > -- retrieve the pasid id used by this domain > > > > The device driver then > > > > iommu_map(domain, ...) > > > > set the pasid id to hardware register and start to do dma. > > Sounds good, I'll drop the private PASID patch if we can figure out a > solution to the attach/detach_dev problem discussed on patch 8/10 > Can you elaborate a bit on private PASID usage? what is the high level flow on it? Again based on earlier explanation, aux domain is specific to IOMMU architecture supporting vtd scalable mode-like capability, which allows separate 2nd/1st level translations per PASID. Need a better understanding how private PASID is relevant here. Thanks Kevin
On 13/09/2018 01:19, Tian, Kevin wrote: >>> This is proposed for architectures which support finer granularity >>> second level translation with no impact on architectures which only >>> support Source ID or the similar granularity. >> >> Just to be clear, in this paragraph you're only referring to the >> Nested/second-level translation for mdev, which is specific to vt-d >> rev3? Other architectures can still do first-level translation with >> PASID, to support some use-cases of IOMMU aware mediated device >> (assigning mdevs to userspace drivers, for example) > > yes. aux domain concept applies only to vt-d rev3 which introduces > scalable mode. Care is taken to avoid breaking usages on existing > architectures. > > one note. Assigning mdevs to user space alone doesn't imply IOMMU > aware. All existing mdev usages use software or proprietary methods to > isolate DMA. There is only one potential IOMMU aware mdev usage > which we talked not rely on vt-d rev3 scalable mode - wrap a random > PCI device into a single mdev instance (no sharing). In that case mdev > inherits RID from parent PCI device, thus is isolated by IOMMU in RID > granular. Our RFC supports this usage too. In VFIO two usages (PASID- > based and RID-based) use same code path, i.e. always binding domain to > the parent device of mdev. But within IOMMU they go different paths. > PASID-based will go to aux-domain as iommu_enable_aux_domain > has been called on that device. RID-based will follow existing > unmanaged domain path, as if it is parent device assignment. For Arm SMMU we're more interested in the PASID-granular case than the RID-granular one. It doesn't necessarily require vt-d rev3 scalable mode, the following example can be implemented with an SMMUv3, since it only needs PASID-granular first-level translation: We have a PCI function that supports PASID, and can be partitioned into multiple isolated entities, mdevs. Each mdev has an MMIO frame, an MSI vector and a PASID. Different processes (userspace drivers, not QEMU) each open one mdev. A process controlling one mdev has two ways of doing DMA: (1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This creates an auxiliary domain for the mdev, with PASID #35. The process creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on the auxiliary domain. The IOMMU driver populates the pgtables associated with PASID #35. (2) SVA. One way of doing it: the process uses a new "VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process address space to the device, gets PASID #35. Simpler, but not everyone wants to use SVA, especially not userspace drivers which need the highest performance. This example only needs to modify first-level translation, and works with SMMUv3. The kernel here could be the host, in which case second-level translation is disabled in the SMMU, or it could be the guest, in which case second-level mappings are created by QEMU and first-level translation is managed by assigning PASID tables to the guest. So (2) would use iommu_sva_bind_device(), but (1) needs something else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to second-level or nested translation? It seems silly to use a different API for first-level, since the flow in userspace and VFIO is the same as your second-level case as far as MAP_DMA ioctl goes. The difference is that in your case the auxiliary domain supports an additional operation which binds first-level page tables. An auxiliary domain that only supports first-level wouldn't support this operation, but it can still implement iommu_map/unmap/etc. Another note: if for some reason you did want to allow userspace to choose between first-level or second-level, you could implement the VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU, but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP ioctl on a NESTING container would populate second-level, and DMA_MAP on a normal container populates first-level. But if you're always going to use second-level by default, the distinction isn't necessary. >> Sounds good, I'll drop the private PASID patch if we can figure out a >> solution to the attach/detach_dev problem discussed on patch 8/10 >> > > Can you elaborate a bit on private PASID usage? what is the > high level flow on it? > > Again based on earlier explanation, aux domain is specific to IOMMU > architecture supporting vtd scalable mode-like capability, which allows > separate 2nd/1st level translations per PASID. Need a better understanding > how private PASID is relevant here. Private PASIDs are used for doing iommu_map/iommu_unmap on PASIDs (first-level translation): https://www.spinics.net/lists/dri-devel/msg177003.html As above, some people don't want SVA, some can't do it, some may even want a few private address spaces just for their kernel driver. They need a way to allocate PASIDs and do iommu_map/iommu_unmap on them, without binding to a process. I was planning to add the private PASID patch to my SVA series, but in my opinion the feature overlaps with auxiliary domains. Thanks, Jean
On Thu, Sep 13, 2018 at 04:03:01PM +0100, Jean-Philippe Brucker wrote: > On 13/09/2018 01:19, Tian, Kevin wrote: > >>> This is proposed for architectures which support finer granularity > >>> second level translation with no impact on architectures which only > >>> support Source ID or the similar granularity. > >> > >> Just to be clear, in this paragraph you're only referring to the > >> Nested/second-level translation for mdev, which is specific to vt-d > >> rev3? Other architectures can still do first-level translation with > >> PASID, to support some use-cases of IOMMU aware mediated device > >> (assigning mdevs to userspace drivers, for example) > > > > yes. aux domain concept applies only to vt-d rev3 which introduces > > scalable mode. Care is taken to avoid breaking usages on existing > > architectures. > > > > one note. Assigning mdevs to user space alone doesn't imply IOMMU > > aware. All existing mdev usages use software or proprietary methods to > > isolate DMA. There is only one potential IOMMU aware mdev usage > > which we talked not rely on vt-d rev3 scalable mode - wrap a random > > PCI device into a single mdev instance (no sharing). In that case mdev > > inherits RID from parent PCI device, thus is isolated by IOMMU in RID > > granular. Our RFC supports this usage too. In VFIO two usages (PASID- > > based and RID-based) use same code path, i.e. always binding domain to > > the parent device of mdev. But within IOMMU they go different paths. > > PASID-based will go to aux-domain as iommu_enable_aux_domain > > has been called on that device. RID-based will follow existing > > unmanaged domain path, as if it is parent device assignment. > > For Arm SMMU we're more interested in the PASID-granular case than the > RID-granular one. It doesn't necessarily require vt-d rev3 scalable > mode, the following example can be implemented with an SMMUv3, since it > only needs PASID-granular first-level translation: You are right, you can simply use the first level as IOVA for every PASID. Only issue becomes when you need to assign that to a guest, you would be required to shadow the 1st level. If you have a 2nd level per-pasid first level can be managed in guest and don't require to shadow them. > > We have a PCI function that supports PASID, and can be partitioned into > multiple isolated entities, mdevs. Each mdev has an MMIO frame, an MSI > vector and a PASID. > > Different processes (userspace drivers, not QEMU) each open one mdev. A > process controlling one mdev has two ways of doing DMA: > > (1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This > creates an auxiliary domain for the mdev, with PASID #35. The process > creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on > the auxiliary domain. The IOMMU driver populates the pgtables associated > with PASID #35. > > (2) SVA. One way of doing it: the process uses a new > "VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process address > space to the device, gets PASID #35. Simpler, but not everyone wants to > use SVA, especially not userspace drivers which need the highest > performance. > > > This example only needs to modify first-level translation, and works > with SMMUv3. The kernel here could be the host, in which case > second-level translation is disabled in the SMMU, or it could be the > guest, in which case second-level mappings are created by QEMU and > first-level translation is managed by assigning PASID tables to the guest. > > So (2) would use iommu_sva_bind_device(), but (1) needs something else. > Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to > second-level or nested translation? It seems silly to use a different > API for first-level, since the flow in userspace and VFIO is the same as > your second-level case as far as MAP_DMA ioctl goes. The difference is > that in your case the auxiliary domain supports an additional operation > which binds first-level page tables. An auxiliary domain that only > supports first-level wouldn't support this operation, but it can still > implement iommu_map/unmap/etc. > > > Another note: if for some reason you did want to allow userspace to > choose between first-level or second-level, you could implement the > VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU, > but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP > ioctl on a NESTING container would populate second-level, and DMA_MAP on > a normal container populates first-level. But if you're always going to > use second-level by default, the distinction isn't necessary. Where is the nesting attribute specified? in vt-d2 it was part of context entry, so also meant all PASID's are nested now. In vt-d3 its part of PASID context. It seems unsafe to share PASID's with different VM's since any request W/O PASID has only one mapping. > > > >> Sounds good, I'll drop the private PASID patch if we can figure out a > >> solution to the attach/detach_dev problem discussed on patch 8/10 > >> > > > > Can you elaborate a bit on private PASID usage? what is the > > high level flow on it? > > > > Again based on earlier explanation, aux domain is specific to IOMMU > > architecture supporting vtd scalable mode-like capability, which allows > > separate 2nd/1st level translations per PASID. Need a better understanding > > how private PASID is relevant here. > > Private PASIDs are used for doing iommu_map/iommu_unmap on PASIDs > (first-level translation): > https://www.spinics.net/lists/dri-devel/msg177003.html As above, some > people don't want SVA, some can't do it, some may even want a few > private address spaces just for their kernel driver. They need a way to > allocate PASIDs and do iommu_map/iommu_unmap on them, without binding to > a process. I was planning to add the private PASID patch to my SVA > series, but in my opinion the feature overlaps with auxiliary domains. It sounds like it maps to AUX domains.
Hi, On 09/13/2018 01:54 AM, Jean-Philippe Brucker wrote: > On 12/09/2018 03:42, Lu Baolu wrote: >> Hi, >> >> On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote: >>> Hi, >>> >>> On 30/08/2018 05:09, Lu Baolu wrote: >>>> Below APIs are introduced in the IOMMU glue for device drivers to use >>>> the finer granularity translation. >>>> >>>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN) >>>> - Represents the ability for supporting multiple domains per device >>>> (a.k.a. finer granularity translations) of the IOMMU hardware. >>> iommu_capable() cannot represent hardware capabilities, we need >>> something else for systems with multiple IOMMUs that have different >>> caps. How about iommu_domain_get_attr on the device's domain instead? >> Domain is not a good choice for per iommu cap query. A domain might be >> attached to devices belonging to different iommu's. >> >> How about an API with device structure as parameter? A device always >> belongs to a specific iommu. This API is supposed to be used the >> device driver. > Ah right, domain attributes won't work. Your suggestion seems more > suitable, but maybe users can simply try to enable auxiliary domains > first, and conclude that the IOMMU doesn't support it if it returns an error > Some driver might want to check whether hardware supports AUX_DOMAIN during the driver probe stage, but doesn't want to enable AUX_DOMAIN at that time. One reasonable use case is driver check AUX_DOMAIN cap during driver probe and expose different sysfs nodes according to whether AUX_DOMAIN is support or not, then AUX_DOMAIN is enabled or disabled during run time through a sysfs node. With this consideration, we still need a API to check cap. How about * iommu_check_aux_domain(struct device *dev) - Check whether the iommu driver supports multiple domains on @dev. Best regards, Lu Baolu
> From: Lu Baolu [mailto:baolu.lu@linux.intel.com] > Sent: Friday, September 14, 2018 10:47 AM > > Hi, > > On 09/13/2018 01:54 AM, Jean-Philippe Brucker wrote: > > On 12/09/2018 03:42, Lu Baolu wrote: > >> Hi, > >> > >> On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote: > >>> Hi, > >>> > >>> On 30/08/2018 05:09, Lu Baolu wrote: > >>>> Below APIs are introduced in the IOMMU glue for device drivers to > use > >>>> the finer granularity translation. > >>>> > >>>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN) > >>>> - Represents the ability for supporting multiple domains per device > >>>> (a.k.a. finer granularity translations) of the IOMMU hardware. > >>> iommu_capable() cannot represent hardware capabilities, we need > >>> something else for systems with multiple IOMMUs that have different > >>> caps. How about iommu_domain_get_attr on the device's domain > instead? > >> Domain is not a good choice for per iommu cap query. A domain might > be > >> attached to devices belonging to different iommu's. > >> > >> How about an API with device structure as parameter? A device always > >> belongs to a specific iommu. This API is supposed to be used the > >> device driver. > > Ah right, domain attributes won't work. Your suggestion seems more > > suitable, but maybe users can simply try to enable auxiliary domains > > first, and conclude that the IOMMU doesn't support it if it returns an > error > > > > Some driver might want to check whether hardware supports > AUX_DOMAIN > during the driver probe stage, but doesn't want to enable AUX_DOMAIN > at that time. One reasonable use case is driver check AUX_DOMAIN cap > during driver probe and expose different sysfs nodes according to > whether AUX_DOMAIN is support or not, then AUX_DOMAIN is enabled or > disabled during run time through a sysfs node. With this consideration, > we still need a API to check cap. > > How about > > * iommu_check_aux_domain(struct device *dev) > - Check whether the iommu driver supports multiple domains on @dev. > maybe generalized as iommu_check_attr with aux_domain as a flag, in case other IOMMU checks introduced in the future. hinted by Jean's comment on iommu_sva_device_init part. Thanks Kevin
On 13/09/2018 17:55, Raj, Ashok wrote: >> For Arm SMMU we're more interested in the PASID-granular case than the >> RID-granular one. It doesn't necessarily require vt-d rev3 scalable >> mode, the following example can be implemented with an SMMUv3, since it >> only needs PASID-granular first-level translation: > > You are right, you can simply use the first level as IOVA for every PASID. > > Only issue becomes when you need to assign that to a guest, you would be required > to shadow the 1st level. If you have a 2nd level per-pasid first level can > be managed in guest and don't require to shadow them. Right, for us assigning a PASID-granular mdev to a guest requires shadowing >> Another note: if for some reason you did want to allow userspace to >> choose between first-level or second-level, you could implement the >> VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU, >> but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP >> ioctl on a NESTING container would populate second-level, and DMA_MAP on >> a normal container populates first-level. But if you're always going to >> use second-level by default, the distinction isn't necessary. > > Where is the nesting attribute specified? in vt-d2 it was part of context > entry, so also meant all PASID's are nested now. In vt-d3 its part of > PASID context. I don't think the nesting attribute is described in details anywhere. The SMMU drivers use it to know if they should create first- or second-level mappings. At the moment QEMU always uses VFIO_TYPE1v2_IOMMU, but Eric Auger is proposing a patch that adds VFIO_TYPE1_NESTING_IOMMU to QEMU: https://www.mail-archive.com/qemu-devel@nongnu.org/msg559820.html > It seems unsafe to share PASID's with different VM's since any request > W/O PASID has only one mapping. Which case are you talking about? It might be more confusing than helpful, but here's my understanding of what we can assign to a guest: | no vIOMMU | vIOMMU no PASID | vIOMMU with PASID --------------+-------------+------------------+-------------------- VF | ok | shadow or nest | nest mdev, SMMUv3 | ok | shadow | shadow + PV (?) mdev, vt-d3 | ok | nest | nest + PV The first line, assigning a PCI VF to a guest is the "basic" vfio-pci case. Currently in QEMU it works by shadowing first-level translation. We still have to upstream nested translation for that case. Vt-d2 didn't support nested without PASID, vt-d3 offers RID_PASID for this. On SMMUv3 the PASID table is assigned to the guest, whereas on vt-d3 the host manages the PASID table and individual page tables are assigned to the guest. Assigning an mdev (here I'm talking about the PASID-granular partition of a VF, not the whole RID-granular VF wrapped by an mdev) could be done by shadowing first-level translation on SMMUv3. It cannot do nested since the VF has a single set of second-level page tables, which cannot be used when mdevs are assigned to different VMs. Vt-d3 has one set of second-level page tables per PASID, so it can do nested. Since the parent device has a single PASID space, allowing the guest to use multiple PASIDs for one mdev requires paravirtual allocation of PASIDs (last column). Vt-d3 uses the Virtual Command Registers for that. I assume that it is safe because the host is in charge of programming PASIDs in the parent device, so the guest couldn't use a PASID allocated to another mdev, but I don't know what the device's programming model would look like. Anyway I don't think guest PASID is tackled by this series (right?) and I don't intend to work on it for SMMUv3 (shadowing stage-1 for vSVA seems like a bad idea...) Does this seem accurate? Thanks, Jean
>> This example only needs to modify first-level translation, and works >> with SMMUv3. The kernel here could be the host, in which case >> second-level translation is disabled in the SMMU, or it could be the >> guest, in which case second-level mappings are created by QEMU and >> first-level translation is managed by assigning PASID tables to the guest. > > the former yes applies to aux domain concept. The latter doesn't - > you have only one second-level per device. whole PASID table managed > by guest means you assign the whole device to guest, which is not the > concept of aux domain here. Right, in the latter case, the host uses a "normal" domain to assign the whole PCI function to the guest. But the guest can still use auxiliary domains like in my example, to sub-assign the PCI function to different guest userspace applications. >> So (2) would use iommu_sva_bind_device(), but (1) needs something else. >> Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to >> second-level or nested translation? It seems silly to use a different >> API for first-level, since the flow in userspace and VFIO is the same as >> your second-level case as far as MAP_DMA ioctl goes. The difference is >> that in your case the auxiliary domain supports an additional operation >> which binds first-level page tables. An auxiliary domain that only >> supports first-level wouldn't support this operation, but it can still >> implement iommu_map/unmap/etc. > > Thanks for correcting me on this. You are right that aux domain shouldn't > impose such limitation on 2nd or nested only. We define aux domain > as a normal domain (aux takes effect only when attaching to a device), > thus it should support all capabilities possible on a normal domain. > > btw I'm not sure whether you look at my comment to patch 8/10. I > explained the rationale why aux domain doesn't interfere with existing > default domain usage, and in a quick thinking above example might > not make difference. but need your confirm here. :-) Yes sorry, I didn't have time to answer, will do it now Thanks, Jean
On Thu, 13 Sep 2018 16:03:01 +0100 Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote: > On 13/09/2018 01:19, Tian, Kevin wrote: > >>> This is proposed for architectures which support finer granularity > >>> second level translation with no impact on architectures which > >>> only support Source ID or the similar granularity. > >> > >> Just to be clear, in this paragraph you're only referring to the > >> Nested/second-level translation for mdev, which is specific to vt-d > >> rev3? Other architectures can still do first-level translation with > >> PASID, to support some use-cases of IOMMU aware mediated device > >> (assigning mdevs to userspace drivers, for example) > > > > yes. aux domain concept applies only to vt-d rev3 which introduces > > scalable mode. Care is taken to avoid breaking usages on existing > > architectures. > > > > one note. Assigning mdevs to user space alone doesn't imply IOMMU > > aware. All existing mdev usages use software or proprietary methods > > to isolate DMA. There is only one potential IOMMU aware mdev usage > > which we talked not rely on vt-d rev3 scalable mode - wrap a random > > PCI device into a single mdev instance (no sharing). In that case > > mdev inherits RID from parent PCI device, thus is isolated by IOMMU > > in RID granular. Our RFC supports this usage too. In VFIO two > > usages (PASID- based and RID-based) use same code path, i.e. always > > binding domain to the parent device of mdev. But within IOMMU they > > go different paths. PASID-based will go to aux-domain as > > iommu_enable_aux_domain has been called on that device. RID-based > > will follow existing unmanaged domain path, as if it is parent > > device assignment. > > For Arm SMMU we're more interested in the PASID-granular case than the > RID-granular one. It doesn't necessarily require vt-d rev3 scalable > mode, the following example can be implemented with an SMMUv3, since > it only needs PASID-granular first-level translation: > > We have a PCI function that supports PASID, and can be partitioned > into multiple isolated entities, mdevs. Each mdev has an MMIO frame, > an MSI vector and a PASID. > > Different processes (userspace drivers, not QEMU) each open one mdev. > A process controlling one mdev has two ways of doing DMA: > > (1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This > creates an auxiliary domain for the mdev, with PASID #35. The process > creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on > the auxiliary domain. The IOMMU driver populates the pgtables > associated with PASID #35. > > (2) SVA. One way of doing it: the process uses a new > "VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process > address space to the device, gets PASID #35. Simpler, but not > everyone wants to use SVA, especially not userspace drivers which > need the highest performance. > > > This example only needs to modify first-level translation, and works > with SMMUv3. The kernel here could be the host, in which case > second-level translation is disabled in the SMMU, or it could be the > guest, in which case second-level mappings are created by QEMU and > first-level translation is managed by assigning PASID tables to the > guest. There is a difference in case of guest SVA. VT-d v3 will bind guest PASID and guest CR3 instead of the guest PASID table. Then turn on nesting. In case of mdev, the second level is obtained from the aux domain which was setup for the default PASID. Or in case of PCI device, second level is harvested from RID2PASID. > So (2) would use iommu_sva_bind_device(), We would need something different than that for guest bind, just to show the two cases: int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, int *pasid, unsigned long flags, void *drvdata) (WIP) int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data) where: /** * struct gpasid_bind_data - Information about device and guest PASID binding * @pasid: Process address space ID used for the guest mm * @addr_width: Guest address width. Paging mode can also be derived. * @gcr3: Guest CR3 value from guest mm */ struct gpasid_bind_data { __u32 pasid; __u64 gcr3; __u32 addr_width; __u32 flags; #define IOMMU_SVA_GPASID_SRE BIT(0) /* supervisor request */ }; Perhaps there is room to merge with io_mm but the life cycle management of guest PASID and host PASID will be different if you rely on mm release callback than FD. > but (1) needs something > else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary > domain to second-level or nested translation? It seems silly to use a > different API for first-level, since the flow in userspace and VFIO > is the same as your second-level case as far as MAP_DMA ioctl goes. > The difference is that in your case the auxiliary domain supports an > additional operation which binds first-level page tables. An > auxiliary domain that only supports first-level wouldn't support this > operation, but it can still implement iommu_map/unmap/etc. > I think the intention is that when a mdev is created, we don;t know whether it will be used for SVA or IOVA. So aux domain is here to "hold a spot" for the default PASID such that MAP_DMA calls can work as usual, which is second level only. Later, if SVA is used on the mdev there will be another PASID allocated for that purpose. Do we need to create an aux domain for each PASID? the translation can be looked up by the combination of parent dev and pasid. > > Another note: if for some reason you did want to allow userspace to > choose between first-level or second-level, you could implement the > VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU, > but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP > ioctl on a NESTING container would populate second-level, and DMA_MAP > on a normal container populates first-level. But if you're always > going to use second-level by default, the distinction isn't necessary. > In case of guest SVA, the second level is always there. > > >> Sounds good, I'll drop the private PASID patch if we can figure > >> out a solution to the attach/detach_dev problem discussed on patch > >> 8/10 > > > > Can you elaborate a bit on private PASID usage? what is the > > high level flow on it? > > > > Again based on earlier explanation, aux domain is specific to IOMMU > > architecture supporting vtd scalable mode-like capability, which > > allows separate 2nd/1st level translations per PASID. Need a better > > understanding how private PASID is relevant here. > > Private PASIDs are used for doing iommu_map/iommu_unmap on PASIDs > (first-level translation): > https://www.spinics.net/lists/dri-devel/msg177003.html As above, some > people don't want SVA, some can't do it, some may even want a few > private address spaces just for their kernel driver. They need a way > to allocate PASIDs and do iommu_map/iommu_unmap on them, without > binding to a process. I was planning to add the private PASID patch > to my SVA series, but in my opinion the feature overlaps with > auxiliary domains. > > Thanks, > Jean > _______________________________________________ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu [Jacob Pan]
On 14/09/2018 22:04, Jacob Pan wrote: >> This example only needs to modify first-level translation, and works >> with SMMUv3. The kernel here could be the host, in which case >> second-level translation is disabled in the SMMU, or it could be the >> guest, in which case second-level mappings are created by QEMU and >> first-level translation is managed by assigning PASID tables to the >> guest. > There is a difference in case of guest SVA. VT-d v3 will bind guest > PASID and guest CR3 instead of the guest PASID table. Then turn on > nesting. In case of mdev, the second level is obtained from the aux > domain which was setup for the default PASID. Or in case of PCI device, > second level is harvested from RID2PASID. Right, though I wasn't talking about the host managing guest SVA here, but a kernel binding the address space of one of its userspace drivers to the mdev. >> So (2) would use iommu_sva_bind_device(), > We would need something different than that for guest bind, just to show > the two cases:> > int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, int > *pasid, unsigned long flags, void *drvdata) > > (WIP) > int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data) > where: > /** > * struct gpasid_bind_data - Information about device and guest PASID > binding > * @pasid: Process address space ID used for the guest mm > * @addr_width: Guest address width. Paging mode can also be derived. > * @gcr3: Guest CR3 value from guest mm > */ > struct gpasid_bind_data { > __u32 pasid; > __u64 gcr3; > __u32 addr_width; > __u32 flags; > #define IOMMU_SVA_GPASID_SRE BIT(0) /* supervisor request */ > }; > Perhaps there is room to merge with io_mm but the life cycle management > of guest PASID and host PASID will be different if you rely on mm > release callback than FD. I think gpasid management should stay separate from io_mm, since in your case VFIO mechanisms are used for life cycle management of the VM, similarly to the former bind_pasid_table proposal. For example closing the container fd would unbind all guest page tables. The QEMU process' address space lifetime seems like the wrong thing to track for gpasid. >> but (1) needs something >> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary >> domain to second-level or nested translation? It seems silly to use a >> different API for first-level, since the flow in userspace and VFIO >> is the same as your second-level case as far as MAP_DMA ioctl goes. >> The difference is that in your case the auxiliary domain supports an >> additional operation which binds first-level page tables. An >> auxiliary domain that only supports first-level wouldn't support this >> operation, but it can still implement iommu_map/unmap/etc. >> > I think the intention is that when a mdev is created, we don;t > know whether it will be used for SVA or IOVA. So aux domain is here to > "hold a spot" for the default PASID such that MAP_DMA calls can work as > usual, which is second level only. Later, if SVA is used on the mdev > there will be another PASID allocated for that purpose. > Do we need to create an aux domain for each PASID? the translation can > be looked up by the combination of parent dev and pasid. When allocating a new PASID for the guest, I suppose you need to clone the second-level translation config? In which case a single aux domain for the mdev might be easier to implement in the IOMMU driver. Entirely up to you since we don't have this case on SMMUv3 Thanks, Jean
> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com] > Sent: Tuesday, September 18, 2018 11:47 PM > > On 14/09/2018 22:04, Jacob Pan wrote: > >> This example only needs to modify first-level translation, and works > >> with SMMUv3. The kernel here could be the host, in which case > >> second-level translation is disabled in the SMMU, or it could be the > >> guest, in which case second-level mappings are created by QEMU and > >> first-level translation is managed by assigning PASID tables to the > >> guest. > > There is a difference in case of guest SVA. VT-d v3 will bind guest > > PASID and guest CR3 instead of the guest PASID table. Then turn on > > nesting. In case of mdev, the second level is obtained from the aux > > domain which was setup for the default PASID. Or in case of PCI device, > > second level is harvested from RID2PASID. > > Right, though I wasn't talking about the host managing guest SVA here, > but a kernel binding the address space of one of its userspace drivers > to the mdev. > > >> So (2) would use iommu_sva_bind_device(), > > We would need something different than that for guest bind, just to show > > the two cases:> > > int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, > int > > *pasid, unsigned long flags, void *drvdata) > > > > (WIP) > > int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data) > > where: > > /** > > * struct gpasid_bind_data - Information about device and guest PASID > > binding > > * @pasid: Process address space ID used for the guest mm > > * @addr_width: Guest address width. Paging mode can also be derived. > > * @gcr3: Guest CR3 value from guest mm > > */ > > struct gpasid_bind_data { > > __u32 pasid; > > __u64 gcr3; > > __u32 addr_width; > > __u32 flags; > > #define IOMMU_SVA_GPASID_SRE BIT(0) /* supervisor request */ > > }; > > Perhaps there is room to merge with io_mm but the life cycle > management > > of guest PASID and host PASID will be different if you rely on mm > > release callback than FD. let's not calling gpasid here - which makes sense only in bind_pasid_table proposal where pasid table thus pasid space is managed by guest. In above context it is always about host pasid (allocated in system-wide), which could point to a host cr3 (user process) or a guest cr3 (vm case). > > I think gpasid management should stay separate from io_mm, since in your > case VFIO mechanisms are used for life cycle management of the VM, > similarly to the former bind_pasid_table proposal. For example closing > the container fd would unbind all guest page tables. The QEMU process' > address space lifetime seems like the wrong thing to track for gpasid. I sort of agree (though not thinking through all the flow carefully). PASIDs are allocated per iommu domain, thus release also happens when domain is detached (along with container fd close). > > >> but (1) needs something > >> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary > >> domain to second-level or nested translation? It seems silly to use a > >> different API for first-level, since the flow in userspace and VFIO > >> is the same as your second-level case as far as MAP_DMA ioctl goes. > >> The difference is that in your case the auxiliary domain supports an > >> additional operation which binds first-level page tables. An > >> auxiliary domain that only supports first-level wouldn't support this > >> operation, but it can still implement iommu_map/unmap/etc. > >> > > I think the intention is that when a mdev is created, we don;t > > know whether it will be used for SVA or IOVA. So aux domain is here to > > "hold a spot" for the default PASID such that MAP_DMA calls can work as > > usual, which is second level only. Later, if SVA is used on the mdev > > there will be another PASID allocated for that purpose. > > Do we need to create an aux domain for each PASID? the translation can > > be looked up by the combination of parent dev and pasid. > > When allocating a new PASID for the guest, I suppose you need to clone > the second-level translation config? In which case a single aux domain > for the mdev might be easier to implement in the IOMMU driver. Entirely > up to you since we don't have this case on SMMUv3 > One thing to highlight in related discussions (also mentioned in other thread). There is not a new iommu domain type called 'aux'. 'aux' matters only to a specific device when a domain is attached to that device which has aux capability enabled. Same domain can be attached to other device as normal domain. In that case multiple PASIDs allocated on same mdev are tied to same aux domain, same bare metal SVA case, i.e. any domain (normal or aux) can include 2nd level structure and multiple 1st level structures. Jean is correct - all PASIDs in same domain then share 2nd level translation, and there are io_mm or similar tracking structures to associate each PASID to a 1st level translation structure. Thanks Kevin
On Wed, 19 Sep 2018 02:22:03 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com] > > Sent: Tuesday, September 18, 2018 11:47 PM > > > > On 14/09/2018 22:04, Jacob Pan wrote: > > >> This example only needs to modify first-level translation, and > > >> works with SMMUv3. The kernel here could be the host, in which > > >> case second-level translation is disabled in the SMMU, or it > > >> could be the guest, in which case second-level mappings are > > >> created by QEMU and first-level translation is managed by > > >> assigning PASID tables to the guest. > > > There is a difference in case of guest SVA. VT-d v3 will bind > > > guest PASID and guest CR3 instead of the guest PASID table. Then > > > turn on nesting. In case of mdev, the second level is obtained > > > from the aux domain which was setup for the default PASID. Or in > > > case of PCI device, second level is harvested from RID2PASID. > > > > Right, though I wasn't talking about the host managing guest SVA > > here, but a kernel binding the address space of one of its > > userspace drivers to the mdev. > > > > >> So (2) would use iommu_sva_bind_device(), > > > We would need something different than that for guest bind, just > > > to show the two cases:> > > > int iommu_sva_bind_device(struct device *dev, struct mm_struct > > > *mm, > > int > > > *pasid, unsigned long flags, void *drvdata) > > > > > > (WIP) > > > int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data > > > *data) where: > > > /** > > > * struct gpasid_bind_data - Information about device and guest > > > PASID binding > > > * @pasid: Process address space ID used for the guest mm > > > * @addr_width: Guest address width. Paging mode can also be > > > derived. > > > * @gcr3: Guest CR3 value from guest mm > > > */ > > > struct gpasid_bind_data { > > > __u32 pasid; > > > __u64 gcr3; > > > __u32 addr_width; > > > __u32 flags; > > > #define IOMMU_SVA_GPASID_SRE BIT(0) /* supervisor request */ > > > }; > > > Perhaps there is room to merge with io_mm but the life cycle > > management > > > of guest PASID and host PASID will be different if you rely on mm > > > release callback than FD. > > let's not calling gpasid here - which makes sense only in > bind_pasid_table proposal where pasid table thus pasid space is > managed by guest. In above context it is always about host pasid > (allocated in system-wide), which could point to a host cr3 (user > process) or a guest cr3 (vm case). > I agree this gpasid is confusing, we have a system wide PASID name space. Just a way to differentiate different bind, perhaps just a flag indicating the PASID is used for guest. i.e. struct pasid_bind_data { __u32 pasid; __u64 gcr3; __u32 addr_width; __u32 flags; #define IOMMU_SVA_GPASID_SRE BIT(0) /* supervisor request */ #define IOMMU_SVA_PASID_GUEST BIT(0) /* host pasid used by guest */ }; > > I think gpasid management should stay separate from io_mm, since in > > your case VFIO mechanisms are used for life cycle management of the > > VM, similarly to the former bind_pasid_table proposal. For example > > closing the container fd would unbind all guest page tables. The > > QEMU process' address space lifetime seems like the wrong thing to > > track for gpasid. > > I sort of agree (though not thinking through all the flow carefully). > PASIDs are allocated per iommu domain, thus release also happens when > domain is detached (along with container fd close). > I also prefer to keep gpasid separate. But I don't think we need to have per iommu domain per PASID for guest SVA case. Assuming you are talking about host IOMMU domain. The PASID bind call is a result of guest PASID cache flush with a PASID previously allocated. The host just need to put gcr3 into the PASID entry then harvest the second level from the existing domain. > > > > >> but (1) needs something > > >> else. Aren't auxiliary domains suitable for (1)? Why limit > > >> auxiliary domain to second-level or nested translation? It seems > > >> silly to use a different API for first-level, since the flow in > > >> userspace and VFIO is the same as your second-level case as far > > >> as MAP_DMA ioctl goes. The difference is that in your case the > > >> auxiliary domain supports an additional operation which binds > > >> first-level page tables. An auxiliary domain that only supports > > >> first-level wouldn't support this operation, but it can still > > >> implement iommu_map/unmap/etc. > > > I think the intention is that when a mdev is created, we don;t > > > know whether it will be used for SVA or IOVA. So aux domain is > > > here to "hold a spot" for the default PASID such that MAP_DMA > > > calls can work as usual, which is second level only. Later, if > > > SVA is used on the mdev there will be another PASID allocated for > > > that purpose. Do we need to create an aux domain for each PASID? > > > the translation can be looked up by the combination of parent dev > > > and pasid. > > > > When allocating a new PASID for the guest, I suppose you need to > > clone the second-level translation config? In which case a single > > aux domain for the mdev might be easier to implement in the IOMMU > > driver. Entirely up to you since we don't have this case on SMMUv3 > > > > One thing to highlight in related discussions (also mentioned in other > thread). There is not a new iommu domain type called 'aux'. 'aux' > matters only to a specific device when a domain is attached to that > device which has aux capability enabled. Same domain can be attached > to other device as normal domain. In that case multiple PASIDs > allocated on same mdev are tied to same aux domain, same bare metal > SVA case, i.e. any domain (normal or aux) can include 2nd level > structure and multiple 1st level structures. Jean is correct - all > PASIDs in same domain then share 2nd level translation, and there are > io_mm or similar tracking structures to associate each PASID to a 1st > level translation structure. > I think we are all talking about the same thing :) yes, 2nd level is cloned from aux domain/default PASID for mdev, and pdev similarly from DMA_MAP domain. > Thanks > Kevin > _______________________________________________ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu [Jacob Pan]