[RFC,v2,00/10] vfio/mdev: IOMMU aware mediated device

Message ID	20180830040922.30426-1-baolu.lu@linux.intel.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> From: Lu Baolu <baolu.lu@linux.intel.com> To: Joerg Roedel <joro@8bytes.org>, David Woodhouse <dwmw2@infradead.org>, Alex Williamson <alex.williamson@redhat.com>, Kirti Wankhede <kwankhede@nvidia.com> Cc: ashok.raj@intel.com, sanjay.k.kumar@intel.com, jacob.jun.pan@intel.com, kevin.tian@intel.com, Jean-Philippe Brucker <jean-philippe.brucker@arm.com>, yi.l.liu@intel.com, yi.y.sun@intel.com, peterx@redhat.com, tiwei.bie@intel.com, iommu@lists.linux-foundation.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Lu Baolu <baolu.lu@linux.intel.com> Subject: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device Date: Thu, 30 Aug 2018 12:09:12 +0800 Message-Id: <20180830040922.30426-1-baolu.lu@linux.intel.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	vfio/mdev: IOMMU aware mediated device \| expand [RFC,v2,00/10] vfio/mdev: IOMMU aware mediated device [RFC,v2,01/10] iommu: Add APIs for multiple domains per device [RFC,v2,02/10] iommu/vt-d: Add multiple domains per device query [RFC,v2,03/10] iommu/amd: Add default branch in amd_iommu_capable() [RFC,v2,04/10] iommu/vt-d: Enable/disable multiple domains per device [RFC,v2,05/10] iommu/vt-d: Attach/detach domains in auxiliary mode [RFC,v2,06/10] iommu/vt-d: Return ID associated with an auxiliary domain [RFC,v2,07/10] vfio/mdev: Add mediated device domain type [RFC,v2,08/10] vfio/type1: Add domain at(de)taching group helpers [RFC,v2,09/10] vfio/type1: Determine domain type of an mdev group [RFC,v2,10/10] vfio/type1: Attach domain for mdev group

Baolu Lu Aug. 30, 2018, 4:09 a.m. UTC

Hi,

The Mediate Device is a framework for fine-grained physical device
sharing across the isolated domains. Currently the mdev framework
is designed to be independent of the platform IOMMU support. As the
result, the DMA isolation relies on the mdev parent device in a
vendor specific way.

There are several cases where a mediated device could be protected
and isolated by the platform IOMMU. For example, Intel vt-d rev3.0
[1] introduces a new translation mode called 'scalable mode', which
enables PASID-granular translations. The vt-d scalable mode is the
key ingredient for Scalable I/O Virtualization [2] [3] which allows
sharing a device in minimal possible granularity (ADI - Assignable
Device Interface).

A mediated device backed by an ADI could be protected and isolated
by the IOMMU since 1) the parent device supports tagging an unique
PASID to all DMA traffic out of the mediated device; and 2) the DMA
translation unit (IOMMU) supports the PASID granular translation.
We can apply IOMMU protection and isolation to this kind of devices
just as what we are doing with an assignable PCI device.

In order to distinguish the IOMMU-capable mediated devices from those
which still need to rely on parent devices, this patch set adds a
domain type attribute to each mdev.

enum mdev_domain_type {
	DOMAIN_TYPE_NO_IOMMU,	/* Don't need any IOMMU support.
				 * All isolation and protection
				 * are handled by the parent
				 * device driver with a device
				 * specific mechanism.
				 */
	DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and protect
				    * the mdev, and the isolation
				    * domain should be attaced with
				    * the parent device.
				    */
};

The mdev parent device driver could opt-in whether an mdev is IOMMU
capable when the device is created by invoking below interface within
its @create callback:

int mdev_set_domain_type(struct device *dev,
                         enum mdev_domain_type type);

In the vfio_iommu_type1_attach_group(), a domain allocated through
iommu_domain_alloc() will be attached to the mdev parent device if
the domain types of mdev devices in group are of type ATTACH_PARENT;
Otherwise, the dummy external domain will be used and all the DMA
isolation and protection are routed to parent driver as the result.

On IOMMU side, a basic requirement is allowing to attach multiple
domains for a PCI device if the device advertises the capability
and the IOMMU hardware supports finer granularity translations than
the normal PCI Source ID based translation.

In order for the ease of discussion, we call "a domain in auxiliary
mode' or simply 'an auxiliary domain' when a domain is attached to
a device for finer granularity translations (than the Source ID based
one). But we need to keep in mind that this doesn't mean two types of
domains. A same domain could be bound to a device for Source ID based
translation, and bound to another device for finer granularity
translation at the same time.

Below APIs are introduced in the IOMMU glue for device drivers to use
the finer granularity translation.

* iommu_capable(IOMMU_CAP_AUX_DOMAIN)
  - Represents the ability for supporting multiple domains per device
    (a.k.a. finer granularity translations) of the IOMMU hardware.
    
* iommu_en(dis)able_aux_domain(struct device *dev)
  - Enable/disable the multiple domains capability for a device
    referenced by @dev.

* iommu_auxiliary_id(struct iommu_domain *domain)
  - Return the index value used for finer-granularity DMA translation.
    The specific device driver needs to feed the hardware with this
    value, so that hardware device could issue the DMA transaction with
    this value tagged.

This patch series extends both IOMMU and vfio components to support
mdev device passing through when it could be isolated and protected
by the IOMMU units. The first part of this series (PATCH 1/10 ~ 6/10)
adds the interfaces and implementation of the multiple domains per
device. The second part (PATCH 7/12 ~ 10/12) adds the domain type
attribute to each mdev, determines domain type according to the
attribute when attaching group in vfio type1 iommu module, and bind
an auxiliary domain for the group with all mediated devices which
requires its own domain.

This patch series depends on a patch set posted here [4] for discussion
which added the support for scalable mode in Intel IOMMU driver.

References:
[1] https://software.intel.com/en-us/download/intel-virtualization-technology-for-directed-io-architecture-specification
[2] https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[3] https://schd.ws/hosted_files/lc32018/00/LC3-SIOV-final.pdf
[4] https://lkml.org/lkml/2018/8/30/27

Best regards,
Lu Baolu

Change log:
  v1->v2:
  - Rewrite the patches with the concept of auxiliary domains.

Lu Baolu (10):
  iommu: Add APIs for multiple domains per device
  iommu/vt-d: Add multiple domains per device query
  iommu/amd: Add default branch in amd_iommu_capable()
  iommu/vt-d: Enable/disable multiple domains per device
  iommu/vt-d: Attach/detach domains in auxiliary mode
  iommu/vt-d: Return ID associated with an auxiliary domain
  vfio/mdev: Add mediated device domain type
  vfio/type1: Add domain at(de)taching group helpers
  vfio/type1: Determine domain type of an mdev group
  vfio/type1: Attach domain for mdev group

 drivers/iommu/amd_iommu.c        |   2 +
 drivers/iommu/intel-iommu.c      | 208 ++++++++++++++++++++++++++++++-
 drivers/iommu/iommu.c            |  29 +++++
 drivers/vfio/mdev/mdev_core.c    |  36 ++++++
 drivers/vfio/mdev/mdev_private.h |   2 +
 drivers/vfio/vfio_iommu_type1.c  | 144 +++++++++++++++++++--
 include/linux/intel-iommu.h      |  11 ++
 include/linux/iommu.h            |  13 ++
 include/linux/mdev.h             |  26 ++++
 9 files changed, 455 insertions(+), 16 deletions(-)

Tian, Kevin Sept. 5, 2018, 3:01 a.m. UTC | #1

> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> Sent: Thursday, August 30, 2018 12:09 PM
> 
[...]
> 
> In order to distinguish the IOMMU-capable mediated devices from those
> which still need to rely on parent devices, this patch set adds a
> domain type attribute to each mdev.
> 
> enum mdev_domain_type {
> 	DOMAIN_TYPE_NO_IOMMU,	/* Don't need any IOMMU support.
> 				 * All isolation and protection
> 				 * are handled by the parent
> 				 * device driver with a device
> 				 * specific mechanism.
> 				 */
> 	DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and
> protect
> 				    * the mdev, and the isolation
> 				    * domain should be attaced with
> 				    * the parent device.
> 				    */
> };
> 

ATTACH_PARENT is not like a good counterpart to NO_IOMMU.

what about DOMAIN_TYPE_NO_IOMMU/DOMAIN_TYPE_IOMMU? whether
to attach parent device is just internal logic.

Alternatively DOMAIN_TYPE_SOFTWARE/DOMAIN_TYPE_HARDWARE,
where software means iommu_domain is managed by software while
the other means managed by hardware.

One side note to Alex - with multiple domain extension in IOMMU layer,
this version combines IOMMU-capable usages in VFIO: PASID-based (as 
in scalable iov) and RID-based (as the usage of mdev wrapper on any 
device). Both cases share the common path - just binding the domain to the
parent device of mdev. IOMMU layer will handle two cases differently later. 

Thanks
Kevin

Alex Williamson Sept. 5, 2018, 7:15 p.m. UTC | #2

On Wed, 5 Sep 2018 03:01:39 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> > Sent: Thursday, August 30, 2018 12:09 PM
> >   
> [...]
> > 
> > In order to distinguish the IOMMU-capable mediated devices from those
> > which still need to rely on parent devices, this patch set adds a
> > domain type attribute to each mdev.
> > 
> > enum mdev_domain_type {
> > 	DOMAIN_TYPE_NO_IOMMU,	/* Don't need any IOMMU support.
> > 				 * All isolation and protection
> > 				 * are handled by the parent
> > 				 * device driver with a device
> > 				 * specific mechanism.
> > 				 */
> > 	DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and
> > protect
> > 				    * the mdev, and the isolation
> > 				    * domain should be attaced with
> > 				    * the parent device.
> > 				    */
> > };
> >   
> 
> ATTACH_PARENT is not like a good counterpart to NO_IOMMU.

Please do not use NO_IOMMU, we already have a thing called
vfio-noiommu, enabled through CONFIG_VFIO_NOIOMMU and module parameter
enable_unsafe_noiommu_mode.  This is much, much too similar and will
generate confusion.

> what about DOMAIN_TYPE_NO_IOMMU/DOMAIN_TYPE_IOMMU? whether
> to attach parent device is just internal logic.
> 
> Alternatively DOMAIN_TYPE_SOFTWARE/DOMAIN_TYPE_HARDWARE,
> where software means iommu_domain is managed by software while
> the other means managed by hardware.

I haven't gotten deep enough into the series to see how it's used, but
my gut reaction is that we don't need an enum, we just need some sort
of pointer on the mdev that points to an iommu_parent, which indicates
the root of our IOMMU based isolation, or is NULL, which indicates we
use vendor defined isolation as we have now.

> One side note to Alex - with multiple domain extension in IOMMU layer,
> this version combines IOMMU-capable usages in VFIO: PASID-based (as 
> in scalable iov) and RID-based (as the usage of mdev wrapper on any 
> device). Both cases share the common path - just binding the domain to the
> parent device of mdev. IOMMU layer will handle two cases differently later. 

Good, I'm glad you've considered the regular (RID) IOMMU domain and not
just the new aux domain.  Thanks,

Alex

Baolu Lu Sept. 6, 2018, 1:29 a.m. UTC | #3

Hi,

On 09/06/2018 03:15 AM, Alex Williamson wrote:
> On Wed, 5 Sep 2018 03:01:39 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
>>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
>>> Sent: Thursday, August 30, 2018 12:09 PM
>>>    
>> [...]
>>>
>>> In order to distinguish the IOMMU-capable mediated devices from those
>>> which still need to rely on parent devices, this patch set adds a
>>> domain type attribute to each mdev.
>>>
>>> enum mdev_domain_type {
>>> 	DOMAIN_TYPE_NO_IOMMU,	/* Don't need any IOMMU support.
>>> 				 * All isolation and protection
>>> 				 * are handled by the parent
>>> 				 * device driver with a device
>>> 				 * specific mechanism.
>>> 				 */
>>> 	DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and
>>> protect
>>> 				    * the mdev, and the isolation
>>> 				    * domain should be attaced with
>>> 				    * the parent device.
>>> 				    */
>>> };
>>>    
>>
>> ATTACH_PARENT is not like a good counterpart to NO_IOMMU.
> 
> Please do not use NO_IOMMU, we already have a thing called
> vfio-noiommu, enabled through CONFIG_VFIO_NOIOMMU and module parameter
> enable_unsafe_noiommu_mode.  This is much, much too similar and will
> generate confusion.

Sure. Will remove this confusion.

> 
>> what about DOMAIN_TYPE_NO_IOMMU/DOMAIN_TYPE_IOMMU? whether
>> to attach parent device is just internal logic.
>>
>> Alternatively DOMAIN_TYPE_SOFTWARE/DOMAIN_TYPE_HARDWARE,
>> where software means iommu_domain is managed by software while
>> the other means managed by hardware.
> 
> I haven't gotten deep enough into the series to see how it's used, but
> my gut reaction is that we don't need an enum, we just need some sort
> of pointer on the mdev that points to an iommu_parent, which indicates
> the root of our IOMMU based isolation, or is NULL, which indicates we
> use vendor defined isolation as we have now.

It works as long as we can distinguish IOMMU based isolation and the
vendor defined isolation.

How about making the iommu_parent points the device structure who
created the mdev? If this pointer is NOT NULL we will bind the domain
to the device pointed to by it, otherwise, handle it in the vendor
defined way?

Best regards,
Lu Baolu

> 
>> One side note to Alex - with multiple domain extension in IOMMU layer,
>> this version combines IOMMU-capable usages in VFIO: PASID-based (as
>> in scalable iov) and RID-based (as the usage of mdev wrapper on any
>> device). Both cases share the common path - just binding the domain to the
>> parent device of mdev. IOMMU layer will handle two cases differently later.
> 
> Good, I'm glad you've considered the regular (RID) IOMMU domain and not
> just the new aux domain.  Thanks,
> 
> Alex
>

Jean-Philippe Brucker Sept. 10, 2018, 4:22 p.m. UTC | #4

Hi,

On 30/08/2018 05:09, Lu Baolu wrote:
> Below APIs are introduced in the IOMMU glue for device drivers to use
> the finer granularity translation.
> 
> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
>   - Represents the ability for supporting multiple domains per device
>     (a.k.a. finer granularity translations) of the IOMMU hardware.

iommu_capable() cannot represent hardware capabilities, we need
something else for systems with multiple IOMMUs that have different
caps. How about iommu_domain_get_attr on the device's domain instead?

> * iommu_en(dis)able_aux_domain(struct device *dev)
>   - Enable/disable the multiple domains capability for a device
>     referenced by @dev.
> 
> * iommu_auxiliary_id(struct iommu_domain *domain)
>   - Return the index value used for finer-granularity DMA translation.
>     The specific device driver needs to feed the hardware with this
>     value, so that hardware device could issue the DMA transaction with
>     this value tagged.

This could also reuse iommu_domain_get_attr.

More generally I'm having trouble understanding how auxiliary domains
will be used. So VFIO allocates PASIDs like this:

* iommu_enable_aux_domain(parent_dev)
* iommu_domain_alloc() -> dom1
* iommu_domain_alloc() -> dom2
* iommu_attach_device(dom1, parent_dev)
  -> dom1 gets PASID #1
* iommu_attach_device(dom2, parent_dev)
  -> dom2 gets PASID #2

Then I'm not sure about the next steps, when userspace does
VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's container. Is the
following use accurate?

For the single translation level:
* iommu_map(dom1, ...) updates first-level/second-level pgtables for
PASID #1
* iommu_map(dom2, ...) updates first-level/second-level pgtables for
PASID #2

Nested translation:
* iommu_map(dom1, ...) updates second-level pgtables for PASID #1
* iommu_bind_table(dom1, ...) binds first-level pgtables, provided by
the guest, for PASID #1
* iommu_map(dom2, ...) updates second-level pgtables for PASID #2
* iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2

I'm trying to understand how to implement this with SMMU and other
IOMMUs. It's not a clean fit since we have a single domain to hold the
second-level pgtables. Then again, the nested case probably doesn't
matter for us - we might as well assign the parent directly, since all
mdevs have the same second-level and can only be assigned to the same VM.

Also, can non-VFIO device drivers use auxiliary domains to do map/unmap
on PASIDs? They are asking to do that and I'm proposing the private
PASID thing, but since aux domains provide a similar feature we should
probably converge somehow.

Thanks,
Jean

Baolu Lu Sept. 12, 2018, 2:42 a.m. UTC | #5

Hi,

On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote:
> Hi,
> 
> On 30/08/2018 05:09, Lu Baolu wrote:
>> Below APIs are introduced in the IOMMU glue for device drivers to use
>> the finer granularity translation.
>>
>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
>>    - Represents the ability for supporting multiple domains per device
>>      (a.k.a. finer granularity translations) of the IOMMU hardware.
> 
> iommu_capable() cannot represent hardware capabilities, we need
> something else for systems with multiple IOMMUs that have different
> caps. How about iommu_domain_get_attr on the device's domain instead?

Domain is not a good choice for per iommu cap query. A domain might be
attached to devices belonging to different iommu's.

How about an API with device structure as parameter? A device always
belongs to a specific iommu. This API is supposed to be used the
device driver.

> 
>> * iommu_en(dis)able_aux_domain(struct device *dev)
>>    - Enable/disable the multiple domains capability for a device
>>      referenced by @dev.
>>
>> * iommu_auxiliary_id(struct iommu_domain *domain)
>>    - Return the index value used for finer-granularity DMA translation.
>>      The specific device driver needs to feed the hardware with this
>>      value, so that hardware device could issue the DMA transaction with
>>      this value tagged.
> 
> This could also reuse iommu_domain_get_attr.
> 
> 
> More generally I'm having trouble understanding how auxiliary domains
> will be used. So VFIO allocates PASIDs like this:

As I wrote in the cover letter, "auxiliary domain" is just a name to
ease discussion. It's actually has no special meaning (we think a domain
as an isolation boundary which could be used by the IOMMU to isolate
the DMA transactions out of a PCI device or partial of it).

So drivers like vfio should see no difference when use an auxiliary
domain. The auxiliary domain is not aware out of iommu driver.

> 
> * iommu_enable_aux_domain(parent_dev)
> * iommu_domain_alloc() -> dom1
> * iommu_domain_alloc() -> dom2
> * iommu_attach_device(dom1, parent_dev)
>    -> dom1 gets PASID #1
> * iommu_attach_device(dom2, parent_dev)
>    -> dom2 gets PASID #2
> 
> Then I'm not sure about the next steps, when userspace does
> VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's container. Is the
> following use accurate?
> 
> For the single translation level:
> * iommu_map(dom1, ...) updates first-level/second-level pgtables for
> PASID #1
> * iommu_map(dom2, ...) updates first-level/second-level pgtables for
> PASID #2
> 
> Nested translation:
> * iommu_map(dom1, ...) updates second-level pgtables for PASID #1
> * iommu_bind_table(dom1, ...) binds first-level pgtables, provided by
> the guest, for PASID #1
> * iommu_map(dom2, ...) updates second-level pgtables for PASID #2
> * iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2
> >
> I'm trying to understand how to implement this with SMMU and other

This is proposed for architectures which support finer granularity
second level translation with no impact on architectures which only
support Source ID or the similar granularity.

> IOMMUs. It's not a clean fit since we have a single domain to hold the
> second-level pgtables. 

Do you mind explaining why a domain holds multiple second-level
pgtables? Shouldn't that be multiple domains?

> Then again, the nested case probably doesn't
> matter for us - we might as well assign the parent directly, since all
> mdevs have the same second-level and can only be assigned to the same VM.
> 
> 
> Also, can non-VFIO device drivers use auxiliary domains to do map/unmap
> on PASIDs? They are asking to do that and I'm proposing the private
> PASID thing, but since aux domains provide a similar feature we should
> probably converge somehow.

Yes, any non-VFIO device driver could use aux domain as well. The use
model is:

iommu_enable_aux_domain(dev)
-- enables aux domain support for this device

iommu_domain_alloc(dev)
-- allocate an iommu domain

iommu_attach_device(domain, dev)
-- attach the domain to device

iommu_auxiliary_id(domain)
-- retrieve the pasid id used by this domain

The device driver then

iommu_map(domain, ...)

set the pasid id to hardware register and start to do dma.

Best regards,
Lu Baolu

Jean-Philippe Brucker Sept. 12, 2018, 5:54 p.m. UTC | #6

On 12/09/2018 03:42, Lu Baolu wrote:
> Hi,
> 
> On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote:
>> Hi,
>>
>> On 30/08/2018 05:09, Lu Baolu wrote:
>>> Below APIs are introduced in the IOMMU glue for device drivers to use
>>> the finer granularity translation.
>>>
>>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
>>>    - Represents the ability for supporting multiple domains per device
>>>      (a.k.a. finer granularity translations) of the IOMMU hardware.
>>
>> iommu_capable() cannot represent hardware capabilities, we need
>> something else for systems with multiple IOMMUs that have different
>> caps. How about iommu_domain_get_attr on the device's domain instead?
> 
> Domain is not a good choice for per iommu cap query. A domain might be
> attached to devices belonging to different iommu's.
>
> How about an API with device structure as parameter? A device always
> belongs to a specific iommu. This API is supposed to be used the
> device driver.

Ah right, domain attributes won't work. Your suggestion seems more
suitable, but maybe users can simply try to enable auxiliary domains
first, and conclude that the IOMMU doesn't support it if it returns an error

>>> * iommu_en(dis)able_aux_domain(struct device *dev)
>>>    - Enable/disable the multiple domains capability for a device
>>>      referenced by @dev.

It strikes me now that in the IOMMU driver,
iommu_enable/disable_aux_domain() will do the same thing as
iommu_sva_device_init/shutdown()
(https://www.spinics.net/lists/arm-kernel/msg651896.html). Some IOMMU
drivers want to enable PASID and allocate PASID tables only when
requested by users, in the sva_init_device IOMMU op (see Joerg's comment
last year https://patchwork.kernel.org/patch/9989307/#21025429). Maybe
we could simply add a flag to iommu_sva_device_init?

>>> * iommu_auxiliary_id(struct iommu_domain *domain)
>>>    - Return the index value used for finer-granularity DMA translation.
>>>      The specific device driver needs to feed the hardware with this
>>>      value, so that hardware device could issue the DMA transaction with
>>>      this value tagged.
>>
>> This could also reuse iommu_domain_get_attr.
>>
>>
>> More generally I'm having trouble understanding how auxiliary domains
>> will be used. So VFIO allocates PASIDs like this:
> 
> As I wrote in the cover letter, "auxiliary domain" is just a name to
> ease discussion. It's actually has no special meaning (we think a domain
> as an isolation boundary which could be used by the IOMMU to isolate
> the DMA transactions out of a PCI device or partial of it).
> 
> So drivers like vfio should see no difference when use an auxiliary
> domain. The auxiliary domain is not aware out of iommu driver.

For an auxiliary domain, VFIO does need to retrieve the PASID and write
it to hardware. But being able to reuse iommu_map/unmap/iova_to_phys/etc
on the auxiliary domain is nice.

>> * iommu_enable_aux_domain(parent_dev)
>> * iommu_domain_alloc() -> dom1
>> * iommu_domain_alloc() -> dom2
>> * iommu_attach_device(dom1, parent_dev)
>>    -> dom1 gets PASID #1
>> * iommu_attach_device(dom2, parent_dev)
>>    -> dom2 gets PASID #2
>>
>> Then I'm not sure about the next steps, when userspace does
>> VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's container. Is the
>> following use accurate?
>>
>> For the single translation level:
>> * iommu_map(dom1, ...) updates first-level/second-level pgtables for
>> PASID #1
>> * iommu_map(dom2, ...) updates first-level/second-level pgtables for
>> PASID #2
>>
>> Nested translation:
>> * iommu_map(dom1, ...) updates second-level pgtables for PASID #1
>> * iommu_bind_table(dom1, ...) binds first-level pgtables, provided by
>> the guest, for PASID #1
>> * iommu_map(dom2, ...) updates second-level pgtables for PASID #2
>> * iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2
>>>
>> I'm trying to understand how to implement this with SMMU and other
> 
> This is proposed for architectures which support finer granularity
> second level translation with no impact on architectures which only
> support Source ID or the similar granularity.

Just to be clear, in this paragraph you're only referring to the
Nested/second-level translation for mdev, which is specific to vt-d
rev3? Other architectures can still do first-level translation with
PASID, to support some use-cases of IOMMU aware mediated device
(assigning mdevs to userspace drivers, for example)

>> IOMMUs. It's not a clean fit since we have a single domain to hold the
>> second-level pgtables. 
> 
> Do you mind explaining why a domain holds multiple second-level
> pgtables? Shouldn't that be multiple domains?

I didn't mean a single domain holding multiple second-level pgtables,
but a single domain holding a single set of second-level pgtables for
all mdevs. But let's ignore that, mdev and second-level isn't realistic
for arm SMMU.

>> Then again, the nested case probably doesn't
>> matter for us - we might as well assign the parent directly, since all
>> mdevs have the same second-level and can only be assigned to the same VM.
>>
>>
>> Also, can non-VFIO device drivers use auxiliary domains to do map/unmap
>> on PASIDs? They are asking to do that and I'm proposing the private
>> PASID thing, but since aux domains provide a similar feature we should
>> probably converge somehow.
> 
> Yes, any non-VFIO device driver could use aux domain as well. The use
> model is:
> 
> iommu_enable_aux_domain(dev)
> -- enables aux domain support for this device
> 
> iommu_domain_alloc(dev)
> -- allocate an iommu domain
> 
> iommu_attach_device(domain, dev)
> -- attach the domain to device
> 
> iommu_auxiliary_id(domain)
> -- retrieve the pasid id used by this domain
> 
> The device driver then
> 
> iommu_map(domain, ...)
> 
> set the pasid id to hardware register and start to do dma.

Sounds good, I'll drop the private PASID patch if we can figure out a
solution to the attach/detach_dev problem discussed on patch 8/10

Thanks,
Jean

Tian, Kevin Sept. 13, 2018, 12:19 a.m. UTC | #7

> From: Jean-Philippe Brucker
> Sent: Thursday, September 13, 2018 1:54 AM
> 
> On 12/09/2018 03:42, Lu Baolu wrote:
> > Hi,
> >
> > On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote:
> >> Hi,
> >>
> >> On 30/08/2018 05:09, Lu Baolu wrote:
> >>> Below APIs are introduced in the IOMMU glue for device drivers to use
> >>> the finer granularity translation.
> >>>
> >>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
> >>>    - Represents the ability for supporting multiple domains per device
> >>>      (a.k.a. finer granularity translations) of the IOMMU hardware.
> >>
> >> iommu_capable() cannot represent hardware capabilities, we need
> >> something else for systems with multiple IOMMUs that have different
> >> caps. How about iommu_domain_get_attr on the device's domain
> instead?
> >
> > Domain is not a good choice for per iommu cap query. A domain might be
> > attached to devices belonging to different iommu's.
> >
> > How about an API with device structure as parameter? A device always
> > belongs to a specific iommu. This API is supposed to be used the
> > device driver.
> 
> Ah right, domain attributes won't work. Your suggestion seems more
> suitable, but maybe users can simply try to enable auxiliary domains
> first, and conclude that the IOMMU doesn't support it if it returns an error
> 
> >>> * iommu_en(dis)able_aux_domain(struct device *dev)
> >>>    - Enable/disable the multiple domains capability for a device
> >>>      referenced by @dev.
> 
> It strikes me now that in the IOMMU driver,
> iommu_enable/disable_aux_domain() will do the same thing as
> iommu_sva_device_init/shutdown()
> (https://www.spinics.net/lists/arm-kernel/msg651896.html). Some IOMMU
> drivers want to enable PASID and allocate PASID tables only when
> requested by users, in the sva_init_device IOMMU op (see Joerg's comment
> last year https://patchwork.kernel.org/patch/9989307/#21025429). Maybe
> we could simply add a flag to iommu_sva_device_init?

We could combine, but definitely 'sva' should be removed :-)

> 
> >>> * iommu_auxiliary_id(struct iommu_domain *domain)
> >>>    - Return the index value used for finer-granularity DMA translation.
> >>>      The specific device driver needs to feed the hardware with this
> >>>      value, so that hardware device could issue the DMA transaction with
> >>>      this value tagged.
> >>
> >> This could also reuse iommu_domain_get_attr.
> >>
> >>
> >> More generally I'm having trouble understanding how auxiliary domains
> >> will be used. So VFIO allocates PASIDs like this:
> >
> > As I wrote in the cover letter, "auxiliary domain" is just a name to
> > ease discussion. It's actually has no special meaning (we think a domain
> > as an isolation boundary which could be used by the IOMMU to isolate
> > the DMA transactions out of a PCI device or partial of it).
> >
> > So drivers like vfio should see no difference when use an auxiliary
> > domain. The auxiliary domain is not aware out of iommu driver.
> 
> For an auxiliary domain, VFIO does need to retrieve the PASID and write
> it to hardware. But being able to reuse
> iommu_map/unmap/iova_to_phys/etc
> on the auxiliary domain is nice.
> 
> >> * iommu_enable_aux_domain(parent_dev)
> >> * iommu_domain_alloc() -> dom1
> >> * iommu_domain_alloc() -> dom2
> >> * iommu_attach_device(dom1, parent_dev)
> >>    -> dom1 gets PASID #1
> >> * iommu_attach_device(dom2, parent_dev)
> >>    -> dom2 gets PASID #2
> >>
> >> Then I'm not sure about the next steps, when userspace does
> >> VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's
> container. Is the
> >> following use accurate?
> >>
> >> For the single translation level:
> >> * iommu_map(dom1, ...) updates first-level/second-level pgtables for
> >> PASID #1
> >> * iommu_map(dom2, ...) updates first-level/second-level pgtables for
> >> PASID #2
> >>
> >> Nested translation:
> >> * iommu_map(dom1, ...) updates second-level pgtables for PASID #1
> >> * iommu_bind_table(dom1, ...) binds first-level pgtables, provided by
> >> the guest, for PASID #1
> >> * iommu_map(dom2, ...) updates second-level pgtables for PASID #2
> >> * iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2
> >>>
> >> I'm trying to understand how to implement this with SMMU and other
> >
> > This is proposed for architectures which support finer granularity
> > second level translation with no impact on architectures which only
> > support Source ID or the similar granularity.
> 
> Just to be clear, in this paragraph you're only referring to the
> Nested/second-level translation for mdev, which is specific to vt-d
> rev3? Other architectures can still do first-level translation with
> PASID, to support some use-cases of IOMMU aware mediated device
> (assigning mdevs to userspace drivers, for example)

yes. aux domain concept applies only to vt-d rev3 which introduces
scalable mode. Care is taken to avoid breaking usages on existing
architectures.

one note. Assigning mdevs to user space alone doesn't imply IOMMU
aware. All existing mdev usages use software or proprietary methods to
isolate DMA. There is only one potential IOMMU aware mdev usage 
which we talked not rely on vt-d rev3 scalable mode - wrap a random 
PCI device into a single mdev instance (no sharing). In that case mdev 
inherits RID from parent PCI device, thus is isolated by IOMMU in RID 
granular. Our RFC supports this usage too. In VFIO two usages (PASID-
based and RID-based) use same code path, i.e. always binding domain to
the parent device of mdev. But within IOMMU they go different paths.
PASID-based will go to aux-domain as iommu_enable_aux_domain
has been called on that device. RID-based will follow existing 
unmanaged domain path, as if it is parent device assignment.

> 
> >> IOMMUs. It's not a clean fit since we have a single domain to hold the
> >> second-level pgtables.
> >
> > Do you mind explaining why a domain holds multiple second-level
> > pgtables? Shouldn't that be multiple domains?
> 
> I didn't mean a single domain holding multiple second-level pgtables,
> but a single domain holding a single set of second-level pgtables for
> all mdevs. But let's ignore that, mdev and second-level isn't realistic
> for arm SMMU.

yes. single second-level doesn't allow multiple mdevs (each mdev
assigned to different user process or VM). that is why vt-d rev3
introduces scalable mode. :-)

> 
> >> Then again, the nested case probably doesn't
> >> matter for us - we might as well assign the parent directly, since all
> >> mdevs have the same second-level and can only be assigned to the same
> VM.
> >>
> >>
> >> Also, can non-VFIO device drivers use auxiliary domains to do
> map/unmap
> >> on PASIDs? They are asking to do that and I'm proposing the private
> >> PASID thing, but since aux domains provide a similar feature we should
> >> probably converge somehow.
> >
> > Yes, any non-VFIO device driver could use aux domain as well. The use
> > model is:
> >
> > iommu_enable_aux_domain(dev)
> > -- enables aux domain support for this device
> >
> > iommu_domain_alloc(dev)
> > -- allocate an iommu domain
> >
> > iommu_attach_device(domain, dev)
> > -- attach the domain to device
> >
> > iommu_auxiliary_id(domain)
> > -- retrieve the pasid id used by this domain
> >
> > The device driver then
> >
> > iommu_map(domain, ...)
> >
> > set the pasid id to hardware register and start to do dma.
> 
> Sounds good, I'll drop the private PASID patch if we can figure out a
> solution to the attach/detach_dev problem discussed on patch 8/10
> 

Can you elaborate a bit on private PASID usage? what is the
high level flow on it? 

Again based on earlier explanation, aux domain is specific to IOMMU
architecture supporting vtd scalable mode-like capability, which allows
separate 2nd/1st level translations per PASID. Need a better understanding
how private PASID is relevant here.

Thanks
Kevin

Jean-Philippe Brucker Sept. 13, 2018, 3:03 p.m. UTC | #8

On 13/09/2018 01:19, Tian, Kevin wrote:
>>> This is proposed for architectures which support finer granularity
>>> second level translation with no impact on architectures which only
>>> support Source ID or the similar granularity.
>>
>> Just to be clear, in this paragraph you're only referring to the
>> Nested/second-level translation for mdev, which is specific to vt-d
>> rev3? Other architectures can still do first-level translation with
>> PASID, to support some use-cases of IOMMU aware mediated device
>> (assigning mdevs to userspace drivers, for example)
> 
> yes. aux domain concept applies only to vt-d rev3 which introduces
> scalable mode. Care is taken to avoid breaking usages on existing
> architectures.
> 
> one note. Assigning mdevs to user space alone doesn't imply IOMMU
> aware. All existing mdev usages use software or proprietary methods to
> isolate DMA. There is only one potential IOMMU aware mdev usage 
> which we talked not rely on vt-d rev3 scalable mode - wrap a random 
> PCI device into a single mdev instance (no sharing). In that case mdev 
> inherits RID from parent PCI device, thus is isolated by IOMMU in RID 
> granular. Our RFC supports this usage too. In VFIO two usages (PASID-
> based and RID-based) use same code path, i.e. always binding domain to
> the parent device of mdev. But within IOMMU they go different paths.
> PASID-based will go to aux-domain as iommu_enable_aux_domain
> has been called on that device. RID-based will follow existing 
> unmanaged domain path, as if it is parent device assignment.

For Arm SMMU we're more interested in the PASID-granular case than the
RID-granular one. It doesn't necessarily require vt-d rev3 scalable
mode, the following example can be implemented with an SMMUv3, since it
only needs PASID-granular first-level translation:

We have a PCI function that supports PASID, and can be partitioned into
multiple isolated entities, mdevs. Each mdev has an MMIO frame, an MSI
vector and a PASID.

Different processes (userspace drivers, not QEMU) each open one mdev. A
process controlling one mdev has two ways of doing DMA:

(1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This
creates an auxiliary domain for the mdev, with PASID #35. The process
creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on
the auxiliary domain. The IOMMU driver populates the pgtables associated
with PASID #35.

(2) SVA. One way of doing it: the process uses a new
"VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process address
space to the device, gets PASID #35. Simpler, but not everyone wants to
use SVA, especially not userspace drivers which need the highest
performance.

This example only needs to modify first-level translation, and works
with SMMUv3. The kernel here could be the host, in which case
second-level translation is disabled in the SMMU, or it could be the
guest, in which case second-level mappings are created by QEMU and
first-level translation is managed by assigning PASID tables to the guest.

So (2) would use iommu_sva_bind_device(), but (1) needs something else.
Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to
second-level or nested translation? It seems silly to use a different
API for first-level, since the flow in userspace and VFIO is the same as
your second-level case as far as MAP_DMA ioctl goes. The difference is
that in your case the auxiliary domain supports an additional operation
which binds first-level page tables. An auxiliary domain that only
supports first-level wouldn't support this operation, but it can still
implement iommu_map/unmap/etc.

Another note: if for some reason you did want to allow userspace to
choose between first-level or second-level, you could implement the
VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU,
but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP
ioctl on a NESTING container would populate second-level, and DMA_MAP on
a normal container populates first-level. But if you're always going to
use second-level by default, the distinction isn't necessary.

>> Sounds good, I'll drop the private PASID patch if we can figure out a
>> solution to the attach/detach_dev problem discussed on patch 8/10
>>
> 
> Can you elaborate a bit on private PASID usage? what is the
> high level flow on it? 
> 
> Again based on earlier explanation, aux domain is specific to IOMMU
> architecture supporting vtd scalable mode-like capability, which allows
> separate 2nd/1st level translations per PASID. Need a better understanding
> how private PASID is relevant here.

Private PASIDs are used for doing iommu_map/iommu_unmap on PASIDs
(first-level translation):
https://www.spinics.net/lists/dri-devel/msg177003.html As above, some
people don't want SVA, some can't do it, some may even want a few
private address spaces just for their kernel driver. They need a way to
allocate PASIDs and do iommu_map/iommu_unmap on them, without binding to
a process. I was planning to add the private PASID patch to my SVA
series, but in my opinion the feature overlaps with auxiliary domains.

Thanks,
Jean

Ashok Raj Sept. 13, 2018, 4:55 p.m. UTC | #9

On Thu, Sep 13, 2018 at 04:03:01PM +0100, Jean-Philippe Brucker wrote:
> On 13/09/2018 01:19, Tian, Kevin wrote:
> >>> This is proposed for architectures which support finer granularity
> >>> second level translation with no impact on architectures which only
> >>> support Source ID or the similar granularity.
> >>
> >> Just to be clear, in this paragraph you're only referring to the
> >> Nested/second-level translation for mdev, which is specific to vt-d
> >> rev3? Other architectures can still do first-level translation with
> >> PASID, to support some use-cases of IOMMU aware mediated device
> >> (assigning mdevs to userspace drivers, for example)
> > 
> > yes. aux domain concept applies only to vt-d rev3 which introduces
> > scalable mode. Care is taken to avoid breaking usages on existing
> > architectures.
> > 
> > one note. Assigning mdevs to user space alone doesn't imply IOMMU
> > aware. All existing mdev usages use software or proprietary methods to
> > isolate DMA. There is only one potential IOMMU aware mdev usage 
> > which we talked not rely on vt-d rev3 scalable mode - wrap a random 
> > PCI device into a single mdev instance (no sharing). In that case mdev 
> > inherits RID from parent PCI device, thus is isolated by IOMMU in RID 
> > granular. Our RFC supports this usage too. In VFIO two usages (PASID-
> > based and RID-based) use same code path, i.e. always binding domain to
> > the parent device of mdev. But within IOMMU they go different paths.
> > PASID-based will go to aux-domain as iommu_enable_aux_domain
> > has been called on that device. RID-based will follow existing 
> > unmanaged domain path, as if it is parent device assignment.
> 
> For Arm SMMU we're more interested in the PASID-granular case than the
> RID-granular one. It doesn't necessarily require vt-d rev3 scalable
> mode, the following example can be implemented with an SMMUv3, since it
> only needs PASID-granular first-level translation:

You are right, you can simply use the first level as IOVA for every PASID.

Only issue becomes when you need to assign that to a guest, you would be required
to shadow the 1st level. If you have a 2nd level per-pasid first level can
be managed in guest and don't require to shadow them.

> 
> We have a PCI function that supports PASID, and can be partitioned into
> multiple isolated entities, mdevs. Each mdev has an MMIO frame, an MSI
> vector and a PASID.
> 
> Different processes (userspace drivers, not QEMU) each open one mdev. A
> process controlling one mdev has two ways of doing DMA:
> 
> (1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This
> creates an auxiliary domain for the mdev, with PASID #35. The process
> creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on
> the auxiliary domain. The IOMMU driver populates the pgtables associated
> with PASID #35.
> 
> (2) SVA. One way of doing it: the process uses a new
> "VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process address
> space to the device, gets PASID #35. Simpler, but not everyone wants to
> use SVA, especially not userspace drivers which need the highest
> performance.
> 
> 
> This example only needs to modify first-level translation, and works
> with SMMUv3. The kernel here could be the host, in which case
> second-level translation is disabled in the SMMU, or it could be the
> guest, in which case second-level mappings are created by QEMU and
> first-level translation is managed by assigning PASID tables to the guest.
> 
> So (2) would use iommu_sva_bind_device(), but (1) needs something else.
> Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to
> second-level or nested translation? It seems silly to use a different
> API for first-level, since the flow in userspace and VFIO is the same as
> your second-level case as far as MAP_DMA ioctl goes. The difference is
> that in your case the auxiliary domain supports an additional operation
> which binds first-level page tables. An auxiliary domain that only
> supports first-level wouldn't support this operation, but it can still
> implement iommu_map/unmap/etc.
> 
> 
> Another note: if for some reason you did want to allow userspace to
> choose between first-level or second-level, you could implement the
> VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU,
> but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP
> ioctl on a NESTING container would populate second-level, and DMA_MAP on
> a normal container populates first-level. But if you're always going to
> use second-level by default, the distinction isn't necessary.

Where is the nesting attribute specified? in vt-d2 it was part of context
entry, so also meant all PASID's are nested now. In vt-d3 its part of
PASID context.

It seems unsafe to share PASID's with different VM's since any request 
W/O PASID has only one mapping.

> 
> 
> >> Sounds good, I'll drop the private PASID patch if we can figure out a
> >> solution to the attach/detach_dev problem discussed on patch 8/10
> >>
> > 
> > Can you elaborate a bit on private PASID usage? what is the
> > high level flow on it? 
> > 
> > Again based on earlier explanation, aux domain is specific to IOMMU
> > architecture supporting vtd scalable mode-like capability, which allows
> > separate 2nd/1st level translations per PASID. Need a better understanding
> > how private PASID is relevant here.
> 
> Private PASIDs are used for doing iommu_map/iommu_unmap on PASIDs
> (first-level translation):
> https://www.spinics.net/lists/dri-devel/msg177003.html As above, some
> people don't want SVA, some can't do it, some may even want a few
> private address spaces just for their kernel driver. They need a way to
> allocate PASIDs and do iommu_map/iommu_unmap on them, without binding to
> a process. I was planning to add the private PASID patch to my SVA
> series, but in my opinion the feature overlaps with auxiliary domains.

It sounds like it maps to AUX domains.

Baolu Lu Sept. 14, 2018, 2:46 a.m. UTC | #10

Hi,

On 09/13/2018 01:54 AM, Jean-Philippe Brucker wrote:
> On 12/09/2018 03:42, Lu Baolu wrote:
>> Hi,
>>
>> On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote:
>>> Hi,
>>>
>>> On 30/08/2018 05:09, Lu Baolu wrote:
>>>> Below APIs are introduced in the IOMMU glue for device drivers to use
>>>> the finer granularity translation.
>>>>
>>>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
>>>>     - Represents the ability for supporting multiple domains per device
>>>>       (a.k.a. finer granularity translations) of the IOMMU hardware.
>>> iommu_capable() cannot represent hardware capabilities, we need
>>> something else for systems with multiple IOMMUs that have different
>>> caps. How about iommu_domain_get_attr on the device's domain instead?
>> Domain is not a good choice for per iommu cap query. A domain might be
>> attached to devices belonging to different iommu's.
>>
>> How about an API with device structure as parameter? A device always
>> belongs to a specific iommu. This API is supposed to be used the
>> device driver.
> Ah right, domain attributes won't work. Your suggestion seems more
> suitable, but maybe users can simply try to enable auxiliary domains
> first, and conclude that the IOMMU doesn't support it if it returns an error
> 

Some driver might want to check whether hardware supports AUX_DOMAIN
during the driver probe stage, but doesn't want to enable AUX_DOMAIN
at that time. One reasonable use case is driver check AUX_DOMAIN cap
during driver probe and expose different sysfs nodes according to
whether AUX_DOMAIN is support or not, then AUX_DOMAIN is enabled or 
disabled during run time through a sysfs node. With this consideration,
we still need a API to check cap.

How about

* iommu_check_aux_domain(struct device *dev)
    - Check whether the iommu driver supports multiple domains on @dev.

Best regards,
Lu Baolu

Tian, Kevin Sept. 14, 2018, 2:53 a.m. UTC | #11

> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> Sent: Friday, September 14, 2018 10:47 AM
> 
> Hi,
> 
> On 09/13/2018 01:54 AM, Jean-Philippe Brucker wrote:
> > On 12/09/2018 03:42, Lu Baolu wrote:
> >> Hi,
> >>
> >> On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote:
> >>> Hi,
> >>>
> >>> On 30/08/2018 05:09, Lu Baolu wrote:
> >>>> Below APIs are introduced in the IOMMU glue for device drivers to
> use
> >>>> the finer granularity translation.
> >>>>
> >>>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
> >>>>     - Represents the ability for supporting multiple domains per device
> >>>>       (a.k.a. finer granularity translations) of the IOMMU hardware.
> >>> iommu_capable() cannot represent hardware capabilities, we need
> >>> something else for systems with multiple IOMMUs that have different
> >>> caps. How about iommu_domain_get_attr on the device's domain
> instead?
> >> Domain is not a good choice for per iommu cap query. A domain might
> be
> >> attached to devices belonging to different iommu's.
> >>
> >> How about an API with device structure as parameter? A device always
> >> belongs to a specific iommu. This API is supposed to be used the
> >> device driver.
> > Ah right, domain attributes won't work. Your suggestion seems more
> > suitable, but maybe users can simply try to enable auxiliary domains
> > first, and conclude that the IOMMU doesn't support it if it returns an
> error
> >
> 
> Some driver might want to check whether hardware supports
> AUX_DOMAIN
> during the driver probe stage, but doesn't want to enable AUX_DOMAIN
> at that time. One reasonable use case is driver check AUX_DOMAIN cap
> during driver probe and expose different sysfs nodes according to
> whether AUX_DOMAIN is support or not, then AUX_DOMAIN is enabled or
> disabled during run time through a sysfs node. With this consideration,
> we still need a API to check cap.
> 
> How about
> 
> * iommu_check_aux_domain(struct device *dev)
>     - Check whether the iommu driver supports multiple domains on @dev.
> 

maybe generalized as iommu_check_attr with aux_domain as a flag,
in case other IOMMU checks introduced in the future. hinted by Jean's
comment on iommu_sva_device_init part.

Thanks
Kevin

Jean-Philippe Brucker Sept. 14, 2018, 2:39 p.m. UTC | #12

On 13/09/2018 17:55, Raj, Ashok wrote:
>> For Arm SMMU we're more interested in the PASID-granular case than the
>> RID-granular one. It doesn't necessarily require vt-d rev3 scalable
>> mode, the following example can be implemented with an SMMUv3, since it
>> only needs PASID-granular first-level translation:
> 
> You are right, you can simply use the first level as IOVA for every PASID.
> 
> Only issue becomes when you need to assign that to a guest, you would be required
> to shadow the 1st level. If you have a 2nd level per-pasid first level can
> be managed in guest and don't require to shadow them.

Right, for us assigning a PASID-granular mdev to a guest requires shadowing

>> Another note: if for some reason you did want to allow userspace to
>> choose between first-level or second-level, you could implement the
>> VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU,
>> but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP
>> ioctl on a NESTING container would populate second-level, and DMA_MAP on
>> a normal container populates first-level. But if you're always going to
>> use second-level by default, the distinction isn't necessary.
> 
> Where is the nesting attribute specified? in vt-d2 it was part of context
> entry, so also meant all PASID's are nested now. In vt-d3 its part of
> PASID context.

I don't think the nesting attribute is described in details anywhere.
The SMMU drivers use it to know if they should create first- or
second-level mappings. At the moment QEMU always uses
VFIO_TYPE1v2_IOMMU, but Eric Auger is proposing a patch that adds
VFIO_TYPE1_NESTING_IOMMU to QEMU:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg559820.html

> It seems unsafe to share PASID's with different VM's since any request 
> W/O PASID has only one mapping.

Which case are you talking about? It might be more confusing than
helpful, but here's my understanding of what we can assign to a guest:

              |  no vIOMMU  | vIOMMU no PASID  | vIOMMU with PASID
--------------+-------------+------------------+--------------------
 VF           |     ok      |  shadow or nest  |      nest
 mdev, SMMUv3 |     ok      |     shadow       |  shadow + PV (?)
 mdev, vt-d3  |     ok      |      nest        |    nest + PV

The first line, assigning a PCI VF to a guest is the "basic" vfio-pci
case. Currently in QEMU it works by shadowing first-level translation.
We still have to upstream nested translation for that case. Vt-d2 didn't
support nested without PASID, vt-d3 offers RID_PASID for this. On SMMUv3
the PASID table is assigned to the guest, whereas on vt-d3 the host
manages the PASID table and individual page tables are assigned to the
guest.

Assigning an mdev (here I'm talking about the PASID-granular partition
of a VF, not the whole RID-granular VF wrapped by an mdev) could be done
by shadowing first-level translation on SMMUv3. It cannot do nested
since the VF has a single set of second-level page tables, which cannot
be used when mdevs are assigned to different VMs. Vt-d3 has one set of
second-level page tables per PASID, so it can do nested.

Since the parent device has a single PASID space, allowing the guest to
use multiple PASIDs for one mdev requires paravirtual allocation of
PASIDs (last column). Vt-d3 uses the Virtual Command Registers for that.
I assume that it is safe because the host is in charge of programming
PASIDs in the parent device, so the guest couldn't use a PASID allocated
to another mdev, but I don't know what the device's programming model
would look like. Anyway I don't think guest PASID is tackled by this
series (right?) and I don't intend to work on it for SMMUv3 (shadowing
stage-1 for vSVA seems like a bad idea...)

Does this seem accurate?

Thanks,
Jean

Jean-Philippe Brucker Sept. 14, 2018, 2:40 p.m. UTC | #13

>> This example only needs to modify first-level translation, and works
>> with SMMUv3. The kernel here could be the host, in which case
>> second-level translation is disabled in the SMMU, or it could be the
>> guest, in which case second-level mappings are created by QEMU and
>> first-level translation is managed by assigning PASID tables to the guest.
> 
> the former yes applies to aux domain concept. The latter doesn't -
> you have only one second-level per device. whole PASID table managed
> by guest means you assign the whole device to guest, which is not the
> concept of aux domain here.

Right, in the latter case, the host uses a "normal" domain to assign the
whole PCI function to the guest. But the guest can still use auxiliary
domains like in my example, to sub-assign the PCI function to different
guest userspace applications.

>> So (2) would use iommu_sva_bind_device(), but (1) needs something else.
>> Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to
>> second-level or nested translation? It seems silly to use a different
>> API for first-level, since the flow in userspace and VFIO is the same as
>> your second-level case as far as MAP_DMA ioctl goes. The difference is
>> that in your case the auxiliary domain supports an additional operation
>> which binds first-level page tables. An auxiliary domain that only
>> supports first-level wouldn't support this operation, but it can still
>> implement iommu_map/unmap/etc.
> 
> Thanks for correcting me on this. You are right that aux domain shouldn't
> impose such limitation on 2nd or nested only. We define aux domain
> as a normal domain (aux takes effect only when attaching to a device),
> thus it should support all capabilities possible on a normal domain.
> 
> btw I'm not sure whether you look at my comment to patch 8/10. I
> explained the rationale why aux domain doesn't interfere with existing
> default domain usage, and in a quick thinking above example might
> not make difference. but need your confirm here. :-)

Yes sorry, I didn't have time to answer, will do it now

Thanks,
Jean

Jacob Pan Sept. 14, 2018, 9:04 p.m. UTC | #14

On Thu, 13 Sep 2018 16:03:01 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> On 13/09/2018 01:19, Tian, Kevin wrote:
> >>> This is proposed for architectures which support finer granularity
> >>> second level translation with no impact on architectures which
> >>> only support Source ID or the similar granularity.  
> >>
> >> Just to be clear, in this paragraph you're only referring to the
> >> Nested/second-level translation for mdev, which is specific to vt-d
> >> rev3? Other architectures can still do first-level translation with
> >> PASID, to support some use-cases of IOMMU aware mediated device
> >> (assigning mdevs to userspace drivers, for example)  
> > 
> > yes. aux domain concept applies only to vt-d rev3 which introduces
> > scalable mode. Care is taken to avoid breaking usages on existing
> > architectures.
> > 
> > one note. Assigning mdevs to user space alone doesn't imply IOMMU
> > aware. All existing mdev usages use software or proprietary methods
> > to isolate DMA. There is only one potential IOMMU aware mdev usage 
> > which we talked not rely on vt-d rev3 scalable mode - wrap a random 
> > PCI device into a single mdev instance (no sharing). In that case
> > mdev inherits RID from parent PCI device, thus is isolated by IOMMU
> > in RID granular. Our RFC supports this usage too. In VFIO two
> > usages (PASID- based and RID-based) use same code path, i.e. always
> > binding domain to the parent device of mdev. But within IOMMU they
> > go different paths. PASID-based will go to aux-domain as
> > iommu_enable_aux_domain has been called on that device. RID-based
> > will follow existing unmanaged domain path, as if it is parent
> > device assignment.  
> 
> For Arm SMMU we're more interested in the PASID-granular case than the
> RID-granular one. It doesn't necessarily require vt-d rev3 scalable
> mode, the following example can be implemented with an SMMUv3, since
> it only needs PASID-granular first-level translation:
> 
> We have a PCI function that supports PASID, and can be partitioned
> into multiple isolated entities, mdevs. Each mdev has an MMIO frame,
> an MSI vector and a PASID.
> 
> Different processes (userspace drivers, not QEMU) each open one mdev.
> A process controlling one mdev has two ways of doing DMA:
> 
> (1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This
> creates an auxiliary domain for the mdev, with PASID #35. The process
> creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on
> the auxiliary domain. The IOMMU driver populates the pgtables
> associated with PASID #35.
> 
> (2) SVA. One way of doing it: the process uses a new
> "VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process
> address space to the device, gets PASID #35. Simpler, but not
> everyone wants to use SVA, especially not userspace drivers which
> need the highest performance.
> 
> 
> This example only needs to modify first-level translation, and works
> with SMMUv3. The kernel here could be the host, in which case
> second-level translation is disabled in the SMMU, or it could be the
> guest, in which case second-level mappings are created by QEMU and
> first-level translation is managed by assigning PASID tables to the
> guest.
There is a difference in case of guest SVA. VT-d v3 will bind guest
PASID and guest CR3 instead of the guest PASID table. Then turn on
nesting. In case of mdev, the second level is obtained from the aux
domain which was setup for the default PASID. Or in case of PCI device,
second level is harvested from RID2PASID.

> So (2) would use iommu_sva_bind_device(), 
We would need something different than that for guest bind, just to show
the two cases:

int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, int
*pasid, unsigned long flags, void *drvdata)

(WIP)
int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data)
where:
/**
 * struct gpasid_bind_data - Information about device and guest PASID binding
 * @pasid:	Process address space ID used for the guest mm
 * @addr_width:	Guest address width. Paging mode can also be derived.
 * @gcr3:	Guest CR3 value from guest mm
 */
struct gpasid_bind_data {
	__u32 pasid;
	__u64 gcr3;
	__u32 addr_width;
	__u32 flags;
#define	IOMMU_SVA_GPASID_SRE	BIT(0) /* supervisor request */
};
Perhaps there is room to merge with io_mm but the life cycle management
of guest PASID and host PASID will be different if you rely on mm
release callback than FD.

> but (1) needs something
> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary
> domain to second-level or nested translation? It seems silly to use a
> different API for first-level, since the flow in userspace and VFIO
> is the same as your second-level case as far as MAP_DMA ioctl goes.
> The difference is that in your case the auxiliary domain supports an
> additional operation which binds first-level page tables. An
> auxiliary domain that only supports first-level wouldn't support this
> operation, but it can still implement iommu_map/unmap/etc.
> 
I think the intention is that when a mdev is created, we don;t
know whether it will be used for SVA or IOVA. So aux domain is here to
"hold a spot" for the default PASID such that MAP_DMA calls can work as
usual, which is second level only. Later, if SVA is used on the mdev
there will be another PASID allocated for that purpose.
Do we need to create an aux domain for each PASID? the translation can
be looked up by the combination of parent dev and pasid.

> 
> Another note: if for some reason you did want to allow userspace to
> choose between first-level or second-level, you could implement the
> VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU,
> but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP
> ioctl on a NESTING container would populate second-level, and DMA_MAP
> on a normal container populates first-level. But if you're always
> going to use second-level by default, the distinction isn't necessary.
> 
In case of guest SVA, the second level is always there.
> 
> >> Sounds good, I'll drop the private PASID patch if we can figure
> >> out a solution to the attach/detach_dev problem discussed on patch
> >> 8/10 
> > 
> > Can you elaborate a bit on private PASID usage? what is the
> > high level flow on it? 
> > 
> > Again based on earlier explanation, aux domain is specific to IOMMU
> > architecture supporting vtd scalable mode-like capability, which
> > allows separate 2nd/1st level translations per PASID. Need a better
> > understanding how private PASID is relevant here.  
> 
> Private PASIDs are used for doing iommu_map/iommu_unmap on PASIDs
> (first-level translation):
> https://www.spinics.net/lists/dri-devel/msg177003.html As above, some
> people don't want SVA, some can't do it, some may even want a few
> private address spaces just for their kernel driver. They need a way
> to allocate PASIDs and do iommu_map/iommu_unmap on them, without
> binding to a process. I was planning to add the private PASID patch
> to my SVA series, but in my opinion the feature overlaps with
> auxiliary domains.
> 
> Thanks,
> Jean
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

Jean-Philippe Brucker Sept. 18, 2018, 3:46 p.m. UTC | #15

On 14/09/2018 22:04, Jacob Pan wrote:
>> This example only needs to modify first-level translation, and works
>> with SMMUv3. The kernel here could be the host, in which case
>> second-level translation is disabled in the SMMU, or it could be the
>> guest, in which case second-level mappings are created by QEMU and
>> first-level translation is managed by assigning PASID tables to the
>> guest.
> There is a difference in case of guest SVA. VT-d v3 will bind guest
> PASID and guest CR3 instead of the guest PASID table. Then turn on
> nesting. In case of mdev, the second level is obtained from the aux
> domain which was setup for the default PASID. Or in case of PCI device,
> second level is harvested from RID2PASID.

Right, though I wasn't talking about the host managing guest SVA here,
but a kernel binding the address space of one of its userspace drivers
to the mdev.

>> So (2) would use iommu_sva_bind_device(), 
> We would need something different than that for guest bind, just to show
> the two cases:>
> int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, int
> *pasid, unsigned long flags, void *drvdata)
> 
> (WIP)
> int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data)
> where:
> /**
>  * struct gpasid_bind_data - Information about device and guest PASID
> binding
>  * @pasid:       Process address space ID used for the guest mm
>  * @addr_width:  Guest address width. Paging mode can also be derived.
>  * @gcr3:        Guest CR3 value from guest mm
>  */
> struct gpasid_bind_data {
>         __u32 pasid;
>         __u64 gcr3;
>         __u32 addr_width;
>         __u32 flags;
> #define IOMMU_SVA_GPASID_SRE    BIT(0) /* supervisor request */
> };
> Perhaps there is room to merge with io_mm but the life cycle management
> of guest PASID and host PASID will be different if you rely on mm
> release callback than FD.

I think gpasid management should stay separate from io_mm, since in your
case VFIO mechanisms are used for life cycle management of the VM,
similarly to the former bind_pasid_table proposal. For example closing
the container fd would unbind all guest page tables. The QEMU process'
address space lifetime seems like the wrong thing to track for gpasid.

>> but (1) needs something
>> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary
>> domain to second-level or nested translation? It seems silly to use a
>> different API for first-level, since the flow in userspace and VFIO
>> is the same as your second-level case as far as MAP_DMA ioctl goes.
>> The difference is that in your case the auxiliary domain supports an
>> additional operation which binds first-level page tables. An
>> auxiliary domain that only supports first-level wouldn't support this
>> operation, but it can still implement iommu_map/unmap/etc.
>> 
> I think the intention is that when a mdev is created, we don;t
> know whether it will be used for SVA or IOVA. So aux domain is here to
> "hold a spot" for the default PASID such that MAP_DMA calls can work as
> usual, which is second level only. Later, if SVA is used on the mdev
> there will be another PASID allocated for that purpose.
> Do we need to create an aux domain for each PASID? the translation can
> be looked up by the combination of parent dev and pasid.

When allocating a new PASID for the guest, I suppose you need to clone
the second-level translation config? In which case a single aux domain
for the mdev might be easier to implement in the IOMMU driver. Entirely
up to you since we don't have this case on SMMUv3

Thanks,
Jean

Tian, Kevin Sept. 19, 2018, 2:22 a.m. UTC | #16

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Tuesday, September 18, 2018 11:47 PM
> 
> On 14/09/2018 22:04, Jacob Pan wrote:
> >> This example only needs to modify first-level translation, and works
> >> with SMMUv3. The kernel here could be the host, in which case
> >> second-level translation is disabled in the SMMU, or it could be the
> >> guest, in which case second-level mappings are created by QEMU and
> >> first-level translation is managed by assigning PASID tables to the
> >> guest.
> > There is a difference in case of guest SVA. VT-d v3 will bind guest
> > PASID and guest CR3 instead of the guest PASID table. Then turn on
> > nesting. In case of mdev, the second level is obtained from the aux
> > domain which was setup for the default PASID. Or in case of PCI device,
> > second level is harvested from RID2PASID.
> 
> Right, though I wasn't talking about the host managing guest SVA here,
> but a kernel binding the address space of one of its userspace drivers
> to the mdev.
> 
> >> So (2) would use iommu_sva_bind_device(),
> > We would need something different than that for guest bind, just to show
> > the two cases:>
> > int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm,
> int
> > *pasid, unsigned long flags, void *drvdata)
> >
> > (WIP)
> > int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data)
> > where:
> > /**
> >  * struct gpasid_bind_data - Information about device and guest PASID
> > binding
> >  * @pasid:       Process address space ID used for the guest mm
> >  * @addr_width:  Guest address width. Paging mode can also be derived.
> >  * @gcr3:        Guest CR3 value from guest mm
> >  */
> > struct gpasid_bind_data {
> >         __u32 pasid;
> >         __u64 gcr3;
> >         __u32 addr_width;
> >         __u32 flags;
> > #define IOMMU_SVA_GPASID_SRE    BIT(0) /* supervisor request */
> > };
> > Perhaps there is room to merge with io_mm but the life cycle
> management
> > of guest PASID and host PASID will be different if you rely on mm
> > release callback than FD.

let's not calling gpasid here - which makes sense only in bind_pasid_table
proposal where pasid table thus pasid space is managed by guest. In above
context it is always about host pasid (allocated in system-wide), which
could point to a host cr3 (user process) or a guest cr3 (vm case).

> 
> I think gpasid management should stay separate from io_mm, since in your
> case VFIO mechanisms are used for life cycle management of the VM,
> similarly to the former bind_pasid_table proposal. For example closing
> the container fd would unbind all guest page tables. The QEMU process'
> address space lifetime seems like the wrong thing to track for gpasid.

I sort of agree (though not thinking through all the flow carefully). PASIDs
are allocated per iommu domain, thus release also happens when domain
is detached (along with container fd close).

> 
> >> but (1) needs something
> >> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary
> >> domain to second-level or nested translation? It seems silly to use a
> >> different API for first-level, since the flow in userspace and VFIO
> >> is the same as your second-level case as far as MAP_DMA ioctl goes.
> >> The difference is that in your case the auxiliary domain supports an
> >> additional operation which binds first-level page tables. An
> >> auxiliary domain that only supports first-level wouldn't support this
> >> operation, but it can still implement iommu_map/unmap/etc.
> >>
> > I think the intention is that when a mdev is created, we don;t
> > know whether it will be used for SVA or IOVA. So aux domain is here to
> > "hold a spot" for the default PASID such that MAP_DMA calls can work as
> > usual, which is second level only. Later, if SVA is used on the mdev
> > there will be another PASID allocated for that purpose.
> > Do we need to create an aux domain for each PASID? the translation can
> > be looked up by the combination of parent dev and pasid.
> 
> When allocating a new PASID for the guest, I suppose you need to clone
> the second-level translation config? In which case a single aux domain
> for the mdev might be easier to implement in the IOMMU driver. Entirely
> up to you since we don't have this case on SMMUv3
> 

One thing to highlight in related discussions (also mentioned in other
thread). There is not a new iommu domain type called 'aux'. 'aux' matters
only to a specific device when a domain is attached to that device which
has aux capability enabled. Same domain can be attached to other device
as normal domain. In that case multiple PASIDs allocated on same mdev
are tied to same aux domain, same bare metal SVA case, i.e. any domain
(normal or aux) can include 2nd level structure and multiple 1st level 
structures. Jean is correct - all PASIDs in same domain then share 2nd 
level translation, and there are io_mm or similar tracking structures to 
associate each PASID to a 1st level translation structure.

Thanks
Kevin

Jacob Pan Sept. 20, 2018, 3:53 p.m. UTC | #17

On Wed, 19 Sep 2018 02:22:03 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> > Sent: Tuesday, September 18, 2018 11:47 PM
> > 
> > On 14/09/2018 22:04, Jacob Pan wrote:  
> > >> This example only needs to modify first-level translation, and
> > >> works with SMMUv3. The kernel here could be the host, in which
> > >> case second-level translation is disabled in the SMMU, or it
> > >> could be the guest, in which case second-level mappings are
> > >> created by QEMU and first-level translation is managed by
> > >> assigning PASID tables to the guest.  
> > > There is a difference in case of guest SVA. VT-d v3 will bind
> > > guest PASID and guest CR3 instead of the guest PASID table. Then
> > > turn on nesting. In case of mdev, the second level is obtained
> > > from the aux domain which was setup for the default PASID. Or in
> > > case of PCI device, second level is harvested from RID2PASID.  
> > 
> > Right, though I wasn't talking about the host managing guest SVA
> > here, but a kernel binding the address space of one of its
> > userspace drivers to the mdev.
> >   
> > >> So (2) would use iommu_sva_bind_device(),  
> > > We would need something different than that for guest bind, just
> > > to show the two cases:>
> > > int iommu_sva_bind_device(struct device *dev, struct mm_struct
> > > *mm,  
> > int  
> > > *pasid, unsigned long flags, void *drvdata)
> > >
> > > (WIP)
> > > int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data
> > > *data) where:
> > > /**
> > >  * struct gpasid_bind_data - Information about device and guest
> > > PASID binding
> > >  * @pasid:       Process address space ID used for the guest mm
> > >  * @addr_width:  Guest address width. Paging mode can also be
> > > derived.
> > >  * @gcr3:        Guest CR3 value from guest mm
> > >  */
> > > struct gpasid_bind_data {
> > >         __u32 pasid;
> > >         __u64 gcr3;
> > >         __u32 addr_width;
> > >         __u32 flags;
> > > #define IOMMU_SVA_GPASID_SRE    BIT(0) /* supervisor request */
> > > };
> > > Perhaps there is room to merge with io_mm but the life cycle  
> > management  
> > > of guest PASID and host PASID will be different if you rely on mm
> > > release callback than FD.  
> 
> let's not calling gpasid here - which makes sense only in
> bind_pasid_table proposal where pasid table thus pasid space is
> managed by guest. In above context it is always about host pasid
> (allocated in system-wide), which could point to a host cr3 (user
> process) or a guest cr3 (vm case).
> 
I agree this gpasid is confusing, we have a system wide PASID
name space. Just a way to differentiate different bind, perhaps
just a flag indicating the PASID is used for guest.
i.e.
struct pasid_bind_data {
         __u32 pasid;
         __u64 gcr3;
         __u32 addr_width;
         __u32 flags;
#define IOMMU_SVA_GPASID_SRE    BIT(0) /* supervisor request */
#define IOMMU_SVA_PASID_GUEST   BIT(0) /* host pasid used by guest */
};

> > I think gpasid management should stay separate from io_mm, since in
> > your case VFIO mechanisms are used for life cycle management of the
> > VM, similarly to the former bind_pasid_table proposal. For example
> > closing the container fd would unbind all guest page tables. The
> > QEMU process' address space lifetime seems like the wrong thing to
> > track for gpasid.  
> 
> I sort of agree (though not thinking through all the flow carefully).
> PASIDs are allocated per iommu domain, thus release also happens when
> domain is detached (along with container fd close).
> 
I also prefer to keep gpasid separate.

But I don't think we need to have per iommu domain per PASID for guest
SVA case. Assuming you are talking about host IOMMU domain. The PASID
bind call is a result of guest PASID cache flush with a PASID
previously allocated. The host just need to put gcr3 into the PASID
entry then harvest the second level from the existing domain.
> >   
> > >> but (1) needs something
> > >> else. Aren't auxiliary domains suitable for (1)? Why limit
> > >> auxiliary domain to second-level or nested translation? It seems
> > >> silly to use a different API for first-level, since the flow in
> > >> userspace and VFIO is the same as your second-level case as far
> > >> as MAP_DMA ioctl goes. The difference is that in your case the
> > >> auxiliary domain supports an additional operation which binds
> > >> first-level page tables. An auxiliary domain that only supports
> > >> first-level wouldn't support this operation, but it can still
> > >> implement iommu_map/unmap/etc. 
> > > I think the intention is that when a mdev is created, we don;t
> > > know whether it will be used for SVA or IOVA. So aux domain is
> > > here to "hold a spot" for the default PASID such that MAP_DMA
> > > calls can work as usual, which is second level only. Later, if
> > > SVA is used on the mdev there will be another PASID allocated for
> > > that purpose. Do we need to create an aux domain for each PASID?
> > > the translation can be looked up by the combination of parent dev
> > > and pasid.  
> > 
> > When allocating a new PASID for the guest, I suppose you need to
> > clone the second-level translation config? In which case a single
> > aux domain for the mdev might be easier to implement in the IOMMU
> > driver. Entirely up to you since we don't have this case on SMMUv3
> >   
> 
> One thing to highlight in related discussions (also mentioned in other
> thread). There is not a new iommu domain type called 'aux'. 'aux'
> matters only to a specific device when a domain is attached to that
> device which has aux capability enabled. Same domain can be attached
> to other device as normal domain. In that case multiple PASIDs
> allocated on same mdev are tied to same aux domain, same bare metal
> SVA case, i.e. any domain (normal or aux) can include 2nd level
> structure and multiple 1st level structures. Jean is correct - all
> PASIDs in same domain then share 2nd level translation, and there are
> io_mm or similar tracking structures to associate each PASID to a 1st
> level translation structure.
> 
I think we are all talking about the same thing :)
yes, 2nd level is cloned from aux domain/default PASID for mdev, and
pdev similarly from DMA_MAP domain.

> Thanks
> Kevin
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

[RFC,v2,00/10] vfio/mdev: IOMMU aware mediated device

Message

Comments