diff mbox series

[v7,13/13] Documentation: userspace-api: iommufd: Update vIOMMU

Message ID 7e4302064e0d02137c1b1e139342affc0485ed3f.1730836219.git.nicolinc@nvidia.com (mailing list archive)
State New
Headers show
Series iommufd: Add vIOMMU infrastructure (Part-1) | expand

Commit Message

Nicolin Chen Nov. 5, 2024, 8:04 p.m. UTC
With the introduction of the new object and its infrastructure, update the
doc to reflect that and add a new graph.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++-
 1 file changed, 68 insertions(+), 1 deletion(-)

Comments

Bagas Sanjaya Nov. 7, 2024, 12:56 a.m. UTC | #1
On Tue, Nov 05, 2024 at 12:04:29PM -0800, Nicolin Chen wrote:
> With the introduction of the new object and its infrastructure, update the
> doc to reflect that and add a new graph.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++-
>  1 file changed, 68 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
> index 2deba93bf159..a8b7766c2849 100644
> --- a/Documentation/userspace-api/iommufd.rst
> +++ b/Documentation/userspace-api/iommufd.rst
> @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace:
>    space usually has mappings from guest-level I/O virtual addresses to guest-
>    level physical addresses.
>  
> +- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance,
> +  passed to or shared with a VM. It may be some HW-accelerated virtualization
> +  features and some SW resources used by the VM. For examples:
> +  * Security namespace for guest owned ID, e.g. guest-controlled cache tags
> +  * Non-device-affiliated event reporting, e.g. invalidation queue errors
> +  * Access to a sharable nesting parent pagetable across physical IOMMUs
> +  * Virtualization of various platforms IDs, e.g. RIDs and others
> +  * Delivery of paravirtualized invalidation
> +  * Direct assigned invalidation queues
> +  * Direct assigned interrupts

The bullet list above is outputted in htmldocs build as long-running paragraph
instead.

> +  Such a vIOMMU object generally has the access to a nesting parent pagetable
> +  to support some HW-accelerated virtualization features. So, a vIOMMU object
> +  must be created given a nesting parent HWPT_PAGING object, and then it would
> +  encapsulate that HWPT_PAGING object. Therefore, a vIOMMU object can be used
> +  to allocate an HWPT_NESTED object in place of the encapsulated HWPT_PAGING.

Thanks.
Nicolin Chen Nov. 7, 2024, 1:35 a.m. UTC | #2
On Thu, Nov 07, 2024 at 07:56:31AM +0700, Bagas Sanjaya wrote:
> On Tue, Nov 05, 2024 at 12:04:29PM -0800, Nicolin Chen wrote:
> > With the introduction of the new object and its infrastructure, update the
> > doc to reflect that and add a new graph.
> > 
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > ---
> >  Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++-
> >  1 file changed, 68 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
> > index 2deba93bf159..a8b7766c2849 100644
> > --- a/Documentation/userspace-api/iommufd.rst
> > +++ b/Documentation/userspace-api/iommufd.rst
> > @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace:
> >    space usually has mappings from guest-level I/O virtual addresses to guest-
> >    level physical addresses.
> >  
> > +- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance,
> > +  passed to or shared with a VM. It may be some HW-accelerated virtualization
> > +  features and some SW resources used by the VM. For examples:
> > +  * Security namespace for guest owned ID, e.g. guest-controlled cache tags
> > +  * Non-device-affiliated event reporting, e.g. invalidation queue errors
> > +  * Access to a sharable nesting parent pagetable across physical IOMMUs
> > +  * Virtualization of various platforms IDs, e.g. RIDs and others
> > +  * Delivery of paravirtualized invalidation
> > +  * Direct assigned invalidation queues
> > +  * Direct assigned interrupts
> 
> The bullet list above is outputted in htmldocs build as long-running paragraph
> instead.

Oh, I overlooked this list.

Would the following change be okay?

-------------------------------------------------
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
index 0ef22b3ca30b..011cbc71b6f5 100644
--- a/Documentation/userspace-api/iommufd.rst
+++ b/Documentation/userspace-api/iommufd.rst
@@ -68,2 +68,3 @@ Following IOMMUFD objects are exposed to userspace:
   features and some SW resources used by the VM. For examples:
+
   * Security namespace for guest owned ID, e.g. guest-controlled cache tags
@@ -75,2 +76,3 @@ Following IOMMUFD objects are exposed to userspace:
   * Direct assigned interrupts
+
   Such a vIOMMU object generally has the access to a nesting parent pagetable
-------------------------------------------------

The outputted html is showing a list with this.

Thanks!
Nicolin
Bagas Sanjaya Nov. 7, 2024, 3:20 a.m. UTC | #3
On Wed, Nov 06, 2024 at 05:35:45PM -0800, Nicolin Chen wrote:
> On Thu, Nov 07, 2024 at 07:56:31AM +0700, Bagas Sanjaya wrote:
> > On Tue, Nov 05, 2024 at 12:04:29PM -0800, Nicolin Chen wrote:
> > > With the introduction of the new object and its infrastructure, update the
> > > doc to reflect that and add a new graph.
> > > 
> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > > ---
> > >  Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++-
> > >  1 file changed, 68 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
> > > index 2deba93bf159..a8b7766c2849 100644
> > > --- a/Documentation/userspace-api/iommufd.rst
> > > +++ b/Documentation/userspace-api/iommufd.rst
> > > @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace:
> > >    space usually has mappings from guest-level I/O virtual addresses to guest-
> > >    level physical addresses.
> > >  
> > > +- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance,
> > > +  passed to or shared with a VM. It may be some HW-accelerated virtualization
> > > +  features and some SW resources used by the VM. For examples:
> > > +  * Security namespace for guest owned ID, e.g. guest-controlled cache tags
> > > +  * Non-device-affiliated event reporting, e.g. invalidation queue errors
> > > +  * Access to a sharable nesting parent pagetable across physical IOMMUs
> > > +  * Virtualization of various platforms IDs, e.g. RIDs and others
> > > +  * Delivery of paravirtualized invalidation
> > > +  * Direct assigned invalidation queues
> > > +  * Direct assigned interrupts
> > 
> > The bullet list above is outputted in htmldocs build as long-running paragraph
> > instead.
> 
> Oh, I overlooked this list.
> 
> Would the following change be okay?
> 
> -------------------------------------------------
> diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
> index 0ef22b3ca30b..011cbc71b6f5 100644
> --- a/Documentation/userspace-api/iommufd.rst
> +++ b/Documentation/userspace-api/iommufd.rst
> @@ -68,2 +68,3 @@ Following IOMMUFD objects are exposed to userspace:
>    features and some SW resources used by the VM. For examples:
> +
>    * Security namespace for guest owned ID, e.g. guest-controlled cache tags
> @@ -75,2 +76,3 @@ Following IOMMUFD objects are exposed to userspace:
>    * Direct assigned interrupts
> +
>    Such a vIOMMU object generally has the access to a nesting parent pagetable
> -------------------------------------------------
> 
> The outputted html is showing a list with this.

Yup, that's right!
Nicolin Chen Nov. 7, 2024, 4:04 a.m. UTC | #4
On Thu, Nov 07, 2024 at 10:20:49AM +0700, Bagas Sanjaya wrote:
> On Wed, Nov 06, 2024 at 05:35:45PM -0800, Nicolin Chen wrote:
> > On Thu, Nov 07, 2024 at 07:56:31AM +0700, Bagas Sanjaya wrote:
> > > On Tue, Nov 05, 2024 at 12:04:29PM -0800, Nicolin Chen wrote:
> > > > With the introduction of the new object and its infrastructure, update the
> > > > doc to reflect that and add a new graph.
> > > > 
> > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > > > ---
> > > >  Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++-
> > > >  1 file changed, 68 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
> > > > index 2deba93bf159..a8b7766c2849 100644
> > > > --- a/Documentation/userspace-api/iommufd.rst
> > > > +++ b/Documentation/userspace-api/iommufd.rst
> > > > @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace:
> > > >    space usually has mappings from guest-level I/O virtual addresses to guest-
> > > >    level physical addresses.
> > > >  
> > > > +- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance,
> > > > +  passed to or shared with a VM. It may be some HW-accelerated virtualization
> > > > +  features and some SW resources used by the VM. For examples:
> > > > +  * Security namespace for guest owned ID, e.g. guest-controlled cache tags
> > > > +  * Non-device-affiliated event reporting, e.g. invalidation queue errors
> > > > +  * Access to a sharable nesting parent pagetable across physical IOMMUs
> > > > +  * Virtualization of various platforms IDs, e.g. RIDs and others
> > > > +  * Delivery of paravirtualized invalidation
> > > > +  * Direct assigned invalidation queues
> > > > +  * Direct assigned interrupts
> > > 
> > > The bullet list above is outputted in htmldocs build as long-running paragraph
> > > instead.
> > 
> > Oh, I overlooked this list.
> > 
> > Would the following change be okay?
> > 
> > -------------------------------------------------
> > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
> > index 0ef22b3ca30b..011cbc71b6f5 100644
> > --- a/Documentation/userspace-api/iommufd.rst
> > +++ b/Documentation/userspace-api/iommufd.rst
> > @@ -68,2 +68,3 @@ Following IOMMUFD objects are exposed to userspace:
> >    features and some SW resources used by the VM. For examples:
> > +
> >    * Security namespace for guest owned ID, e.g. guest-controlled cache tags
> > @@ -75,2 +76,3 @@ Following IOMMUFD objects are exposed to userspace:
> >    * Direct assigned interrupts
> > +
> >    Such a vIOMMU object generally has the access to a nesting parent pagetable
> > -------------------------------------------------
> > 
> > The outputted html is showing a list with this.
> 
> Yup, that's right!

Thank you! Would it be possible for you to give a Reviewed-by,
given the condition of squashing this diff?

Likely, Jason will help squash it when taking this v7 via his
iommufd tree. So, we might not respin a v8.

Nicolin
Bagas Sanjaya Nov. 8, 2024, 9:01 a.m. UTC | #5
On Wed, Nov 06, 2024 at 08:04:09PM -0800, Nicolin Chen wrote:
> On Thu, Nov 07, 2024 at 10:20:49AM +0700, Bagas Sanjaya wrote:
> > On Wed, Nov 06, 2024 at 05:35:45PM -0800, Nicolin Chen wrote:
> > > On Thu, Nov 07, 2024 at 07:56:31AM +0700, Bagas Sanjaya wrote:
> > > > On Tue, Nov 05, 2024 at 12:04:29PM -0800, Nicolin Chen wrote:
> > > > > With the introduction of the new object and its infrastructure, update the
> > > > > doc to reflect that and add a new graph.
> > > > > 
> > > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > > > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > > > > ---
> > > > >  Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++-
> > > > >  1 file changed, 68 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
> > > > > index 2deba93bf159..a8b7766c2849 100644
> > > > > --- a/Documentation/userspace-api/iommufd.rst
> > > > > +++ b/Documentation/userspace-api/iommufd.rst
> > > > > @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace:
> > > > >    space usually has mappings from guest-level I/O virtual addresses to guest-
> > > > >    level physical addresses.
> > > > >  
> > > > > +- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance,
> > > > > +  passed to or shared with a VM. It may be some HW-accelerated virtualization
> > > > > +  features and some SW resources used by the VM. For examples:
> > > > > +  * Security namespace for guest owned ID, e.g. guest-controlled cache tags
> > > > > +  * Non-device-affiliated event reporting, e.g. invalidation queue errors
> > > > > +  * Access to a sharable nesting parent pagetable across physical IOMMUs
> > > > > +  * Virtualization of various platforms IDs, e.g. RIDs and others
> > > > > +  * Delivery of paravirtualized invalidation
> > > > > +  * Direct assigned invalidation queues
> > > > > +  * Direct assigned interrupts
> > > > 
> > > > The bullet list above is outputted in htmldocs build as long-running paragraph
> > > > instead.
> > > 
> > > Oh, I overlooked this list.
> > > 
> > > Would the following change be okay?
> > > 
> > > -------------------------------------------------
> > > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
> > > index 0ef22b3ca30b..011cbc71b6f5 100644
> > > --- a/Documentation/userspace-api/iommufd.rst
> > > +++ b/Documentation/userspace-api/iommufd.rst
> > > @@ -68,2 +68,3 @@ Following IOMMUFD objects are exposed to userspace:
> > >    features and some SW resources used by the VM. For examples:
> > > +
> > >    * Security namespace for guest owned ID, e.g. guest-controlled cache tags
> > > @@ -75,2 +76,3 @@ Following IOMMUFD objects are exposed to userspace:
> > >    * Direct assigned interrupts
> > > +
> > >    Such a vIOMMU object generally has the access to a nesting parent pagetable
> > > -------------------------------------------------
> > > 
> > > The outputted html is showing a list with this.
> > 
> > Yup, that's right!
> 
> Thank you! Would it be possible for you to give a Reviewed-by,
> given the condition of squashing this diff?

Alright, here it goes...

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Yi Liu Nov. 12, 2024, 1:15 p.m. UTC | #6
On 2024/11/6 04:04, Nicolin Chen wrote:
>   
>   The diagrams below show relationships between user-visible objects and kernel
> @@ -101,6 +132,28 @@ creating the objects and links::
>              |------------>|iommu_domain|<----|iommu_domain|<----|device|
>                            |____________|     |____________|     |______|
>   
> +  _______________________________________________________________________
> + |                      iommufd (with vIOMMU)                            |
> + |                                                                       |
> + |                             [5]                                       |
> + |                        _____________                                  |
> + |                       |             |                                 |
> + |      |----------------|    vIOMMU   |                                 |
> + |      |                |             |                                 |
> + |      |                |             |                                 |
> + |      |      [1]       |             |          [4]             [2]    |
> + |      |     ______     |             |     _____________     ________  |
> + |      |    |      |    |     [3]     |    |             |   |        | |
> + |      |    | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
> + |      |    |______|    |_____________|    |_____________|   |________| |
> + |      |        |              |                  |               |     |
> + |______|________|______________|__________________|_______________|_____|
> +        |        |              |                  |               |
> +  ______v_____   |        ______v_____       ______v_____       ___v__
> + |   struct   |  |  PFN  |  (paging)  |     |  (nested)  |     |struct|
> + |iommu_device|  |------>|iommu_domain|<----|iommu_domain|<----|device|
> + |____________|   storage|____________|     |____________|     |______|
> +
>   1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. An iommufd can
>      hold multiple IOAS objects. IOAS is the most generic object and does not
>      expose interfaces that are specific to single IOMMU drivers. All operations
> @@ -132,7 +185,8 @@ creating the objects and links::
>        flag is set.
>   
>   4. IOMMUFD_OBJ_HWPT_NESTED can be only manually created via the IOMMU_HWPT_ALLOC
> -   uAPI, provided an hwpt_id via @pt_id to associate the new HWPT_NESTED object
> +   uAPI, provided an hwpt_id or a viommu_id of a vIOMMU object encapsulating a
> +   nesting parent HWPT_PAGING via @pt_id to associate the new HWPT_NESTED object
>      to the corresponding HWPT_PAGING object. The associating HWPT_PAGING object
>      must be a nesting parent manually allocated via the same uAPI previously with
>      an IOMMU_HWPT_ALLOC_NEST_PARENT flag, otherwise the allocation will fail. The
> @@ -149,6 +203,18 @@ creating the objects and links::
>         created via the same IOMMU_HWPT_ALLOC uAPI. The difference is at the type
>         of the object passed in via the @pt_id field of struct iommufd_hwpt_alloc.
>   
> +5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOC
> +   uAPI, provided a dev_id (for the device's physical IOMMU to back the vIOMMU)
> +   and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). The
> +   iommufd core will link the vIOMMU object to the struct iommu_device that the
> +   struct device is behind. 

It looks to be reasonable to share the viommu_obj between devices behind 
the same physical IOMMU. This design seems no enforcement for it. So it's
all up to userspace from what I got. :)
Nicolin Chen Nov. 14, 2024, 12:18 a.m. UTC | #7
On Tue, Nov 12, 2024 at 09:15:02PM +0800, Yi Liu wrote:
> On 2024/11/6 04:04, Nicolin Chen wrote:
> > +5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOC
> > +   uAPI, provided a dev_id (for the device's physical IOMMU to back the vIOMMU)
> > +   and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). The
> > +   iommufd core will link the vIOMMU object to the struct iommu_device that the
> > +   struct device is behind.
> 
> It looks to be reasonable to share the viommu_obj between devices behind
> the same physical IOMMU. This design seems no enforcement for it. So it's
> all up to userspace from what I got. :)

It enforces at the vDEVICE allocation:
	if (viommu->iommu_dev != __iommu_get_iommu_dev(idev->dev)) {
		return -EINVAL;

Yet, assuming you are referring to creating two vIOMMUs per VM for
two devices behind the same IOMMU (?), there is no enforcement..

The suggested way for VMM, just like two devices sharing the same
s2 parent hwpt, is to share the vIOMMU object.

Thanks
Nic
Yi Liu Nov. 14, 2024, 3:13 a.m. UTC | #8
On 2024/11/14 08:18, Nicolin Chen wrote:
> On Tue, Nov 12, 2024 at 09:15:02PM +0800, Yi Liu wrote:
>> On 2024/11/6 04:04, Nicolin Chen wrote:
>>> +5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOC
>>> +   uAPI, provided a dev_id (for the device's physical IOMMU to back the vIOMMU)
>>> +   and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). The
>>> +   iommufd core will link the vIOMMU object to the struct iommu_device that the
>>> +   struct device is behind.
>>
>> It looks to be reasonable to share the viommu_obj between devices behind
>> the same physical IOMMU. This design seems no enforcement for it. So it's
>> all up to userspace from what I got. :)
> 
> It enforces at the vDEVICE allocation:
> 	if (viommu->iommu_dev != __iommu_get_iommu_dev(idev->dev)) {
> 		return -EINVAL;

this matches the device and the viommu.

> 
> Yet, assuming you are referring to creating two vIOMMUs per VM for
> two devices behind the same IOMMU (?), there is no enforcement..

right, but not limited to two vIOMMUs as the viommu_obj is not instanced
per vIOMMUs.

> The suggested way for VMM, just like two devices sharing the same
> s2 parent hwpt, is to share the vIOMMU object.

so the user would try to create vDevices with a given viommu_obj until
failure, then it would allocate another viommu_obj for the failed device.
is it? sounds reasonable.
Nicolin Chen Nov. 14, 2024, 3:18 a.m. UTC | #9
On Thu, Nov 14, 2024 at 11:13:00AM +0800, Yi Liu wrote:
> On 2024/11/14 08:18, Nicolin Chen wrote:
> > On Tue, Nov 12, 2024 at 09:15:02PM +0800, Yi Liu wrote:
> > > On 2024/11/6 04:04, Nicolin Chen wrote:
> > > > +5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOC
> > > > +   uAPI, provided a dev_id (for the device's physical IOMMU to back the vIOMMU)
> > > > +   and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). The
> > > > +   iommufd core will link the vIOMMU object to the struct iommu_device that the
> > > > +   struct device is behind.
> > > 
> > > It looks to be reasonable to share the viommu_obj between devices behind
> > > the same physical IOMMU. This design seems no enforcement for it. So it's
> > > all up to userspace from what I got. :)
> > 
> > It enforces at the vDEVICE allocation:
> > 	if (viommu->iommu_dev != __iommu_get_iommu_dev(idev->dev)) {
> > 		return -EINVAL;
> 
> this matches the device and the viommu.

And viommu has a hard relationship with physical instance.

> > 
> > Yet, assuming you are referring to creating two vIOMMUs per VM for
> > two devices behind the same IOMMU (?), there is no enforcement..
> 
> right, but not limited to two vIOMMUs as the viommu_obj is not instanced
> per vIOMMUs.

It doesn't make a lot of sense to create two. But it doesn't seem
to be something that we can limit here. VMM can create the first
vIOMMU using the first device and create the second vIOMMU using
the second device, though both devices are behind the same IOMMU
and passed to the same VM. Thing is that kernel doesn't know if
they are passed to the same VM.

> > The suggested way for VMM, just like two devices sharing the same
> > s2 parent hwpt, is to share the vIOMMU object.
> 
> so the user would try to create vDevices with a given viommu_obj until
> failure, then it would allocate another viommu_obj for the failed device.
> is it? sounds reasonable.

Yes. It is the same as previously dealing with a nesting parent:
test and allocate if fails. The virtual IOMMU driver in VMM can
keep a list of the vIOMMU objects for each device to test.

Thanks
Nic
Jason Gunthorpe Nov. 14, 2024, 4:20 p.m. UTC | #10
On Wed, Nov 13, 2024 at 07:18:42PM -0800, Nicolin Chen wrote:
> > so the user would try to create vDevices with a given viommu_obj until
> > failure, then it would allocate another viommu_obj for the failed device.
> > is it? sounds reasonable.
> 
> Yes. It is the same as previously dealing with a nesting parent:
> test and allocate if fails. The virtual IOMMU driver in VMM can
> keep a list of the vIOMMU objects for each device to test.

The viommu object should be tied to the VMM's vIOMMU vHW object that
it is paravirtualizing toward the VM.

So we shouldn't be creating viommu objects on demand, it should be
created when the vIOMMU is created, and the presumably the qemu
command line will describe how to link vPCI/VFIO functions to vIOMMU
instances. If they kernel won't allow the user's configuration then it
should fail, IMHO.

Some try-and-fail might be interesting to auto-provision vIOMMU's and
provision vPCI functions. Though I suspect we will be providing
information in other ioctls so something like libvirt can construct
the correct configuration directly.

Jason
Nicolin Chen Nov. 15, 2024, 10:07 p.m. UTC | #11
On Thu, Nov 14, 2024 at 12:20:10PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 13, 2024 at 07:18:42PM -0800, Nicolin Chen wrote:
> > > so the user would try to create vDevices with a given viommu_obj until
> > > failure, then it would allocate another viommu_obj for the failed device.
> > > is it? sounds reasonable.
> > 
> > Yes. It is the same as previously dealing with a nesting parent:
> > test and allocate if fails. The virtual IOMMU driver in VMM can
> > keep a list of the vIOMMU objects for each device to test.
> 
> The viommu object should be tied to the VMM's vIOMMU vHW object that
> it is paravirtualizing toward the VM.
> 
> So we shouldn't be creating viommu objects on demand, it should be
> created when the vIOMMU is created, and the presumably the qemu
> command line will describe how to link vPCI/VFIO functions to vIOMMU
> instances. If they kernel won't allow the user's configuration then it
> should fail, IMHO.

Intel's virtual IOMMU in QEMU has one instance but could create
two vIOMMU objects for devices behind two different pIOMMUs. So,
in this case, it does the on-demand (or try-and-fail) approach?

One corner case that Yi reminded me of was that VMM having two
virtual IOMMUs for two devices that are behind the same pIOMMU,
then these two virtual IOMMUs don't necessarily share the same
vIOMMU object, i.e. VMM is allowed to allocate two vIOMMU objs?

> Some try-and-fail might be interesting to auto-provision vIOMMU's and
> provision vPCI functions. Though I suspect we will be providing
> information in other ioctls so something like libvirt can construct
> the correct configuration directly.

By "auto-provision", you mean libvirt assigning devices to the
correct virtual IOMMUs corresponding to the physical instances?
If so, we can just match the "iommu" sysfs node of devices with
the iommu node(s) under /sys/class/iommu/, right?

Thanks
Nicolin
Jason Gunthorpe Nov. 16, 2024, 12:34 a.m. UTC | #12
On Fri, Nov 15, 2024 at 02:07:41PM -0800, Nicolin Chen wrote:
> On Thu, Nov 14, 2024 at 12:20:10PM -0400, Jason Gunthorpe wrote:
> > On Wed, Nov 13, 2024 at 07:18:42PM -0800, Nicolin Chen wrote:
> > > > so the user would try to create vDevices with a given viommu_obj until
> > > > failure, then it would allocate another viommu_obj for the failed device.
> > > > is it? sounds reasonable.
> > > 
> > > Yes. It is the same as previously dealing with a nesting parent:
> > > test and allocate if fails. The virtual IOMMU driver in VMM can
> > > keep a list of the vIOMMU objects for each device to test.
> > 
> > The viommu object should be tied to the VMM's vIOMMU vHW object that
> > it is paravirtualizing toward the VM.
> > 
> > So we shouldn't be creating viommu objects on demand, it should be
> > created when the vIOMMU is created, and the presumably the qemu
> > command line will describe how to link vPCI/VFIO functions to vIOMMU
> > instances. If they kernel won't allow the user's configuration then it
> > should fail, IMHO.
> 
> Intel's virtual IOMMU in QEMU has one instance but could create
> two vIOMMU objects for devices behind two different pIOMMUs. So,
> in this case, it does the on-demand (or try-and-fail) approach?

I suspect Intel does need viommu at all, and if it ever does it will
not be able to have one instance..

> One corner case that Yi reminded me of was that VMM having two
> virtual IOMMUs for two devices that are behind the same pIOMMU,
> then these two virtual IOMMUs don't necessarily share the same
> vIOMMU object, i.e. VMM is allowed to allocate two vIOMMU objs?

Yes this is allowed
 
> > Some try-and-fail might be interesting to auto-provision vIOMMU's and
> > provision vPCI functions. Though I suspect we will be providing
> > information in other ioctls so something like libvirt can construct
> > the correct configuration directly.
> 
> By "auto-provision", you mean libvirt assigning devices to the
> correct virtual IOMMUs corresponding to the physical instances?
> If so, we can just match the "iommu" sysfs node of devices with
> the iommu node(s) under /sys/class/iommu/, right?

Yes

Jason
Yi Liu Nov. 18, 2024, 6:10 a.m. UTC | #13
On 2024/11/16 08:34, Jason Gunthorpe wrote:
> On Fri, Nov 15, 2024 at 02:07:41PM -0800, Nicolin Chen wrote:
>> On Thu, Nov 14, 2024 at 12:20:10PM -0400, Jason Gunthorpe wrote:
>>> On Wed, Nov 13, 2024 at 07:18:42PM -0800, Nicolin Chen wrote:
>>>>> so the user would try to create vDevices with a given viommu_obj until
>>>>> failure, then it would allocate another viommu_obj for the failed device.
>>>>> is it? sounds reasonable.
>>>>
>>>> Yes. It is the same as previously dealing with a nesting parent:
>>>> test and allocate if fails. The virtual IOMMU driver in VMM can
>>>> keep a list of the vIOMMU objects for each device to test.
>>>
>>> The viommu object should be tied to the VMM's vIOMMU vHW object that
>>> it is paravirtualizing toward the VM.
>>>
>>> So we shouldn't be creating viommu objects on demand, it should be
>>> created when the vIOMMU is created, and the presumably the qemu
>>> command line will describe how to link vPCI/VFIO functions to vIOMMU
>>> instances. If they kernel won't allow the user's configuration then it
>>> should fail, IMHO.
>>
>> Intel's virtual IOMMU in QEMU has one instance but could create
>> two vIOMMU objects for devices behind two different pIOMMUs. So,
>> in this case, it does the on-demand (or try-and-fail) approach?
> 
> I suspect Intel does need viommu at all, and if it ever does it will
> not be able to have one instance..

hmmm. As long as I got, the viommu_obj is a representative of the hw
IOMMU slice of resource used by the VM. It is hence instanced per hw
iommu. Based on this, one vIOMMU can have multiple or one viommu_obj.
Either should be allowed by design.

BTW. @Nic, I think the viommu_obj instance is not strictly be per hw
IOMMUs. e.g. two devices behind one hw IOMMU can have their own viommu_obj
as well. Is it? I didn't see a problem for it. So the viommu_obj is
instanced >= hw IOMMU number used by the VM.

Regards,
Yi Liu
diff mbox series

Patch

diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
index 2deba93bf159..a8b7766c2849 100644
--- a/Documentation/userspace-api/iommufd.rst
+++ b/Documentation/userspace-api/iommufd.rst
@@ -63,6 +63,37 @@  Following IOMMUFD objects are exposed to userspace:
   space usually has mappings from guest-level I/O virtual addresses to guest-
   level physical addresses.
 
+- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance,
+  passed to or shared with a VM. It may be some HW-accelerated virtualization
+  features and some SW resources used by the VM. For examples:
+  * Security namespace for guest owned ID, e.g. guest-controlled cache tags
+  * Non-device-affiliated event reporting, e.g. invalidation queue errors
+  * Access to a sharable nesting parent pagetable across physical IOMMUs
+  * Virtualization of various platforms IDs, e.g. RIDs and others
+  * Delivery of paravirtualized invalidation
+  * Direct assigned invalidation queues
+  * Direct assigned interrupts
+  Such a vIOMMU object generally has the access to a nesting parent pagetable
+  to support some HW-accelerated virtualization features. So, a vIOMMU object
+  must be created given a nesting parent HWPT_PAGING object, and then it would
+  encapsulate that HWPT_PAGING object. Therefore, a vIOMMU object can be used
+  to allocate an HWPT_NESTED object in place of the encapsulated HWPT_PAGING.
+
+  .. note::
+
+     The name "vIOMMU" isn't necessarily identical to a virtualized IOMMU in a
+     VM. A VM can have one giant virtualized IOMMU running on a machine having
+     multiple physical IOMMUs, in which case the VMM will dispatch the requests
+     or configurations from this single virtualized IOMMU instance to multiple
+     vIOMMU objects created for individual slices of different physical IOMMUs.
+     In other words, a vIOMMU object is always a representation of one physical
+     IOMMU, not necessarily of a virtualized IOMMU. For VMMs that want the full
+     virtualization features from physical IOMMUs, it is suggested to build the
+     same number of virtualized IOMMUs as the number of physical IOMMUs, so the
+     passed-through devices would be connected to their own virtualized IOMMUs
+     backed by corresponding vIOMMU objects, in which case a guest OS would do
+     the "dispatch" naturally instead of VMM trappings.
+
 All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
 
 The diagrams below show relationships between user-visible objects and kernel
@@ -101,6 +132,28 @@  creating the objects and links::
            |------------>|iommu_domain|<----|iommu_domain|<----|device|
                          |____________|     |____________|     |______|
 
+  _______________________________________________________________________
+ |                      iommufd (with vIOMMU)                            |
+ |                                                                       |
+ |                             [5]                                       |
+ |                        _____________                                  |
+ |                       |             |                                 |
+ |      |----------------|    vIOMMU   |                                 |
+ |      |                |             |                                 |
+ |      |                |             |                                 |
+ |      |      [1]       |             |          [4]             [2]    |
+ |      |     ______     |             |     _____________     ________  |
+ |      |    |      |    |     [3]     |    |             |   |        | |
+ |      |    | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
+ |      |    |______|    |_____________|    |_____________|   |________| |
+ |      |        |              |                  |               |     |
+ |______|________|______________|__________________|_______________|_____|
+        |        |              |                  |               |
+  ______v_____   |        ______v_____       ______v_____       ___v__
+ |   struct   |  |  PFN  |  (paging)  |     |  (nested)  |     |struct|
+ |iommu_device|  |------>|iommu_domain|<----|iommu_domain|<----|device|
+ |____________|   storage|____________|     |____________|     |______|
+
 1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. An iommufd can
    hold multiple IOAS objects. IOAS is the most generic object and does not
    expose interfaces that are specific to single IOMMU drivers. All operations
@@ -132,7 +185,8 @@  creating the objects and links::
      flag is set.
 
 4. IOMMUFD_OBJ_HWPT_NESTED can be only manually created via the IOMMU_HWPT_ALLOC
-   uAPI, provided an hwpt_id via @pt_id to associate the new HWPT_NESTED object
+   uAPI, provided an hwpt_id or a viommu_id of a vIOMMU object encapsulating a
+   nesting parent HWPT_PAGING via @pt_id to associate the new HWPT_NESTED object
    to the corresponding HWPT_PAGING object. The associating HWPT_PAGING object
    must be a nesting parent manually allocated via the same uAPI previously with
    an IOMMU_HWPT_ALLOC_NEST_PARENT flag, otherwise the allocation will fail. The
@@ -149,6 +203,18 @@  creating the objects and links::
       created via the same IOMMU_HWPT_ALLOC uAPI. The difference is at the type
       of the object passed in via the @pt_id field of struct iommufd_hwpt_alloc.
 
+5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOC
+   uAPI, provided a dev_id (for the device's physical IOMMU to back the vIOMMU)
+   and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). The
+   iommufd core will link the vIOMMU object to the struct iommu_device that the
+   struct device is behind. And an IOMMU driver can implement a viommu_alloc op
+   to allocate its own vIOMMU data structure embedding the core-level structure
+   iommufd_viommu and some driver-specific data. If necessary, the driver can
+   also configure its HW virtualization feature for that vIOMMU (and thus for
+   the VM). Successful completion of this operation sets up the linkages between
+   the vIOMMU object and the HWPT_PAGING, then this vIOMMU object can be used
+   as a nesting parent object to allocate an HWPT_NESTED object described above.
+
 A device can only bind to an iommufd due to DMA ownership claim and attach to at
 most one IOAS object (no support of PASID yet).
 
@@ -161,6 +227,7 @@  User visible objects are backed by following datastructures:
 - iommufd_device for IOMMUFD_OBJ_DEVICE.
 - iommufd_hwpt_paging for IOMMUFD_OBJ_HWPT_PAGING.
 - iommufd_hwpt_nested for IOMMUFD_OBJ_HWPT_NESTED.
+- iommufd_viommu for IOMMUFD_OBJ_VIOMMU.
 
 Several terminologies when looking at these datastructures: