Message ID | 7e4302064e0d02137c1b1e139342affc0485ed3f.1730836219.git.nicolinc@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | iommufd: Add vIOMMU infrastructure (Part-1) | expand |
On Tue, Nov 05, 2024 at 12:04:29PM -0800, Nicolin Chen wrote: > With the introduction of the new object and its infrastructure, update the > doc to reflect that and add a new graph. > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > Reviewed-by: Kevin Tian <kevin.tian@intel.com> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > --- > Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++- > 1 file changed, 68 insertions(+), 1 deletion(-) > > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst > index 2deba93bf159..a8b7766c2849 100644 > --- a/Documentation/userspace-api/iommufd.rst > +++ b/Documentation/userspace-api/iommufd.rst > @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace: > space usually has mappings from guest-level I/O virtual addresses to guest- > level physical addresses. > > +- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance, > + passed to or shared with a VM. It may be some HW-accelerated virtualization > + features and some SW resources used by the VM. For examples: > + * Security namespace for guest owned ID, e.g. guest-controlled cache tags > + * Non-device-affiliated event reporting, e.g. invalidation queue errors > + * Access to a sharable nesting parent pagetable across physical IOMMUs > + * Virtualization of various platforms IDs, e.g. RIDs and others > + * Delivery of paravirtualized invalidation > + * Direct assigned invalidation queues > + * Direct assigned interrupts The bullet list above is outputted in htmldocs build as long-running paragraph instead. > + Such a vIOMMU object generally has the access to a nesting parent pagetable > + to support some HW-accelerated virtualization features. So, a vIOMMU object > + must be created given a nesting parent HWPT_PAGING object, and then it would > + encapsulate that HWPT_PAGING object. Therefore, a vIOMMU object can be used > + to allocate an HWPT_NESTED object in place of the encapsulated HWPT_PAGING. Thanks.
On Thu, Nov 07, 2024 at 07:56:31AM +0700, Bagas Sanjaya wrote: > On Tue, Nov 05, 2024 at 12:04:29PM -0800, Nicolin Chen wrote: > > With the introduction of the new object and its infrastructure, update the > > doc to reflect that and add a new graph. > > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > > Reviewed-by: Kevin Tian <kevin.tian@intel.com> > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > > --- > > Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++- > > 1 file changed, 68 insertions(+), 1 deletion(-) > > > > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst > > index 2deba93bf159..a8b7766c2849 100644 > > --- a/Documentation/userspace-api/iommufd.rst > > +++ b/Documentation/userspace-api/iommufd.rst > > @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace: > > space usually has mappings from guest-level I/O virtual addresses to guest- > > level physical addresses. > > > > +- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance, > > + passed to or shared with a VM. It may be some HW-accelerated virtualization > > + features and some SW resources used by the VM. For examples: > > + * Security namespace for guest owned ID, e.g. guest-controlled cache tags > > + * Non-device-affiliated event reporting, e.g. invalidation queue errors > > + * Access to a sharable nesting parent pagetable across physical IOMMUs > > + * Virtualization of various platforms IDs, e.g. RIDs and others > > + * Delivery of paravirtualized invalidation > > + * Direct assigned invalidation queues > > + * Direct assigned interrupts > > The bullet list above is outputted in htmldocs build as long-running paragraph > instead. Oh, I overlooked this list. Would the following change be okay? ------------------------------------------------- diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst index 0ef22b3ca30b..011cbc71b6f5 100644 --- a/Documentation/userspace-api/iommufd.rst +++ b/Documentation/userspace-api/iommufd.rst @@ -68,2 +68,3 @@ Following IOMMUFD objects are exposed to userspace: features and some SW resources used by the VM. For examples: + * Security namespace for guest owned ID, e.g. guest-controlled cache tags @@ -75,2 +76,3 @@ Following IOMMUFD objects are exposed to userspace: * Direct assigned interrupts + Such a vIOMMU object generally has the access to a nesting parent pagetable ------------------------------------------------- The outputted html is showing a list with this. Thanks! Nicolin
On Wed, Nov 06, 2024 at 05:35:45PM -0800, Nicolin Chen wrote: > On Thu, Nov 07, 2024 at 07:56:31AM +0700, Bagas Sanjaya wrote: > > On Tue, Nov 05, 2024 at 12:04:29PM -0800, Nicolin Chen wrote: > > > With the introduction of the new object and its infrastructure, update the > > > doc to reflect that and add a new graph. > > > > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > > > Reviewed-by: Kevin Tian <kevin.tian@intel.com> > > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > > > --- > > > Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++- > > > 1 file changed, 68 insertions(+), 1 deletion(-) > > > > > > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst > > > index 2deba93bf159..a8b7766c2849 100644 > > > --- a/Documentation/userspace-api/iommufd.rst > > > +++ b/Documentation/userspace-api/iommufd.rst > > > @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace: > > > space usually has mappings from guest-level I/O virtual addresses to guest- > > > level physical addresses. > > > > > > +- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance, > > > + passed to or shared with a VM. It may be some HW-accelerated virtualization > > > + features and some SW resources used by the VM. For examples: > > > + * Security namespace for guest owned ID, e.g. guest-controlled cache tags > > > + * Non-device-affiliated event reporting, e.g. invalidation queue errors > > > + * Access to a sharable nesting parent pagetable across physical IOMMUs > > > + * Virtualization of various platforms IDs, e.g. RIDs and others > > > + * Delivery of paravirtualized invalidation > > > + * Direct assigned invalidation queues > > > + * Direct assigned interrupts > > > > The bullet list above is outputted in htmldocs build as long-running paragraph > > instead. > > Oh, I overlooked this list. > > Would the following change be okay? > > ------------------------------------------------- > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst > index 0ef22b3ca30b..011cbc71b6f5 100644 > --- a/Documentation/userspace-api/iommufd.rst > +++ b/Documentation/userspace-api/iommufd.rst > @@ -68,2 +68,3 @@ Following IOMMUFD objects are exposed to userspace: > features and some SW resources used by the VM. For examples: > + > * Security namespace for guest owned ID, e.g. guest-controlled cache tags > @@ -75,2 +76,3 @@ Following IOMMUFD objects are exposed to userspace: > * Direct assigned interrupts > + > Such a vIOMMU object generally has the access to a nesting parent pagetable > ------------------------------------------------- > > The outputted html is showing a list with this. Yup, that's right!
On Thu, Nov 07, 2024 at 10:20:49AM +0700, Bagas Sanjaya wrote: > On Wed, Nov 06, 2024 at 05:35:45PM -0800, Nicolin Chen wrote: > > On Thu, Nov 07, 2024 at 07:56:31AM +0700, Bagas Sanjaya wrote: > > > On Tue, Nov 05, 2024 at 12:04:29PM -0800, Nicolin Chen wrote: > > > > With the introduction of the new object and its infrastructure, update the > > > > doc to reflect that and add a new graph. > > > > > > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > > > > Reviewed-by: Kevin Tian <kevin.tian@intel.com> > > > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > > > > --- > > > > Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++- > > > > 1 file changed, 68 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst > > > > index 2deba93bf159..a8b7766c2849 100644 > > > > --- a/Documentation/userspace-api/iommufd.rst > > > > +++ b/Documentation/userspace-api/iommufd.rst > > > > @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace: > > > > space usually has mappings from guest-level I/O virtual addresses to guest- > > > > level physical addresses. > > > > > > > > +- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance, > > > > + passed to or shared with a VM. It may be some HW-accelerated virtualization > > > > + features and some SW resources used by the VM. For examples: > > > > + * Security namespace for guest owned ID, e.g. guest-controlled cache tags > > > > + * Non-device-affiliated event reporting, e.g. invalidation queue errors > > > > + * Access to a sharable nesting parent pagetable across physical IOMMUs > > > > + * Virtualization of various platforms IDs, e.g. RIDs and others > > > > + * Delivery of paravirtualized invalidation > > > > + * Direct assigned invalidation queues > > > > + * Direct assigned interrupts > > > > > > The bullet list above is outputted in htmldocs build as long-running paragraph > > > instead. > > > > Oh, I overlooked this list. > > > > Would the following change be okay? > > > > ------------------------------------------------- > > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst > > index 0ef22b3ca30b..011cbc71b6f5 100644 > > --- a/Documentation/userspace-api/iommufd.rst > > +++ b/Documentation/userspace-api/iommufd.rst > > @@ -68,2 +68,3 @@ Following IOMMUFD objects are exposed to userspace: > > features and some SW resources used by the VM. For examples: > > + > > * Security namespace for guest owned ID, e.g. guest-controlled cache tags > > @@ -75,2 +76,3 @@ Following IOMMUFD objects are exposed to userspace: > > * Direct assigned interrupts > > + > > Such a vIOMMU object generally has the access to a nesting parent pagetable > > ------------------------------------------------- > > > > The outputted html is showing a list with this. > > Yup, that's right! Thank you! Would it be possible for you to give a Reviewed-by, given the condition of squashing this diff? Likely, Jason will help squash it when taking this v7 via his iommufd tree. So, we might not respin a v8. Nicolin
On Wed, Nov 06, 2024 at 08:04:09PM -0800, Nicolin Chen wrote: > On Thu, Nov 07, 2024 at 10:20:49AM +0700, Bagas Sanjaya wrote: > > On Wed, Nov 06, 2024 at 05:35:45PM -0800, Nicolin Chen wrote: > > > On Thu, Nov 07, 2024 at 07:56:31AM +0700, Bagas Sanjaya wrote: > > > > On Tue, Nov 05, 2024 at 12:04:29PM -0800, Nicolin Chen wrote: > > > > > With the introduction of the new object and its infrastructure, update the > > > > > doc to reflect that and add a new graph. > > > > > > > > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > > > > > Reviewed-by: Kevin Tian <kevin.tian@intel.com> > > > > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > > > > > --- > > > > > Documentation/userspace-api/iommufd.rst | 69 ++++++++++++++++++++++++- > > > > > 1 file changed, 68 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst > > > > > index 2deba93bf159..a8b7766c2849 100644 > > > > > --- a/Documentation/userspace-api/iommufd.rst > > > > > +++ b/Documentation/userspace-api/iommufd.rst > > > > > @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace: > > > > > space usually has mappings from guest-level I/O virtual addresses to guest- > > > > > level physical addresses. > > > > > > > > > > +- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance, > > > > > + passed to or shared with a VM. It may be some HW-accelerated virtualization > > > > > + features and some SW resources used by the VM. For examples: > > > > > + * Security namespace for guest owned ID, e.g. guest-controlled cache tags > > > > > + * Non-device-affiliated event reporting, e.g. invalidation queue errors > > > > > + * Access to a sharable nesting parent pagetable across physical IOMMUs > > > > > + * Virtualization of various platforms IDs, e.g. RIDs and others > > > > > + * Delivery of paravirtualized invalidation > > > > > + * Direct assigned invalidation queues > > > > > + * Direct assigned interrupts > > > > > > > > The bullet list above is outputted in htmldocs build as long-running paragraph > > > > instead. > > > > > > Oh, I overlooked this list. > > > > > > Would the following change be okay? > > > > > > ------------------------------------------------- > > > diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst > > > index 0ef22b3ca30b..011cbc71b6f5 100644 > > > --- a/Documentation/userspace-api/iommufd.rst > > > +++ b/Documentation/userspace-api/iommufd.rst > > > @@ -68,2 +68,3 @@ Following IOMMUFD objects are exposed to userspace: > > > features and some SW resources used by the VM. For examples: > > > + > > > * Security namespace for guest owned ID, e.g. guest-controlled cache tags > > > @@ -75,2 +76,3 @@ Following IOMMUFD objects are exposed to userspace: > > > * Direct assigned interrupts > > > + > > > Such a vIOMMU object generally has the access to a nesting parent pagetable > > > ------------------------------------------------- > > > > > > The outputted html is showing a list with this. > > > > Yup, that's right! > > Thank you! Would it be possible for you to give a Reviewed-by, > given the condition of squashing this diff? Alright, here it goes... Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
On 2024/11/6 04:04, Nicolin Chen wrote: > > The diagrams below show relationships between user-visible objects and kernel > @@ -101,6 +132,28 @@ creating the objects and links:: > |------------>|iommu_domain|<----|iommu_domain|<----|device| > |____________| |____________| |______| > > + _______________________________________________________________________ > + | iommufd (with vIOMMU) | > + | | > + | [5] | > + | _____________ | > + | | | | > + | |----------------| vIOMMU | | > + | | | | | > + | | | | | > + | | [1] | | [4] [2] | > + | | ______ | | _____________ ________ | > + | | | | | [3] | | | | | | > + | | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | | > + | | |______| |_____________| |_____________| |________| | > + | | | | | | | > + |______|________|______________|__________________|_______________|_____| > + | | | | | > + ______v_____ | ______v_____ ______v_____ ___v__ > + | struct | | PFN | (paging) | | (nested) | |struct| > + |iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device| > + |____________| storage|____________| |____________| |______| > + > 1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. An iommufd can > hold multiple IOAS objects. IOAS is the most generic object and does not > expose interfaces that are specific to single IOMMU drivers. All operations > @@ -132,7 +185,8 @@ creating the objects and links:: > flag is set. > > 4. IOMMUFD_OBJ_HWPT_NESTED can be only manually created via the IOMMU_HWPT_ALLOC > - uAPI, provided an hwpt_id via @pt_id to associate the new HWPT_NESTED object > + uAPI, provided an hwpt_id or a viommu_id of a vIOMMU object encapsulating a > + nesting parent HWPT_PAGING via @pt_id to associate the new HWPT_NESTED object > to the corresponding HWPT_PAGING object. The associating HWPT_PAGING object > must be a nesting parent manually allocated via the same uAPI previously with > an IOMMU_HWPT_ALLOC_NEST_PARENT flag, otherwise the allocation will fail. The > @@ -149,6 +203,18 @@ creating the objects and links:: > created via the same IOMMU_HWPT_ALLOC uAPI. The difference is at the type > of the object passed in via the @pt_id field of struct iommufd_hwpt_alloc. > > +5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOC > + uAPI, provided a dev_id (for the device's physical IOMMU to back the vIOMMU) > + and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). The > + iommufd core will link the vIOMMU object to the struct iommu_device that the > + struct device is behind. It looks to be reasonable to share the viommu_obj between devices behind the same physical IOMMU. This design seems no enforcement for it. So it's all up to userspace from what I got. :)
On Tue, Nov 12, 2024 at 09:15:02PM +0800, Yi Liu wrote: > On 2024/11/6 04:04, Nicolin Chen wrote: > > +5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOC > > + uAPI, provided a dev_id (for the device's physical IOMMU to back the vIOMMU) > > + and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). The > > + iommufd core will link the vIOMMU object to the struct iommu_device that the > > + struct device is behind. > > It looks to be reasonable to share the viommu_obj between devices behind > the same physical IOMMU. This design seems no enforcement for it. So it's > all up to userspace from what I got. :) It enforces at the vDEVICE allocation: if (viommu->iommu_dev != __iommu_get_iommu_dev(idev->dev)) { return -EINVAL; Yet, assuming you are referring to creating two vIOMMUs per VM for two devices behind the same IOMMU (?), there is no enforcement.. The suggested way for VMM, just like two devices sharing the same s2 parent hwpt, is to share the vIOMMU object. Thanks Nic
On 2024/11/14 08:18, Nicolin Chen wrote: > On Tue, Nov 12, 2024 at 09:15:02PM +0800, Yi Liu wrote: >> On 2024/11/6 04:04, Nicolin Chen wrote: >>> +5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOC >>> + uAPI, provided a dev_id (for the device's physical IOMMU to back the vIOMMU) >>> + and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). The >>> + iommufd core will link the vIOMMU object to the struct iommu_device that the >>> + struct device is behind. >> >> It looks to be reasonable to share the viommu_obj between devices behind >> the same physical IOMMU. This design seems no enforcement for it. So it's >> all up to userspace from what I got. :) > > It enforces at the vDEVICE allocation: > if (viommu->iommu_dev != __iommu_get_iommu_dev(idev->dev)) { > return -EINVAL; this matches the device and the viommu. > > Yet, assuming you are referring to creating two vIOMMUs per VM for > two devices behind the same IOMMU (?), there is no enforcement.. right, but not limited to two vIOMMUs as the viommu_obj is not instanced per vIOMMUs. > The suggested way for VMM, just like two devices sharing the same > s2 parent hwpt, is to share the vIOMMU object. so the user would try to create vDevices with a given viommu_obj until failure, then it would allocate another viommu_obj for the failed device. is it? sounds reasonable.
On Thu, Nov 14, 2024 at 11:13:00AM +0800, Yi Liu wrote: > On 2024/11/14 08:18, Nicolin Chen wrote: > > On Tue, Nov 12, 2024 at 09:15:02PM +0800, Yi Liu wrote: > > > On 2024/11/6 04:04, Nicolin Chen wrote: > > > > +5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOC > > > > + uAPI, provided a dev_id (for the device's physical IOMMU to back the vIOMMU) > > > > + and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). The > > > > + iommufd core will link the vIOMMU object to the struct iommu_device that the > > > > + struct device is behind. > > > > > > It looks to be reasonable to share the viommu_obj between devices behind > > > the same physical IOMMU. This design seems no enforcement for it. So it's > > > all up to userspace from what I got. :) > > > > It enforces at the vDEVICE allocation: > > if (viommu->iommu_dev != __iommu_get_iommu_dev(idev->dev)) { > > return -EINVAL; > > this matches the device and the viommu. And viommu has a hard relationship with physical instance. > > > > Yet, assuming you are referring to creating two vIOMMUs per VM for > > two devices behind the same IOMMU (?), there is no enforcement.. > > right, but not limited to two vIOMMUs as the viommu_obj is not instanced > per vIOMMUs. It doesn't make a lot of sense to create two. But it doesn't seem to be something that we can limit here. VMM can create the first vIOMMU using the first device and create the second vIOMMU using the second device, though both devices are behind the same IOMMU and passed to the same VM. Thing is that kernel doesn't know if they are passed to the same VM. > > The suggested way for VMM, just like two devices sharing the same > > s2 parent hwpt, is to share the vIOMMU object. > > so the user would try to create vDevices with a given viommu_obj until > failure, then it would allocate another viommu_obj for the failed device. > is it? sounds reasonable. Yes. It is the same as previously dealing with a nesting parent: test and allocate if fails. The virtual IOMMU driver in VMM can keep a list of the vIOMMU objects for each device to test. Thanks Nic
On Wed, Nov 13, 2024 at 07:18:42PM -0800, Nicolin Chen wrote: > > so the user would try to create vDevices with a given viommu_obj until > > failure, then it would allocate another viommu_obj for the failed device. > > is it? sounds reasonable. > > Yes. It is the same as previously dealing with a nesting parent: > test and allocate if fails. The virtual IOMMU driver in VMM can > keep a list of the vIOMMU objects for each device to test. The viommu object should be tied to the VMM's vIOMMU vHW object that it is paravirtualizing toward the VM. So we shouldn't be creating viommu objects on demand, it should be created when the vIOMMU is created, and the presumably the qemu command line will describe how to link vPCI/VFIO functions to vIOMMU instances. If they kernel won't allow the user's configuration then it should fail, IMHO. Some try-and-fail might be interesting to auto-provision vIOMMU's and provision vPCI functions. Though I suspect we will be providing information in other ioctls so something like libvirt can construct the correct configuration directly. Jason
On Thu, Nov 14, 2024 at 12:20:10PM -0400, Jason Gunthorpe wrote: > On Wed, Nov 13, 2024 at 07:18:42PM -0800, Nicolin Chen wrote: > > > so the user would try to create vDevices with a given viommu_obj until > > > failure, then it would allocate another viommu_obj for the failed device. > > > is it? sounds reasonable. > > > > Yes. It is the same as previously dealing with a nesting parent: > > test and allocate if fails. The virtual IOMMU driver in VMM can > > keep a list of the vIOMMU objects for each device to test. > > The viommu object should be tied to the VMM's vIOMMU vHW object that > it is paravirtualizing toward the VM. > > So we shouldn't be creating viommu objects on demand, it should be > created when the vIOMMU is created, and the presumably the qemu > command line will describe how to link vPCI/VFIO functions to vIOMMU > instances. If they kernel won't allow the user's configuration then it > should fail, IMHO. Intel's virtual IOMMU in QEMU has one instance but could create two vIOMMU objects for devices behind two different pIOMMUs. So, in this case, it does the on-demand (or try-and-fail) approach? One corner case that Yi reminded me of was that VMM having two virtual IOMMUs for two devices that are behind the same pIOMMU, then these two virtual IOMMUs don't necessarily share the same vIOMMU object, i.e. VMM is allowed to allocate two vIOMMU objs? > Some try-and-fail might be interesting to auto-provision vIOMMU's and > provision vPCI functions. Though I suspect we will be providing > information in other ioctls so something like libvirt can construct > the correct configuration directly. By "auto-provision", you mean libvirt assigning devices to the correct virtual IOMMUs corresponding to the physical instances? If so, we can just match the "iommu" sysfs node of devices with the iommu node(s) under /sys/class/iommu/, right? Thanks Nicolin
On Fri, Nov 15, 2024 at 02:07:41PM -0800, Nicolin Chen wrote: > On Thu, Nov 14, 2024 at 12:20:10PM -0400, Jason Gunthorpe wrote: > > On Wed, Nov 13, 2024 at 07:18:42PM -0800, Nicolin Chen wrote: > > > > so the user would try to create vDevices with a given viommu_obj until > > > > failure, then it would allocate another viommu_obj for the failed device. > > > > is it? sounds reasonable. > > > > > > Yes. It is the same as previously dealing with a nesting parent: > > > test and allocate if fails. The virtual IOMMU driver in VMM can > > > keep a list of the vIOMMU objects for each device to test. > > > > The viommu object should be tied to the VMM's vIOMMU vHW object that > > it is paravirtualizing toward the VM. > > > > So we shouldn't be creating viommu objects on demand, it should be > > created when the vIOMMU is created, and the presumably the qemu > > command line will describe how to link vPCI/VFIO functions to vIOMMU > > instances. If they kernel won't allow the user's configuration then it > > should fail, IMHO. > > Intel's virtual IOMMU in QEMU has one instance but could create > two vIOMMU objects for devices behind two different pIOMMUs. So, > in this case, it does the on-demand (or try-and-fail) approach? I suspect Intel does need viommu at all, and if it ever does it will not be able to have one instance.. > One corner case that Yi reminded me of was that VMM having two > virtual IOMMUs for two devices that are behind the same pIOMMU, > then these two virtual IOMMUs don't necessarily share the same > vIOMMU object, i.e. VMM is allowed to allocate two vIOMMU objs? Yes this is allowed > > Some try-and-fail might be interesting to auto-provision vIOMMU's and > > provision vPCI functions. Though I suspect we will be providing > > information in other ioctls so something like libvirt can construct > > the correct configuration directly. > > By "auto-provision", you mean libvirt assigning devices to the > correct virtual IOMMUs corresponding to the physical instances? > If so, we can just match the "iommu" sysfs node of devices with > the iommu node(s) under /sys/class/iommu/, right? Yes Jason
On 2024/11/16 08:34, Jason Gunthorpe wrote: > On Fri, Nov 15, 2024 at 02:07:41PM -0800, Nicolin Chen wrote: >> On Thu, Nov 14, 2024 at 12:20:10PM -0400, Jason Gunthorpe wrote: >>> On Wed, Nov 13, 2024 at 07:18:42PM -0800, Nicolin Chen wrote: >>>>> so the user would try to create vDevices with a given viommu_obj until >>>>> failure, then it would allocate another viommu_obj for the failed device. >>>>> is it? sounds reasonable. >>>> >>>> Yes. It is the same as previously dealing with a nesting parent: >>>> test and allocate if fails. The virtual IOMMU driver in VMM can >>>> keep a list of the vIOMMU objects for each device to test. >>> >>> The viommu object should be tied to the VMM's vIOMMU vHW object that >>> it is paravirtualizing toward the VM. >>> >>> So we shouldn't be creating viommu objects on demand, it should be >>> created when the vIOMMU is created, and the presumably the qemu >>> command line will describe how to link vPCI/VFIO functions to vIOMMU >>> instances. If they kernel won't allow the user's configuration then it >>> should fail, IMHO. >> >> Intel's virtual IOMMU in QEMU has one instance but could create >> two vIOMMU objects for devices behind two different pIOMMUs. So, >> in this case, it does the on-demand (or try-and-fail) approach? > > I suspect Intel does need viommu at all, and if it ever does it will > not be able to have one instance.. hmmm. As long as I got, the viommu_obj is a representative of the hw IOMMU slice of resource used by the VM. It is hence instanced per hw iommu. Based on this, one vIOMMU can have multiple or one viommu_obj. Either should be allowed by design. BTW. @Nic, I think the viommu_obj instance is not strictly be per hw IOMMUs. e.g. two devices behind one hw IOMMU can have their own viommu_obj as well. Is it? I didn't see a problem for it. So the viommu_obj is instanced >= hw IOMMU number used by the VM. Regards, Yi Liu
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst index 2deba93bf159..a8b7766c2849 100644 --- a/Documentation/userspace-api/iommufd.rst +++ b/Documentation/userspace-api/iommufd.rst @@ -63,6 +63,37 @@ Following IOMMUFD objects are exposed to userspace: space usually has mappings from guest-level I/O virtual addresses to guest- level physical addresses. +- IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance, + passed to or shared with a VM. It may be some HW-accelerated virtualization + features and some SW resources used by the VM. For examples: + * Security namespace for guest owned ID, e.g. guest-controlled cache tags + * Non-device-affiliated event reporting, e.g. invalidation queue errors + * Access to a sharable nesting parent pagetable across physical IOMMUs + * Virtualization of various platforms IDs, e.g. RIDs and others + * Delivery of paravirtualized invalidation + * Direct assigned invalidation queues + * Direct assigned interrupts + Such a vIOMMU object generally has the access to a nesting parent pagetable + to support some HW-accelerated virtualization features. So, a vIOMMU object + must be created given a nesting parent HWPT_PAGING object, and then it would + encapsulate that HWPT_PAGING object. Therefore, a vIOMMU object can be used + to allocate an HWPT_NESTED object in place of the encapsulated HWPT_PAGING. + + .. note:: + + The name "vIOMMU" isn't necessarily identical to a virtualized IOMMU in a + VM. A VM can have one giant virtualized IOMMU running on a machine having + multiple physical IOMMUs, in which case the VMM will dispatch the requests + or configurations from this single virtualized IOMMU instance to multiple + vIOMMU objects created for individual slices of different physical IOMMUs. + In other words, a vIOMMU object is always a representation of one physical + IOMMU, not necessarily of a virtualized IOMMU. For VMMs that want the full + virtualization features from physical IOMMUs, it is suggested to build the + same number of virtualized IOMMUs as the number of physical IOMMUs, so the + passed-through devices would be connected to their own virtualized IOMMUs + backed by corresponding vIOMMU objects, in which case a guest OS would do + the "dispatch" naturally instead of VMM trappings. + All user-visible objects are destroyed via the IOMMU_DESTROY uAPI. The diagrams below show relationships between user-visible objects and kernel @@ -101,6 +132,28 @@ creating the objects and links:: |------------>|iommu_domain|<----|iommu_domain|<----|device| |____________| |____________| |______| + _______________________________________________________________________ + | iommufd (with vIOMMU) | + | | + | [5] | + | _____________ | + | | | | + | |----------------| vIOMMU | | + | | | | | + | | | | | + | | [1] | | [4] [2] | + | | ______ | | _____________ ________ | + | | | | | [3] | | | | | | + | | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | | + | | |______| |_____________| |_____________| |________| | + | | | | | | | + |______|________|______________|__________________|_______________|_____| + | | | | | + ______v_____ | ______v_____ ______v_____ ___v__ + | struct | | PFN | (paging) | | (nested) | |struct| + |iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device| + |____________| storage|____________| |____________| |______| + 1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. An iommufd can hold multiple IOAS objects. IOAS is the most generic object and does not expose interfaces that are specific to single IOMMU drivers. All operations @@ -132,7 +185,8 @@ creating the objects and links:: flag is set. 4. IOMMUFD_OBJ_HWPT_NESTED can be only manually created via the IOMMU_HWPT_ALLOC - uAPI, provided an hwpt_id via @pt_id to associate the new HWPT_NESTED object + uAPI, provided an hwpt_id or a viommu_id of a vIOMMU object encapsulating a + nesting parent HWPT_PAGING via @pt_id to associate the new HWPT_NESTED object to the corresponding HWPT_PAGING object. The associating HWPT_PAGING object must be a nesting parent manually allocated via the same uAPI previously with an IOMMU_HWPT_ALLOC_NEST_PARENT flag, otherwise the allocation will fail. The @@ -149,6 +203,18 @@ creating the objects and links:: created via the same IOMMU_HWPT_ALLOC uAPI. The difference is at the type of the object passed in via the @pt_id field of struct iommufd_hwpt_alloc. +5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOC + uAPI, provided a dev_id (for the device's physical IOMMU to back the vIOMMU) + and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). The + iommufd core will link the vIOMMU object to the struct iommu_device that the + struct device is behind. And an IOMMU driver can implement a viommu_alloc op + to allocate its own vIOMMU data structure embedding the core-level structure + iommufd_viommu and some driver-specific data. If necessary, the driver can + also configure its HW virtualization feature for that vIOMMU (and thus for + the VM). Successful completion of this operation sets up the linkages between + the vIOMMU object and the HWPT_PAGING, then this vIOMMU object can be used + as a nesting parent object to allocate an HWPT_NESTED object described above. + A device can only bind to an iommufd due to DMA ownership claim and attach to at most one IOAS object (no support of PASID yet). @@ -161,6 +227,7 @@ User visible objects are backed by following datastructures: - iommufd_device for IOMMUFD_OBJ_DEVICE. - iommufd_hwpt_paging for IOMMUFD_OBJ_HWPT_PAGING. - iommufd_hwpt_nested for IOMMUFD_OBJ_HWPT_NESTED. +- iommufd_viommu for IOMMUFD_OBJ_VIOMMU. Several terminologies when looking at these datastructures: