diff mbox series

[RFC,v16,1/9] iommu: Introduce attach/detach_pasid_table API

Message ID 20211027104428.1059740-2-eric.auger@redhat.com (mailing list archive)
State New, archived
Headers show
Series SMMUv3 Nested Stage Setup (IOMMU part) | expand

Commit Message

Eric Auger Oct. 27, 2021, 10:44 a.m. UTC
In virtualization use case, when a guest is assigned
a PCI host device, protected by a virtual IOMMU on the guest,
the physical IOMMU must be programmed to be consistent with
the guest mappings. If the physical IOMMU supports two
translation stages it makes sense to program guest mappings
onto the first stage/level (ARM/Intel terminology) while the host
owns the stage/level 2.

In that case, it is mandated to trap on guest configuration
settings and pass those to the physical iommu driver.

This patch adds a new API to the iommu subsystem that allows
to set/unset the pasid table information.

A generic iommu_pasid_table_config struct is introduced in
a new iommu.h uapi header. This is going to be used by the VFIO
user API.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v13 -> v14:
- export iommu_attach_pasid_table
- add dummy iommu_uapi_attach_pasid_table
- swap base_ptr and format in iommu_pasid_table_config

v12 -> v13:
- Fix config check

v11 -> v12:
- add argsz, name the union
---
 drivers/iommu/iommu.c      | 69 ++++++++++++++++++++++++++++++++++++++
 include/linux/iommu.h      | 27 +++++++++++++++
 include/uapi/linux/iommu.h | 54 +++++++++++++++++++++++++++++
 3 files changed, 150 insertions(+)

Comments

Joerg Roedel Dec. 6, 2021, 10:48 a.m. UTC | #1
On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote:
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>

This Signed-of-by chain looks dubious, you are the author but the last
one in the chain?

> +int iommu_uapi_attach_pasid_table(struct iommu_domain *domain,
> +				  void __user *uinfo)
> +{

[...]

> +	if (pasid_table_data.format == IOMMU_PASID_FORMAT_SMMUV3 &&
> +	    pasid_table_data.argsz <
> +		offsetofend(struct iommu_pasid_table_config, vendor_data.smmuv3))
> +		return -EINVAL;

This check looks like it belongs in driver specific code.

Regards,

	Joerg
Eric Auger Dec. 7, 2021, 10:22 a.m. UTC | #2
Hi Joerg,

On 12/6/21 11:48 AM, Joerg Roedel wrote:
> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote:
>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> This Signed-of-by chain looks dubious, you are the author but the last
> one in the chain?
The 1st RFC in Aug 2018
(https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032478.html)
said this was a generalization of Jacob's patch


  [PATCH v5 01/23] iommu: introduce bind_pasid_table API function


  https://lists.linuxfoundation.org/pipermail/iommu/2018-May/027647.html

So indeed Jacob should be the author. I guess the multiple rebases got
this eventually replaced at some point, which is not an excuse. Please
forgive me for that.
Now the original patch already had this list of SoB so I don't know if I
shall simplify it.


>
>> +int iommu_uapi_attach_pasid_table(struct iommu_domain *domain,
>> +				  void __user *uinfo)
>> +{
> [...]
>
>> +	if (pasid_table_data.format == IOMMU_PASID_FORMAT_SMMUV3 &&
>> +	    pasid_table_data.argsz <
>> +		offsetofend(struct iommu_pasid_table_config, vendor_data.smmuv3))
>> +		return -EINVAL;
> This check looks like it belongs in driver specific code.
Indeed, I will fix that in my next respin :-)

Thanks!

Eric
>
> Regards,
>
> 	Joerg
>
Baolu Lu Dec. 8, 2021, 2:44 a.m. UTC | #3
Hi Eric,

On 12/7/21 6:22 PM, Eric Auger wrote:
> On 12/6/21 11:48 AM, Joerg Roedel wrote:
>> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote:
>>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com>
>>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com>
>>> Signed-off-by: Ashok Raj<ashok.raj@intel.com>
>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Eric Auger<eric.auger@redhat.com>
>> This Signed-of-by chain looks dubious, you are the author but the last
>> one in the chain?
> The 1st RFC in Aug 2018
> (https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032478.html)
> said this was a generalization of Jacob's patch
> 
> 
>    [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
> 
> 
>    https://lists.linuxfoundation.org/pipermail/iommu/2018-May/027647.html
> 
> So indeed Jacob should be the author. I guess the multiple rebases got
> this eventually replaced at some point, which is not an excuse. Please
> forgive me for that.
> Now the original patch already had this list of SoB so I don't know if I
> shall simplify it.

As we have decided to move the nested mode (dual stages) implementation
onto the developing iommufd framework, what's the value of adding this
into iommu core?

Best regards,
baolu
Eric Auger Dec. 8, 2021, 7:33 a.m. UTC | #4
Hi Baolu,

On 12/8/21 3:44 AM, Lu Baolu wrote:
> Hi Eric,
>
> On 12/7/21 6:22 PM, Eric Auger wrote:
>> On 12/6/21 11:48 AM, Joerg Roedel wrote:
>>> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote:
>>>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com>
>>>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com>
>>>> Signed-off-by: Ashok Raj<ashok.raj@intel.com>
>>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com>
>>>> Signed-off-by: Eric Auger<eric.auger@redhat.com>
>>> This Signed-of-by chain looks dubious, you are the author but the last
>>> one in the chain?
>> The 1st RFC in Aug 2018
>> (https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032478.html)
>> said this was a generalization of Jacob's patch
>>
>>
>>    [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
>>
>>
>>   
>> https://lists.linuxfoundation.org/pipermail/iommu/2018-May/027647.html
>>
>> So indeed Jacob should be the author. I guess the multiple rebases got
>> this eventually replaced at some point, which is not an excuse. Please
>> forgive me for that.
>> Now the original patch already had this list of SoB so I don't know if I
>> shall simplify it.
>
> As we have decided to move the nested mode (dual stages) implementation
> onto the developing iommufd framework, what's the value of adding this
> into iommu core?

The iommu_uapi_attach_pasid_table uapi should disappear indeed as it is
is bound to be replaced by /dev/iommu fellow API.
However until I can rebase on /dev/iommu code I am obliged to keep it to
maintain this integration, hence the RFC.

Thanks

Eric
>
> Best regards,
> baolu
>
Jason Gunthorpe Dec. 8, 2021, 12:56 p.m. UTC | #5
On Wed, Dec 08, 2021 at 08:33:33AM +0100, Eric Auger wrote:
> Hi Baolu,
> 
> On 12/8/21 3:44 AM, Lu Baolu wrote:
> > Hi Eric,
> >
> > On 12/7/21 6:22 PM, Eric Auger wrote:
> >> On 12/6/21 11:48 AM, Joerg Roedel wrote:
> >>> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote:
> >>>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com>
> >>>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com>
> >>>> Signed-off-by: Ashok Raj<ashok.raj@intel.com>
> >>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com>
> >>>> Signed-off-by: Eric Auger<eric.auger@redhat.com>
> >>> This Signed-of-by chain looks dubious, you are the author but the last
> >>> one in the chain?
> >> The 1st RFC in Aug 2018
> >> (https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032478.html)
> >> said this was a generalization of Jacob's patch
> >>
> >>
> >>    [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
> >>
> >>
> >>   
> >> https://lists.linuxfoundation.org/pipermail/iommu/2018-May/027647.html
> >>
> >> So indeed Jacob should be the author. I guess the multiple rebases got
> >> this eventually replaced at some point, which is not an excuse. Please
> >> forgive me for that.
> >> Now the original patch already had this list of SoB so I don't know if I
> >> shall simplify it.
> >
> > As we have decided to move the nested mode (dual stages) implementation
> > onto the developing iommufd framework, what's the value of adding this
> > into iommu core?
> 
> The iommu_uapi_attach_pasid_table uapi should disappear indeed as it is
> is bound to be replaced by /dev/iommu fellow API.
> However until I can rebase on /dev/iommu code I am obliged to keep it to
> maintain this integration, hence the RFC.

Indeed, we are getting pretty close to having the base iommufd that we
can start adding stuff like this into. Maybe in January, you can look
at some parts of what is evolving here:

https://github.com/jgunthorpe/linux/commits/iommufd
https://github.com/LuBaolu/intel-iommu/commits/iommu-dma-ownership-v2
https://github.com/luxis1999/iommufd/commits/iommufd-v5.16-rc2

From a progress perspective I would like to start with simple 'page
tables in userspace', ie no PASID in this step.

'page tables in userspace' means an iommufd ioctl to create an
iommu_domain where the IOMMU HW is directly travesering a
device-specific page table structure in user space memory. All the HW
today implements this by using another iommu_domain to allow the IOMMU
HW DMA access to user memory - ie nesting or multi-stage or whatever.

This would come along with some ioctls to invalidate the IOTLB.

I'm imagining this step as a iommu_group->op->create_user_domain()
driver callback which will create a new kind of domain with
domain-unique ops. Ie map/unmap related should all be NULL as those
are impossible operations.

From there the usual struct device (ie RID) attach/detatch stuff needs
to take care of routing DMAs to this iommu_domain.

Step two would be to add the ability for an iommufd using driver to
request that a RID&PASID is connected to an iommu_domain. This
connection can be requested for any kind of iommu_domain, kernel owned
or user owned.

I don't quite have an answer how exactly the SMMUv3 vs Intel
difference in PASID routing should be resolved.

to get answers I'm hoping to start building some sketch RFCs for these
different things on iommufd, hopefully in January. I'm looking at user
page tables, PASID, dirty tracking and userspace IO fault handling as
the main features iommufd must tackle.

The purpose of the sketches would be to validate that the HW features
we want to exposed can work will with the choices the base is making.

Jason
Jean-Philippe Brucker Dec. 8, 2021, 5:20 p.m. UTC | #6
On Wed, Dec 08, 2021 at 08:56:16AM -0400, Jason Gunthorpe wrote:
> From a progress perspective I would like to start with simple 'page
> tables in userspace', ie no PASID in this step.
> 
> 'page tables in userspace' means an iommufd ioctl to create an
> iommu_domain where the IOMMU HW is directly travesering a
> device-specific page table structure in user space memory. All the HW
> today implements this by using another iommu_domain to allow the IOMMU
> HW DMA access to user memory - ie nesting or multi-stage or whatever.
> 
> This would come along with some ioctls to invalidate the IOTLB.
> 
> I'm imagining this step as a iommu_group->op->create_user_domain()
> driver callback which will create a new kind of domain with
> domain-unique ops. Ie map/unmap related should all be NULL as those
> are impossible operations.
> 
> From there the usual struct device (ie RID) attach/detatch stuff needs
> to take care of routing DMAs to this iommu_domain.
> 
> Step two would be to add the ability for an iommufd using driver to
> request that a RID&PASID is connected to an iommu_domain. This
> connection can be requested for any kind of iommu_domain, kernel owned
> or user owned.
> 
> I don't quite have an answer how exactly the SMMUv3 vs Intel
> difference in PASID routing should be resolved.

In SMMUv3 the user pgd is always stored in the PASID table (actually
called "context descriptor table" but I want to avoid confusion with the
VT-d "context table"). And to access the PASID table, the SMMUv3 first
translate its GPA into a PA using the stage-2 page table. For userspace to
pass individual pgds to the kernel, as opposed to passing whole PASID
tables, the host kernel needs to reserve GPA space and map it in stage-2,
so it can store the PASID table in there. Userspace manages GPA space.

This would be easy for a single pgd. In this case the PASID table has a
single entry and userspace could just pass one GPA page during
registration. However it isn't easily generalized to full PASID support,
because managing a multi-level PASID table will require runtime GPA
allocation, and that API is awkward. That's why we opted for "attach PASID
table" operation rather than "attach page table" (back then the choice was
easy since VT-d used the same concept).

So I think the simplest way to support nesting is still to have separate
modes of operations depending on the hardware.

Thanks,
Jean

> 
> to get answers I'm hoping to start building some sketch RFCs for these
> different things on iommufd, hopefully in January. I'm looking at user
> page tables, PASID, dirty tracking and userspace IO fault handling as
> the main features iommufd must tackle.
> 
> The purpose of the sketches would be to validate that the HW features
> we want to exposed can work will with the choices the base is making.
> 
> Jason
Jason Gunthorpe Dec. 8, 2021, 6:31 p.m. UTC | #7
On Wed, Dec 08, 2021 at 05:20:39PM +0000, Jean-Philippe Brucker wrote:
> On Wed, Dec 08, 2021 at 08:56:16AM -0400, Jason Gunthorpe wrote:
> > From a progress perspective I would like to start with simple 'page
> > tables in userspace', ie no PASID in this step.
> > 
> > 'page tables in userspace' means an iommufd ioctl to create an
> > iommu_domain where the IOMMU HW is directly travesering a
> > device-specific page table structure in user space memory. All the HW
> > today implements this by using another iommu_domain to allow the IOMMU
> > HW DMA access to user memory - ie nesting or multi-stage or whatever.
> > 
> > This would come along with some ioctls to invalidate the IOTLB.
> > 
> > I'm imagining this step as a iommu_group->op->create_user_domain()
> > driver callback which will create a new kind of domain with
> > domain-unique ops. Ie map/unmap related should all be NULL as those
> > are impossible operations.
> > 
> > From there the usual struct device (ie RID) attach/detatch stuff needs
> > to take care of routing DMAs to this iommu_domain.
> > 
> > Step two would be to add the ability for an iommufd using driver to
> > request that a RID&PASID is connected to an iommu_domain. This
> > connection can be requested for any kind of iommu_domain, kernel owned
> > or user owned.
> > 
> > I don't quite have an answer how exactly the SMMUv3 vs Intel
> > difference in PASID routing should be resolved.
> 
> In SMMUv3 the user pgd is always stored in the PASID table (actually
> called "context descriptor table" but I want to avoid confusion with
> the VT-d "context table"). And to access the PASID table, the SMMUv3 first
> translate its GPA into a PA using the stage-2 page table. For userspace to
> pass individual pgds to the kernel, as opposed to passing whole PASID
> tables, the host kernel needs to reserve GPA space and map it in stage-2,
> so it can store the PASID table in there. Userspace manages GPA space.

It is what I thought.. So in the SMMUv3 spec the STE is completely in
kernel memory, but it points to an S1ContextPtr that must be an IPA if
the "stage 1 translation tables" are IPA. Only via S1ContextPtr can we
decode the substream?

So in SMMUv3 land we don't really ever talk about PASID, we have a
'user page table' that is bound to an entire RID and *all* PASIDs.

While Intel would have a 'user page table' that is only bound to a RID
& PASID

Certianly it is not a difference we can hide from userspace.
 
> This would be easy for a single pgd. In this case the PASID table has a
> single entry and userspace could just pass one GPA page during
> registration. However it isn't easily generalized to full PASID support,
> because managing a multi-level PASID table will require runtime GPA
> allocation, and that API is awkward. That's why we opted for "attach PASID
> table" operation rather than "attach page table" (back then the choice was
> easy since VT-d used the same concept).

I think the entire context descriptor table should be in userspace,
and filled in by userspace, as part of the userspace page table.

The kernel API should accept the S1ContextPtr IPA and all the parts of
the STE that relate to the defining the layout of what the S1Context
points to an thats it.

We should have another mode where the kernel owns everything, and the
S1ContexPtr is a PA with Stage 2 bypassed.

That part is fine, the more open question is what does the driver
interface look like when userspace tell something like vfio-pci to
connect to this thing. At some level the attaching device needs to
authorize iommufd to take the entire PASID table and RID.

Specifically we cannot use this thing with a mdev, while the Intel
version of a userspace page table can be.

Maybe that is just some 'allow whole device' flag in an API

Jason
Tian, Kevin Dec. 9, 2021, 2:58 a.m. UTC | #8
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, December 9, 2021 2:31 AM
> 
> On Wed, Dec 08, 2021 at 05:20:39PM +0000, Jean-Philippe Brucker wrote:
> > On Wed, Dec 08, 2021 at 08:56:16AM -0400, Jason Gunthorpe wrote:
> > > From a progress perspective I would like to start with simple 'page
> > > tables in userspace', ie no PASID in this step.
> > >
> > > 'page tables in userspace' means an iommufd ioctl to create an
> > > iommu_domain where the IOMMU HW is directly travesering a
> > > device-specific page table structure in user space memory. All the HW
> > > today implements this by using another iommu_domain to allow the
> IOMMU
> > > HW DMA access to user memory - ie nesting or multi-stage or whatever.
> > >
> > > This would come along with some ioctls to invalidate the IOTLB.
> > >
> > > I'm imagining this step as a iommu_group->op->create_user_domain()
> > > driver callback which will create a new kind of domain with
> > > domain-unique ops. Ie map/unmap related should all be NULL as those
> > > are impossible operations.
> > >
> > > From there the usual struct device (ie RID) attach/detatch stuff needs
> > > to take care of routing DMAs to this iommu_domain.
> > >
> > > Step two would be to add the ability for an iommufd using driver to
> > > request that a RID&PASID is connected to an iommu_domain. This
> > > connection can be requested for any kind of iommu_domain, kernel
> owned
> > > or user owned.
> > >
> > > I don't quite have an answer how exactly the SMMUv3 vs Intel
> > > difference in PASID routing should be resolved.
> >
> > In SMMUv3 the user pgd is always stored in the PASID table (actually
> > called "context descriptor table" but I want to avoid confusion with
> > the VT-d "context table"). And to access the PASID table, the SMMUv3 first
> > translate its GPA into a PA using the stage-2 page table. For userspace to
> > pass individual pgds to the kernel, as opposed to passing whole PASID
> > tables, the host kernel needs to reserve GPA space and map it in stage-2,
> > so it can store the PASID table in there. Userspace manages GPA space.
> 
> It is what I thought.. So in the SMMUv3 spec the STE is completely in
> kernel memory, but it points to an S1ContextPtr that must be an IPA if
> the "stage 1 translation tables" are IPA. Only via S1ContextPtr can we
> decode the substream?
> 
> So in SMMUv3 land we don't really ever talk about PASID, we have a
> 'user page table' that is bound to an entire RID and *all* PASIDs.
> 
> While Intel would have a 'user page table' that is only bound to a RID
> & PASID
> 
> Certianly it is not a difference we can hide from userspace.

Concept-wise it is still a 'user page table' with vendor specific format.

Taking your earlier analog it's just for a single 84-bit address space
(20PASID+64bitVA) per RID.

So what we requires is still one unified ioctl in your step-1 to support
per-RID 'user page table'.

For ARM it's SMMU's PASID table format. There is no step-2 since PASID
is already within the address space covered by the user PASID table.

For Intel it's VT-d's 1st level page table format. When moving to step-2
then allows multiple 'user page tables' connected to RID & PASID.

> 
> > This would be easy for a single pgd. In this case the PASID table has a
> > single entry and userspace could just pass one GPA page during
> > registration. However it isn't easily generalized to full PASID support,
> > because managing a multi-level PASID table will require runtime GPA
> > allocation, and that API is awkward. That's why we opted for "attach PASID
> > table" operation rather than "attach page table" (back then the choice was
> > easy since VT-d used the same concept).
> 
> I think the entire context descriptor table should be in userspace,
> and filled in by userspace, as part of the userspace page table.
> 
> The kernel API should accept the S1ContextPtr IPA and all the parts of
> the STE that relate to the defining the layout of what the S1Context
> points to an thats it.
> 
> We should have another mode where the kernel owns everything, and the
> S1ContexPtr is a PA with Stage 2 bypassed.

I guess this is for the usage like DPDK. In that case yes we can have
unified ioctl since the kernel manages everything including the PASID
table. 

> 
> That part is fine, the more open question is what does the driver
> interface look like when userspace tell something like vfio-pci to
> connect to this thing. At some level the attaching device needs to
> authorize iommufd to take the entire PASID table and RID.

as long as smmu driver advocates only supporting step-1 ioctl,
then this authorization should be implied already.

> 
> Specifically we cannot use this thing with a mdev, while the Intel
> version of a userspace page table can be.

yes. Supporting mdev is all the reason why Intel puts the PASID
table in host physical address space to be managed by the kernel.

> 
> Maybe that is just some 'allow whole device' flag in an API
> 

As said, I feel this special flag is not required as long as the 
vendor iommu driver only supports your step-1 interface which
implies 'allow whole device' for ARM.

Thanks
Kevin
Tian, Kevin Dec. 9, 2021, 3:21 a.m. UTC | #9
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, December 8, 2021 8:56 PM
> 
> On Wed, Dec 08, 2021 at 08:33:33AM +0100, Eric Auger wrote:
> > Hi Baolu,
> >
> > On 12/8/21 3:44 AM, Lu Baolu wrote:
> > > Hi Eric,
> > >
> > > On 12/7/21 6:22 PM, Eric Auger wrote:
> > >> On 12/6/21 11:48 AM, Joerg Roedel wrote:
> > >>> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote:
> > >>>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com>
> > >>>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com>
> > >>>> Signed-off-by: Ashok Raj<ashok.raj@intel.com>
> > >>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com>
> > >>>> Signed-off-by: Eric Auger<eric.auger@redhat.com>
> > >>> This Signed-of-by chain looks dubious, you are the author but the last
> > >>> one in the chain?
> > >> The 1st RFC in Aug 2018
> > >> (https://lists.cs.columbia.edu/pipermail/kvmarm/2018-
> August/032478.html)
> > >> said this was a generalization of Jacob's patch
> > >>
> > >>
> > >>    [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
> > >>
> > >>
> > >>
> > >> https://lists.linuxfoundation.org/pipermail/iommu/2018-
> May/027647.html
> > >>
> > >> So indeed Jacob should be the author. I guess the multiple rebases got
> > >> this eventually replaced at some point, which is not an excuse. Please
> > >> forgive me for that.
> > >> Now the original patch already had this list of SoB so I don't know if I
> > >> shall simplify it.
> > >
> > > As we have decided to move the nested mode (dual stages)
> implementation
> > > onto the developing iommufd framework, what's the value of adding this
> > > into iommu core?
> >
> > The iommu_uapi_attach_pasid_table uapi should disappear indeed as it is
> > is bound to be replaced by /dev/iommu fellow API.
> > However until I can rebase on /dev/iommu code I am obliged to keep it to
> > maintain this integration, hence the RFC.
> 
> Indeed, we are getting pretty close to having the base iommufd that we
> can start adding stuff like this into. Maybe in January, you can look
> at some parts of what is evolving here:
> 
> https://github.com/jgunthorpe/linux/commits/iommufd
> https://github.com/LuBaolu/intel-iommu/commits/iommu-dma-ownership-
> v2
> https://github.com/luxis1999/iommufd/commits/iommufd-v5.16-rc2
> 
> From a progress perspective I would like to start with simple 'page
> tables in userspace', ie no PASID in this step.
> 
> 'page tables in userspace' means an iommufd ioctl to create an
> iommu_domain where the IOMMU HW is directly travesering a
> device-specific page table structure in user space memory. All the HW
> today implements this by using another iommu_domain to allow the IOMMU
> HW DMA access to user memory - ie nesting or multi-stage or whatever.

One clarification here in case people may get confused based on the
current iommu_domain definition. Jason brainstormed with us on how
to represent 'user page table' in the IOMMU sub-system. One is to
extend iommu_domain as a general representation for any page table
instance. The other option is to create new representations for user
page tables and then link them under existing iommu_domain.

This context is based on the 1st option. and As Jason said in the bottom
we still need to sketch out whether it works as expected. 
Tian, Kevin Dec. 9, 2021, 3:59 a.m. UTC | #10
> From: Tian, Kevin
> Sent: Thursday, December 9, 2021 10:58 AM
> 
> For ARM it's SMMU's PASID table format. There is no step-2 since PASID
> is already within the address space covered by the user PASID table.
> 

One correction here. 'no step-2' is definitely wrong here as it means
more than user page table in your plan (e.g. dpdk).

To simplify it what I meant is:

iommufd reports how many 'user page tables' are supported given a device.

ARM always reports only one can be supported, and it must be created in 
PASID table format. tagged by RID.

Intel reports one in step1 (tagged by RID), and N in step2 (tagged by
RID+PASID). A special flag in attach call allows the user to specify the
additional PASID routing info for a 'user page table'.

Thanks
Kevin
Eric Auger Dec. 9, 2021, 7:50 a.m. UTC | #11
Hi Jason,

On 12/8/21 7:31 PM, Jason Gunthorpe wrote:
> On Wed, Dec 08, 2021 at 05:20:39PM +0000, Jean-Philippe Brucker wrote:
>> On Wed, Dec 08, 2021 at 08:56:16AM -0400, Jason Gunthorpe wrote:
>>> From a progress perspective I would like to start with simple 'page
>>> tables in userspace', ie no PASID in this step.
>>>
>>> 'page tables in userspace' means an iommufd ioctl to create an
>>> iommu_domain where the IOMMU HW is directly travesering a
>>> device-specific page table structure in user space memory. All the HW
>>> today implements this by using another iommu_domain to allow the IOMMU
>>> HW DMA access to user memory - ie nesting or multi-stage or whatever.
>>>
>>> This would come along with some ioctls to invalidate the IOTLB.
>>>
>>> I'm imagining this step as a iommu_group->op->create_user_domain()
>>> driver callback which will create a new kind of domain with
>>> domain-unique ops. Ie map/unmap related should all be NULL as those
>>> are impossible operations.
>>>
>>> From there the usual struct device (ie RID) attach/detatch stuff needs
>>> to take care of routing DMAs to this iommu_domain.
>>>
>>> Step two would be to add the ability for an iommufd using driver to
>>> request that a RID&PASID is connected to an iommu_domain. This
>>> connection can be requested for any kind of iommu_domain, kernel owned
>>> or user owned.
>>>
>>> I don't quite have an answer how exactly the SMMUv3 vs Intel
>>> difference in PASID routing should be resolved.
>> In SMMUv3 the user pgd is always stored in the PASID table (actually
>> called "context descriptor table" but I want to avoid confusion with
>> the VT-d "context table"). And to access the PASID table, the SMMUv3 first
>> translate its GPA into a PA using the stage-2 page table. For userspace to
>> pass individual pgds to the kernel, as opposed to passing whole PASID
>> tables, the host kernel needs to reserve GPA space and map it in stage-2,
>> so it can store the PASID table in there. Userspace manages GPA space.
> It is what I thought.. So in the SMMUv3 spec the STE is completely in
> kernel memory, but it points to an S1ContextPtr that must be an IPA if
> the "stage 1 translation tables" are IPA. Only via S1ContextPtr can we
> decode the substream?
Yes that's correct. S1ContextPtr is the IPA of the L1 Context Descriptor
Table which is then indexed by substreamID.

>
> So in SMMUv3 land we don't really ever talk about PASID, we have a
> 'user page table' that is bound to an entire RID and *all* PASIDs.
in ARM terminology substreamID matches the PASID and this is what
indexes the L1 Context Descriptor Table.

>
> While Intel would have a 'user page table' that is only bound to a RID
> & PASID
>
> Certianly it is not a difference we can hide from userspace.
>  
>> This would be easy for a single pgd. In this case the PASID table has a
>> single entry and userspace could just pass one GPA page during
>> registration. However it isn't easily generalized to full PASID support,
>> because managing a multi-level PASID table will require runtime GPA
>> allocation, and that API is awkward. That's why we opted for "attach PASID
>> table" operation rather than "attach page table" (back then the choice was
>> easy since VT-d used the same concept).
> I think the entire context descriptor table should be in userspace,
> and filled in by userspace, as part of the userspace page table.

In ARM nested mode the L1 Context Descriptor Table is fully managed by
the guest and the userspace only needs to trap its S1ContextPtr and pass
it to the host.
>
> The kernel API should accept the S1ContextPtr IPA and all the parts of
> the STE that relate to the defining the layout of what the S1Context
> points to an thats it.
Yes that's exactly what is done currently. At config time the host must
trap guest STE changes (format and S1ContextPtr) and "incorporate" those
changes into the stage2 related STE information. The STE is owned by the
host kernel as it contains the stage2 information (S2TTB).

In
https://developer.arm.com/documentation/ihi0070/latest
(ARM_IHI_0070_D_b_System_Memory_Management_Unit_Architecture_Specification.pdf)
Synthetic diagrams can be found in 3.3.2 StreamIDs to Context
Descriptors. They give the global view.

Note this series only coped with a single CD in the Context Descriptor
Table.

Thanks

Eric
>
> We should have another mode where the kernel owns everything, and the
> S1ContexPtr is a PA with Stage 2 bypassed.
>
> That part is fine, the more open question is what does the driver
> interface look like when userspace tell something like vfio-pci to
> connect to this thing. At some level the attaching device needs to
> authorize iommufd to take the entire PASID table and RID.
>
> Specifically we cannot use this thing with a mdev, while the Intel
> version of a userspace page table can be.
>
> Maybe that is just some 'allow whole device' flag in an API
>
> Jason
>
Eric Auger Dec. 9, 2021, 8:31 a.m. UTC | #12
Hi Jason,

On 12/8/21 1:56 PM, Jason Gunthorpe wrote:
> On Wed, Dec 08, 2021 at 08:33:33AM +0100, Eric Auger wrote:
>> Hi Baolu,
>>
>> On 12/8/21 3:44 AM, Lu Baolu wrote:
>>> Hi Eric,
>>>
>>> On 12/7/21 6:22 PM, Eric Auger wrote:
>>>> On 12/6/21 11:48 AM, Joerg Roedel wrote:
>>>>> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote:
>>>>>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com>
>>>>>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com>
>>>>>> Signed-off-by: Ashok Raj<ashok.raj@intel.com>
>>>>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com>
>>>>>> Signed-off-by: Eric Auger<eric.auger@redhat.com>
>>>>> This Signed-of-by chain looks dubious, you are the author but the last
>>>>> one in the chain?
>>>> The 1st RFC in Aug 2018
>>>> (https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032478.html)
>>>> said this was a generalization of Jacob's patch
>>>>
>>>>
>>>>    [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
>>>>
>>>>
>>>>   
>>>> https://lists.linuxfoundation.org/pipermail/iommu/2018-May/027647.html
>>>>
>>>> So indeed Jacob should be the author. I guess the multiple rebases got
>>>> this eventually replaced at some point, which is not an excuse. Please
>>>> forgive me for that.
>>>> Now the original patch already had this list of SoB so I don't know if I
>>>> shall simplify it.
>>> As we have decided to move the nested mode (dual stages) implementation
>>> onto the developing iommufd framework, what's the value of adding this
>>> into iommu core?
>> The iommu_uapi_attach_pasid_table uapi should disappear indeed as it is
>> is bound to be replaced by /dev/iommu fellow API.
>> However until I can rebase on /dev/iommu code I am obliged to keep it to
>> maintain this integration, hence the RFC.
> Indeed, we are getting pretty close to having the base iommufd that we
> can start adding stuff like this into. Maybe in January, you can look
> at some parts of what is evolving here:
>
> https://github.com/jgunthorpe/linux/commits/iommufd
> https://github.com/LuBaolu/intel-iommu/commits/iommu-dma-ownership-v2
> https://github.com/luxis1999/iommufd/commits/iommufd-v5.16-rc2
Interesting. thank you for the preview links. I will have a look asap

Eric
>
> From a progress perspective I would like to start with simple 'page
> tables in userspace', ie no PASID in this step.
>
> 'page tables in userspace' means an iommufd ioctl to create an
> iommu_domain where the IOMMU HW is directly travesering a
> device-specific page table structure in user space memory. All the HW
> today implements this by using another iommu_domain to allow the IOMMU
> HW DMA access to user memory - ie nesting or multi-stage or whatever.
>
> This would come along with some ioctls to invalidate the IOTLB.
>
> I'm imagining this step as a iommu_group->op->create_user_domain()
> driver callback which will create a new kind of domain with
> domain-unique ops. Ie map/unmap related should all be NULL as those
> are impossible operations.
>
> From there the usual struct device (ie RID) attach/detatch stuff needs
> to take care of routing DMAs to this iommu_domain.
>
> Step two would be to add the ability for an iommufd using driver to
> request that a RID&PASID is connected to an iommu_domain. This
> connection can be requested for any kind of iommu_domain, kernel owned
> or user owned.
>
> I don't quite have an answer how exactly the SMMUv3 vs Intel
> difference in PASID routing should be resolved.
>
> to get answers I'm hoping to start building some sketch RFCs for these
> different things on iommufd, hopefully in January. I'm looking at user
> page tables, PASID, dirty tracking and userspace IO fault handling as
> the main features iommufd must tackle.
>
> The purpose of the sketches would be to validate that the HW features
> we want to exposed can work will with the choices the base is making.
>
> Jason
>
Eric Auger Dec. 9, 2021, 9:44 a.m. UTC | #13
Hi Kevin,

On 12/9/21 4:21 AM, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Wednesday, December 8, 2021 8:56 PM
>>
>> On Wed, Dec 08, 2021 at 08:33:33AM +0100, Eric Auger wrote:
>>> Hi Baolu,
>>>
>>> On 12/8/21 3:44 AM, Lu Baolu wrote:
>>>> Hi Eric,
>>>>
>>>> On 12/7/21 6:22 PM, Eric Auger wrote:
>>>>> On 12/6/21 11:48 AM, Joerg Roedel wrote:
>>>>>> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote:
>>>>>>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com>
>>>>>>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com>
>>>>>>> Signed-off-by: Ashok Raj<ashok.raj@intel.com>
>>>>>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com>
>>>>>>> Signed-off-by: Eric Auger<eric.auger@redhat.com>
>>>>>> This Signed-of-by chain looks dubious, you are the author but the last
>>>>>> one in the chain?
>>>>> The 1st RFC in Aug 2018
>>>>> (https://lists.cs.columbia.edu/pipermail/kvmarm/2018-
>> August/032478.html)
>>>>> said this was a generalization of Jacob's patch
>>>>>
>>>>>
>>>>>    [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
>>>>>
>>>>>
>>>>>
>>>>> https://lists.linuxfoundation.org/pipermail/iommu/2018-
>> May/027647.html
>>>>> So indeed Jacob should be the author. I guess the multiple rebases got
>>>>> this eventually replaced at some point, which is not an excuse. Please
>>>>> forgive me for that.
>>>>> Now the original patch already had this list of SoB so I don't know if I
>>>>> shall simplify it.
>>>> As we have decided to move the nested mode (dual stages)
>> implementation
>>>> onto the developing iommufd framework, what's the value of adding this
>>>> into iommu core?
>>> The iommu_uapi_attach_pasid_table uapi should disappear indeed as it is
>>> is bound to be replaced by /dev/iommu fellow API.
>>> However until I can rebase on /dev/iommu code I am obliged to keep it to
>>> maintain this integration, hence the RFC.
>> Indeed, we are getting pretty close to having the base iommufd that we
>> can start adding stuff like this into. Maybe in January, you can look
>> at some parts of what is evolving here:
>>
>> https://github.com/jgunthorpe/linux/commits/iommufd
>> https://github.com/LuBaolu/intel-iommu/commits/iommu-dma-ownership-
>> v2
>> https://github.com/luxis1999/iommufd/commits/iommufd-v5.16-rc2
>>
>> From a progress perspective I would like to start with simple 'page
>> tables in userspace', ie no PASID in this step.
>>
>> 'page tables in userspace' means an iommufd ioctl to create an
>> iommu_domain where the IOMMU HW is directly travesering a
>> device-specific page table structure in user space memory. All the HW
>> today implements this by using another iommu_domain to allow the IOMMU
>> HW DMA access to user memory - ie nesting or multi-stage or whatever.
> One clarification here in case people may get confused based on the
> current iommu_domain definition. Jason brainstormed with us on how
> to represent 'user page table' in the IOMMU sub-system. One is to
> extend iommu_domain as a general representation for any page table
> instance. The other option is to create new representations for user
> page tables and then link them under existing iommu_domain.
>
> This context is based on the 1st option. and As Jason said in the bottom
> we still need to sketch out whether it works as expected. 
Jason Gunthorpe Dec. 9, 2021, 3:40 p.m. UTC | #14
On Thu, Dec 09, 2021 at 08:50:04AM +0100, Eric Auger wrote:

> > The kernel API should accept the S1ContextPtr IPA and all the parts of
> > the STE that relate to the defining the layout of what the S1Context
> > points to an thats it.

> Yes that's exactly what is done currently. At config time the host must
> trap guest STE changes (format and S1ContextPtr) and "incorporate" those
> changes into the stage2 related STE information. The STE is owned by the
> host kernel as it contains the stage2 information (S2TTB).

[..]

> Note this series only coped with a single CD in the Context Descriptor
> Table.

I'm confused, where does this limit arise?

The kernel accepts as input all the bits in the STE that describe the
layout of the CDT owned by userspace, shouldn't userspace be able to
construct all forms of CDT with any number of CDs in them?

Or do you mean this is some qemu limitation?

Jason
Jason Gunthorpe Dec. 9, 2021, 4:08 p.m. UTC | #15
On Thu, Dec 09, 2021 at 03:59:57AM +0000, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Thursday, December 9, 2021 10:58 AM
> > 
> > For ARM it's SMMU's PASID table format. There is no step-2 since PASID
> > is already within the address space covered by the user PASID table.
> > 
> 
> One correction here. 'no step-2' is definitely wrong here as it means
> more than user page table in your plan (e.g. dpdk).
> 
> To simplify it what I meant is:
> 
> iommufd reports how many 'user page tables' are supported given a device.
> 
> ARM always reports only one can be supported, and it must be created in 
> PASID table format. tagged by RID.
> 
> Intel reports one in step1 (tagged by RID), and N in step2 (tagged by
> RID+PASID). A special flag in attach call allows the user to specify the
> additional PASID routing info for a 'user page table'.

I don't think 'number of user page tables' makes sense

It really is 'attach to the whole device' vs 'attach to the RID' as a
semantic that should exist 

If we imagine a userspace using kernel page tables it certainly makes
sense to assign page table A to the RID and page table B to a PASID
even in simple cases like vfio-pci.

The only case where userspace would want to capture the entire RID and
all PASIDs is something like this ARM situation - but userspace just
created a device specific object and already knows exactly what kind
of behavior it has.

So, something like vfio pci would implement three uAPI operations:
 - Attach page table to RID
 - Attach page table to PASID
 - Attach page table to RID and all PASIDs
   And here 'page table' is everything below the STE in SMMUv3

While mdev can only support:
 - Access emulated page table
 - Attach page table to PASID

It is what I've said a couple of times, the API the driver calls
toward iommufd to attach a page table must be unambiguous as to the
intention, which also means userspace must be unambiguous too.

Jason
Eric Auger Dec. 9, 2021, 4:37 p.m. UTC | #16
Hi Jason,

On 12/9/21 4:40 PM, Jason Gunthorpe wrote:
> On Thu, Dec 09, 2021 at 08:50:04AM +0100, Eric Auger wrote:
>
>>> The kernel API should accept the S1ContextPtr IPA and all the parts of
>>> the STE that relate to the defining the layout of what the S1Context
>>> points to an thats it.
>> Yes that's exactly what is done currently. At config time the host must
>> trap guest STE changes (format and S1ContextPtr) and "incorporate" those
>> changes into the stage2 related STE information. The STE is owned by the
>> host kernel as it contains the stage2 information (S2TTB).
> [..]
>
>> Note this series only coped with a single CD in the Context Descriptor
>> Table.
> I'm confused, where does this limit arise?
>
> The kernel accepts as input all the bits in the STE that describe the
> layout of the CDT owned by userspace, shouldn't userspace be able to
> construct all forms of CDT with any number of CDs in them?
>
> Or do you mean this is some qemu limitation?
The upstream vSMMUv3 emulation does not support multiple CDs at the
moment and since I have no proper means to validate the vSVA case I am
rejecting any attempt from user space to inject guest configs featuring
mutliple PASIDs. Also PASID cache invalidation must be added to this series.

Nevertheless those limitations were tackled and overcomed by others in
CC so I don't think there is any blocking issue.

Thanks

Eric
>
> Jason
>
Tian, Kevin Dec. 10, 2021, 8:56 a.m. UTC | #17
> From: Jason Gunthorpe via iommu
> Sent: Friday, December 10, 2021 12:08 AM
> 
> On Thu, Dec 09, 2021 at 03:59:57AM +0000, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Thursday, December 9, 2021 10:58 AM
> > >
> > > For ARM it's SMMU's PASID table format. There is no step-2 since PASID
> > > is already within the address space covered by the user PASID table.
> > >
> >
> > One correction here. 'no step-2' is definitely wrong here as it means
> > more than user page table in your plan (e.g. dpdk).
> >
> > To simplify it what I meant is:
> >
> > iommufd reports how many 'user page tables' are supported given a device.
> >
> > ARM always reports only one can be supported, and it must be created in
> > PASID table format. tagged by RID.
> >
> > Intel reports one in step1 (tagged by RID), and N in step2 (tagged by
> > RID+PASID). A special flag in attach call allows the user to specify the
> > additional PASID routing info for a 'user page table'.
> 
> I don't think 'number of user page tables' makes sense
> 
> It really is 'attach to the whole device' vs 'attach to the RID' as a
> semantic that should exist
> 
> If we imagine a userspace using kernel page tables it certainly makes
> sense to assign page table A to the RID and page table B to a PASID
> even in simple cases like vfio-pci.
> 
> The only case where userspace would want to capture the entire RID and
> all PASIDs is something like this ARM situation - but userspace just
> created a device specific object and already knows exactly what kind
> of behavior it has.
> 
> So, something like vfio pci would implement three uAPI operations:
>  - Attach page table to RID
>  - Attach page table to PASID
>  - Attach page table to RID and all PASIDs
>    And here 'page table' is everything below the STE in SMMUv3
> 
> While mdev can only support:
>  - Access emulated page table
>  - Attach page table to PASID

mdev is a pci device from user p.o.v, having its vRID and vPASID. From
this angle the uAPI is no different from vfio-pci (except the ARM one):

  - (sw mdev) Attach emulated page table to vRID (no iommu domain)
  - (hw mdev) Attach page table to vRID (mapped to mdev PASID)
  - (hw mdev) Attach page table to vPASID (mapped to a fungible PASID)

> 
> It is what I've said a couple of times, the API the driver calls
> toward iommufd to attach a page table must be unambiguous as to the
> intention, which also means userspace must be unambiguous too.
> 

No question on the unambiguous part. But we also need to consider
the common semantics that can be abstracted.

From user p.o.v a vRID can be attached to at most two page tables (if
nesting is enabled). This just requires the basic attaching form for 
either one page table or two page tables:

	at_data = {
		.iommufd	= xxx;
		.pgtable_id	= yyy;
	};
	ioctl(device_fd, VFIO_DEVICE_ATTACH_PGTABLE, &at_data);

This can already cover ARM's requirement. The user page table
attached to vRID is in vendor specific format, e.g. either ARM pasid 
table format or Intel stage-1 format. For ARM pasid_table + underlying 
stage-1 page tables can be considered as a single big paging structure.

From this angle I'm not sure the benefit of making a separate uAPI 
just because it's a pasid table for ARM.

Then when PASID needs to be explicitly specified (e.g. in Intel case):

	at_data = {
		.iommufd	= xxx;
		.pgtable_id	= yyy;
		.flags 		= VFIO_ATTACH_FLAGS_PASID;
		.pasid		= zzz;
	};
	ioctl(device_fd, VFIO_DEVICE_ATTACH_PGTABLE, &at_data);

Again, I don't think what a simple flag can solve needs to be made
into a separate uAPI.

Is modeling like above considered ambiguous?

Thanks
Kevin
Jason Gunthorpe Dec. 10, 2021, 1:23 p.m. UTC | #18
On Fri, Dec 10, 2021 at 08:56:56AM +0000, Tian, Kevin wrote:
> > So, something like vfio pci would implement three uAPI operations:
> >  - Attach page table to RID
> >  - Attach page table to PASID
> >  - Attach page table to RID and all PASIDs
> >    And here 'page table' is everything below the STE in SMMUv3
> > 
> > While mdev can only support:
> >  - Access emulated page table
> >  - Attach page table to PASID
> 
> mdev is a pci device from user p.o.v, having its vRID and vPASID. From
> this angle the uAPI is no different from vfio-pci (except the ARM one):

No, it isn't. The internal operation is completely different, and it
must call different iommufd APIs than vfio-pci does.

This is user visible - mdev can never be attached to an ARM user page
table, for instance.

For iommufd there is no vRID, vPASID or any confusing stuff like
that. You'll have an easier time if you stop thinking in these terms.

We probably end up with three iommufd calls:
 int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id, unsigned int flags)
 int iommufd_device_attach_pasid(struct iommufd_device *idev, u32 *pt_id, unsigned int flags, ioasid_t *pasid)
 int iommufd_device_attach_sw_iommu(struct iommufd_device *idev, u32 pt_id);

And the uAPI from VFIO must map onto them.

vfio-pci:
  - 'VFIO_SET_CONTAINER' does
    iommufd_device_attach(idev, &pt_id, IOMMUFD_FULL_DEVICE);
    # IOMMU specific if this captures PASIDs or cause them to fail,
    # but IOMMUFD_FULL_DEVICE will prevent attaching any PASID
    # later on all iommu's.

vfio-mdev:
  - 'VFIO_SET_CONTAINER' does one of:
    iommufd_device_attach_pasid(idev, &pt_id, IOMMUFD_ASSIGN_PASID, &pasid);
    iommufd_device_attach_sw_iommu(idev, pt_id);

That is three of the cases.

Then we have new ioctls for the other cases:

vfio-pci:
  - 'bind only the RID, so we can use PASID'
    iommufd_device_attach(idev, &pt_id, 0);
  - 'bind to a specific PASID'
    iommufd_device_attach_pasid(idev, &pt_id, 0, &pasid);

vfio-mdev:
  - 'like VFIO_SET_CONTAINER but bind to a specific PASID'
    iommufd_device_attach_pasid(idev, &pt_id, 0, &pasid);

The iommu driver will block attachments that are incompatible, ie ARM
user page tables only work with:
 iommufd_device_attach(idev, &pt_id, IOMMUFD_FULL_DEVICE)
all other calls fail.

How exactly we put all of this into new ioctls, I'm not sure, but it
does seem pretty clear this is what the iommufd kAPI will need to look
like to cover the cases we know about already.

As you can see, userpace needs to understand what mode it is operating
in. If it does IOMMUFD_FULL_DEVICE and manages PASID somehow in
userspace, or it doesn't and can use the iommufd_device_attach_pasid()
paths.

> Is modeling like above considered ambiguous?

You've skipped straight to the ioctls without designing the kernel API
to meet all the requirements  :)

Jason
Tian, Kevin Dec. 11, 2021, 3:57 a.m. UTC | #19
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, December 10, 2021 9:23 PM
> 
> On Fri, Dec 10, 2021 at 08:56:56AM +0000, Tian, Kevin wrote:
> > > So, something like vfio pci would implement three uAPI operations:
> > >  - Attach page table to RID
> > >  - Attach page table to PASID
> > >  - Attach page table to RID and all PASIDs
> > >    And here 'page table' is everything below the STE in SMMUv3
> > >
> > > While mdev can only support:
> > >  - Access emulated page table
> > >  - Attach page table to PASID
> >
> > mdev is a pci device from user p.o.v, having its vRID and vPASID. From
> > this angle the uAPI is no different from vfio-pci (except the ARM one):
> 
> No, it isn't. The internal operation is completely different, and it
> must call different iommufd APIs than vfio-pci does.

Well, you mentioned "uAPI operations" thus my earlier comment 
is purely from uAPI p.o.v instead of internal iommufd APIs (not meant
I didn't think of them). I think this is the main divergence in this 
discussion as when I saw you said "while mdev can only support" 
I assume it's still about uAPI (more specifically VFIO uAPI as it carries 
the attach call to iommufd).

> 
> This is user visible - mdev can never be attached to an ARM user page
> table, for instance.

sure. the iommu driver will fail the attach request when seeing
incompatible way is used.

> 
> For iommufd there is no vRID, vPASID or any confusing stuff like
> that. You'll have an easier time if you stop thinking in these terms.

I don't have a difficulty here as from vfio uAPI p.o.v it's about
vRID and vPASID. But there is NO any confusion on iommufd which
should only deal with physical thing. This has been settled down
long time ago in high level design discussion. 
Tian, Kevin Dec. 11, 2021, 5:18 a.m. UTC | #20
> From: Tian, Kevin
> Sent: Saturday, December 11, 2021 11:58 AM
>
> This might be the only open as I still didn't see why we need an
> explicit flag to claim a 'full device' thing. From kernel p.o.v the
> ARM case is no different from Intel that both allows an user
> page table attached to vRID, just with different format and

obviously this is 'RID' to not cause further confusion since it
talks about the kernel p.o.v

> addr width (Intel is 64bit, ARM is 84bit where PASID can be
> considered a sub-handle in the 84bit address space and not
> the kernel's business).
> 
> and ARM doesn't support explicit PASID attach then those calls
> will fail for sure.

Thanks
Kevin
Jason Gunthorpe Dec. 16, 2021, 8:48 p.m. UTC | #21
On Sat, Dec 11, 2021 at 03:57:45AM +0000, Tian, Kevin wrote:

> This might be the only open as I still didn't see why we need an
> explicit flag to claim a 'full device' thing. From kernel p.o.v the
> ARM case is no different from Intel that both allows an user
> page table attached to vRID, just with different format and
> addr width (Intel is 64bit, ARM is 84bit where PASID can be
> considered a sub-handle in the 84bit address space and not
> the kernel's business).

I think the difference is intention.

In one case the kernel is saying 'attach a RID and I intend to use
PASID' in which case the kernel user can call the PASID APIs.

The second case is saying 'I will not use PASID'.

They are different things and I think it is a surprising API if the
kernel user attaches a domain, intends to use PASID and then finds out
it can't, eg because an ARM user page table was hooked up.

If you imagine the flag as 'I intend to use PASID' I think it makes a
fair amount of sense from an API design too.

We could probably do without it, at least for VFIO and qemu cases, but
it seems a little bit peculiar to me.

Jason
Tian, Kevin Jan. 4, 2022, 2:42 a.m. UTC | #22
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, December 17, 2021 4:49 AM
> 
> On Sat, Dec 11, 2021 at 03:57:45AM +0000, Tian, Kevin wrote:
> 
> > This might be the only open as I still didn't see why we need an
> > explicit flag to claim a 'full device' thing. From kernel p.o.v the
> > ARM case is no different from Intel that both allows an user
> > page table attached to vRID, just with different format and
> > addr width (Intel is 64bit, ARM is 84bit where PASID can be
> > considered a sub-handle in the 84bit address space and not
> > the kernel's business).
> 
> I think the difference is intention.
> 
> In one case the kernel is saying 'attach a RID and I intend to use
> PASID' in which case the kernel user can call the PASID APIs.
> 
> The second case is saying 'I will not use PASID'.
> 
> They are different things and I think it is a surprising API if the
> kernel user attaches a domain, intends to use PASID and then finds out
> it can't, eg because an ARM user page table was hooked up.
> 
> If you imagine the flag as 'I intend to use PASID' I think it makes a
> fair amount of sense from an API design too.
> 
> We could probably do without it, at least for VFIO and qemu cases, but
> it seems a little bit peculiar to me.
> 

ok, combining the kernel user makes the flag more sensible.

Thanks
Kevin
diff mbox series

Patch

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3303d707bab4..6033c263c6e6 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2236,6 +2236,75 @@  int iommu_uapi_sva_unbind_gpasid(struct iommu_domain *domain, struct device *dev
 }
 EXPORT_SYMBOL_GPL(iommu_uapi_sva_unbind_gpasid);
 
+int iommu_attach_pasid_table(struct iommu_domain *domain,
+			     struct iommu_pasid_table_config *cfg)
+{
+	if (unlikely(!domain->ops->attach_pasid_table))
+		return -ENODEV;
+
+	return domain->ops->attach_pasid_table(domain, cfg);
+}
+EXPORT_SYMBOL_GPL(iommu_attach_pasid_table);
+
+int iommu_uapi_attach_pasid_table(struct iommu_domain *domain,
+				  void __user *uinfo)
+{
+	struct iommu_pasid_table_config pasid_table_data = { 0 };
+	u32 minsz;
+
+	if (unlikely(!domain->ops->attach_pasid_table))
+		return -ENODEV;
+
+	/*
+	 * No new spaces can be added before the variable sized union, the
+	 * minimum size is the offset to the union.
+	 */
+	minsz = offsetof(struct iommu_pasid_table_config, vendor_data);
+
+	/* Copy minsz from user to get flags and argsz */
+	if (copy_from_user(&pasid_table_data, uinfo, minsz))
+		return -EFAULT;
+
+	/* Fields before the variable size union are mandatory */
+	if (pasid_table_data.argsz < minsz)
+		return -EINVAL;
+
+	/* PASID and address granu require additional info beyond minsz */
+	if (pasid_table_data.version != PASID_TABLE_CFG_VERSION_1)
+		return -EINVAL;
+	if (pasid_table_data.format == IOMMU_PASID_FORMAT_SMMUV3 &&
+	    pasid_table_data.argsz <
+		offsetofend(struct iommu_pasid_table_config, vendor_data.smmuv3))
+		return -EINVAL;
+
+	/*
+	 * User might be using a newer UAPI header which has a larger data
+	 * size, we shall support the existing flags within the current
+	 * size. Copy the remaining user data _after_ minsz but not more
+	 * than the current kernel supported size.
+	 */
+	if (copy_from_user((void *)&pasid_table_data + minsz, uinfo + minsz,
+			   min_t(u32, pasid_table_data.argsz, sizeof(pasid_table_data)) - minsz))
+		return -EFAULT;
+
+	/* Now the argsz is validated, check the content */
+	if (pasid_table_data.config < IOMMU_PASID_CONFIG_TRANSLATE ||
+	    pasid_table_data.config > IOMMU_PASID_CONFIG_ABORT)
+		return -EINVAL;
+
+	return domain->ops->attach_pasid_table(domain, &pasid_table_data);
+}
+EXPORT_SYMBOL_GPL(iommu_uapi_attach_pasid_table);
+
+void iommu_detach_pasid_table(struct iommu_domain *domain)
+{
+	if (unlikely(!domain->ops->detach_pasid_table))
+		return;
+
+	domain->ops->detach_pasid_table(domain);
+}
+EXPORT_SYMBOL_GPL(iommu_detach_pasid_table);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d2f3435e7d17..e34a1b1c805b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -232,6 +232,8 @@  struct iommu_iotlb_gather {
  * @cache_invalidate: invalidate translation caches
  * @sva_bind_gpasid: bind guest pasid and mm
  * @sva_unbind_gpasid: unbind guest pasid and mm
+ * @attach_pasid_table: attach a pasid table
+ * @detach_pasid_table: detach the pasid table
  * @def_domain_type: device default domain type, return value:
  *		- IOMMU_DOMAIN_IDENTITY: must use an identity domain
  *		- IOMMU_DOMAIN_DMA: must use a dma domain
@@ -297,6 +299,9 @@  struct iommu_ops {
 				      void *drvdata);
 	void (*sva_unbind)(struct iommu_sva *handle);
 	u32 (*sva_get_pasid)(struct iommu_sva *handle);
+	int (*attach_pasid_table)(struct iommu_domain *domain,
+				  struct iommu_pasid_table_config *cfg);
+	void (*detach_pasid_table)(struct iommu_domain *domain);
 
 	int (*page_response)(struct device *dev,
 			     struct iommu_fault_event *evt,
@@ -430,6 +435,11 @@  extern int iommu_uapi_sva_unbind_gpasid(struct iommu_domain *domain,
 					struct device *dev, void __user *udata);
 extern int iommu_sva_unbind_gpasid(struct iommu_domain *domain,
 				   struct device *dev, ioasid_t pasid);
+extern int iommu_attach_pasid_table(struct iommu_domain *domain,
+				    struct iommu_pasid_table_config *cfg);
+extern int iommu_uapi_attach_pasid_table(struct iommu_domain *domain,
+					 void __user *udata);
+extern void iommu_detach_pasid_table(struct iommu_domain *domain);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
@@ -1035,6 +1045,23 @@  iommu_aux_get_pasid(struct iommu_domain *domain, struct device *dev)
 	return -ENODEV;
 }
 
+static inline
+int iommu_attach_pasid_table(struct iommu_domain *domain,
+			     struct iommu_pasid_table_config *cfg)
+{
+	return -ENODEV;
+}
+
+static inline
+int iommu_uapi_attach_pasid_table(struct iommu_domain *domain,
+				  void __user *uinfo)
+{
+	return -ENODEV;
+}
+
+static inline
+void iommu_detach_pasid_table(struct iommu_domain *domain) {}
+
 static inline struct iommu_sva *
 iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, void *drvdata)
 {
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 59178fc229ca..8c079a78dfec 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -339,4 +339,58 @@  struct iommu_gpasid_bind_data {
 	} vendor;
 };
 
+/**
+ * struct iommu_pasid_smmuv3 - ARM SMMUv3 Stream Table Entry stage 1 related
+ *     information
+ * @version: API version of this structure
+ * @s1fmt: STE s1fmt (format of the CD table: single CD, linear table
+ *         or 2-level table)
+ * @s1dss: STE s1dss (specifies the behavior when @pasid_bits != 0
+ *         and no PASID is passed along with the incoming transaction)
+ * @padding: reserved for future use (should be zero)
+ *
+ * The PASID table is referred to as the Context Descriptor (CD) table on ARM
+ * SMMUv3. Please refer to the ARM SMMU 3.x spec (ARM IHI 0070A) for full
+ * details.
+ */
+struct iommu_pasid_smmuv3 {
+#define PASID_TABLE_SMMUV3_CFG_VERSION_1 1
+	__u32	version;
+	__u8	s1fmt;
+	__u8	s1dss;
+	__u8	padding[2];
+};
+
+/**
+ * struct iommu_pasid_table_config - PASID table data used to bind guest PASID
+ *     table to the host IOMMU
+ * @argsz: User filled size of this data
+ * @version: API version to prepare for future extensions
+ * @base_ptr: guest physical address of the PASID table
+ * @format: format of the PASID table
+ * @pasid_bits: number of PASID bits used in the PASID table
+ * @config: indicates whether the guest translation stage must
+ *          be translated, bypassed or aborted.
+ * @padding: reserved for future use (should be zero)
+ * @vendor_data.smmuv3: table information when @format is
+ * %IOMMU_PASID_FORMAT_SMMUV3
+ */
+struct iommu_pasid_table_config {
+	__u32	argsz;
+#define PASID_TABLE_CFG_VERSION_1 1
+	__u32	version;
+	__u64	base_ptr;
+#define IOMMU_PASID_FORMAT_SMMUV3	1
+	__u32	format;
+	__u8	pasid_bits;
+#define IOMMU_PASID_CONFIG_TRANSLATE	1
+#define IOMMU_PASID_CONFIG_BYPASS	2
+#define IOMMU_PASID_CONFIG_ABORT	3
+	__u8	config;
+	__u8    padding[2];
+	union {
+		struct iommu_pasid_smmuv3 smmuv3;
+	} vendor_data;
+};
+
 #endif /* _UAPI_IOMMU_H */