Message ID | 20211027104428.1059740-2-eric.auger@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | SMMUv3 Nested Stage Setup (IOMMU part) | expand |
On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote: > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com> > Signed-off-by: Ashok Raj <ashok.raj@intel.com> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > Signed-off-by: Eric Auger <eric.auger@redhat.com> This Signed-of-by chain looks dubious, you are the author but the last one in the chain? > +int iommu_uapi_attach_pasid_table(struct iommu_domain *domain, > + void __user *uinfo) > +{ [...] > + if (pasid_table_data.format == IOMMU_PASID_FORMAT_SMMUV3 && > + pasid_table_data.argsz < > + offsetofend(struct iommu_pasid_table_config, vendor_data.smmuv3)) > + return -EINVAL; This check looks like it belongs in driver specific code. Regards, Joerg
Hi Joerg, On 12/6/21 11:48 AM, Joerg Roedel wrote: > On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote: >> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> >> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com> >> Signed-off-by: Ashok Raj <ashok.raj@intel.com> >> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> >> Signed-off-by: Eric Auger <eric.auger@redhat.com> > This Signed-of-by chain looks dubious, you are the author but the last > one in the chain? The 1st RFC in Aug 2018 (https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032478.html) said this was a generalization of Jacob's patch [PATCH v5 01/23] iommu: introduce bind_pasid_table API function https://lists.linuxfoundation.org/pipermail/iommu/2018-May/027647.html So indeed Jacob should be the author. I guess the multiple rebases got this eventually replaced at some point, which is not an excuse. Please forgive me for that. Now the original patch already had this list of SoB so I don't know if I shall simplify it. > >> +int iommu_uapi_attach_pasid_table(struct iommu_domain *domain, >> + void __user *uinfo) >> +{ > [...] > >> + if (pasid_table_data.format == IOMMU_PASID_FORMAT_SMMUV3 && >> + pasid_table_data.argsz < >> + offsetofend(struct iommu_pasid_table_config, vendor_data.smmuv3)) >> + return -EINVAL; > This check looks like it belongs in driver specific code. Indeed, I will fix that in my next respin :-) Thanks! Eric > > Regards, > > Joerg >
Hi Eric, On 12/7/21 6:22 PM, Eric Auger wrote: > On 12/6/21 11:48 AM, Joerg Roedel wrote: >> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote: >>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com> >>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com> >>> Signed-off-by: Ashok Raj<ashok.raj@intel.com> >>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com> >>> Signed-off-by: Eric Auger<eric.auger@redhat.com> >> This Signed-of-by chain looks dubious, you are the author but the last >> one in the chain? > The 1st RFC in Aug 2018 > (https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032478.html) > said this was a generalization of Jacob's patch > > > [PATCH v5 01/23] iommu: introduce bind_pasid_table API function > > > https://lists.linuxfoundation.org/pipermail/iommu/2018-May/027647.html > > So indeed Jacob should be the author. I guess the multiple rebases got > this eventually replaced at some point, which is not an excuse. Please > forgive me for that. > Now the original patch already had this list of SoB so I don't know if I > shall simplify it. As we have decided to move the nested mode (dual stages) implementation onto the developing iommufd framework, what's the value of adding this into iommu core? Best regards, baolu
Hi Baolu, On 12/8/21 3:44 AM, Lu Baolu wrote: > Hi Eric, > > On 12/7/21 6:22 PM, Eric Auger wrote: >> On 12/6/21 11:48 AM, Joerg Roedel wrote: >>> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote: >>>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com> >>>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com> >>>> Signed-off-by: Ashok Raj<ashok.raj@intel.com> >>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com> >>>> Signed-off-by: Eric Auger<eric.auger@redhat.com> >>> This Signed-of-by chain looks dubious, you are the author but the last >>> one in the chain? >> The 1st RFC in Aug 2018 >> (https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032478.html) >> said this was a generalization of Jacob's patch >> >> >> [PATCH v5 01/23] iommu: introduce bind_pasid_table API function >> >> >> >> https://lists.linuxfoundation.org/pipermail/iommu/2018-May/027647.html >> >> So indeed Jacob should be the author. I guess the multiple rebases got >> this eventually replaced at some point, which is not an excuse. Please >> forgive me for that. >> Now the original patch already had this list of SoB so I don't know if I >> shall simplify it. > > As we have decided to move the nested mode (dual stages) implementation > onto the developing iommufd framework, what's the value of adding this > into iommu core? The iommu_uapi_attach_pasid_table uapi should disappear indeed as it is is bound to be replaced by /dev/iommu fellow API. However until I can rebase on /dev/iommu code I am obliged to keep it to maintain this integration, hence the RFC. Thanks Eric > > Best regards, > baolu >
On Wed, Dec 08, 2021 at 08:33:33AM +0100, Eric Auger wrote: > Hi Baolu, > > On 12/8/21 3:44 AM, Lu Baolu wrote: > > Hi Eric, > > > > On 12/7/21 6:22 PM, Eric Auger wrote: > >> On 12/6/21 11:48 AM, Joerg Roedel wrote: > >>> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote: > >>>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com> > >>>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com> > >>>> Signed-off-by: Ashok Raj<ashok.raj@intel.com> > >>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com> > >>>> Signed-off-by: Eric Auger<eric.auger@redhat.com> > >>> This Signed-of-by chain looks dubious, you are the author but the last > >>> one in the chain? > >> The 1st RFC in Aug 2018 > >> (https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032478.html) > >> said this was a generalization of Jacob's patch > >> > >> > >> [PATCH v5 01/23] iommu: introduce bind_pasid_table API function > >> > >> > >> > >> https://lists.linuxfoundation.org/pipermail/iommu/2018-May/027647.html > >> > >> So indeed Jacob should be the author. I guess the multiple rebases got > >> this eventually replaced at some point, which is not an excuse. Please > >> forgive me for that. > >> Now the original patch already had this list of SoB so I don't know if I > >> shall simplify it. > > > > As we have decided to move the nested mode (dual stages) implementation > > onto the developing iommufd framework, what's the value of adding this > > into iommu core? > > The iommu_uapi_attach_pasid_table uapi should disappear indeed as it is > is bound to be replaced by /dev/iommu fellow API. > However until I can rebase on /dev/iommu code I am obliged to keep it to > maintain this integration, hence the RFC. Indeed, we are getting pretty close to having the base iommufd that we can start adding stuff like this into. Maybe in January, you can look at some parts of what is evolving here: https://github.com/jgunthorpe/linux/commits/iommufd https://github.com/LuBaolu/intel-iommu/commits/iommu-dma-ownership-v2 https://github.com/luxis1999/iommufd/commits/iommufd-v5.16-rc2 From a progress perspective I would like to start with simple 'page tables in userspace', ie no PASID in this step. 'page tables in userspace' means an iommufd ioctl to create an iommu_domain where the IOMMU HW is directly travesering a device-specific page table structure in user space memory. All the HW today implements this by using another iommu_domain to allow the IOMMU HW DMA access to user memory - ie nesting or multi-stage or whatever. This would come along with some ioctls to invalidate the IOTLB. I'm imagining this step as a iommu_group->op->create_user_domain() driver callback which will create a new kind of domain with domain-unique ops. Ie map/unmap related should all be NULL as those are impossible operations. From there the usual struct device (ie RID) attach/detatch stuff needs to take care of routing DMAs to this iommu_domain. Step two would be to add the ability for an iommufd using driver to request that a RID&PASID is connected to an iommu_domain. This connection can be requested for any kind of iommu_domain, kernel owned or user owned. I don't quite have an answer how exactly the SMMUv3 vs Intel difference in PASID routing should be resolved. to get answers I'm hoping to start building some sketch RFCs for these different things on iommufd, hopefully in January. I'm looking at user page tables, PASID, dirty tracking and userspace IO fault handling as the main features iommufd must tackle. The purpose of the sketches would be to validate that the HW features we want to exposed can work will with the choices the base is making. Jason
On Wed, Dec 08, 2021 at 08:56:16AM -0400, Jason Gunthorpe wrote: > From a progress perspective I would like to start with simple 'page > tables in userspace', ie no PASID in this step. > > 'page tables in userspace' means an iommufd ioctl to create an > iommu_domain where the IOMMU HW is directly travesering a > device-specific page table structure in user space memory. All the HW > today implements this by using another iommu_domain to allow the IOMMU > HW DMA access to user memory - ie nesting or multi-stage or whatever. > > This would come along with some ioctls to invalidate the IOTLB. > > I'm imagining this step as a iommu_group->op->create_user_domain() > driver callback which will create a new kind of domain with > domain-unique ops. Ie map/unmap related should all be NULL as those > are impossible operations. > > From there the usual struct device (ie RID) attach/detatch stuff needs > to take care of routing DMAs to this iommu_domain. > > Step two would be to add the ability for an iommufd using driver to > request that a RID&PASID is connected to an iommu_domain. This > connection can be requested for any kind of iommu_domain, kernel owned > or user owned. > > I don't quite have an answer how exactly the SMMUv3 vs Intel > difference in PASID routing should be resolved. In SMMUv3 the user pgd is always stored in the PASID table (actually called "context descriptor table" but I want to avoid confusion with the VT-d "context table"). And to access the PASID table, the SMMUv3 first translate its GPA into a PA using the stage-2 page table. For userspace to pass individual pgds to the kernel, as opposed to passing whole PASID tables, the host kernel needs to reserve GPA space and map it in stage-2, so it can store the PASID table in there. Userspace manages GPA space. This would be easy for a single pgd. In this case the PASID table has a single entry and userspace could just pass one GPA page during registration. However it isn't easily generalized to full PASID support, because managing a multi-level PASID table will require runtime GPA allocation, and that API is awkward. That's why we opted for "attach PASID table" operation rather than "attach page table" (back then the choice was easy since VT-d used the same concept). So I think the simplest way to support nesting is still to have separate modes of operations depending on the hardware. Thanks, Jean > > to get answers I'm hoping to start building some sketch RFCs for these > different things on iommufd, hopefully in January. I'm looking at user > page tables, PASID, dirty tracking and userspace IO fault handling as > the main features iommufd must tackle. > > The purpose of the sketches would be to validate that the HW features > we want to exposed can work will with the choices the base is making. > > Jason
On Wed, Dec 08, 2021 at 05:20:39PM +0000, Jean-Philippe Brucker wrote: > On Wed, Dec 08, 2021 at 08:56:16AM -0400, Jason Gunthorpe wrote: > > From a progress perspective I would like to start with simple 'page > > tables in userspace', ie no PASID in this step. > > > > 'page tables in userspace' means an iommufd ioctl to create an > > iommu_domain where the IOMMU HW is directly travesering a > > device-specific page table structure in user space memory. All the HW > > today implements this by using another iommu_domain to allow the IOMMU > > HW DMA access to user memory - ie nesting or multi-stage or whatever. > > > > This would come along with some ioctls to invalidate the IOTLB. > > > > I'm imagining this step as a iommu_group->op->create_user_domain() > > driver callback which will create a new kind of domain with > > domain-unique ops. Ie map/unmap related should all be NULL as those > > are impossible operations. > > > > From there the usual struct device (ie RID) attach/detatch stuff needs > > to take care of routing DMAs to this iommu_domain. > > > > Step two would be to add the ability for an iommufd using driver to > > request that a RID&PASID is connected to an iommu_domain. This > > connection can be requested for any kind of iommu_domain, kernel owned > > or user owned. > > > > I don't quite have an answer how exactly the SMMUv3 vs Intel > > difference in PASID routing should be resolved. > > In SMMUv3 the user pgd is always stored in the PASID table (actually > called "context descriptor table" but I want to avoid confusion with > the VT-d "context table"). And to access the PASID table, the SMMUv3 first > translate its GPA into a PA using the stage-2 page table. For userspace to > pass individual pgds to the kernel, as opposed to passing whole PASID > tables, the host kernel needs to reserve GPA space and map it in stage-2, > so it can store the PASID table in there. Userspace manages GPA space. It is what I thought.. So in the SMMUv3 spec the STE is completely in kernel memory, but it points to an S1ContextPtr that must be an IPA if the "stage 1 translation tables" are IPA. Only via S1ContextPtr can we decode the substream? So in SMMUv3 land we don't really ever talk about PASID, we have a 'user page table' that is bound to an entire RID and *all* PASIDs. While Intel would have a 'user page table' that is only bound to a RID & PASID Certianly it is not a difference we can hide from userspace. > This would be easy for a single pgd. In this case the PASID table has a > single entry and userspace could just pass one GPA page during > registration. However it isn't easily generalized to full PASID support, > because managing a multi-level PASID table will require runtime GPA > allocation, and that API is awkward. That's why we opted for "attach PASID > table" operation rather than "attach page table" (back then the choice was > easy since VT-d used the same concept). I think the entire context descriptor table should be in userspace, and filled in by userspace, as part of the userspace page table. The kernel API should accept the S1ContextPtr IPA and all the parts of the STE that relate to the defining the layout of what the S1Context points to an thats it. We should have another mode where the kernel owns everything, and the S1ContexPtr is a PA with Stage 2 bypassed. That part is fine, the more open question is what does the driver interface look like when userspace tell something like vfio-pci to connect to this thing. At some level the attaching device needs to authorize iommufd to take the entire PASID table and RID. Specifically we cannot use this thing with a mdev, while the Intel version of a userspace page table can be. Maybe that is just some 'allow whole device' flag in an API Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, December 9, 2021 2:31 AM > > On Wed, Dec 08, 2021 at 05:20:39PM +0000, Jean-Philippe Brucker wrote: > > On Wed, Dec 08, 2021 at 08:56:16AM -0400, Jason Gunthorpe wrote: > > > From a progress perspective I would like to start with simple 'page > > > tables in userspace', ie no PASID in this step. > > > > > > 'page tables in userspace' means an iommufd ioctl to create an > > > iommu_domain where the IOMMU HW is directly travesering a > > > device-specific page table structure in user space memory. All the HW > > > today implements this by using another iommu_domain to allow the > IOMMU > > > HW DMA access to user memory - ie nesting or multi-stage or whatever. > > > > > > This would come along with some ioctls to invalidate the IOTLB. > > > > > > I'm imagining this step as a iommu_group->op->create_user_domain() > > > driver callback which will create a new kind of domain with > > > domain-unique ops. Ie map/unmap related should all be NULL as those > > > are impossible operations. > > > > > > From there the usual struct device (ie RID) attach/detatch stuff needs > > > to take care of routing DMAs to this iommu_domain. > > > > > > Step two would be to add the ability for an iommufd using driver to > > > request that a RID&PASID is connected to an iommu_domain. This > > > connection can be requested for any kind of iommu_domain, kernel > owned > > > or user owned. > > > > > > I don't quite have an answer how exactly the SMMUv3 vs Intel > > > difference in PASID routing should be resolved. > > > > In SMMUv3 the user pgd is always stored in the PASID table (actually > > called "context descriptor table" but I want to avoid confusion with > > the VT-d "context table"). And to access the PASID table, the SMMUv3 first > > translate its GPA into a PA using the stage-2 page table. For userspace to > > pass individual pgds to the kernel, as opposed to passing whole PASID > > tables, the host kernel needs to reserve GPA space and map it in stage-2, > > so it can store the PASID table in there. Userspace manages GPA space. > > It is what I thought.. So in the SMMUv3 spec the STE is completely in > kernel memory, but it points to an S1ContextPtr that must be an IPA if > the "stage 1 translation tables" are IPA. Only via S1ContextPtr can we > decode the substream? > > So in SMMUv3 land we don't really ever talk about PASID, we have a > 'user page table' that is bound to an entire RID and *all* PASIDs. > > While Intel would have a 'user page table' that is only bound to a RID > & PASID > > Certianly it is not a difference we can hide from userspace. Concept-wise it is still a 'user page table' with vendor specific format. Taking your earlier analog it's just for a single 84-bit address space (20PASID+64bitVA) per RID. So what we requires is still one unified ioctl in your step-1 to support per-RID 'user page table'. For ARM it's SMMU's PASID table format. There is no step-2 since PASID is already within the address space covered by the user PASID table. For Intel it's VT-d's 1st level page table format. When moving to step-2 then allows multiple 'user page tables' connected to RID & PASID. > > > This would be easy for a single pgd. In this case the PASID table has a > > single entry and userspace could just pass one GPA page during > > registration. However it isn't easily generalized to full PASID support, > > because managing a multi-level PASID table will require runtime GPA > > allocation, and that API is awkward. That's why we opted for "attach PASID > > table" operation rather than "attach page table" (back then the choice was > > easy since VT-d used the same concept). > > I think the entire context descriptor table should be in userspace, > and filled in by userspace, as part of the userspace page table. > > The kernel API should accept the S1ContextPtr IPA and all the parts of > the STE that relate to the defining the layout of what the S1Context > points to an thats it. > > We should have another mode where the kernel owns everything, and the > S1ContexPtr is a PA with Stage 2 bypassed. I guess this is for the usage like DPDK. In that case yes we can have unified ioctl since the kernel manages everything including the PASID table. > > That part is fine, the more open question is what does the driver > interface look like when userspace tell something like vfio-pci to > connect to this thing. At some level the attaching device needs to > authorize iommufd to take the entire PASID table and RID. as long as smmu driver advocates only supporting step-1 ioctl, then this authorization should be implied already. > > Specifically we cannot use this thing with a mdev, while the Intel > version of a userspace page table can be. yes. Supporting mdev is all the reason why Intel puts the PASID table in host physical address space to be managed by the kernel. > > Maybe that is just some 'allow whole device' flag in an API > As said, I feel this special flag is not required as long as the vendor iommu driver only supports your step-1 interface which implies 'allow whole device' for ARM. Thanks Kevin
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, December 8, 2021 8:56 PM > > On Wed, Dec 08, 2021 at 08:33:33AM +0100, Eric Auger wrote: > > Hi Baolu, > > > > On 12/8/21 3:44 AM, Lu Baolu wrote: > > > Hi Eric, > > > > > > On 12/7/21 6:22 PM, Eric Auger wrote: > > >> On 12/6/21 11:48 AM, Joerg Roedel wrote: > > >>> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote: > > >>>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com> > > >>>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com> > > >>>> Signed-off-by: Ashok Raj<ashok.raj@intel.com> > > >>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com> > > >>>> Signed-off-by: Eric Auger<eric.auger@redhat.com> > > >>> This Signed-of-by chain looks dubious, you are the author but the last > > >>> one in the chain? > > >> The 1st RFC in Aug 2018 > > >> (https://lists.cs.columbia.edu/pipermail/kvmarm/2018- > August/032478.html) > > >> said this was a generalization of Jacob's patch > > >> > > >> > > >> [PATCH v5 01/23] iommu: introduce bind_pasid_table API function > > >> > > >> > > >> > > >> https://lists.linuxfoundation.org/pipermail/iommu/2018- > May/027647.html > > >> > > >> So indeed Jacob should be the author. I guess the multiple rebases got > > >> this eventually replaced at some point, which is not an excuse. Please > > >> forgive me for that. > > >> Now the original patch already had this list of SoB so I don't know if I > > >> shall simplify it. > > > > > > As we have decided to move the nested mode (dual stages) > implementation > > > onto the developing iommufd framework, what's the value of adding this > > > into iommu core? > > > > The iommu_uapi_attach_pasid_table uapi should disappear indeed as it is > > is bound to be replaced by /dev/iommu fellow API. > > However until I can rebase on /dev/iommu code I am obliged to keep it to > > maintain this integration, hence the RFC. > > Indeed, we are getting pretty close to having the base iommufd that we > can start adding stuff like this into. Maybe in January, you can look > at some parts of what is evolving here: > > https://github.com/jgunthorpe/linux/commits/iommufd > https://github.com/LuBaolu/intel-iommu/commits/iommu-dma-ownership- > v2 > https://github.com/luxis1999/iommufd/commits/iommufd-v5.16-rc2 > > From a progress perspective I would like to start with simple 'page > tables in userspace', ie no PASID in this step. > > 'page tables in userspace' means an iommufd ioctl to create an > iommu_domain where the IOMMU HW is directly travesering a > device-specific page table structure in user space memory. All the HW > today implements this by using another iommu_domain to allow the IOMMU > HW DMA access to user memory - ie nesting or multi-stage or whatever. One clarification here in case people may get confused based on the current iommu_domain definition. Jason brainstormed with us on how to represent 'user page table' in the IOMMU sub-system. One is to extend iommu_domain as a general representation for any page table instance. The other option is to create new representations for user page tables and then link them under existing iommu_domain. This context is based on the 1st option. and As Jason said in the bottom we still need to sketch out whether it works as expected.
> From: Tian, Kevin > Sent: Thursday, December 9, 2021 10:58 AM > > For ARM it's SMMU's PASID table format. There is no step-2 since PASID > is already within the address space covered by the user PASID table. > One correction here. 'no step-2' is definitely wrong here as it means more than user page table in your plan (e.g. dpdk). To simplify it what I meant is: iommufd reports how many 'user page tables' are supported given a device. ARM always reports only one can be supported, and it must be created in PASID table format. tagged by RID. Intel reports one in step1 (tagged by RID), and N in step2 (tagged by RID+PASID). A special flag in attach call allows the user to specify the additional PASID routing info for a 'user page table'. Thanks Kevin
Hi Jason, On 12/8/21 7:31 PM, Jason Gunthorpe wrote: > On Wed, Dec 08, 2021 at 05:20:39PM +0000, Jean-Philippe Brucker wrote: >> On Wed, Dec 08, 2021 at 08:56:16AM -0400, Jason Gunthorpe wrote: >>> From a progress perspective I would like to start with simple 'page >>> tables in userspace', ie no PASID in this step. >>> >>> 'page tables in userspace' means an iommufd ioctl to create an >>> iommu_domain where the IOMMU HW is directly travesering a >>> device-specific page table structure in user space memory. All the HW >>> today implements this by using another iommu_domain to allow the IOMMU >>> HW DMA access to user memory - ie nesting or multi-stage or whatever. >>> >>> This would come along with some ioctls to invalidate the IOTLB. >>> >>> I'm imagining this step as a iommu_group->op->create_user_domain() >>> driver callback which will create a new kind of domain with >>> domain-unique ops. Ie map/unmap related should all be NULL as those >>> are impossible operations. >>> >>> From there the usual struct device (ie RID) attach/detatch stuff needs >>> to take care of routing DMAs to this iommu_domain. >>> >>> Step two would be to add the ability for an iommufd using driver to >>> request that a RID&PASID is connected to an iommu_domain. This >>> connection can be requested for any kind of iommu_domain, kernel owned >>> or user owned. >>> >>> I don't quite have an answer how exactly the SMMUv3 vs Intel >>> difference in PASID routing should be resolved. >> In SMMUv3 the user pgd is always stored in the PASID table (actually >> called "context descriptor table" but I want to avoid confusion with >> the VT-d "context table"). And to access the PASID table, the SMMUv3 first >> translate its GPA into a PA using the stage-2 page table. For userspace to >> pass individual pgds to the kernel, as opposed to passing whole PASID >> tables, the host kernel needs to reserve GPA space and map it in stage-2, >> so it can store the PASID table in there. Userspace manages GPA space. > It is what I thought.. So in the SMMUv3 spec the STE is completely in > kernel memory, but it points to an S1ContextPtr that must be an IPA if > the "stage 1 translation tables" are IPA. Only via S1ContextPtr can we > decode the substream? Yes that's correct. S1ContextPtr is the IPA of the L1 Context Descriptor Table which is then indexed by substreamID. > > So in SMMUv3 land we don't really ever talk about PASID, we have a > 'user page table' that is bound to an entire RID and *all* PASIDs. in ARM terminology substreamID matches the PASID and this is what indexes the L1 Context Descriptor Table. > > While Intel would have a 'user page table' that is only bound to a RID > & PASID > > Certianly it is not a difference we can hide from userspace. > >> This would be easy for a single pgd. In this case the PASID table has a >> single entry and userspace could just pass one GPA page during >> registration. However it isn't easily generalized to full PASID support, >> because managing a multi-level PASID table will require runtime GPA >> allocation, and that API is awkward. That's why we opted for "attach PASID >> table" operation rather than "attach page table" (back then the choice was >> easy since VT-d used the same concept). > I think the entire context descriptor table should be in userspace, > and filled in by userspace, as part of the userspace page table. In ARM nested mode the L1 Context Descriptor Table is fully managed by the guest and the userspace only needs to trap its S1ContextPtr and pass it to the host. > > The kernel API should accept the S1ContextPtr IPA and all the parts of > the STE that relate to the defining the layout of what the S1Context > points to an thats it. Yes that's exactly what is done currently. At config time the host must trap guest STE changes (format and S1ContextPtr) and "incorporate" those changes into the stage2 related STE information. The STE is owned by the host kernel as it contains the stage2 information (S2TTB). In https://developer.arm.com/documentation/ihi0070/latest (ARM_IHI_0070_D_b_System_Memory_Management_Unit_Architecture_Specification.pdf) Synthetic diagrams can be found in 3.3.2 StreamIDs to Context Descriptors. They give the global view. Note this series only coped with a single CD in the Context Descriptor Table. Thanks Eric > > We should have another mode where the kernel owns everything, and the > S1ContexPtr is a PA with Stage 2 bypassed. > > That part is fine, the more open question is what does the driver > interface look like when userspace tell something like vfio-pci to > connect to this thing. At some level the attaching device needs to > authorize iommufd to take the entire PASID table and RID. > > Specifically we cannot use this thing with a mdev, while the Intel > version of a userspace page table can be. > > Maybe that is just some 'allow whole device' flag in an API > > Jason >
Hi Jason, On 12/8/21 1:56 PM, Jason Gunthorpe wrote: > On Wed, Dec 08, 2021 at 08:33:33AM +0100, Eric Auger wrote: >> Hi Baolu, >> >> On 12/8/21 3:44 AM, Lu Baolu wrote: >>> Hi Eric, >>> >>> On 12/7/21 6:22 PM, Eric Auger wrote: >>>> On 12/6/21 11:48 AM, Joerg Roedel wrote: >>>>> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote: >>>>>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com> >>>>>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com> >>>>>> Signed-off-by: Ashok Raj<ashok.raj@intel.com> >>>>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com> >>>>>> Signed-off-by: Eric Auger<eric.auger@redhat.com> >>>>> This Signed-of-by chain looks dubious, you are the author but the last >>>>> one in the chain? >>>> The 1st RFC in Aug 2018 >>>> (https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032478.html) >>>> said this was a generalization of Jacob's patch >>>> >>>> >>>> [PATCH v5 01/23] iommu: introduce bind_pasid_table API function >>>> >>>> >>>> >>>> https://lists.linuxfoundation.org/pipermail/iommu/2018-May/027647.html >>>> >>>> So indeed Jacob should be the author. I guess the multiple rebases got >>>> this eventually replaced at some point, which is not an excuse. Please >>>> forgive me for that. >>>> Now the original patch already had this list of SoB so I don't know if I >>>> shall simplify it. >>> As we have decided to move the nested mode (dual stages) implementation >>> onto the developing iommufd framework, what's the value of adding this >>> into iommu core? >> The iommu_uapi_attach_pasid_table uapi should disappear indeed as it is >> is bound to be replaced by /dev/iommu fellow API. >> However until I can rebase on /dev/iommu code I am obliged to keep it to >> maintain this integration, hence the RFC. > Indeed, we are getting pretty close to having the base iommufd that we > can start adding stuff like this into. Maybe in January, you can look > at some parts of what is evolving here: > > https://github.com/jgunthorpe/linux/commits/iommufd > https://github.com/LuBaolu/intel-iommu/commits/iommu-dma-ownership-v2 > https://github.com/luxis1999/iommufd/commits/iommufd-v5.16-rc2 Interesting. thank you for the preview links. I will have a look asap Eric > > From a progress perspective I would like to start with simple 'page > tables in userspace', ie no PASID in this step. > > 'page tables in userspace' means an iommufd ioctl to create an > iommu_domain where the IOMMU HW is directly travesering a > device-specific page table structure in user space memory. All the HW > today implements this by using another iommu_domain to allow the IOMMU > HW DMA access to user memory - ie nesting or multi-stage or whatever. > > This would come along with some ioctls to invalidate the IOTLB. > > I'm imagining this step as a iommu_group->op->create_user_domain() > driver callback which will create a new kind of domain with > domain-unique ops. Ie map/unmap related should all be NULL as those > are impossible operations. > > From there the usual struct device (ie RID) attach/detatch stuff needs > to take care of routing DMAs to this iommu_domain. > > Step two would be to add the ability for an iommufd using driver to > request that a RID&PASID is connected to an iommu_domain. This > connection can be requested for any kind of iommu_domain, kernel owned > or user owned. > > I don't quite have an answer how exactly the SMMUv3 vs Intel > difference in PASID routing should be resolved. > > to get answers I'm hoping to start building some sketch RFCs for these > different things on iommufd, hopefully in January. I'm looking at user > page tables, PASID, dirty tracking and userspace IO fault handling as > the main features iommufd must tackle. > > The purpose of the sketches would be to validate that the HW features > we want to exposed can work will with the choices the base is making. > > Jason >
Hi Kevin, On 12/9/21 4:21 AM, Tian, Kevin wrote: >> From: Jason Gunthorpe <jgg@nvidia.com> >> Sent: Wednesday, December 8, 2021 8:56 PM >> >> On Wed, Dec 08, 2021 at 08:33:33AM +0100, Eric Auger wrote: >>> Hi Baolu, >>> >>> On 12/8/21 3:44 AM, Lu Baolu wrote: >>>> Hi Eric, >>>> >>>> On 12/7/21 6:22 PM, Eric Auger wrote: >>>>> On 12/6/21 11:48 AM, Joerg Roedel wrote: >>>>>> On Wed, Oct 27, 2021 at 12:44:20PM +0200, Eric Auger wrote: >>>>>>> Signed-off-by: Jean-Philippe Brucker<jean-philippe.brucker@arm.com> >>>>>>> Signed-off-by: Liu, Yi L<yi.l.liu@linux.intel.com> >>>>>>> Signed-off-by: Ashok Raj<ashok.raj@intel.com> >>>>>>> Signed-off-by: Jacob Pan<jacob.jun.pan@linux.intel.com> >>>>>>> Signed-off-by: Eric Auger<eric.auger@redhat.com> >>>>>> This Signed-of-by chain looks dubious, you are the author but the last >>>>>> one in the chain? >>>>> The 1st RFC in Aug 2018 >>>>> (https://lists.cs.columbia.edu/pipermail/kvmarm/2018- >> August/032478.html) >>>>> said this was a generalization of Jacob's patch >>>>> >>>>> >>>>> [PATCH v5 01/23] iommu: introduce bind_pasid_table API function >>>>> >>>>> >>>>> >>>>> https://lists.linuxfoundation.org/pipermail/iommu/2018- >> May/027647.html >>>>> So indeed Jacob should be the author. I guess the multiple rebases got >>>>> this eventually replaced at some point, which is not an excuse. Please >>>>> forgive me for that. >>>>> Now the original patch already had this list of SoB so I don't know if I >>>>> shall simplify it. >>>> As we have decided to move the nested mode (dual stages) >> implementation >>>> onto the developing iommufd framework, what's the value of adding this >>>> into iommu core? >>> The iommu_uapi_attach_pasid_table uapi should disappear indeed as it is >>> is bound to be replaced by /dev/iommu fellow API. >>> However until I can rebase on /dev/iommu code I am obliged to keep it to >>> maintain this integration, hence the RFC. >> Indeed, we are getting pretty close to having the base iommufd that we >> can start adding stuff like this into. Maybe in January, you can look >> at some parts of what is evolving here: >> >> https://github.com/jgunthorpe/linux/commits/iommufd >> https://github.com/LuBaolu/intel-iommu/commits/iommu-dma-ownership- >> v2 >> https://github.com/luxis1999/iommufd/commits/iommufd-v5.16-rc2 >> >> From a progress perspective I would like to start with simple 'page >> tables in userspace', ie no PASID in this step. >> >> 'page tables in userspace' means an iommufd ioctl to create an >> iommu_domain where the IOMMU HW is directly travesering a >> device-specific page table structure in user space memory. All the HW >> today implements this by using another iommu_domain to allow the IOMMU >> HW DMA access to user memory - ie nesting or multi-stage or whatever. > One clarification here in case people may get confused based on the > current iommu_domain definition. Jason brainstormed with us on how > to represent 'user page table' in the IOMMU sub-system. One is to > extend iommu_domain as a general representation for any page table > instance. The other option is to create new representations for user > page tables and then link them under existing iommu_domain. > > This context is based on the 1st option. and As Jason said in the bottom > we still need to sketch out whether it works as expected.
On Thu, Dec 09, 2021 at 08:50:04AM +0100, Eric Auger wrote: > > The kernel API should accept the S1ContextPtr IPA and all the parts of > > the STE that relate to the defining the layout of what the S1Context > > points to an thats it. > Yes that's exactly what is done currently. At config time the host must > trap guest STE changes (format and S1ContextPtr) and "incorporate" those > changes into the stage2 related STE information. The STE is owned by the > host kernel as it contains the stage2 information (S2TTB). [..] > Note this series only coped with a single CD in the Context Descriptor > Table. I'm confused, where does this limit arise? The kernel accepts as input all the bits in the STE that describe the layout of the CDT owned by userspace, shouldn't userspace be able to construct all forms of CDT with any number of CDs in them? Or do you mean this is some qemu limitation? Jason
On Thu, Dec 09, 2021 at 03:59:57AM +0000, Tian, Kevin wrote: > > From: Tian, Kevin > > Sent: Thursday, December 9, 2021 10:58 AM > > > > For ARM it's SMMU's PASID table format. There is no step-2 since PASID > > is already within the address space covered by the user PASID table. > > > > One correction here. 'no step-2' is definitely wrong here as it means > more than user page table in your plan (e.g. dpdk). > > To simplify it what I meant is: > > iommufd reports how many 'user page tables' are supported given a device. > > ARM always reports only one can be supported, and it must be created in > PASID table format. tagged by RID. > > Intel reports one in step1 (tagged by RID), and N in step2 (tagged by > RID+PASID). A special flag in attach call allows the user to specify the > additional PASID routing info for a 'user page table'. I don't think 'number of user page tables' makes sense It really is 'attach to the whole device' vs 'attach to the RID' as a semantic that should exist If we imagine a userspace using kernel page tables it certainly makes sense to assign page table A to the RID and page table B to a PASID even in simple cases like vfio-pci. The only case where userspace would want to capture the entire RID and all PASIDs is something like this ARM situation - but userspace just created a device specific object and already knows exactly what kind of behavior it has. So, something like vfio pci would implement three uAPI operations: - Attach page table to RID - Attach page table to PASID - Attach page table to RID and all PASIDs And here 'page table' is everything below the STE in SMMUv3 While mdev can only support: - Access emulated page table - Attach page table to PASID It is what I've said a couple of times, the API the driver calls toward iommufd to attach a page table must be unambiguous as to the intention, which also means userspace must be unambiguous too. Jason
Hi Jason, On 12/9/21 4:40 PM, Jason Gunthorpe wrote: > On Thu, Dec 09, 2021 at 08:50:04AM +0100, Eric Auger wrote: > >>> The kernel API should accept the S1ContextPtr IPA and all the parts of >>> the STE that relate to the defining the layout of what the S1Context >>> points to an thats it. >> Yes that's exactly what is done currently. At config time the host must >> trap guest STE changes (format and S1ContextPtr) and "incorporate" those >> changes into the stage2 related STE information. The STE is owned by the >> host kernel as it contains the stage2 information (S2TTB). > [..] > >> Note this series only coped with a single CD in the Context Descriptor >> Table. > I'm confused, where does this limit arise? > > The kernel accepts as input all the bits in the STE that describe the > layout of the CDT owned by userspace, shouldn't userspace be able to > construct all forms of CDT with any number of CDs in them? > > Or do you mean this is some qemu limitation? The upstream vSMMUv3 emulation does not support multiple CDs at the moment and since I have no proper means to validate the vSVA case I am rejecting any attempt from user space to inject guest configs featuring mutliple PASIDs. Also PASID cache invalidation must be added to this series. Nevertheless those limitations were tackled and overcomed by others in CC so I don't think there is any blocking issue. Thanks Eric > > Jason >
> From: Jason Gunthorpe via iommu > Sent: Friday, December 10, 2021 12:08 AM > > On Thu, Dec 09, 2021 at 03:59:57AM +0000, Tian, Kevin wrote: > > > From: Tian, Kevin > > > Sent: Thursday, December 9, 2021 10:58 AM > > > > > > For ARM it's SMMU's PASID table format. There is no step-2 since PASID > > > is already within the address space covered by the user PASID table. > > > > > > > One correction here. 'no step-2' is definitely wrong here as it means > > more than user page table in your plan (e.g. dpdk). > > > > To simplify it what I meant is: > > > > iommufd reports how many 'user page tables' are supported given a device. > > > > ARM always reports only one can be supported, and it must be created in > > PASID table format. tagged by RID. > > > > Intel reports one in step1 (tagged by RID), and N in step2 (tagged by > > RID+PASID). A special flag in attach call allows the user to specify the > > additional PASID routing info for a 'user page table'. > > I don't think 'number of user page tables' makes sense > > It really is 'attach to the whole device' vs 'attach to the RID' as a > semantic that should exist > > If we imagine a userspace using kernel page tables it certainly makes > sense to assign page table A to the RID and page table B to a PASID > even in simple cases like vfio-pci. > > The only case where userspace would want to capture the entire RID and > all PASIDs is something like this ARM situation - but userspace just > created a device specific object and already knows exactly what kind > of behavior it has. > > So, something like vfio pci would implement three uAPI operations: > - Attach page table to RID > - Attach page table to PASID > - Attach page table to RID and all PASIDs > And here 'page table' is everything below the STE in SMMUv3 > > While mdev can only support: > - Access emulated page table > - Attach page table to PASID mdev is a pci device from user p.o.v, having its vRID and vPASID. From this angle the uAPI is no different from vfio-pci (except the ARM one): - (sw mdev) Attach emulated page table to vRID (no iommu domain) - (hw mdev) Attach page table to vRID (mapped to mdev PASID) - (hw mdev) Attach page table to vPASID (mapped to a fungible PASID) > > It is what I've said a couple of times, the API the driver calls > toward iommufd to attach a page table must be unambiguous as to the > intention, which also means userspace must be unambiguous too. > No question on the unambiguous part. But we also need to consider the common semantics that can be abstracted. From user p.o.v a vRID can be attached to at most two page tables (if nesting is enabled). This just requires the basic attaching form for either one page table or two page tables: at_data = { .iommufd = xxx; .pgtable_id = yyy; }; ioctl(device_fd, VFIO_DEVICE_ATTACH_PGTABLE, &at_data); This can already cover ARM's requirement. The user page table attached to vRID is in vendor specific format, e.g. either ARM pasid table format or Intel stage-1 format. For ARM pasid_table + underlying stage-1 page tables can be considered as a single big paging structure. From this angle I'm not sure the benefit of making a separate uAPI just because it's a pasid table for ARM. Then when PASID needs to be explicitly specified (e.g. in Intel case): at_data = { .iommufd = xxx; .pgtable_id = yyy; .flags = VFIO_ATTACH_FLAGS_PASID; .pasid = zzz; }; ioctl(device_fd, VFIO_DEVICE_ATTACH_PGTABLE, &at_data); Again, I don't think what a simple flag can solve needs to be made into a separate uAPI. Is modeling like above considered ambiguous? Thanks Kevin
On Fri, Dec 10, 2021 at 08:56:56AM +0000, Tian, Kevin wrote: > > So, something like vfio pci would implement three uAPI operations: > > - Attach page table to RID > > - Attach page table to PASID > > - Attach page table to RID and all PASIDs > > And here 'page table' is everything below the STE in SMMUv3 > > > > While mdev can only support: > > - Access emulated page table > > - Attach page table to PASID > > mdev is a pci device from user p.o.v, having its vRID and vPASID. From > this angle the uAPI is no different from vfio-pci (except the ARM one): No, it isn't. The internal operation is completely different, and it must call different iommufd APIs than vfio-pci does. This is user visible - mdev can never be attached to an ARM user page table, for instance. For iommufd there is no vRID, vPASID or any confusing stuff like that. You'll have an easier time if you stop thinking in these terms. We probably end up with three iommufd calls: int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id, unsigned int flags) int iommufd_device_attach_pasid(struct iommufd_device *idev, u32 *pt_id, unsigned int flags, ioasid_t *pasid) int iommufd_device_attach_sw_iommu(struct iommufd_device *idev, u32 pt_id); And the uAPI from VFIO must map onto them. vfio-pci: - 'VFIO_SET_CONTAINER' does iommufd_device_attach(idev, &pt_id, IOMMUFD_FULL_DEVICE); # IOMMU specific if this captures PASIDs or cause them to fail, # but IOMMUFD_FULL_DEVICE will prevent attaching any PASID # later on all iommu's. vfio-mdev: - 'VFIO_SET_CONTAINER' does one of: iommufd_device_attach_pasid(idev, &pt_id, IOMMUFD_ASSIGN_PASID, &pasid); iommufd_device_attach_sw_iommu(idev, pt_id); That is three of the cases. Then we have new ioctls for the other cases: vfio-pci: - 'bind only the RID, so we can use PASID' iommufd_device_attach(idev, &pt_id, 0); - 'bind to a specific PASID' iommufd_device_attach_pasid(idev, &pt_id, 0, &pasid); vfio-mdev: - 'like VFIO_SET_CONTAINER but bind to a specific PASID' iommufd_device_attach_pasid(idev, &pt_id, 0, &pasid); The iommu driver will block attachments that are incompatible, ie ARM user page tables only work with: iommufd_device_attach(idev, &pt_id, IOMMUFD_FULL_DEVICE) all other calls fail. How exactly we put all of this into new ioctls, I'm not sure, but it does seem pretty clear this is what the iommufd kAPI will need to look like to cover the cases we know about already. As you can see, userpace needs to understand what mode it is operating in. If it does IOMMUFD_FULL_DEVICE and manages PASID somehow in userspace, or it doesn't and can use the iommufd_device_attach_pasid() paths. > Is modeling like above considered ambiguous? You've skipped straight to the ioctls without designing the kernel API to meet all the requirements :) Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, December 10, 2021 9:23 PM > > On Fri, Dec 10, 2021 at 08:56:56AM +0000, Tian, Kevin wrote: > > > So, something like vfio pci would implement three uAPI operations: > > > - Attach page table to RID > > > - Attach page table to PASID > > > - Attach page table to RID and all PASIDs > > > And here 'page table' is everything below the STE in SMMUv3 > > > > > > While mdev can only support: > > > - Access emulated page table > > > - Attach page table to PASID > > > > mdev is a pci device from user p.o.v, having its vRID and vPASID. From > > this angle the uAPI is no different from vfio-pci (except the ARM one): > > No, it isn't. The internal operation is completely different, and it > must call different iommufd APIs than vfio-pci does. Well, you mentioned "uAPI operations" thus my earlier comment is purely from uAPI p.o.v instead of internal iommufd APIs (not meant I didn't think of them). I think this is the main divergence in this discussion as when I saw you said "while mdev can only support" I assume it's still about uAPI (more specifically VFIO uAPI as it carries the attach call to iommufd). > > This is user visible - mdev can never be attached to an ARM user page > table, for instance. sure. the iommu driver will fail the attach request when seeing incompatible way is used. > > For iommufd there is no vRID, vPASID or any confusing stuff like > that. You'll have an easier time if you stop thinking in these terms. I don't have a difficulty here as from vfio uAPI p.o.v it's about vRID and vPASID. But there is NO any confusion on iommufd which should only deal with physical thing. This has been settled down long time ago in high level design discussion.
> From: Tian, Kevin > Sent: Saturday, December 11, 2021 11:58 AM > > This might be the only open as I still didn't see why we need an > explicit flag to claim a 'full device' thing. From kernel p.o.v the > ARM case is no different from Intel that both allows an user > page table attached to vRID, just with different format and obviously this is 'RID' to not cause further confusion since it talks about the kernel p.o.v > addr width (Intel is 64bit, ARM is 84bit where PASID can be > considered a sub-handle in the 84bit address space and not > the kernel's business). > > and ARM doesn't support explicit PASID attach then those calls > will fail for sure. Thanks Kevin
On Sat, Dec 11, 2021 at 03:57:45AM +0000, Tian, Kevin wrote: > This might be the only open as I still didn't see why we need an > explicit flag to claim a 'full device' thing. From kernel p.o.v the > ARM case is no different from Intel that both allows an user > page table attached to vRID, just with different format and > addr width (Intel is 64bit, ARM is 84bit where PASID can be > considered a sub-handle in the 84bit address space and not > the kernel's business). I think the difference is intention. In one case the kernel is saying 'attach a RID and I intend to use PASID' in which case the kernel user can call the PASID APIs. The second case is saying 'I will not use PASID'. They are different things and I think it is a surprising API if the kernel user attaches a domain, intends to use PASID and then finds out it can't, eg because an ARM user page table was hooked up. If you imagine the flag as 'I intend to use PASID' I think it makes a fair amount of sense from an API design too. We could probably do without it, at least for VFIO and qemu cases, but it seems a little bit peculiar to me. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, December 17, 2021 4:49 AM > > On Sat, Dec 11, 2021 at 03:57:45AM +0000, Tian, Kevin wrote: > > > This might be the only open as I still didn't see why we need an > > explicit flag to claim a 'full device' thing. From kernel p.o.v the > > ARM case is no different from Intel that both allows an user > > page table attached to vRID, just with different format and > > addr width (Intel is 64bit, ARM is 84bit where PASID can be > > considered a sub-handle in the 84bit address space and not > > the kernel's business). > > I think the difference is intention. > > In one case the kernel is saying 'attach a RID and I intend to use > PASID' in which case the kernel user can call the PASID APIs. > > The second case is saying 'I will not use PASID'. > > They are different things and I think it is a surprising API if the > kernel user attaches a domain, intends to use PASID and then finds out > it can't, eg because an ARM user page table was hooked up. > > If you imagine the flag as 'I intend to use PASID' I think it makes a > fair amount of sense from an API design too. > > We could probably do without it, at least for VFIO and qemu cases, but > it seems a little bit peculiar to me. > ok, combining the kernel user makes the flag more sensible. Thanks Kevin
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 3303d707bab4..6033c263c6e6 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2236,6 +2236,75 @@ int iommu_uapi_sva_unbind_gpasid(struct iommu_domain *domain, struct device *dev } EXPORT_SYMBOL_GPL(iommu_uapi_sva_unbind_gpasid); +int iommu_attach_pasid_table(struct iommu_domain *domain, + struct iommu_pasid_table_config *cfg) +{ + if (unlikely(!domain->ops->attach_pasid_table)) + return -ENODEV; + + return domain->ops->attach_pasid_table(domain, cfg); +} +EXPORT_SYMBOL_GPL(iommu_attach_pasid_table); + +int iommu_uapi_attach_pasid_table(struct iommu_domain *domain, + void __user *uinfo) +{ + struct iommu_pasid_table_config pasid_table_data = { 0 }; + u32 minsz; + + if (unlikely(!domain->ops->attach_pasid_table)) + return -ENODEV; + + /* + * No new spaces can be added before the variable sized union, the + * minimum size is the offset to the union. + */ + minsz = offsetof(struct iommu_pasid_table_config, vendor_data); + + /* Copy minsz from user to get flags and argsz */ + if (copy_from_user(&pasid_table_data, uinfo, minsz)) + return -EFAULT; + + /* Fields before the variable size union are mandatory */ + if (pasid_table_data.argsz < minsz) + return -EINVAL; + + /* PASID and address granu require additional info beyond minsz */ + if (pasid_table_data.version != PASID_TABLE_CFG_VERSION_1) + return -EINVAL; + if (pasid_table_data.format == IOMMU_PASID_FORMAT_SMMUV3 && + pasid_table_data.argsz < + offsetofend(struct iommu_pasid_table_config, vendor_data.smmuv3)) + return -EINVAL; + + /* + * User might be using a newer UAPI header which has a larger data + * size, we shall support the existing flags within the current + * size. Copy the remaining user data _after_ minsz but not more + * than the current kernel supported size. + */ + if (copy_from_user((void *)&pasid_table_data + minsz, uinfo + minsz, + min_t(u32, pasid_table_data.argsz, sizeof(pasid_table_data)) - minsz)) + return -EFAULT; + + /* Now the argsz is validated, check the content */ + if (pasid_table_data.config < IOMMU_PASID_CONFIG_TRANSLATE || + pasid_table_data.config > IOMMU_PASID_CONFIG_ABORT) + return -EINVAL; + + return domain->ops->attach_pasid_table(domain, &pasid_table_data); +} +EXPORT_SYMBOL_GPL(iommu_uapi_attach_pasid_table); + +void iommu_detach_pasid_table(struct iommu_domain *domain) +{ + if (unlikely(!domain->ops->detach_pasid_table)) + return; + + domain->ops->detach_pasid_table(domain); +} +EXPORT_SYMBOL_GPL(iommu_detach_pasid_table); + static void __iommu_detach_device(struct iommu_domain *domain, struct device *dev) { diff --git a/include/linux/iommu.h b/include/linux/iommu.h index d2f3435e7d17..e34a1b1c805b 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -232,6 +232,8 @@ struct iommu_iotlb_gather { * @cache_invalidate: invalidate translation caches * @sva_bind_gpasid: bind guest pasid and mm * @sva_unbind_gpasid: unbind guest pasid and mm + * @attach_pasid_table: attach a pasid table + * @detach_pasid_table: detach the pasid table * @def_domain_type: device default domain type, return value: * - IOMMU_DOMAIN_IDENTITY: must use an identity domain * - IOMMU_DOMAIN_DMA: must use a dma domain @@ -297,6 +299,9 @@ struct iommu_ops { void *drvdata); void (*sva_unbind)(struct iommu_sva *handle); u32 (*sva_get_pasid)(struct iommu_sva *handle); + int (*attach_pasid_table)(struct iommu_domain *domain, + struct iommu_pasid_table_config *cfg); + void (*detach_pasid_table)(struct iommu_domain *domain); int (*page_response)(struct device *dev, struct iommu_fault_event *evt, @@ -430,6 +435,11 @@ extern int iommu_uapi_sva_unbind_gpasid(struct iommu_domain *domain, struct device *dev, void __user *udata); extern int iommu_sva_unbind_gpasid(struct iommu_domain *domain, struct device *dev, ioasid_t pasid); +extern int iommu_attach_pasid_table(struct iommu_domain *domain, + struct iommu_pasid_table_config *cfg); +extern int iommu_uapi_attach_pasid_table(struct iommu_domain *domain, + void __user *udata); +extern void iommu_detach_pasid_table(struct iommu_domain *domain); extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev); extern struct iommu_domain *iommu_get_dma_domain(struct device *dev); extern int iommu_map(struct iommu_domain *domain, unsigned long iova, @@ -1035,6 +1045,23 @@ iommu_aux_get_pasid(struct iommu_domain *domain, struct device *dev) return -ENODEV; } +static inline +int iommu_attach_pasid_table(struct iommu_domain *domain, + struct iommu_pasid_table_config *cfg) +{ + return -ENODEV; +} + +static inline +int iommu_uapi_attach_pasid_table(struct iommu_domain *domain, + void __user *uinfo) +{ + return -ENODEV; +} + +static inline +void iommu_detach_pasid_table(struct iommu_domain *domain) {} + static inline struct iommu_sva * iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, void *drvdata) { diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h index 59178fc229ca..8c079a78dfec 100644 --- a/include/uapi/linux/iommu.h +++ b/include/uapi/linux/iommu.h @@ -339,4 +339,58 @@ struct iommu_gpasid_bind_data { } vendor; }; +/** + * struct iommu_pasid_smmuv3 - ARM SMMUv3 Stream Table Entry stage 1 related + * information + * @version: API version of this structure + * @s1fmt: STE s1fmt (format of the CD table: single CD, linear table + * or 2-level table) + * @s1dss: STE s1dss (specifies the behavior when @pasid_bits != 0 + * and no PASID is passed along with the incoming transaction) + * @padding: reserved for future use (should be zero) + * + * The PASID table is referred to as the Context Descriptor (CD) table on ARM + * SMMUv3. Please refer to the ARM SMMU 3.x spec (ARM IHI 0070A) for full + * details. + */ +struct iommu_pasid_smmuv3 { +#define PASID_TABLE_SMMUV3_CFG_VERSION_1 1 + __u32 version; + __u8 s1fmt; + __u8 s1dss; + __u8 padding[2]; +}; + +/** + * struct iommu_pasid_table_config - PASID table data used to bind guest PASID + * table to the host IOMMU + * @argsz: User filled size of this data + * @version: API version to prepare for future extensions + * @base_ptr: guest physical address of the PASID table + * @format: format of the PASID table + * @pasid_bits: number of PASID bits used in the PASID table + * @config: indicates whether the guest translation stage must + * be translated, bypassed or aborted. + * @padding: reserved for future use (should be zero) + * @vendor_data.smmuv3: table information when @format is + * %IOMMU_PASID_FORMAT_SMMUV3 + */ +struct iommu_pasid_table_config { + __u32 argsz; +#define PASID_TABLE_CFG_VERSION_1 1 + __u32 version; + __u64 base_ptr; +#define IOMMU_PASID_FORMAT_SMMUV3 1 + __u32 format; + __u8 pasid_bits; +#define IOMMU_PASID_CONFIG_TRANSLATE 1 +#define IOMMU_PASID_CONFIG_BYPASS 2 +#define IOMMU_PASID_CONFIG_ABORT 3 + __u8 config; + __u8 padding[2]; + union { + struct iommu_pasid_smmuv3 smmuv3; + } vendor_data; +}; + #endif /* _UAPI_IOMMU_H */