Message ID | 5-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Initial support for SMMUv3 nested translation | expand |
On Wed, Oct 30, 2024 at 09:20:49PM -0300, Jason Gunthorpe wrote: > From: Nicolin Chen <nicolinc@nvidia.com> > > For virtualization cases the IDR/IIDR/AIDR values of the actual SMMU > instance need to be available to the VMM so it can construct an > appropriate vSMMUv3 that reflects the correct HW capabilities. > > For userspace page tables these values are required to constrain the valid > values within the CD table and the IOPTEs. > > The kernel does not sanitize these values. If building a VMM then > userspace is required to only forward bits into a VM that it knows it can > implement. Some bits will also require a VMM to detect if appropriate > kernel support is available such as for ATS and BTM. > > Start a new file and kconfig for the advanced iommufd support. This lets > it be compiled out for kernels that are not intended to support > virtualization, and allows distros to leave it disabled until they are > shipping a matching qemu too. > > Tested-by: Nicolin Chen <nicolinc@nvidia.com> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > Reviewed-by: Kevin Tian <kevin.tian@intel.com> > Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> > Reviewed-by: Donald Dutile <ddutile@redhat.com> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> --->8 > diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h > index e266dfa6a38d9d..b227ac16333fe1 100644 > --- a/include/uapi/linux/iommufd.h > +++ b/include/uapi/linux/iommufd.h > @@ -488,15 +488,50 @@ struct iommu_hw_info_vtd { > __aligned_u64 ecap_reg; > }; > > +/** > + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information > + * (IOMMU_HW_INFO_TYPE_ARM_SMMUV3) > + * > + * @flags: Must be set to 0 > + * @__reserved: Must be 0 > + * @idr: Implemented features for ARM SMMU Non-secure programming interface > + * @iidr: Information about the implementation and implementer of ARM SMMU, > + * and architecture version supported > + * @aidr: ARM SMMU architecture version > + * > + * For the details of @idr, @iidr and @aidr, please refer to the chapters > + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec. > + * > + * User space should read the underlying ARM SMMUv3 hardware information for > + * the list of supported features. > + * > + * Note that these values reflect the raw HW capability, without any insight if > + * any required kernel driver support is present. Bits may be set indicating the > + * HW has functionality that is lacking kernel software support, such as BTM. If > + * a VMM is using this information to construct emulated copies of these > + * registers it should only forward bits that it knows it can support. > + * > + * In future, presence of required kernel support will be indicated in flags. What about the case where we _know_ that some functionality is broken in the hardware? For example, we nobble BTM support on MMU 700 thanks to erratum #2812531 yet we'll still cheerfully advertise it in IDR0 here. Similarly, HTTU can be overridden by IORT, so should we update the view that we advertise for that as well? Will
On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote: > > +/** > > + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information > > + * (IOMMU_HW_INFO_TYPE_ARM_SMMUV3) > > + * > > + * @flags: Must be set to 0 > > + * @__reserved: Must be 0 > > + * @idr: Implemented features for ARM SMMU Non-secure programming interface > > + * @iidr: Information about the implementation and implementer of ARM SMMU, > > + * and architecture version supported > > + * @aidr: ARM SMMU architecture version > > + * > > + * For the details of @idr, @iidr and @aidr, please refer to the chapters > > + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec. > > + * > > + * User space should read the underlying ARM SMMUv3 hardware information for > > + * the list of supported features. > > + * > > + * Note that these values reflect the raw HW capability, without any insight if > > + * any required kernel driver support is present. Bits may be set indicating the > > + * HW has functionality that is lacking kernel software support, such as BTM. If > > + * a VMM is using this information to construct emulated copies of these > > + * registers it should only forward bits that it knows it can support. > > + * > > + * In future, presence of required kernel support will be indicated in flags. > > What about the case where we _know_ that some functionality is broken in > the hardware? For example, we nobble BTM support on MMU 700 thanks to > erratum #2812531 yet we'll still cheerfully advertise it in IDR0 here. > Similarly, HTTU can be overridden by IORT, so should we update the view > that we advertise for that as well? My knee jerk answer is no, these struct fields should just report the raw HW register. A VMM should not copy these fields directly into a VM. The principle purpose is to give the VMM the same details about the HW as the kernel so it can apply erratas/etc. For instance, if we hide these fields how will the VMM/VM know to apply the various flushing errata? With vCMDQ/etc the VM is directly pushing flushes to HW, it must know the errata. For BTM/HTTU/etc - those all require kernel SW support and per-device permission in the kernel to turn on. For instance requesting a nested vSTE that needs BTM will fail today during attach. Turning on HTTU on the S2 already has an API that will fail if the IORT blocks it. Incrementally dealing with expanding the support is part of the "required kernel support will be indicated in flags." Basically, exposing the information as-is doesn't do any harm. Jason
On 2024-11-04 12:41 pm, Jason Gunthorpe wrote: > On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote: >>> +/** >>> + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information >>> + * (IOMMU_HW_INFO_TYPE_ARM_SMMUV3) >>> + * >>> + * @flags: Must be set to 0 >>> + * @__reserved: Must be 0 >>> + * @idr: Implemented features for ARM SMMU Non-secure programming interface >>> + * @iidr: Information about the implementation and implementer of ARM SMMU, >>> + * and architecture version supported >>> + * @aidr: ARM SMMU architecture version >>> + * >>> + * For the details of @idr, @iidr and @aidr, please refer to the chapters >>> + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec. >>> + * >>> + * User space should read the underlying ARM SMMUv3 hardware information for >>> + * the list of supported features. >>> + * >>> + * Note that these values reflect the raw HW capability, without any insight if >>> + * any required kernel driver support is present. Bits may be set indicating the >>> + * HW has functionality that is lacking kernel software support, such as BTM. If >>> + * a VMM is using this information to construct emulated copies of these >>> + * registers it should only forward bits that it knows it can support. But how *is* a VMM supposed to know what it can support? Are they all expected to grovel the host devicetree/ACPI tables and maintain their own knowledge of implementation errata to understand what's actually usable? >>> + * >>> + * In future, presence of required kernel support will be indicated in flags. >> >> What about the case where we _know_ that some functionality is broken in >> the hardware? For example, we nobble BTM support on MMU 700 thanks to >> erratum #2812531 yet we'll still cheerfully advertise it in IDR0 here. >> Similarly, HTTU can be overridden by IORT, so should we update the view >> that we advertise for that as well? > > My knee jerk answer is no, these struct fields should just report the > raw HW register. A VMM should not copy these fields directly into a > VM. The principle purpose is to give the VMM the same details about the > HW as the kernel so it can apply erratas/etc. > > For instance, if we hide these fields how will the VMM/VM know to > apply the various flushing errata? With vCMDQ/etc the VM is directly > pushing flushes to HW, it must know the errata. That doesn't seem like a valid argument. We obviously can't abstract SMMU_IIDR, that would indeed be an invitation for trouble, but otherwise, if an erratum affects S1 operation under conditions dependent on an optional feature, then not advertising that feature would make the workaround irrelevant anyway, since as far as the VM is concerned it would be wrong to expect a non-existent feature to work in the first place. > For BTM/HTTU/etc - those all require kernel SW support and per-device > permission in the kernel to turn on. For instance requesting a nested > vSTE that needs BTM will fail today during attach. Turning on HTTU on > the S2 already has an API that will fail if the IORT blocks it. What does S2 HTTU have to do with the VM? How the host wants to maintain its S2 tables it its own business. AFAICS, unless the VMM wants to do some fiddly CD shadowing, it's going to be kinda hard to prevent the SMMU seeing a guest CD with CD.HA and/or CD.HD set if the guest expects S1 HTTU to work. I'm not sure what "vSTE that needs BTM" means. Even if the system does support BTM, the only control is the global SMMU_CR2.PTM, and a vSMMU can't usefully emulate changing that either way. Either the host set PTM=0 before enabling the SMMU, so BTM can be advertised and expected to work, or it didn't, in which case there can be no BTM, full stop. > Incrementally dealing with expanding the support is part of the > "required kernel support will be indicated in flags." > > Basically, exposing the information as-is doesn't do any harm. I would say it does. Advertising a feature when we already know it's not usable at all puts a non-trivial and unnecessary burden on the VMM and VM to then have to somehow derive that information from other sources, at the risk of being confused by unexpected behaviour if they don't. We sanitise CPU ID registers for userspace and KVM, so I see no compelling reason for SMMU ID registers to be different. Thanks, Robin.
On Wed, Nov 06, 2024 at 04:37:53PM +0000, Robin Murphy wrote: > On 2024-11-04 12:41 pm, Jason Gunthorpe wrote: > > On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote: > > > > +/** > > > > + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information > > > > + * (IOMMU_HW_INFO_TYPE_ARM_SMMUV3) > > > > + * > > > > + * @flags: Must be set to 0 > > > > + * @__reserved: Must be 0 > > > > + * @idr: Implemented features for ARM SMMU Non-secure programming interface > > > > + * @iidr: Information about the implementation and implementer of ARM SMMU, > > > > + * and architecture version supported > > > > + * @aidr: ARM SMMU architecture version > > > > + * > > > > + * For the details of @idr, @iidr and @aidr, please refer to the chapters > > > > + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec. > > > > + * > > > > + * User space should read the underlying ARM SMMUv3 hardware information for > > > > + * the list of supported features. > > > > + * > > > > + * Note that these values reflect the raw HW capability, without any insight if > > > > + * any required kernel driver support is present. Bits may be set indicating the > > > > + * HW has functionality that is lacking kernel software support, such as BTM. If > > > > + * a VMM is using this information to construct emulated copies of these > > > > + * registers it should only forward bits that it knows it can support. > > But how *is* a VMM supposed to know what it can support? I answered a related question to Mostafa with an example: https://lore.kernel.org/linux-iommu/20240903235532.GJ3773488@nvidia.com/ "global" capabilities that are enabled directly from the CD entry would follow the pattern. > Are they all expected to grovel the host devicetree/ACPI tables and > maintain their own knowledge of implementation errata to understand > what's actually usable? No, VMMs are expected to only implement base line features we have working today and not blindly add new features based only HW registers reported here. Each future capability we want to enable at the VMM needs an analysis: 1) Does it require kernel SW changes, ie like BTM? Then it needs a kernel_capabilities bit to say the kernel SW exists 2) Does it require data from ACPI/DT/etc? Then it needs a kernel_capabilities bit 3) Does it need to be "turned on" per VM, ie with a VMS enablement? Then it needs a new request flag in ALLOC_VIOMMU 4) Otherwise it can be read directly from the idr[] array This is why the comment above is so stern that the VMM "should only forward bits that it knows it can support". > S2 tables it its own business. AFAICS, unless the VMM wants to do some > fiddly CD shadowing, it's going to be kinda hard to prevent the SMMU seeing > a guest CD with CD.HA and/or CD.HD set if the guest expects S1 HTTU to work. If the VMM wrongly indicates HTTU support to the VM, because it wrongly inspected those bits in the idr report, then it is just broken. > I would say it does. Advertising a feature when we already know it's not > usable at all puts a non-trivial and unnecessary burden on the VMM and VM to > then have to somehow derive that information from other sources, at the risk > of being confused by unexpected behaviour if they don't. That is not the purpose here, the register report is not to be used as "advertising features". It describes details of the raw HW that the VMM may need to use *some* of the fields. There are quite a few fields that fit #4 today: OAS, VAX, GRAN, BBML, CD2L, etc. Basically we will pass most of the bits and mask a few. If we get the masking wrong and pass something we shouldn't, then we've improved nothing compared to this proposal. I think we are likely to get the masking wrong :) > We sanitise CPU ID registers for userspace and KVM, so I see no compelling > reason for SMMU ID registers to be different. We discussed this already: https://lore.kernel.org/linux-iommu/20240904120103.GB3915968@nvidia.com It is a false comparison, for KVM the kernel is responsible to control the CPU ID registers. Reporting the registers the VM sees to the VMM makes alot of sense. For SMMU the VMM exclusively controls the VM's ID registers. If you still feel strongly about this please let me know by Friday and I will drop the idr[] array from this cycle. We can continue to discuss a solution for the next cycle. Regards, Jason
On 2024-11-06 6:05 pm, Jason Gunthorpe wrote: > On Wed, Nov 06, 2024 at 04:37:53PM +0000, Robin Murphy wrote: >> On 2024-11-04 12:41 pm, Jason Gunthorpe wrote: >>> On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote: >>>>> +/** >>>>> + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information >>>>> + * (IOMMU_HW_INFO_TYPE_ARM_SMMUV3) >>>>> + * >>>>> + * @flags: Must be set to 0 >>>>> + * @__reserved: Must be 0 >>>>> + * @idr: Implemented features for ARM SMMU Non-secure programming interface >>>>> + * @iidr: Information about the implementation and implementer of ARM SMMU, >>>>> + * and architecture version supported >>>>> + * @aidr: ARM SMMU architecture version >>>>> + * >>>>> + * For the details of @idr, @iidr and @aidr, please refer to the chapters >>>>> + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec. >>>>> + * >>>>> + * User space should read the underlying ARM SMMUv3 hardware information for >>>>> + * the list of supported features. >>>>> + * >>>>> + * Note that these values reflect the raw HW capability, without any insight if >>>>> + * any required kernel driver support is present. Bits may be set indicating the >>>>> + * HW has functionality that is lacking kernel software support, such as BTM. If >>>>> + * a VMM is using this information to construct emulated copies of these >>>>> + * registers it should only forward bits that it knows it can support. >> >> But how *is* a VMM supposed to know what it can support? > > I answered a related question to Mostafa with an example: > > https://lore.kernel.org/linux-iommu/20240903235532.GJ3773488@nvidia.com/ > > "global" capabilities that are enabled directly from the CD entry > would follow the pattern. > >> Are they all expected to grovel the host devicetree/ACPI tables and >> maintain their own knowledge of implementation errata to understand >> what's actually usable? > > No, VMMs are expected to only implement base line features we have > working today and not blindly add new features based only HW registers > reported here. > > Each future capability we want to enable at the VMM needs an analysis: > > 1) Does it require kernel SW changes, ie like BTM? Then it needs a > kernel_capabilities bit to say the kernel SW exists > 2) Does it require data from ACPI/DT/etc? Then it needs a > kernel_capabilities bit > 3) Does it need to be "turned on" per VM, ie with a VMS enablement? > Then it needs a new request flag in ALLOC_VIOMMU > 4) Otherwise it can be read directly from the idr[] array > > This is why the comment above is so stern that the VMM "should only > forward bits that it knows it can support". So... you're saying this patch is in fact broken, or at least uselessly incomplete, since VMMs aren't allowed to emulate a vSMMU at all without first consulting some other interface which does not exist? Great. >> S2 tables it its own business. AFAICS, unless the VMM wants to do some >> fiddly CD shadowing, it's going to be kinda hard to prevent the SMMU seeing >> a guest CD with CD.HA and/or CD.HD set if the guest expects S1 HTTU to work. > > If the VMM wrongly indicates HTTU support to the VM, because it > wrongly inspected those bits in the idr report, then it is just > broken. What do you mean? We could have a system right now where the hardware is configured with SMMU_IDR0.HTTU=2, but it turned out that atomics were broken in the interconnect so firmware sets the IORT "HTTU override" field is set to 0. We know about that in the kernel, but all a VMM sees is iommu_hw_info_arm_smmuv3.idr[0] indicating HTTU=2. If it is "broken" to take the only information available at face value, assume HTTU is available, and reflect that in a vSMMU interface, then what is the correct thing to do, other than to not dare emulate a vSMMU at all, in fear of a sternly worded comment? >> I would say it does. Advertising a feature when we already know it's not >> usable at all puts a non-trivial and unnecessary burden on the VMM and VM to >> then have to somehow derive that information from other sources, at the risk >> of being confused by unexpected behaviour if they don't. > > That is not the purpose here, the register report is not to be used as > "advertising features". It describes details of the raw HW that the > VMM may need to use *some* of the fields. > > There are quite a few fields that fit #4 today: OAS, VAX, GRAN, BBML, > CD2L, etc. > > Basically we will pass most of the bits and mask a few. If we get the > masking wrong and pass something we shouldn't, then we've improved > nothing compared to this proposal. I think we are likely to get the > masking wrong :) Seriously? A simple inverse of the feature detection the kernel driver already does for its own needs, implemented once in the same place, is hard? Compared to maintaining the exact same information within the driver but in some new different form, and also maintaining it in the UAPI, and having every VMM ever all do the same work to put the two together, and always be up to date with the right UAPI, and never ever let any field slip through as-is, especially not all the ones which were RES0 at time of writing, enforced by a sternly worded comment? Why yes, of course I can see how that's trivially easy and carries no risk whatsoever. >> We sanitise CPU ID registers for userspace and KVM, so I see no compelling >> reason for SMMU ID registers to be different. > > We discussed this already: > > https://lore.kernel.org/linux-iommu/20240904120103.GB3915968@nvidia.com > > It is a false comparison, for KVM the kernel is responsible to control > the CPU ID registers. Reporting the registers the VM sees to the VMM > makes alot of sense. For SMMU the VMM exclusively controls the VM's ID > registers. Pointing out that two things are different is a false comparison because they are different, by virtue of your choice to make them different? Please try making sense. Your tautology still does not offer any reasoning against doing the logical thing and following the same basic pattern: the kernel uses the ID register mechanism itself to advertise the set of features it's able/willing to support, by sanitising the values it offers to the VMM, combining the notions of hardware and kernel support where the distinction is irrelevant anyway. The VMM is then still free to take those values and hide more features, or potentially add any that it is capable of emulating without the kernel's help, and advertise that final set to the VM. Obviously there are significant *implementation* differences, most notably that the latter VMM->VM part doesn't need to involve IOMMUFD at all since MMIO register emulation can stay entirely in userspace, whereas for CPU system registers the final VM-visible values need to be plugged back in to KVM for it to handle the traps. We are all asking you to explain why you think doing the kernel->VMM advertisement naturally and intuitively is somehow bad, and forcing VMMs to instead rely on a more complex, fragile, and crucially non-existent additional interface is better. You should take "We discussed this already" as more of a clue to yourself than to me - if 4 different people have all said the exact same thing in so many words, perhaps there's something in it... And in case I need to spell it out with less sarcasm, "we'll get masking wrong in the kernel" only implies "we'll get kernel_capabilities wrong in the kernel (and elsewhere)", so it's clearly not a useful argument to keep repeating. Besides, as KVM + sysfs + MRS emulation shows, we're pretty experienced at masking ID registers in the kernel. It's not hard to do it right in a robust manner, where particularly with the nature of SMMU features, the only real risk might be forgetting to expose something new once we do actually support it. > If you still feel strongly about this please let me know by Friday and > I will drop the idr[] array from this cycle. We can continue to > discuss a solution for the next cycle. It already can't work as-is, I don't see how making it even more broken would help. IMO it doesn't seem like a good idea to be merging UAPI at all while it's still clearly incomplete and by its own definition unusable. Thanks, Robin.
On Wed, Nov 06, 2024 at 09:05:26PM +0000, Robin Murphy wrote: > On 2024-11-06 6:05 pm, Jason Gunthorpe wrote: > > If you still feel strongly about this please let me know by Friday and > > I will drop the idr[] array from this cycle. We can continue to > > discuss a solution for the next cycle. > > It already can't work as-is, I don't see how making it even more broken > would help. IMO it doesn't seem like a good idea to be merging UAPI at > all while it's still clearly incomplete and by its own definition unusable. Robin, would you please give a clear suggestion for the hw_info? My takeaway is that you would want the unsupported features (per firmware overrides and errata) to be stripped from the reporting IDR array. Alternatively, we could start with some basic nesting features less those advanced ones (HTTU/PRI or so), and then add then later once we're comfortable to advertise. Does this sound okay to you? Thanks Nicolin
On Wed, Nov 06, 2024 at 09:05:26PM +0000, Robin Murphy wrote: > > Each future capability we want to enable at the VMM needs an analysis: > > > > 1) Does it require kernel SW changes, ie like BTM? Then it needs a > > kernel_capabilities bit to say the kernel SW exists > > 2) Does it require data from ACPI/DT/etc? Then it needs a > > kernel_capabilities bit > > 3) Does it need to be "turned on" per VM, ie with a VMS enablement? > > Then it needs a new request flag in ALLOC_VIOMMU > > 4) Otherwise it can be read directly from the idr[] array > > > > This is why the comment above is so stern that the VMM "should only > > forward bits that it knows it can support". > > So... you're saying this patch is in fact broken, or at least uselessly > incomplete, since VMMs aren't allowed to emulate a vSMMU at all without > first consulting some other interface which does not exist? Great. That is too far, there is nothing fundamentally broken here. This does not enable every SMMU feature in the architecture. It implements a basic baseline and no more. The VMMs are only permitted to read bits in that baseline. VMMs that want to go beyond that have to go through the 4 steps above. Fundamentally all we are arguing about here is if bits the VMM shouldn't even read should be forced to zero by the kernel. If the bits are forced to zero then perhaps they could be used as a future SW feature negotiation, and maybe that would be better, or maybe not. See below about BTM for a counter example. I agree the kdoc does not describe what the baseline actually is. > We know about that in the kernel, but all a VMM sees is > iommu_hw_info_arm_smmuv3.idr[0] indicating HTTU=2. If it is "broken" to take > the only information available at face value, assume HTTU is available, and > reflect that in a vSMMU interface, then what is the correct thing to do, > other than to not dare emulate a vSMMU at all, in fear of a sternly worded > comment? These patches support a baseline feature set. They do not support the entire SMMU architecture. They do not support vHTTU. Things are going step by step because this is a huge project. HTTU is not in the baseline, so it is definitely wrong for any VMM working on these patches to advertise support to the VM! The VMM *MUST* ignore such bits in the IDR report. The correct thing is the 4 steps I outlined above. When a VMM author wants to go beyond the baseline they have to determine what kernel and VMM software is required and implement the missing parts. > Your tautology still does not offer any reasoning against doing the logical > thing and following the same basic pattern: the kernel uses the ID register > mechanism itself to advertise the set of features it's able/willing to > support, by sanitising the values it offers to the VMM, combining the > notions of hardware and kernel support where the distinction is irrelevant > anyway. Even with your plan the VMM can not pass the kernel's IDR register into the VM unprotected. It must always sanitize it against the features that the VMM supports. Let's consider vBTM support. This requires kernel support to enable the global BTM register *AND* a new VMM ioctl flow with iommufd to link to KVM's VMID. You are suggesting that the old kernel should wire IDR BTM to 0 and the new kernel will set it to 1. However the VMM cannot just forward this bit to the VM! A old VMM that does not do the KVM flow does NOT support BTM. All VMM's must mask the BTM bit from day 0, or we can never set it to 1 in a future kernel. Which is exactly the same plan as this patch! > You should take "We discussed this already" > as more of a clue to yourself than to me - if 4 different people have all > said the exact same thing in so many words, perhaps there's something in > it... And all seemed to agree it was not a big deal after the discussion. I think Mostafa was driving in a direction that we break up the IDR into explicit fields and thus be explicit about what information the VMM is able to access. This would effectively document and enforce what the baseline is. > And in case I need to spell it out with less sarcasm, "we'll get masking > wrong in the kernel" only implies "we'll get kernel_capabilities wrong in > the kernel (and elsewhere)", Less sarcasm and hyperbole would be nice. > > If you still feel strongly about this please let me know by Friday and > > I will drop the idr[] array from this cycle. We can continue to > > discuss a solution for the next cycle. > > It already can't work as-is, As above, this is not true. > I don't see how making it even more broken > would help. IMO it doesn't seem like a good idea to be merging UAPI at all > while it's still clearly incomplete and by its own definition unusable. I agree that removing the idr would make it definitely unusable. There is quite a bit of essential information in there the VMM needs. However, the uAPI is extensible so adding a new field in the next cycle is not breaking. I'm trying to offer you an olive branch so we can have this discussion with time instead of urgently at the 11th hour and derailing the series. This exact topic has been discussed already two months ago, and is, IMHO, mostly a style choice not an urgent technical matter. If you have another option we can conclude in the next few days then I'm happy to hear it. If we focus only on the baseline functionality Nicolin can come up with a list of fields and we can write it very explicitly in the kdoc that the VMM can only read those fields. I could also reluctantly agree to permanent masking to report only the above kdoc bits in the kernel. Noting if you want this because you don't trust the VMM authors to follow an explicit kdoc, then we likely also cannot ever set some bits to 1. Eg BTM would be permanently 0. Regards, Jason
On Wed, Nov 06, 2024 at 10:35:07PM -0400, Jason Gunthorpe wrote:
> I agree the kdoc does not describe what the baseline actually is.
Nicolin worked on this, here is a more detailed kdoc:
/**
* struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
* (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
*
* @flags: Must be set to 0
* @__reserved: Must be 0
* @idr: Implemented features for ARM SMMU Non-secure programming interface
* @iidr: Information about the implementation and implementer of ARM SMMU,
* and architecture version supported
* @aidr: ARM SMMU architecture version
*
* For the details of @idr, @iidr and @aidr, please refer to the chapters
* from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
*
* This reports the raw HW capability, and not all bits are meaningful to be
* read by userspace. Only the following fields should be used:
*
* idr[0]: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN , CD2L, ASID16, TTF
* idr[1]: SIDSIZE, SSIDSIZE
* idr[3]: BBML, RIL
* idr[5]: VAX, GRAN64K, GRAN16K, GRAN4K
*
* - S1P should be assumed to be true if a NESTED HWPT can be created
* - VFIO/iommufd only support platforms with COHACC, it should be assumed to be
* true.
* - ATS is a per-device property. If the VMM describes any devices as ATS
* capable in ACPI/DT it should set the corresponding idr.
*
* This list may expand in future (eg E0PD, AIE, PBHA, D128, DS etc). It is
* important that VMMs do not read bits outside the list to allow for
* compatibility with future kernels. Several features in the SMMUv3
* architecture are not currently supported by the kernel for nesting: HTTU,
* BTM, MPAM and others.
*/
This focuses on stuff we can actually test and gives a path to test
and confirm a no-code update to future stuff.
The future list (E0PD/etc) reflects things the current kernel doesn't
use. Our naive read of the spec suggests they are probably fine to
just read the raw HW IDR. When someone implements guest kernel
support, does a detailed spec read, and crucially tests them - then we
can update the comment and have immediate support.
HTTU, BTM, and others like that will need additional bits outside the
IDR if someone wishes to enable them.
Jason
Hi guys, On Wed, Nov 06, 2024 at 10:35:06PM -0400, Jason Gunthorpe wrote: > On Wed, Nov 06, 2024 at 09:05:26PM +0000, Robin Murphy wrote: > > You should take "We discussed this already" > > as more of a clue to yourself than to me - if 4 different people have all > > said the exact same thing in so many words, perhaps there's something in > > it... > > And all seemed to agree it was not a big deal after the discussion. > > I think Mostafa was driving in a direction that we break up the IDR > into explicit fields and thus be explicit about what information the > VMM is able to access. This would effectively document and enforce > what the baseline is. As one of the four people mentioned above, I figured I'd chime in with my rationale for queuing this in case it's of any help or interest. Initially, I was reasonably sure that we should be sanitising the ID registers and being selective about what we advertise to userspace. However, after Jason's reply to my comments, mulling it over in my head and having lively conversations with Mostafa at lunchtime, I've come full circle. Is it a great interface? Not at all. It's 8 register values copied from the hardware to userspace. But is it good enough? I think it is. It's also extremely simple (i.e. easy to explain what it does and trivial to implement), which I think is a huge benefit given that the IOMMUFD work around it is still evolving. I'm firmly of the opinion that the VMM is going to need a tonne of help from other sources to expose a virtual IOMMU successfully. For example, anything relating to policy or configuration choices should be driven from userspace rather than the kernel. If we start to expose policy in the id registers for the cases where it happens to fit nicely, I fear that this will backfire and the VMM will end up second-guessing the kernel in cases where it decides it knows best. I'm not sold on the analogy with CPU ID registers as (a) we don't have big/little idiocy to deal with in the SMMU and (b) I don't think SMMU features always need support code in the host, which is quite unlike many CPU features that cannot be exposed safely without hypervisor context-switching support. On top of all that, this interface can be extended if we change our minds. If we decide we need to mask out fields, I think we could add that after the fact. Hell, if we all decide that it's a disaster in a few releases time, we can try something else. Ultimately, Jason is the one maintaining IOMMUFD and he gets to deal with the problems if his UAPI doesn't work out :) So, given where we are in the cycle, I think the pragmatic thing to do is to land this change now and enable the ongoing IOMMUFD work to continue. We still have over two months to resolve any major problems with the interface (and even unplug it entirely from the driver if we really get stuck) but for now I think it's "fine". Will
On Fri, Nov 08, 2024 at 02:53:22PM +0000, Will Deacon wrote: > So, given where we are in the cycle, I think the pragmatic thing to do > is to land this change now and enable the ongoing IOMMUFD work to > continue. We still have over two months to resolve any major problems > with the interface (and even unplug it entirely from the driver if we > really get stuck) but for now I think it's "fine". Thanks Will, this all eloquently captures my line of reasoning also. I will add that after going through the exercise with Nicolin, to write down all the fields the VMM is allowed to touch, it seems we do have a list of future fields that likely do not require any host support (E0PD, AIE, PBHA, D128, DS). We also have ones that will (BTM) and then the fwspec weirdo of HTTU. Given those ratios I feel pretty good about showing that data to the userspace, expecting that down the road we will "turn it on" for E0PD/etc without any kernel change, and BTM/HTTU will not use idr for discovery. The biggest risk is the VMM's badly muck it up. I now think it was a lazy shortcut to not enumerate all the fields in the kdoc. Now that we've done that we see that the (unreviewed) RFC qemu patches did miss some things and it is being corrected. Nicolin and I had some discussion if we should go ahead and include E0PD/etc in the documentation right now, but we both felt some uncertainty that we really understood those features deeply enough to be confident, and have no way to test. There is also the nanny option of zeroing everything not in the approved list, but I feel like it is already so hard to write a VMM vSMMU3 that is too nannyish. Hoepfully time will not show that my opinion of VMM authors is too high. :\ Jason
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index b3aa1f5d53218b..0c9bceb1653d5f 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -415,6 +415,15 @@ config ARM_SMMU_V3_SVA Say Y here if your system supports SVA extensions such as PCIe PASID and PRI. +config ARM_SMMU_V3_IOMMUFD + bool "Enable IOMMUFD features for ARM SMMUv3 (EXPERIMENTAL)" + depends on IOMMUFD + help + Support for IOMMUFD features intended to support virtual machines + with accelerated virtual IOMMUs. + + Say Y here if you are doing development and testing on this feature. + config ARM_SMMU_V3_KUNIT_TEST tristate "KUnit tests for arm-smmu-v3 driver" if !KUNIT_ALL_TESTS depends on KUNIT diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile index dc98c88b48c827..493a659cc66bb2 100644 --- a/drivers/iommu/arm/arm-smmu-v3/Makefile +++ b/drivers/iommu/arm/arm-smmu-v3/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_ARM_SMMU_V3) += arm_smmu_v3.o arm_smmu_v3-y := arm-smmu-v3.o +arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_IOMMUFD) += arm-smmu-v3-iommufd.o arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o arm_smmu_v3-$(CONFIG_TEGRA241_CMDQV) += tegra241-cmdqv.o diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c new file mode 100644 index 00000000000000..3d2671031c9bb5 --- /dev/null +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c @@ -0,0 +1,31 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES + */ + +#include <uapi/linux/iommufd.h> + +#include "arm-smmu-v3.h" + +void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type) +{ + struct arm_smmu_master *master = dev_iommu_priv_get(dev); + struct iommu_hw_info_arm_smmuv3 *info; + u32 __iomem *base_idr; + unsigned int i; + + info = kzalloc(sizeof(*info), GFP_KERNEL); + if (!info) + return ERR_PTR(-ENOMEM); + + base_idr = master->smmu->base + ARM_SMMU_IDR0; + for (i = 0; i <= 5; i++) + info->idr[i] = readl_relaxed(base_idr + i); + info->iidr = readl_relaxed(master->smmu->base + ARM_SMMU_IIDR); + info->aidr = readl_relaxed(master->smmu->base + ARM_SMMU_AIDR); + + *length = sizeof(*info); + *type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3; + + return info; +} diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 38725810c14eeb..996774d461aea2 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3506,6 +3506,7 @@ static struct iommu_ops arm_smmu_ops = { .identity_domain = &arm_smmu_identity_domain, .blocked_domain = &arm_smmu_blocked_domain, .capable = arm_smmu_capable, + .hw_info = arm_smmu_hw_info, .domain_alloc_paging = arm_smmu_domain_alloc_paging, .domain_alloc_sva = arm_smmu_sva_domain_alloc, .domain_alloc_user = arm_smmu_domain_alloc_user, diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 06e3d88932df12..66261fd5bfb2d2 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -81,6 +81,8 @@ struct arm_smmu_device; #define IIDR_REVISION GENMASK(15, 12) #define IIDR_IMPLEMENTER GENMASK(11, 0) +#define ARM_SMMU_AIDR 0x1C + #define ARM_SMMU_CR0 0x20 #define CR0_ATSCHK (1 << 4) #define CR0_CMDQEN (1 << 3) @@ -956,4 +958,11 @@ tegra241_cmdqv_probe(struct arm_smmu_device *smmu) return ERR_PTR(-ENODEV); } #endif /* CONFIG_TEGRA241_CMDQV */ + +#if IS_ENABLED(CONFIG_ARM_SMMU_V3_IOMMUFD) +void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type); +#else +#define arm_smmu_hw_info NULL +#endif /* CONFIG_ARM_SMMU_V3_IOMMUFD */ + #endif /* _ARM_SMMU_V3_H */ diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index e266dfa6a38d9d..b227ac16333fe1 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -488,15 +488,50 @@ struct iommu_hw_info_vtd { __aligned_u64 ecap_reg; }; +/** + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information + * (IOMMU_HW_INFO_TYPE_ARM_SMMUV3) + * + * @flags: Must be set to 0 + * @__reserved: Must be 0 + * @idr: Implemented features for ARM SMMU Non-secure programming interface + * @iidr: Information about the implementation and implementer of ARM SMMU, + * and architecture version supported + * @aidr: ARM SMMU architecture version + * + * For the details of @idr, @iidr and @aidr, please refer to the chapters + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec. + * + * User space should read the underlying ARM SMMUv3 hardware information for + * the list of supported features. + * + * Note that these values reflect the raw HW capability, without any insight if + * any required kernel driver support is present. Bits may be set indicating the + * HW has functionality that is lacking kernel software support, such as BTM. If + * a VMM is using this information to construct emulated copies of these + * registers it should only forward bits that it knows it can support. + * + * In future, presence of required kernel support will be indicated in flags. + */ +struct iommu_hw_info_arm_smmuv3 { + __u32 flags; + __u32 __reserved; + __u32 idr[6]; + __u32 iidr; + __u32 aidr; +}; + /** * enum iommu_hw_info_type - IOMMU Hardware Info Types * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware * info * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type + * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type */ enum iommu_hw_info_type { IOMMU_HW_INFO_TYPE_NONE = 0, IOMMU_HW_INFO_TYPE_INTEL_VTD = 1, + IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2, }; /**