diff mbox series

[v4,05/12] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info

Message ID 5-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com (mailing list archive)
State New
Headers show
Series Initial support for SMMUv3 nested translation | expand

Commit Message

Jason Gunthorpe Oct. 31, 2024, 12:20 a.m. UTC
From: Nicolin Chen <nicolinc@nvidia.com>

For virtualization cases the IDR/IIDR/AIDR values of the actual SMMU
instance need to be available to the VMM so it can construct an
appropriate vSMMUv3 that reflects the correct HW capabilities.

For userspace page tables these values are required to constrain the valid
values within the CD table and the IOPTEs.

The kernel does not sanitize these values. If building a VMM then
userspace is required to only forward bits into a VM that it knows it can
implement. Some bits will also require a VMM to detect if appropriate
kernel support is available such as for ATS and BTM.

Start a new file and kconfig for the advanced iommufd support. This lets
it be compiled out for kernels that are not intended to support
virtualization, and allows distros to leave it disabled until they are
shipping a matching qemu too.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/Kconfig                         |  9 +++++
 drivers/iommu/arm/arm-smmu-v3/Makefile        |  1 +
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     | 31 ++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  9 +++++
 include/uapi/linux/iommufd.h                  | 35 +++++++++++++++++++
 6 files changed, 86 insertions(+)
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c

Comments

Will Deacon Nov. 4, 2024, 11:47 a.m. UTC | #1
On Wed, Oct 30, 2024 at 09:20:49PM -0300, Jason Gunthorpe wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
> 
> For virtualization cases the IDR/IIDR/AIDR values of the actual SMMU
> instance need to be available to the VMM so it can construct an
> appropriate vSMMUv3 that reflects the correct HW capabilities.
> 
> For userspace page tables these values are required to constrain the valid
> values within the CD table and the IOPTEs.
> 
> The kernel does not sanitize these values. If building a VMM then
> userspace is required to only forward bits into a VM that it knows it can
> implement. Some bits will also require a VMM to detect if appropriate
> kernel support is available such as for ATS and BTM.
> 
> Start a new file and kconfig for the advanced iommufd support. This lets
> it be compiled out for kernels that are not intended to support
> virtualization, and allows distros to leave it disabled until they are
> shipping a matching qemu too.
> 
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
> Reviewed-by: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

--->8

> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index e266dfa6a38d9d..b227ac16333fe1 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -488,15 +488,50 @@ struct iommu_hw_info_vtd {
>  	__aligned_u64 ecap_reg;
>  };
>  
> +/**
> + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
> + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> + *
> + * @flags: Must be set to 0
> + * @__reserved: Must be 0
> + * @idr: Implemented features for ARM SMMU Non-secure programming interface
> + * @iidr: Information about the implementation and implementer of ARM SMMU,
> + *        and architecture version supported
> + * @aidr: ARM SMMU architecture version
> + *
> + * For the details of @idr, @iidr and @aidr, please refer to the chapters
> + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
> + *
> + * User space should read the underlying ARM SMMUv3 hardware information for
> + * the list of supported features.
> + *
> + * Note that these values reflect the raw HW capability, without any insight if
> + * any required kernel driver support is present. Bits may be set indicating the
> + * HW has functionality that is lacking kernel software support, such as BTM. If
> + * a VMM is using this information to construct emulated copies of these
> + * registers it should only forward bits that it knows it can support.
> + *
> + * In future, presence of required kernel support will be indicated in flags.

What about the case where we _know_ that some functionality is broken in
the hardware? For example, we nobble BTM support on MMU 700 thanks to
erratum #2812531 yet we'll still cheerfully advertise it in IDR0 here.
Similarly, HTTU can be overridden by IORT, so should we update the view
that we advertise for that as well?

Will
Jason Gunthorpe Nov. 4, 2024, 12:41 p.m. UTC | #2
On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote:
> > +/**
> > + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
> > + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> > + *
> > + * @flags: Must be set to 0
> > + * @__reserved: Must be 0
> > + * @idr: Implemented features for ARM SMMU Non-secure programming interface
> > + * @iidr: Information about the implementation and implementer of ARM SMMU,
> > + *        and architecture version supported
> > + * @aidr: ARM SMMU architecture version
> > + *
> > + * For the details of @idr, @iidr and @aidr, please refer to the chapters
> > + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
> > + *
> > + * User space should read the underlying ARM SMMUv3 hardware information for
> > + * the list of supported features.
> > + *
> > + * Note that these values reflect the raw HW capability, without any insight if
> > + * any required kernel driver support is present. Bits may be set indicating the
> > + * HW has functionality that is lacking kernel software support, such as BTM. If
> > + * a VMM is using this information to construct emulated copies of these
> > + * registers it should only forward bits that it knows it can support.
> > + *
> > + * In future, presence of required kernel support will be indicated in flags.
> 
> What about the case where we _know_ that some functionality is broken in
> the hardware? For example, we nobble BTM support on MMU 700 thanks to
> erratum #2812531 yet we'll still cheerfully advertise it in IDR0 here.
> Similarly, HTTU can be overridden by IORT, so should we update the view
> that we advertise for that as well?

My knee jerk answer is no, these struct fields should just report the
raw HW register. A VMM should not copy these fields directly into a
VM. The principle purpose is to give the VMM the same details about the
HW as the kernel so it can apply erratas/etc.

For instance, if we hide these fields how will the VMM/VM know to
apply the various flushing errata? With vCMDQ/etc the VM is directly
pushing flushes to HW, it must know the errata.

For BTM/HTTU/etc - those all require kernel SW support and per-device
permission in the kernel to turn on. For instance requesting a nested
vSTE that needs BTM will fail today during attach. Turning on HTTU on
the S2 already has an API that will fail if the IORT blocks it.

Incrementally dealing with expanding the support is part of the
"required kernel support will be indicated in flags."

Basically, exposing the information as-is doesn't do any harm.

Jason
Robin Murphy Nov. 6, 2024, 4:37 p.m. UTC | #3
On 2024-11-04 12:41 pm, Jason Gunthorpe wrote:
> On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote:
>>> +/**
>>> + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
>>> + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
>>> + *
>>> + * @flags: Must be set to 0
>>> + * @__reserved: Must be 0
>>> + * @idr: Implemented features for ARM SMMU Non-secure programming interface
>>> + * @iidr: Information about the implementation and implementer of ARM SMMU,
>>> + *        and architecture version supported
>>> + * @aidr: ARM SMMU architecture version
>>> + *
>>> + * For the details of @idr, @iidr and @aidr, please refer to the chapters
>>> + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
>>> + *
>>> + * User space should read the underlying ARM SMMUv3 hardware information for
>>> + * the list of supported features.
>>> + *
>>> + * Note that these values reflect the raw HW capability, without any insight if
>>> + * any required kernel driver support is present. Bits may be set indicating the
>>> + * HW has functionality that is lacking kernel software support, such as BTM. If
>>> + * a VMM is using this information to construct emulated copies of these
>>> + * registers it should only forward bits that it knows it can support.

But how *is* a VMM supposed to know what it can support? Are they all 
expected to grovel the host devicetree/ACPI tables and maintain their 
own knowledge of implementation errata to understand what's actually usable?

>>> + *
>>> + * In future, presence of required kernel support will be indicated in flags.
>>
>> What about the case where we _know_ that some functionality is broken in
>> the hardware? For example, we nobble BTM support on MMU 700 thanks to
>> erratum #2812531 yet we'll still cheerfully advertise it in IDR0 here.
>> Similarly, HTTU can be overridden by IORT, so should we update the view
>> that we advertise for that as well?
> 
> My knee jerk answer is no, these struct fields should just report the
> raw HW register. A VMM should not copy these fields directly into a
> VM. The principle purpose is to give the VMM the same details about the
> HW as the kernel so it can apply erratas/etc.
> 
> For instance, if we hide these fields how will the VMM/VM know to
> apply the various flushing errata? With vCMDQ/etc the VM is directly
> pushing flushes to HW, it must know the errata.

That doesn't seem like a valid argument. We obviously can't abstract 
SMMU_IIDR, that would indeed be an invitation for trouble, but 
otherwise, if an erratum affects S1 operation under conditions dependent 
on an optional feature, then not advertising that feature would make the 
workaround irrelevant anyway, since as far as the VM is concerned it 
would be wrong to expect a non-existent feature to work in the first place.

> For BTM/HTTU/etc - those all require kernel SW support and per-device
> permission in the kernel to turn on. For instance requesting a nested
> vSTE that needs BTM will fail today during attach. Turning on HTTU on
> the S2 already has an API that will fail if the IORT blocks it.

What does S2 HTTU have to do with the VM? How the host wants to maintain 
its S2 tables it its own business. AFAICS, unless the VMM wants to do 
some fiddly CD shadowing, it's going to be kinda hard to prevent the 
SMMU seeing a guest CD with CD.HA and/or CD.HD set if the guest expects 
S1 HTTU to work.

I'm not sure what "vSTE that needs BTM" means. Even if the system does 
support BTM, the only control is the global SMMU_CR2.PTM, and a vSMMU 
can't usefully emulate changing that either way. Either the host set 
PTM=0 before enabling the SMMU, so BTM can be advertised and expected to 
work, or it didn't, in which case there can be no BTM, full stop.

> Incrementally dealing with expanding the support is part of the
> "required kernel support will be indicated in flags."
> 
> Basically, exposing the information as-is doesn't do any harm.

I would say it does. Advertising a feature when we already know it's not 
usable at all puts a non-trivial and unnecessary burden on the VMM and 
VM to then have to somehow derive that information from other sources, 
at the risk of being confused by unexpected behaviour if they don't.

We sanitise CPU ID registers for userspace and KVM, so I see no 
compelling reason for SMMU ID registers to be different.

Thanks,
Robin.
Jason Gunthorpe Nov. 6, 2024, 6:05 p.m. UTC | #4
On Wed, Nov 06, 2024 at 04:37:53PM +0000, Robin Murphy wrote:
> On 2024-11-04 12:41 pm, Jason Gunthorpe wrote:
> > On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote:
> > > > +/**
> > > > + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
> > > > + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> > > > + *
> > > > + * @flags: Must be set to 0
> > > > + * @__reserved: Must be 0
> > > > + * @idr: Implemented features for ARM SMMU Non-secure programming interface
> > > > + * @iidr: Information about the implementation and implementer of ARM SMMU,
> > > > + *        and architecture version supported
> > > > + * @aidr: ARM SMMU architecture version
> > > > + *
> > > > + * For the details of @idr, @iidr and @aidr, please refer to the chapters
> > > > + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
> > > > + *
> > > > + * User space should read the underlying ARM SMMUv3 hardware information for
> > > > + * the list of supported features.
> > > > + *
> > > > + * Note that these values reflect the raw HW capability, without any insight if
> > > > + * any required kernel driver support is present. Bits may be set indicating the
> > > > + * HW has functionality that is lacking kernel software support, such as BTM. If
> > > > + * a VMM is using this information to construct emulated copies of these
> > > > + * registers it should only forward bits that it knows it can support.
> 
> But how *is* a VMM supposed to know what it can support?

I answered a related question to Mostafa with an example:

https://lore.kernel.org/linux-iommu/20240903235532.GJ3773488@nvidia.com/

"global" capabilities that are enabled directly from the CD entry
would follow the pattern.

> Are they all expected to grovel the host devicetree/ACPI tables and
> maintain their own knowledge of implementation errata to understand
> what's actually usable?

No, VMMs are expected to only implement base line features we have
working today and not blindly add new features based only HW registers
reported here.

Each future capability we want to enable at the VMM needs an analysis:

 1) Does it require kernel SW changes, ie like BTM? Then it needs a
    kernel_capabilities bit to say the kernel SW exists
 2) Does it require data from ACPI/DT/etc? Then it needs a
    kernel_capabilities bit
 3) Does it need to be "turned on" per VM, ie with a VMS enablement?
    Then it needs a new request flag in ALLOC_VIOMMU
 4) Otherwise it can be read directly from the idr[] array

This is why the comment above is so stern that the VMM "should only
forward bits that it knows it can support".

> S2 tables it its own business. AFAICS, unless the VMM wants to do some
> fiddly CD shadowing, it's going to be kinda hard to prevent the SMMU seeing
> a guest CD with CD.HA and/or CD.HD set if the guest expects S1 HTTU to work.

If the VMM wrongly indicates HTTU support to the VM, because it
wrongly inspected those bits in the idr report, then it is just
broken.

> I would say it does. Advertising a feature when we already know it's not
> usable at all puts a non-trivial and unnecessary burden on the VMM and VM to
> then have to somehow derive that information from other sources, at the risk
> of being confused by unexpected behaviour if they don't.

That is not the purpose here, the register report is not to be used as
"advertising features". It describes details of the raw HW that the
VMM may need to use *some* of the fields.

There are quite a few fields that fit #4 today: OAS, VAX, GRAN, BBML,
CD2L, etc.

Basically we will pass most of the bits and mask a few. If we get the
masking wrong and pass something we shouldn't, then we've improved
nothing compared to this proposal. I think we are likely to get the
masking wrong :)

> We sanitise CPU ID registers for userspace and KVM, so I see no compelling
> reason for SMMU ID registers to be different.

We discussed this already:

https://lore.kernel.org/linux-iommu/20240904120103.GB3915968@nvidia.com

It is a false comparison, for KVM the kernel is responsible to control
the CPU ID registers. Reporting the registers the VM sees to the VMM
makes alot of sense. For SMMU the VMM exclusively controls the VM's ID
registers.

If you still feel strongly about this please let me know by Friday and
I will drop the idr[] array from this cycle. We can continue to
discuss a solution for the next cycle.

Regards,
Jason
Robin Murphy Nov. 6, 2024, 9:05 p.m. UTC | #5
On 2024-11-06 6:05 pm, Jason Gunthorpe wrote:
> On Wed, Nov 06, 2024 at 04:37:53PM +0000, Robin Murphy wrote:
>> On 2024-11-04 12:41 pm, Jason Gunthorpe wrote:
>>> On Mon, Nov 04, 2024 at 11:47:24AM +0000, Will Deacon wrote:
>>>>> +/**
>>>>> + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
>>>>> + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
>>>>> + *
>>>>> + * @flags: Must be set to 0
>>>>> + * @__reserved: Must be 0
>>>>> + * @idr: Implemented features for ARM SMMU Non-secure programming interface
>>>>> + * @iidr: Information about the implementation and implementer of ARM SMMU,
>>>>> + *        and architecture version supported
>>>>> + * @aidr: ARM SMMU architecture version
>>>>> + *
>>>>> + * For the details of @idr, @iidr and @aidr, please refer to the chapters
>>>>> + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
>>>>> + *
>>>>> + * User space should read the underlying ARM SMMUv3 hardware information for
>>>>> + * the list of supported features.
>>>>> + *
>>>>> + * Note that these values reflect the raw HW capability, without any insight if
>>>>> + * any required kernel driver support is present. Bits may be set indicating the
>>>>> + * HW has functionality that is lacking kernel software support, such as BTM. If
>>>>> + * a VMM is using this information to construct emulated copies of these
>>>>> + * registers it should only forward bits that it knows it can support.
>>
>> But how *is* a VMM supposed to know what it can support?
> 
> I answered a related question to Mostafa with an example:
> 
> https://lore.kernel.org/linux-iommu/20240903235532.GJ3773488@nvidia.com/
> 
> "global" capabilities that are enabled directly from the CD entry
> would follow the pattern.
> 
>> Are they all expected to grovel the host devicetree/ACPI tables and
>> maintain their own knowledge of implementation errata to understand
>> what's actually usable?
> 
> No, VMMs are expected to only implement base line features we have
> working today and not blindly add new features based only HW registers
> reported here.
> 
> Each future capability we want to enable at the VMM needs an analysis:
> 
>   1) Does it require kernel SW changes, ie like BTM? Then it needs a
>      kernel_capabilities bit to say the kernel SW exists
>   2) Does it require data from ACPI/DT/etc? Then it needs a
>      kernel_capabilities bit
>   3) Does it need to be "turned on" per VM, ie with a VMS enablement?
>      Then it needs a new request flag in ALLOC_VIOMMU
>   4) Otherwise it can be read directly from the idr[] array
> 
> This is why the comment above is so stern that the VMM "should only
> forward bits that it knows it can support".

So... you're saying this patch is in fact broken, or at least uselessly 
incomplete, since VMMs aren't allowed to emulate a vSMMU at all without 
first consulting some other interface which does not exist? Great.

>> S2 tables it its own business. AFAICS, unless the VMM wants to do some
>> fiddly CD shadowing, it's going to be kinda hard to prevent the SMMU seeing
>> a guest CD with CD.HA and/or CD.HD set if the guest expects S1 HTTU to work.
> 
> If the VMM wrongly indicates HTTU support to the VM, because it
> wrongly inspected those bits in the idr report, then it is just
> broken.

What do you mean? We could have a system right now where the hardware is 
configured with SMMU_IDR0.HTTU=2, but it turned out that atomics were 
broken in the interconnect so firmware sets the IORT "HTTU override" 
field is set to 0. We know about that in the kernel, but all a VMM sees 
is iommu_hw_info_arm_smmuv3.idr[0] indicating HTTU=2. If it is "broken" 
to take the only information available at face value, assume HTTU is 
available, and reflect that in a vSMMU interface, then what is the 
correct thing to do, other than to not dare emulate a vSMMU at all, in 
fear of a sternly worded comment?

>> I would say it does. Advertising a feature when we already know it's not
>> usable at all puts a non-trivial and unnecessary burden on the VMM and VM to
>> then have to somehow derive that information from other sources, at the risk
>> of being confused by unexpected behaviour if they don't.
> 
> That is not the purpose here, the register report is not to be used as
> "advertising features". It describes details of the raw HW that the
> VMM may need to use *some* of the fields.
> 
> There are quite a few fields that fit #4 today: OAS, VAX, GRAN, BBML,
> CD2L, etc.
> 
> Basically we will pass most of the bits and mask a few. If we get the
> masking wrong and pass something we shouldn't, then we've improved
> nothing compared to this proposal. I think we are likely to get the
> masking wrong :)

Seriously? A simple inverse of the feature detection the kernel driver 
already does for its own needs, implemented once in the same place, is hard?

Compared to maintaining the exact same information within the driver but 
in some new different form, and also maintaining it in the UAPI, and 
having every VMM ever all do the same work to put the two together, and 
always be up to date with the right UAPI, and never ever let any field 
slip through as-is, especially not all the ones which were RES0 at time 
of writing, enforced by a sternly worded comment? Why yes, of course I 
can see how that's trivially easy and carries no risk whatsoever.

>> We sanitise CPU ID registers for userspace and KVM, so I see no compelling
>> reason for SMMU ID registers to be different.
> 
> We discussed this already:
> 
> https://lore.kernel.org/linux-iommu/20240904120103.GB3915968@nvidia.com
> 
> It is a false comparison, for KVM the kernel is responsible to control
> the CPU ID registers. Reporting the registers the VM sees to the VMM
> makes alot of sense. For SMMU the VMM exclusively controls the VM's ID
> registers.

Pointing out that two things are different is a false comparison because 
they are different, by virtue of your choice to make them different? 
Please try making sense.

Your tautology still does not offer any reasoning against doing the 
logical thing and following the same basic pattern: the kernel uses the 
ID register mechanism itself to advertise the set of features it's 
able/willing to support, by sanitising the values it offers to the VMM, 
combining the notions of hardware and kernel support where the 
distinction is irrelevant anyway. The VMM is then still free to take 
those values and hide more features, or potentially add any that it is 
capable of emulating without the kernel's help, and advertise that final 
set to the VM. Obviously there are significant *implementation* 
differences, most notably that the latter VMM->VM part doesn't need to 
involve IOMMUFD at all since MMIO register emulation can stay entirely 
in userspace, whereas for CPU system registers the final VM-visible 
values need to be plugged back in to KVM for it to handle the traps.

We are all asking you to explain why you think doing the kernel->VMM 
advertisement naturally and intuitively is somehow bad, and forcing VMMs 
to instead rely on a more complex, fragile, and crucially non-existent 
additional interface is better. You should take "We discussed this 
already" as more of a clue to yourself than to me - if 4 different 
people have all said the exact same thing in so many words, perhaps 
there's something in it...

And in case I need to spell it out with less sarcasm, "we'll get masking 
wrong in the kernel" only implies "we'll get kernel_capabilities wrong 
in the kernel (and elsewhere)", so it's clearly not a useful argument to 
keep repeating. Besides, as KVM + sysfs + MRS emulation shows, we're 
pretty experienced at masking ID registers in the kernel. It's not hard 
to do it right in a robust manner, where particularly with the nature of 
SMMU features, the only real risk might be forgetting to expose 
something new once we do actually support it.

> If you still feel strongly about this please let me know by Friday and
> I will drop the idr[] array from this cycle. We can continue to
> discuss a solution for the next cycle.

It already can't work as-is, I don't see how making it even more broken 
would help. IMO it doesn't seem like a good idea to be merging UAPI at 
all while it's still clearly incomplete and by its own definition unusable.

Thanks,
Robin.
Nicolin Chen Nov. 6, 2024, 9:53 p.m. UTC | #6
On Wed, Nov 06, 2024 at 09:05:26PM +0000, Robin Murphy wrote:
> On 2024-11-06 6:05 pm, Jason Gunthorpe wrote:
> > If you still feel strongly about this please let me know by Friday and
> > I will drop the idr[] array from this cycle. We can continue to
> > discuss a solution for the next cycle.
> 
> It already can't work as-is, I don't see how making it even more broken
> would help. IMO it doesn't seem like a good idea to be merging UAPI at
> all while it's still clearly incomplete and by its own definition unusable.

Robin, would you please give a clear suggestion for the hw_info?

My takeaway is that you would want the unsupported features (per
firmware overrides and errata) to be stripped from the reporting
IDR array. Alternatively, we could start with some basic nesting
features less those advanced ones (HTTU/PRI or so), and then add
then later once we're comfortable to advertise.

Does this sound okay to you?

Thanks
Nicolin
Jason Gunthorpe Nov. 7, 2024, 2:35 a.m. UTC | #7
On Wed, Nov 06, 2024 at 09:05:26PM +0000, Robin Murphy wrote:
> > Each future capability we want to enable at the VMM needs an analysis:
> > 
> >   1) Does it require kernel SW changes, ie like BTM? Then it needs a
> >      kernel_capabilities bit to say the kernel SW exists
> >   2) Does it require data from ACPI/DT/etc? Then it needs a
> >      kernel_capabilities bit
> >   3) Does it need to be "turned on" per VM, ie with a VMS enablement?
> >      Then it needs a new request flag in ALLOC_VIOMMU
> >   4) Otherwise it can be read directly from the idr[] array
> > 
> > This is why the comment above is so stern that the VMM "should only
> > forward bits that it knows it can support".
> 
> So... you're saying this patch is in fact broken, or at least uselessly
> incomplete, since VMMs aren't allowed to emulate a vSMMU at all without
> first consulting some other interface which does not exist? Great.

That is too far, there is nothing fundamentally broken here.

This does not enable every SMMU feature in the architecture. It
implements a basic baseline and no more. The VMMs are only permitted
to read bits in that baseline. VMMs that want to go beyond that have
to go through the 4 steps above.

Fundamentally all we are arguing about here is if bits the VMM
shouldn't even read should be forced to zero by the kernel.

If the bits are forced to zero then perhaps they could be used as a
future SW feature negotiation, and maybe that would be better, or
maybe not. See below about BTM for a counter example.

I agree the kdoc does not describe what the baseline actually is.

> We know about that in the kernel, but all a VMM sees is
> iommu_hw_info_arm_smmuv3.idr[0] indicating HTTU=2. If it is "broken" to take
> the only information available at face value, assume HTTU is available, and
> reflect that in a vSMMU interface, then what is the correct thing to do,
> other than to not dare emulate a vSMMU at all, in fear of a sternly worded
> comment?

These patches support a baseline feature set. They do not support the
entire SMMU architecture. They do not support vHTTU.

Things are going step by step because this is a huge project. HTTU is
not in the baseline, so it is definitely wrong for any VMM working on
these patches to advertise support to the VM! The VMM *MUST* ignore
such bits in the IDR report.

The correct thing is the 4 steps I outlined above. When a VMM author
wants to go beyond the baseline they have to determine what kernel and
VMM software is required and implement the missing parts.

> Your tautology still does not offer any reasoning against doing the logical
> thing and following the same basic pattern: the kernel uses the ID register
> mechanism itself to advertise the set of features it's able/willing to
> support, by sanitising the values it offers to the VMM, combining the
> notions of hardware and kernel support where the distinction is irrelevant
> anyway.

Even with your plan the VMM can not pass the kernel's IDR register
into the VM unprotected. It must always sanitize it against the
features that the VMM supports.

Let's consider vBTM support. This requires kernel support to enable
the global BTM register *AND* a new VMM ioctl flow with iommufd to
link to KVM's VMID.

You are suggesting that the old kernel should wire IDR BTM to 0 and
the new kernel will set it to 1. However the VMM cannot just forward
this bit to the VM! A old VMM that does not do the KVM flow does NOT
support BTM.

All VMM's must mask the BTM bit from day 0, or we can never set it to
1 in a future kernel. Which is exactly the same plan as this patch!

> You should take "We discussed this already"
> as more of a clue to yourself than to me - if 4 different people have all
> said the exact same thing in so many words, perhaps there's something in
> it...

And all seemed to agree it was not a big deal after the discussion.

I think Mostafa was driving in a direction that we break up the IDR
into explicit fields and thus be explicit about what information the
VMM is able to access. This would effectively document and enforce
what the baseline is.

> And in case I need to spell it out with less sarcasm, "we'll get masking
> wrong in the kernel" only implies "we'll get kernel_capabilities wrong in
> the kernel (and elsewhere)",

Less sarcasm and hyperbole would be nice.

> > If you still feel strongly about this please let me know by Friday and
> > I will drop the idr[] array from this cycle. We can continue to
> > discuss a solution for the next cycle.
> 
> It already can't work as-is, 

As above, this is not true.

> I don't see how making it even more broken
> would help. IMO it doesn't seem like a good idea to be merging UAPI at all
> while it's still clearly incomplete and by its own definition unusable.

I agree that removing the idr would make it definitely unusable. There
is quite a bit of essential information in there the VMM
needs. However, the uAPI is extensible so adding a new field in the
next cycle is not breaking.

I'm trying to offer you an olive branch so we can have this discussion
with time instead of urgently at the 11th hour and derailing the
series. This exact topic has been discussed already two months ago,
and is, IMHO, mostly a style choice not an urgent technical matter.

If you have another option we can conclude in the next few days then
I'm happy to hear it.

If we focus only on the baseline functionality Nicolin can come up
with a list of fields and we can write it very explicitly in the kdoc
that the VMM can only read those fields.

I could also reluctantly agree to permanent masking to report only the
above kdoc bits in the kernel. Noting if you want this because you
don't trust the VMM authors to follow an explicit kdoc, then we likely
also cannot ever set some bits to 1. Eg BTM would be permanently 0.

Regards,
Jason
Jason Gunthorpe Nov. 7, 2024, 3:51 p.m. UTC | #8
On Wed, Nov 06, 2024 at 10:35:07PM -0400, Jason Gunthorpe wrote:

> I agree the kdoc does not describe what the baseline actually is.

Nicolin worked on this, here is a more detailed kdoc:

/**
 * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
 *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
 *
 * @flags: Must be set to 0
 * @__reserved: Must be 0
 * @idr: Implemented features for ARM SMMU Non-secure programming interface
 * @iidr: Information about the implementation and implementer of ARM SMMU,
 *        and architecture version supported
 * @aidr: ARM SMMU architecture version
 *
 * For the details of @idr, @iidr and @aidr, please refer to the chapters
 * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
 *
 * This reports the raw HW capability, and not all bits are meaningful to be
 * read by userspace. Only the following fields should be used:
 *
 * idr[0]: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN , CD2L, ASID16, TTF
 * idr[1]: SIDSIZE, SSIDSIZE
 * idr[3]: BBML, RIL
 * idr[5]: VAX, GRAN64K, GRAN16K, GRAN4K
 *
 * - S1P should be assumed to be true if a NESTED HWPT can be created
 * - VFIO/iommufd only support platforms with COHACC, it should be assumed to be
 *   true.
 * - ATS is a per-device property. If the VMM describes any devices as ATS
 *   capable in ACPI/DT it should set the corresponding idr.
 *
 * This list may expand in future (eg E0PD, AIE, PBHA, D128, DS etc). It is
 * important that VMMs do not read bits outside the list to allow for
 * compatibility with future kernels. Several features in the SMMUv3
 * architecture are not currently supported by the kernel for nesting: HTTU,
 * BTM, MPAM and others.
 */

This focuses on stuff we can actually test and gives a path to test
and confirm a no-code update to future stuff.

The future list (E0PD/etc) reflects things the current kernel doesn't
use. Our naive read of the spec suggests they are probably fine to
just read the raw HW IDR. When someone implements guest kernel
support, does a detailed spec read, and crucially tests them - then we
can update the comment and have immediate support.

HTTU, BTM, and others like that will need additional bits outside the
IDR if someone wishes to enable them.

Jason
Will Deacon Nov. 8, 2024, 2:53 p.m. UTC | #9
Hi guys,

On Wed, Nov 06, 2024 at 10:35:06PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 06, 2024 at 09:05:26PM +0000, Robin Murphy wrote:
> > You should take "We discussed this already"
> > as more of a clue to yourself than to me - if 4 different people have all
> > said the exact same thing in so many words, perhaps there's something in
> > it...
> 
> And all seemed to agree it was not a big deal after the discussion.
> 
> I think Mostafa was driving in a direction that we break up the IDR
> into explicit fields and thus be explicit about what information the
> VMM is able to access. This would effectively document and enforce
> what the baseline is.

As one of the four people mentioned above, I figured I'd chime in with
my rationale for queuing this in case it's of any help or interest.

Initially, I was reasonably sure that we should be sanitising the ID
registers and being selective about what we advertise to userspace.
However, after Jason's reply to my comments, mulling it over in my head
and having lively conversations with Mostafa at lunchtime, I've come
full circle. Is it a great interface? Not at all. It's 8 register values
copied from the hardware to userspace. But is it good enough? I think it
is. It's also extremely simple (i.e. easy to explain what it does and
trivial to implement), which I think is a huge benefit given that the
IOMMUFD work around it is still evolving.

I'm firmly of the opinion that the VMM is going to need a tonne of help
from other sources to expose a virtual IOMMU successfully. For example,
anything relating to policy or configuration choices should be driven
from userspace rather than the kernel. If we start to expose policy in
the id registers for the cases where it happens to fit nicely, I fear
that this will backfire and the VMM will end up second-guessing the
kernel in cases where it decides it knows best. I'm not sold on the
analogy with CPU ID registers as (a) we don't have big/little idiocy to
deal with in the SMMU and (b) I don't think SMMU features always need
support code in the host, which is quite unlike many CPU features that
cannot be exposed safely without hypervisor context-switching support.

On top of all that, this interface can be extended if we change our
minds. If we decide we need to mask out fields, I think we could add
that after the fact. Hell, if we all decide that it's a disaster in a
few releases time, we can try something else. Ultimately, Jason is the
one maintaining IOMMUFD and he gets to deal with the problems if his
UAPI doesn't work out :)

So, given where we are in the cycle, I think the pragmatic thing to do
is to land this change now and enable the ongoing IOMMUFD work to
continue. We still have over two months to resolve any major problems
with the interface (and even unplug it entirely from the driver if we
really get stuck) but for now I think it's "fine".

Will
Jason Gunthorpe Nov. 8, 2024, 3:42 p.m. UTC | #10
On Fri, Nov 08, 2024 at 02:53:22PM +0000, Will Deacon wrote:

> So, given where we are in the cycle, I think the pragmatic thing to do
> is to land this change now and enable the ongoing IOMMUFD work to
> continue. We still have over two months to resolve any major problems
> with the interface (and even unplug it entirely from the driver if we
> really get stuck) but for now I think it's "fine".

Thanks Will, this all eloquently captures my line of reasoning also.

I will add that after going through the exercise with Nicolin, to
write down all the fields the VMM is allowed to touch, it seems we do
have a list of future fields that likely do not require any host
support (E0PD, AIE, PBHA, D128, DS). We also have ones that will (BTM)
and then the fwspec weirdo of HTTU.

Given those ratios I feel pretty good about showing that data to the
userspace, expecting that down the road we will "turn it on" for
E0PD/etc without any kernel change, and BTM/HTTU will not use idr for
discovery.

The biggest risk is the VMM's badly muck it up. I now think it was a
lazy shortcut to not enumerate all the fields in the kdoc. Now that
we've done that we see that the (unreviewed) RFC qemu patches did miss
some things and it is being corrected. 

Nicolin and I had some discussion if we should go ahead and include
E0PD/etc in the documentation right now, but we both felt some
uncertainty that we really understood those features deeply enough to
be confident, and have no way to test.

There is also the nanny option of zeroing everything not in the
approved list, but I feel like it is already so hard to write a VMM
vSMMU3 that is too nannyish. Hoepfully time will not show that my
opinion of VMM authors is too high. :\

Jason
diff mbox series

Patch

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index b3aa1f5d53218b..0c9bceb1653d5f 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -415,6 +415,15 @@  config ARM_SMMU_V3_SVA
 	  Say Y here if your system supports SVA extensions such as PCIe PASID
 	  and PRI.
 
+config ARM_SMMU_V3_IOMMUFD
+	bool "Enable IOMMUFD features for ARM SMMUv3 (EXPERIMENTAL)"
+	depends on IOMMUFD
+	help
+	  Support for IOMMUFD features intended to support virtual machines
+	  with accelerated virtual IOMMUs.
+
+	  Say Y here if you are doing development and testing on this feature.
+
 config ARM_SMMU_V3_KUNIT_TEST
 	tristate "KUnit tests for arm-smmu-v3 driver"  if !KUNIT_ALL_TESTS
 	depends on KUNIT
diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
index dc98c88b48c827..493a659cc66bb2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -1,6 +1,7 @@ 
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_ARM_SMMU_V3) += arm_smmu_v3.o
 arm_smmu_v3-y := arm-smmu-v3.o
+arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_IOMMUFD) += arm-smmu-v3-iommufd.o
 arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
 arm_smmu_v3-$(CONFIG_TEGRA241_CMDQV) += tegra241-cmdqv.o
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
new file mode 100644
index 00000000000000..3d2671031c9bb5
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -0,0 +1,31 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES
+ */
+
+#include <uapi/linux/iommufd.h>
+
+#include "arm-smmu-v3.h"
+
+void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct iommu_hw_info_arm_smmuv3 *info;
+	u32 __iomem *base_idr;
+	unsigned int i;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return ERR_PTR(-ENOMEM);
+
+	base_idr = master->smmu->base + ARM_SMMU_IDR0;
+	for (i = 0; i <= 5; i++)
+		info->idr[i] = readl_relaxed(base_idr + i);
+	info->iidr = readl_relaxed(master->smmu->base + ARM_SMMU_IIDR);
+	info->aidr = readl_relaxed(master->smmu->base + ARM_SMMU_AIDR);
+
+	*length = sizeof(*info);
+	*type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3;
+
+	return info;
+}
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 38725810c14eeb..996774d461aea2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3506,6 +3506,7 @@  static struct iommu_ops arm_smmu_ops = {
 	.identity_domain	= &arm_smmu_identity_domain,
 	.blocked_domain		= &arm_smmu_blocked_domain,
 	.capable		= arm_smmu_capable,
+	.hw_info		= arm_smmu_hw_info,
 	.domain_alloc_paging    = arm_smmu_domain_alloc_paging,
 	.domain_alloc_sva       = arm_smmu_sva_domain_alloc,
 	.domain_alloc_user	= arm_smmu_domain_alloc_user,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 06e3d88932df12..66261fd5bfb2d2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -81,6 +81,8 @@  struct arm_smmu_device;
 #define IIDR_REVISION			GENMASK(15, 12)
 #define IIDR_IMPLEMENTER		GENMASK(11, 0)
 
+#define ARM_SMMU_AIDR			0x1C
+
 #define ARM_SMMU_CR0			0x20
 #define CR0_ATSCHK			(1 << 4)
 #define CR0_CMDQEN			(1 << 3)
@@ -956,4 +958,11 @@  tegra241_cmdqv_probe(struct arm_smmu_device *smmu)
 	return ERR_PTR(-ENODEV);
 }
 #endif /* CONFIG_TEGRA241_CMDQV */
+
+#if IS_ENABLED(CONFIG_ARM_SMMU_V3_IOMMUFD)
+void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type);
+#else
+#define arm_smmu_hw_info NULL
+#endif /* CONFIG_ARM_SMMU_V3_IOMMUFD */
+
 #endif /* _ARM_SMMU_V3_H */
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index e266dfa6a38d9d..b227ac16333fe1 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -488,15 +488,50 @@  struct iommu_hw_info_vtd {
 	__aligned_u64 ecap_reg;
 };
 
+/**
+ * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
+ *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
+ *
+ * @flags: Must be set to 0
+ * @__reserved: Must be 0
+ * @idr: Implemented features for ARM SMMU Non-secure programming interface
+ * @iidr: Information about the implementation and implementer of ARM SMMU,
+ *        and architecture version supported
+ * @aidr: ARM SMMU architecture version
+ *
+ * For the details of @idr, @iidr and @aidr, please refer to the chapters
+ * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
+ *
+ * User space should read the underlying ARM SMMUv3 hardware information for
+ * the list of supported features.
+ *
+ * Note that these values reflect the raw HW capability, without any insight if
+ * any required kernel driver support is present. Bits may be set indicating the
+ * HW has functionality that is lacking kernel software support, such as BTM. If
+ * a VMM is using this information to construct emulated copies of these
+ * registers it should only forward bits that it knows it can support.
+ *
+ * In future, presence of required kernel support will be indicated in flags.
+ */
+struct iommu_hw_info_arm_smmuv3 {
+	__u32 flags;
+	__u32 __reserved;
+	__u32 idr[6];
+	__u32 iidr;
+	__u32 aidr;
+};
+
 /**
  * enum iommu_hw_info_type - IOMMU Hardware Info Types
  * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
  *                           info
  * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
  */
 enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_NONE = 0,
 	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
+	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
 };
 
 /**