diff mbox series

[1/3] amd-vi: use the same IOMMU page table levels for PV and HVM

Message ID 20231117094749.81091-2-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series x86/iommu: improve setup time of hwdom IOMMU | expand

Commit Message

Roger Pau Monné Nov. 17, 2023, 9:47 a.m. UTC
Using different page table levels for HVM or PV guest is not helpful, and is
not inline with the IOMMU implementation used by the other architecture vendor
(VT-d).

Switch to uniformly use DEFAULT_DOMAIN_ADDRESS_WIDTH in order to set the AMD-Vi
page table levels.

Note using the max RAM address for PV was bogus anyway, as there's no guarantee
there can't be device MMIO or reserved regions past the maximum RAM region.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/drivers/passthrough/amd/pci_amd_iommu.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

Comments

Andrew Cooper Nov. 17, 2023, 11:55 a.m. UTC | #1
On 17/11/2023 9:47 am, Roger Pau Monne wrote:
> Using different page table levels for HVM or PV guest is not helpful, and is
> not inline with the IOMMU implementation used by the other architecture vendor
> (VT-d).
>
> Switch to uniformly use DEFAULT_DOMAIN_ADDRESS_WIDTH in order to set the AMD-Vi
> page table levels.
>
> Note using the max RAM address for PV was bogus anyway, as there's no guarantee
> there can't be device MMIO or reserved regions past the maximum RAM region.

Indeed - and the MMIO regions do matter for P2P DMA.

> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Variable-height IOMMU pagetables are not worth the security
vulnerabilities they're made of.  I regret not fighting hard enough to
kill them entirely several years ago...

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, although...

> ---
>  xen/drivers/passthrough/amd/pci_amd_iommu.c | 20 ++++++++------------
>  1 file changed, 8 insertions(+), 12 deletions(-)
>
> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> index 6bc73dc21052..f9e749d74da2 100644
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -359,21 +359,17 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
>  static int cf_check amd_iommu_domain_init(struct domain *d)
>  {
>      struct domain_iommu *hd = dom_iommu(d);
> +    int pgmode = amd_iommu_get_paging_mode(
> +        1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT));

"paging mode" comes from the spec, but it's a very backwards way of
spelling height.

Can we at least start to improve the comprehensibility by renaming this
variable.

> +
> +    if ( pgmode < 0 )
> +        return pgmode;
>  
>      /*
> -     * Choose the number of levels for the IOMMU page tables.
> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
> -     *   RAM) above the 512G boundary.
> -     * - HVM could in principle use 3 or 4 depending on how much guest
> -     *   physical address space we give it, but this isn't known yet so use 4
> -     *   unilaterally.
> -     * - Unity maps may require an even higher number.
> +     * Choose the number of levels for the IOMMU page tables, taking into
> +     * account unity maps.
>       */
> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
> -            is_hvm_domain(d)
> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
> -            : get_upper_mfn_bound() + 1),
> -        amd_iommu_min_paging_mode);
> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);

I think these min/max variables can be dropped now we're not doing
variable height IOMMU pagetables, which further simplifies this expression.

Dunno if it's something better folded into this patch, or done at some
point in the future.

~Andrew
Jan Beulich Nov. 20, 2023, 9:45 a.m. UTC | #2
On 17.11.2023 12:55, Andrew Cooper wrote:
> On 17/11/2023 9:47 am, Roger Pau Monne wrote:
>> Using different page table levels for HVM or PV guest is not helpful, and is
>> not inline with the IOMMU implementation used by the other architecture vendor
>> (VT-d).
>>
>> Switch to uniformly use DEFAULT_DOMAIN_ADDRESS_WIDTH in order to set the AMD-Vi
>> page table levels.
>>
>> Note using the max RAM address for PV was bogus anyway, as there's no guarantee
>> there can't be device MMIO or reserved regions past the maximum RAM region.
> 
> Indeed - and the MMIO regions do matter for P2P DMA.

So what about any such living above the 48-bit boundary (i.e. not covered
by DEFAULT_DOMAIN_ADDRESS_WIDTH)?

>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Variable-height IOMMU pagetables are not worth the security
> vulnerabilities they're made of.  I regret not fighting hard enough to
> kill them entirely several years ago...
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, although...
> 
>> ---
>>  xen/drivers/passthrough/amd/pci_amd_iommu.c | 20 ++++++++------------
>>  1 file changed, 8 insertions(+), 12 deletions(-)
>>
>> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> index 6bc73dc21052..f9e749d74da2 100644
>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> @@ -359,21 +359,17 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
>>  static int cf_check amd_iommu_domain_init(struct domain *d)
>>  {
>>      struct domain_iommu *hd = dom_iommu(d);
>> +    int pgmode = amd_iommu_get_paging_mode(
>> +        1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT));
> 
> "paging mode" comes from the spec, but it's a very backwards way of
> spelling height.
> 
> Can we at least start to improve the comprehensibility by renaming this
> variable.
> 
>> +
>> +    if ( pgmode < 0 )
>> +        return pgmode;
>>  
>>      /*
>> -     * Choose the number of levels for the IOMMU page tables.
>> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
>> -     *   RAM) above the 512G boundary.
>> -     * - HVM could in principle use 3 or 4 depending on how much guest
>> -     *   physical address space we give it, but this isn't known yet so use 4
>> -     *   unilaterally.
>> -     * - Unity maps may require an even higher number.
>> +     * Choose the number of levels for the IOMMU page tables, taking into
>> +     * account unity maps.
>>       */
>> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
>> -            is_hvm_domain(d)
>> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
>> -            : get_upper_mfn_bound() + 1),
>> -        amd_iommu_min_paging_mode);
>> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
> 
> I think these min/max variables can be dropped now we're not doing
> variable height IOMMU pagetables, which further simplifies this expression.

Did you take unity maps into account? At least $subject and comment looks
to not be consistent in this regard: Either unity maps need considering
specially (and then we don't uniformly use the same depth), or they don't
need mentioning in the comment (anymore).

Jan
Roger Pau Monné Nov. 20, 2023, 10:27 a.m. UTC | #3
On Mon, Nov 20, 2023 at 10:45:29AM +0100, Jan Beulich wrote:
> On 17.11.2023 12:55, Andrew Cooper wrote:
> > On 17/11/2023 9:47 am, Roger Pau Monne wrote:
> >> Using different page table levels for HVM or PV guest is not helpful, and is
> >> not inline with the IOMMU implementation used by the other architecture vendor
> >> (VT-d).
> >>
> >> Switch to uniformly use DEFAULT_DOMAIN_ADDRESS_WIDTH in order to set the AMD-Vi
> >> page table levels.
> >>
> >> Note using the max RAM address for PV was bogus anyway, as there's no guarantee
> >> there can't be device MMIO or reserved regions past the maximum RAM region.
> > 
> > Indeed - and the MMIO regions do matter for P2P DMA.
> 
> So what about any such living above the 48-bit boundary (i.e. not covered
> by DEFAULT_DOMAIN_ADDRESS_WIDTH)?

That would only work for PV guests AFAICT, as HVM guests will already
refuse to create such mappings even before getting into the IOMMU
code: p2m_pt_set_entry() will return an error as the p2m code only
deals with 4 level page tables.

> 
> >> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > 
> > Variable-height IOMMU pagetables are not worth the security
> > vulnerabilities they're made of.  I regret not fighting hard enough to
> > kill them entirely several years ago...
> > 
> > Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, although...
> > 
> >> ---
> >>  xen/drivers/passthrough/amd/pci_amd_iommu.c | 20 ++++++++------------
> >>  1 file changed, 8 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> >> index 6bc73dc21052..f9e749d74da2 100644
> >> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> >> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> >> @@ -359,21 +359,17 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
> >>  static int cf_check amd_iommu_domain_init(struct domain *d)
> >>  {
> >>      struct domain_iommu *hd = dom_iommu(d);
> >> +    int pgmode = amd_iommu_get_paging_mode(
> >> +        1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT));
> > 
> > "paging mode" comes from the spec, but it's a very backwards way of
> > spelling height.
> > 
> > Can we at least start to improve the comprehensibility by renaming this
> > variable.
> > 
> >> +
> >> +    if ( pgmode < 0 )
> >> +        return pgmode;
> >>  
> >>      /*
> >> -     * Choose the number of levels for the IOMMU page tables.
> >> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
> >> -     *   RAM) above the 512G boundary.
> >> -     * - HVM could in principle use 3 or 4 depending on how much guest
> >> -     *   physical address space we give it, but this isn't known yet so use 4
> >> -     *   unilaterally.
> >> -     * - Unity maps may require an even higher number.
> >> +     * Choose the number of levels for the IOMMU page tables, taking into
> >> +     * account unity maps.
> >>       */
> >> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
> >> -            is_hvm_domain(d)
> >> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
> >> -            : get_upper_mfn_bound() + 1),
> >> -        amd_iommu_min_paging_mode);
> >> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
> > 
> > I think these min/max variables can be dropped now we're not doing
> > variable height IOMMU pagetables, which further simplifies this expression.
> 
> Did you take unity maps into account? At least $subject and comment looks
> to not be consistent in this regard: Either unity maps need considering
> specially (and then we don't uniformly use the same depth), or they don't
> need mentioning in the comment (anymore).

Unity maps that require an address width > DEFAULT_DOMAIN_ADDRESS_WIDTH
will currently only work on PV at best, as HVM p2m code is limited to
4 level page tables, so even if the IOMMU page tables support a
greater address width the call to map such regions will trigger an
error in the p2m code way before attempting to create any IOMMU
mappings.

We could do:

hd->arch.amd.paging_mode =
    is_hvm_domain(d) ? pgmode : max(pgmode, amd_iommu_min_paging_mode);

Putting IVMD/RMRR regions that require the usage of 5 level page
tables would be a very short sighted move by vendors IMO.

And will put us back in a situation where PV vs HVM can get different
IOMMU page table levels, which is undesirable.  It might be better to
just assume all domains use DEFAULT_DOMAIN_ADDRESS_WIDTH and hide
devices that have IVMD/RMRR regions above that limit.

Thanks, Roger.
Jan Beulich Nov. 20, 2023, 10:37 a.m. UTC | #4
On 20.11.2023 11:27, Roger Pau Monné wrote:
> On Mon, Nov 20, 2023 at 10:45:29AM +0100, Jan Beulich wrote:
>> On 17.11.2023 12:55, Andrew Cooper wrote:
>>> On 17/11/2023 9:47 am, Roger Pau Monne wrote:
>>>>      /*
>>>> -     * Choose the number of levels for the IOMMU page tables.
>>>> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
>>>> -     *   RAM) above the 512G boundary.
>>>> -     * - HVM could in principle use 3 or 4 depending on how much guest
>>>> -     *   physical address space we give it, but this isn't known yet so use 4
>>>> -     *   unilaterally.
>>>> -     * - Unity maps may require an even higher number.
>>>> +     * Choose the number of levels for the IOMMU page tables, taking into
>>>> +     * account unity maps.
>>>>       */
>>>> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
>>>> -            is_hvm_domain(d)
>>>> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
>>>> -            : get_upper_mfn_bound() + 1),
>>>> -        amd_iommu_min_paging_mode);
>>>> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
>>>
>>> I think these min/max variables can be dropped now we're not doing
>>> variable height IOMMU pagetables, which further simplifies this expression.
>>
>> Did you take unity maps into account? At least $subject and comment looks
>> to not be consistent in this regard: Either unity maps need considering
>> specially (and then we don't uniformly use the same depth), or they don't
>> need mentioning in the comment (anymore).
> 
> Unity maps that require an address width > DEFAULT_DOMAIN_ADDRESS_WIDTH
> will currently only work on PV at best, as HVM p2m code is limited to
> 4 level page tables, so even if the IOMMU page tables support a
> greater address width the call to map such regions will trigger an
> error in the p2m code way before attempting to create any IOMMU
> mappings.
> 
> We could do:
> 
> hd->arch.amd.paging_mode =
>     is_hvm_domain(d) ? pgmode : max(pgmode, amd_iommu_min_paging_mode);
> 
> Putting IVMD/RMRR regions that require the usage of 5 level page
> tables would be a very short sighted move by vendors IMO.
> 
> And will put us back in a situation where PV vs HVM can get different
> IOMMU page table levels, which is undesirable.  It might be better to
> just assume all domains use DEFAULT_DOMAIN_ADDRESS_WIDTH and hide
> devices that have IVMD/RMRR regions above that limit.

That's a possible approach, yes. To be honest, I was actually hoping we'd
move in a different direction: Do away with the entirely arbitrary
DEFAULT_DOMAIN_ADDRESS_WIDTH, and use actual system properties instead.

Whether having PV and HVM have uniform depth is indeed desirable is also
not entirely obvious to me. Having looked over patch 3 now, it also
hasn't become clear to me why the change here is actually a (necessary)
prereq.

Jan
Roger Pau Monné Nov. 20, 2023, 10:50 a.m. UTC | #5
On Mon, Nov 20, 2023 at 11:37:43AM +0100, Jan Beulich wrote:
> On 20.11.2023 11:27, Roger Pau Monné wrote:
> > On Mon, Nov 20, 2023 at 10:45:29AM +0100, Jan Beulich wrote:
> >> On 17.11.2023 12:55, Andrew Cooper wrote:
> >>> On 17/11/2023 9:47 am, Roger Pau Monne wrote:
> >>>>      /*
> >>>> -     * Choose the number of levels for the IOMMU page tables.
> >>>> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
> >>>> -     *   RAM) above the 512G boundary.
> >>>> -     * - HVM could in principle use 3 or 4 depending on how much guest
> >>>> -     *   physical address space we give it, but this isn't known yet so use 4
> >>>> -     *   unilaterally.
> >>>> -     * - Unity maps may require an even higher number.
> >>>> +     * Choose the number of levels for the IOMMU page tables, taking into
> >>>> +     * account unity maps.
> >>>>       */
> >>>> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
> >>>> -            is_hvm_domain(d)
> >>>> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
> >>>> -            : get_upper_mfn_bound() + 1),
> >>>> -        amd_iommu_min_paging_mode);
> >>>> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
> >>>
> >>> I think these min/max variables can be dropped now we're not doing
> >>> variable height IOMMU pagetables, which further simplifies this expression.
> >>
> >> Did you take unity maps into account? At least $subject and comment looks
> >> to not be consistent in this regard: Either unity maps need considering
> >> specially (and then we don't uniformly use the same depth), or they don't
> >> need mentioning in the comment (anymore).
> > 
> > Unity maps that require an address width > DEFAULT_DOMAIN_ADDRESS_WIDTH
> > will currently only work on PV at best, as HVM p2m code is limited to
> > 4 level page tables, so even if the IOMMU page tables support a
> > greater address width the call to map such regions will trigger an
> > error in the p2m code way before attempting to create any IOMMU
> > mappings.
> > 
> > We could do:
> > 
> > hd->arch.amd.paging_mode =
> >     is_hvm_domain(d) ? pgmode : max(pgmode, amd_iommu_min_paging_mode);
> > 
> > Putting IVMD/RMRR regions that require the usage of 5 level page
> > tables would be a very short sighted move by vendors IMO.
> > 
> > And will put us back in a situation where PV vs HVM can get different
> > IOMMU page table levels, which is undesirable.  It might be better to
> > just assume all domains use DEFAULT_DOMAIN_ADDRESS_WIDTH and hide
> > devices that have IVMD/RMRR regions above that limit.
> 
> That's a possible approach, yes. To be honest, I was actually hoping we'd
> move in a different direction: Do away with the entirely arbitrary
> DEFAULT_DOMAIN_ADDRESS_WIDTH, and use actual system properties instead.

Hm, yes, that might be a sensible approach, but right now I don't want
to block this series on such (likely big) piece of work.  I think we
should aim for HVM and PV to have the same IOMMU page table levels,
and that's currently limited by the p2m code only supporting 4 levels.

> Whether having PV and HVM have uniform depth is indeed desirable is also
> not entirely obvious to me. Having looked over patch 3 now, it also
> hasn't become clear to me why the change here is actually a (necessary)
> prereq.

Oh, it's a prereq because I've found AMD systems that have reserved
regions > 512GB, but no RAM past that region.  arch_iommu_hwdom_init()
would fail on those systems when patch 3/3 was applied, as then
reserved regions past the last RAM address are also mapped in
arch_iommu_hwdom_init().

Thanks, Roger.
Jan Beulich Nov. 20, 2023, 11:34 a.m. UTC | #6
On 20.11.2023 11:50, Roger Pau Monné wrote:
> On Mon, Nov 20, 2023 at 11:37:43AM +0100, Jan Beulich wrote:
>> On 20.11.2023 11:27, Roger Pau Monné wrote:
>>> On Mon, Nov 20, 2023 at 10:45:29AM +0100, Jan Beulich wrote:
>>>> On 17.11.2023 12:55, Andrew Cooper wrote:
>>>>> On 17/11/2023 9:47 am, Roger Pau Monne wrote:
>>>>>>      /*
>>>>>> -     * Choose the number of levels for the IOMMU page tables.
>>>>>> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
>>>>>> -     *   RAM) above the 512G boundary.
>>>>>> -     * - HVM could in principle use 3 or 4 depending on how much guest
>>>>>> -     *   physical address space we give it, but this isn't known yet so use 4
>>>>>> -     *   unilaterally.
>>>>>> -     * - Unity maps may require an even higher number.
>>>>>> +     * Choose the number of levels for the IOMMU page tables, taking into
>>>>>> +     * account unity maps.
>>>>>>       */
>>>>>> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
>>>>>> -            is_hvm_domain(d)
>>>>>> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
>>>>>> -            : get_upper_mfn_bound() + 1),
>>>>>> -        amd_iommu_min_paging_mode);
>>>>>> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
>>>>>
>>>>> I think these min/max variables can be dropped now we're not doing
>>>>> variable height IOMMU pagetables, which further simplifies this expression.
>>>>
>>>> Did you take unity maps into account? At least $subject and comment looks
>>>> to not be consistent in this regard: Either unity maps need considering
>>>> specially (and then we don't uniformly use the same depth), or they don't
>>>> need mentioning in the comment (anymore).
>>>
>>> Unity maps that require an address width > DEFAULT_DOMAIN_ADDRESS_WIDTH
>>> will currently only work on PV at best, as HVM p2m code is limited to
>>> 4 level page tables, so even if the IOMMU page tables support a
>>> greater address width the call to map such regions will trigger an
>>> error in the p2m code way before attempting to create any IOMMU
>>> mappings.
>>>
>>> We could do:
>>>
>>> hd->arch.amd.paging_mode =
>>>     is_hvm_domain(d) ? pgmode : max(pgmode, amd_iommu_min_paging_mode);
>>>
>>> Putting IVMD/RMRR regions that require the usage of 5 level page
>>> tables would be a very short sighted move by vendors IMO.
>>>
>>> And will put us back in a situation where PV vs HVM can get different
>>> IOMMU page table levels, which is undesirable.  It might be better to
>>> just assume all domains use DEFAULT_DOMAIN_ADDRESS_WIDTH and hide
>>> devices that have IVMD/RMRR regions above that limit.
>>
>> That's a possible approach, yes. To be honest, I was actually hoping we'd
>> move in a different direction: Do away with the entirely arbitrary
>> DEFAULT_DOMAIN_ADDRESS_WIDTH, and use actual system properties instead.
> 
> Hm, yes, that might be a sensible approach, but right now I don't want
> to block this series on such (likely big) piece of work.  I think we
> should aim for HVM and PV to have the same IOMMU page table levels,
> and that's currently limited by the p2m code only supporting 4 levels.

No, I certainly don't mean to introduce a dependency there. Yet what
you do here goes actively against that possible movement in the other
direction: What "actual system properties" are differs between PV and
HVM (host properties vs guest properties), and hence there would
continue to be a (possible) difference in depth between the two.

>> Whether having PV and HVM have uniform depth is indeed desirable is also
>> not entirely obvious to me. Having looked over patch 3 now, it also
>> hasn't become clear to me why the change here is actually a (necessary)
>> prereq.
> 
> Oh, it's a prereq because I've found AMD systems that have reserved
> regions > 512GB, but no RAM past that region.  arch_iommu_hwdom_init()
> would fail on those systems when patch 3/3 was applied, as then
> reserved regions past the last RAM address are also mapped in
> arch_iommu_hwdom_init().

Hmm, interesting. I can't bring together "would fail" and "are also
mapped" though, unless the latter was meant to say "are attempted to
also be mapped", in which case I could at least see room for failure.
Yet still this would then feel like an issue with the last patch alone,
which the change here is merely avoiding (without this being a strict
prereq). Instead I'd expect us to use 4 levels whenever there are any
kind of regions (reserved or not) above 512G. Without disallowing use
of 3 levels on other (smaller) systems.

Jan
Roger Pau Monné Nov. 20, 2023, 12:01 p.m. UTC | #7
On Mon, Nov 20, 2023 at 12:34:45PM +0100, Jan Beulich wrote:
> On 20.11.2023 11:50, Roger Pau Monné wrote:
> > On Mon, Nov 20, 2023 at 11:37:43AM +0100, Jan Beulich wrote:
> >> On 20.11.2023 11:27, Roger Pau Monné wrote:
> >>> On Mon, Nov 20, 2023 at 10:45:29AM +0100, Jan Beulich wrote:
> >>>> On 17.11.2023 12:55, Andrew Cooper wrote:
> >>>>> On 17/11/2023 9:47 am, Roger Pau Monne wrote:
> >>>>>>      /*
> >>>>>> -     * Choose the number of levels for the IOMMU page tables.
> >>>>>> -     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
> >>>>>> -     *   RAM) above the 512G boundary.
> >>>>>> -     * - HVM could in principle use 3 or 4 depending on how much guest
> >>>>>> -     *   physical address space we give it, but this isn't known yet so use 4
> >>>>>> -     *   unilaterally.
> >>>>>> -     * - Unity maps may require an even higher number.
> >>>>>> +     * Choose the number of levels for the IOMMU page tables, taking into
> >>>>>> +     * account unity maps.
> >>>>>>       */
> >>>>>> -    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
> >>>>>> -            is_hvm_domain(d)
> >>>>>> -            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
> >>>>>> -            : get_upper_mfn_bound() + 1),
> >>>>>> -        amd_iommu_min_paging_mode);
> >>>>>> +    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
> >>>>>
> >>>>> I think these min/max variables can be dropped now we're not doing
> >>>>> variable height IOMMU pagetables, which further simplifies this expression.
> >>>>
> >>>> Did you take unity maps into account? At least $subject and comment looks
> >>>> to not be consistent in this regard: Either unity maps need considering
> >>>> specially (and then we don't uniformly use the same depth), or they don't
> >>>> need mentioning in the comment (anymore).
> >>>
> >>> Unity maps that require an address width > DEFAULT_DOMAIN_ADDRESS_WIDTH
> >>> will currently only work on PV at best, as HVM p2m code is limited to
> >>> 4 level page tables, so even if the IOMMU page tables support a
> >>> greater address width the call to map such regions will trigger an
> >>> error in the p2m code way before attempting to create any IOMMU
> >>> mappings.
> >>>
> >>> We could do:
> >>>
> >>> hd->arch.amd.paging_mode =
> >>>     is_hvm_domain(d) ? pgmode : max(pgmode, amd_iommu_min_paging_mode);
> >>>
> >>> Putting IVMD/RMRR regions that require the usage of 5 level page
> >>> tables would be a very short sighted move by vendors IMO.
> >>>
> >>> And will put us back in a situation where PV vs HVM can get different
> >>> IOMMU page table levels, which is undesirable.  It might be better to
> >>> just assume all domains use DEFAULT_DOMAIN_ADDRESS_WIDTH and hide
> >>> devices that have IVMD/RMRR regions above that limit.
> >>
> >> That's a possible approach, yes. To be honest, I was actually hoping we'd
> >> move in a different direction: Do away with the entirely arbitrary
> >> DEFAULT_DOMAIN_ADDRESS_WIDTH, and use actual system properties instead.
> > 
> > Hm, yes, that might be a sensible approach, but right now I don't want
> > to block this series on such (likely big) piece of work.  I think we
> > should aim for HVM and PV to have the same IOMMU page table levels,
> > and that's currently limited by the p2m code only supporting 4 levels.
> 
> No, I certainly don't mean to introduce a dependency there. Yet what
> you do here goes actively against that possible movement in the other
> direction: What "actual system properties" are differs between PV and
> HVM (host properties vs guest properties), and hence there would
> continue to be a (possible) difference in depth between the two.

Might be.  Overall seems like more complexity for a little win.

The simplest option would be to unconditionally use the maximum page
table levels supported by both the CPU and the IOMMU.

> >> Whether having PV and HVM have uniform depth is indeed desirable is also
> >> not entirely obvious to me. Having looked over patch 3 now, it also
> >> hasn't become clear to me why the change here is actually a (necessary)
> >> prereq.
> > 
> > Oh, it's a prereq because I've found AMD systems that have reserved
> > regions > 512GB, but no RAM past that region.  arch_iommu_hwdom_init()
> > would fail on those systems when patch 3/3 was applied, as then
> > reserved regions past the last RAM address are also mapped in
> > arch_iommu_hwdom_init().
> 
> Hmm, interesting. I can't bring together "would fail" and "are also
> mapped" though, unless the latter was meant to say "are attempted to
> also be mapped", in which case I could at least see room for failure.

Yes, "are attempted to also be mapped", and that attempt fails.  I
would assume that "would fail" was already connected to "also mapped",
but maybe it's not clear enough.

> Yet still this would then feel like an issue with the last patch alone,
> which the change here is merely avoiding (without this being a strict
> prereq). Instead I'd expect us to use 4 levels whenever there are any
> kind of regions (reserved or not) above 512G. Without disallowing use
> of 3 levels on other (smaller) systems.

While reserved regions are the ones that made me realize about this
IOMMU page table difference, what about device MMIO regions?

There's no limitation that avoids MMIO regions from living past the
last RAM address, and possibly above the 512GB mark.

If anything for PV we should limit page table levels based on the
supported paddr bits reported by the CPU, but limiting it based on the
memory map seems plain bogus.

Thanks, Roger.
Jan Beulich Nov. 20, 2023, 4:27 p.m. UTC | #8
On 20.11.2023 13:01, Roger Pau Monné wrote:
> On Mon, Nov 20, 2023 at 12:34:45PM +0100, Jan Beulich wrote:
>> Yet still this would then feel like an issue with the last patch alone,
>> which the change here is merely avoiding (without this being a strict
>> prereq). Instead I'd expect us to use 4 levels whenever there are any
>> kind of regions (reserved or not) above 512G. Without disallowing use
>> of 3 levels on other (smaller) systems.
> 
> While reserved regions are the ones that made me realize about this
> IOMMU page table difference, what about device MMIO regions?
> 
> There's no limitation that avoids MMIO regions from living past the
> last RAM address, and possibly above the 512GB mark.
> 
> If anything for PV we should limit page table levels based on the
> supported paddr bits reported by the CPU, but limiting it based on the
> memory map seems plain bogus.

Right, matches what we were discussing (really it's the paddr_bits reported
to the domain, but I guess we have little reason to alter the host value
especially for Dom0).

Jan
diff mbox series

Patch

diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 6bc73dc21052..f9e749d74da2 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -359,21 +359,17 @@  int __read_mostly amd_iommu_min_paging_mode = 1;
 static int cf_check amd_iommu_domain_init(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    int pgmode = amd_iommu_get_paging_mode(
+        1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT));
+
+    if ( pgmode < 0 )
+        return pgmode;
 
     /*
-     * Choose the number of levels for the IOMMU page tables.
-     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
-     *   RAM) above the 512G boundary.
-     * - HVM could in principle use 3 or 4 depending on how much guest
-     *   physical address space we give it, but this isn't known yet so use 4
-     *   unilaterally.
-     * - Unity maps may require an even higher number.
+     * Choose the number of levels for the IOMMU page tables, taking into
+     * account unity maps.
      */
-    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
-            is_hvm_domain(d)
-            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
-            : get_upper_mfn_bound() + 1),
-        amd_iommu_min_paging_mode);
+    hd->arch.amd.paging_mode = max(pgmode, amd_iommu_min_paging_mode);
 
     return 0;
 }