diff mbox series

[v4,06/21] IOMMU/x86: perform PV Dom0 mappings in batches

Message ID f85a5557-3483-8135-ff47-a15474aaebb4@suse.com (mailing list archive)
State Superseded
Headers show
Series IOMMU: superpage support when not sharing pagetables | expand

Commit Message

Jan Beulich April 25, 2022, 8:34 a.m. UTC
For large page mappings to be easily usable (i.e. in particular without
un-shattering of smaller page mappings) and for mapping operations to
then also be more efficient, pass batches of Dom0 memory to iommu_map().
In dom0_construct_pv() and its helpers (covering strict mode) this
additionally requires establishing the type of those pages (albeit with
zero type references).

The earlier establishing of PGT_writable_page | PGT_validated requires
the existing places where this gets done (through get_page_and_type())
to be updated: For pages which actually have a mapping, the type
refcount needs to be 1.

There is actually a related bug that gets fixed here as a side effect:
Typically the last L1 table would get marked as such only after
get_page_and_type(..., PGT_writable_page). While this is fine as far as
refcounting goes, the page did remain mapped in the IOMMU in this case
(when "iommu=dom0-strict").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Subsequently p2m_add_identity_entry() may want to also gain an order
parameter, for arch_iommu_hwdom_init() to use. While this only affects
non-RAM regions, systems typically have 2-16Mb of reserved space
immediately below 4Gb, which hence could be mapped more efficiently.

The installing of zero-ref writable types has in fact shown (observed
while putting together the change) that despite the intention by the
XSA-288 changes (affecting DomU-s only) for Dom0 a number of
sufficiently ordinary pages (at the very least initrd and P2M ones as
well as pages that are part of the initial allocation but not part of
the initial mapping) still have been starting out as PGT_none, meaning
that they would have gained IOMMU mappings only the first time these
pages would get mapped writably. Consequently an open question is
whether iommu_memory_setup() should set the pages to PGT_writable_page
independent of need_iommu_pt_sync().

I didn't think I need to address the bug mentioned in the description in
a separate (prereq) patch, but if others disagree I could certainly
break out that part (needing to first use iommu_legacy_unmap() then).

Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
They'll end up mapped via the later get_page_and_type().

As to the way these refs get installed: I've chosen to avoid the more
expensive {get,put}_page_and_type(), favoring to put in place the
intended type directly. I guess I could be convinced to avoid this
bypassing of the actual logic; I merely think it's unnecessarily
expensive.

Note also that strictly speaking the iommu_iotlb_flush_all() here (as
well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
needed: Actual hooking up (AMD) or enabling of translation (VT-d)
occurs only afterwards anyway, so nothing can have made it into TLBs
just yet.
---
v3: Fold iommu_map() into (the now renamed) iommu_memory_setup(). Move
    iommu_unmap() into mark_pv_pt_pages_rdonly(). Adjust (split) log
    message in arch_iommu_hwdom_init().

Comments

Roger Pau Monné May 3, 2022, 2:49 p.m. UTC | #1
On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
> For large page mappings to be easily usable (i.e. in particular without
> un-shattering of smaller page mappings) and for mapping operations to
> then also be more efficient, pass batches of Dom0 memory to iommu_map().
> In dom0_construct_pv() and its helpers (covering strict mode) this
> additionally requires establishing the type of those pages (albeit with
> zero type references).

I think it's possible I've already asked this.  Would it make sense to
add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
specific flag?

It would seem to me that doing it that way would also allow the
mappings to get established in blocks for domUs.

And be less error prone in having to match memory allocation with
iommu_memory_setup() calls in order for the pages to be added to the
IOMMU page tables.

> The earlier establishing of PGT_writable_page | PGT_validated requires
> the existing places where this gets done (through get_page_and_type())
> to be updated: For pages which actually have a mapping, the type
> refcount needs to be 1.
> 
> There is actually a related bug that gets fixed here as a side effect:
> Typically the last L1 table would get marked as such only after
> get_page_and_type(..., PGT_writable_page). While this is fine as far as
> refcounting goes, the page did remain mapped in the IOMMU in this case
> (when "iommu=dom0-strict").
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Subsequently p2m_add_identity_entry() may want to also gain an order
> parameter, for arch_iommu_hwdom_init() to use. While this only affects
> non-RAM regions, systems typically have 2-16Mb of reserved space
> immediately below 4Gb, which hence could be mapped more efficiently.

Indeed.

> The installing of zero-ref writable types has in fact shown (observed
> while putting together the change) that despite the intention by the
> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
> sufficiently ordinary pages (at the very least initrd and P2M ones as
> well as pages that are part of the initial allocation but not part of
> the initial mapping) still have been starting out as PGT_none, meaning
> that they would have gained IOMMU mappings only the first time these
> pages would get mapped writably. Consequently an open question is
> whether iommu_memory_setup() should set the pages to PGT_writable_page
> independent of need_iommu_pt_sync().

I think I'm confused, doesn't the setting of PGT_writable_page happen
as a result of need_iommu_pt_sync() and having those pages added to
the IOMMU page tables? (so they can be properly tracked and IOMMU
mappings are removed if thte page is also removed)

If the pages are not added here (because dom0 is not running in strict
mode) then setting PGT_writable_page is not required?

> I didn't think I need to address the bug mentioned in the description in
> a separate (prereq) patch, but if others disagree I could certainly
> break out that part (needing to first use iommu_legacy_unmap() then).
> 
> Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
> They'll end up mapped via the later get_page_and_type().
> 
> As to the way these refs get installed: I've chosen to avoid the more
> expensive {get,put}_page_and_type(), favoring to put in place the
> intended type directly. I guess I could be convinced to avoid this
> bypassing of the actual logic; I merely think it's unnecessarily
> expensive.

In a different piece of code I would have asked to avoid open-coding
the type changes.  But there are already open-coded type changes in
dom0_construct_pv(), so adding those doesn't make the current status
worse.

> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
> occurs only afterwards anyway, so nothing can have made it into TLBs
> just yet.

Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
away, as we must strictly do the hwdom init before enabling the iommu
itself.

The one in dom0 build I'm less convinced, just to be on the safe side
if we ever change the order of IOMMU init and memory setup.  I would
expect flushing an empty TLB to not be very expensive?

> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
>  
>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>  {
> -    unsigned long i, top, max_pfn;
> -    unsigned int flush_flags = 0;
> +    unsigned long i, top, max_pfn, start, count;
> +    unsigned int flush_flags = 0, start_perms = 0;
>  
>      BUG_ON(!is_hardware_domain(d));
>  
> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>       * setting up potentially conflicting mappings here.
>       */
> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>  
> -    for ( ; i < top; i++ )
> +    for ( i = start, count = 0; i < top; )
>      {
>          unsigned long pfn = pdx_to_pfn(i);
>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>          if ( !perms )
>              rc = 0;
>          else if ( paging_mode_translate(d) )
> +        {
>              rc = p2m_add_identity_entry(d, pfn,
>                                          perms & IOMMUF_writable ? p2m_access_rw
>                                                                  : p2m_access_r,
>                                          0);
> +            if ( rc )
> +                printk(XENLOG_WARNING
> +                       "%pd: identity mapping of %lx failed: %d\n",
> +                       d, pfn, rc);
> +        }
> +        else if ( pfn != start + count || perms != start_perms )
> +        {
> +        commit:
> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
> +                           &flush_flags);
> +            if ( rc )
> +                printk(XENLOG_WARNING
> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
> +                       d, pfn, pfn + count, rc);
> +            SWAP(start, pfn);
> +            start_perms = perms;
> +            count = 1;
> +        }
>          else
> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> -                           perms, &flush_flags);
> +        {
> +            ++count;
> +            rc = 0;

Seeing as we want to process this in blocks now, I wonder whether it
would make sense to take a different approach, and use a rangeset to
track which regions need to be mapped.  What gets added would be based
on the host e820 plus the options
iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
based on the logic in hwdom_iommu_map() and finally we could iterate
over the regions afterwards using rangeset_consume_ranges().

Not that you strictly need to do it here, just think the end result
would be clearer.

Thanks, Roger.
Jan Beulich May 4, 2022, 9:46 a.m. UTC | #2
On 03.05.2022 16:49, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
>> For large page mappings to be easily usable (i.e. in particular without
>> un-shattering of smaller page mappings) and for mapping operations to
>> then also be more efficient, pass batches of Dom0 memory to iommu_map().
>> In dom0_construct_pv() and its helpers (covering strict mode) this
>> additionally requires establishing the type of those pages (albeit with
>> zero type references).
> 
> I think it's possible I've already asked this.  Would it make sense to
> add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
> specific flag?

I don't think you did ask, but now that you do: This would look like a
layering violation to me. I don't think allocation should ever have
mapping (into the IOMMU or elsewhere) as a "side effect", no matter
that ...

> It would seem to me that doing it that way would also allow the
> mappings to get established in blocks for domUs.

... then this would perhaps be possible.

>> The installing of zero-ref writable types has in fact shown (observed
>> while putting together the change) that despite the intention by the
>> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
>> sufficiently ordinary pages (at the very least initrd and P2M ones as
>> well as pages that are part of the initial allocation but not part of
>> the initial mapping) still have been starting out as PGT_none, meaning
>> that they would have gained IOMMU mappings only the first time these
>> pages would get mapped writably. Consequently an open question is
>> whether iommu_memory_setup() should set the pages to PGT_writable_page
>> independent of need_iommu_pt_sync().
> 
> I think I'm confused, doesn't the setting of PGT_writable_page happen
> as a result of need_iommu_pt_sync() and having those pages added to
> the IOMMU page tables? (so they can be properly tracked and IOMMU
> mappings are removed if thte page is also removed)

In principle yes - in guest_physmap_add_page(). But this function isn't
called for the pages I did enumerate in the remark. XSA-288 really only
cared about getting this right for DomU-s.

> If the pages are not added here (because dom0 is not running in strict
> mode) then setting PGT_writable_page is not required?

Correct - in that case we skip fiddling with IOMMU mappings on
transitions to/from PGT_writable_page, and hence putting this type in
place would be benign (but improve consistency).

>> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
>> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
>> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
>> occurs only afterwards anyway, so nothing can have made it into TLBs
>> just yet.
> 
> Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
> away, as we must strictly do the hwdom init before enabling the iommu
> itself.

Why would that be? That's imo as much of an implementation detail as
...

> The one in dom0 build I'm less convinced, just to be on the safe side
> if we ever change the order of IOMMU init and memory setup.

... this. Just like we populate tables with the IOMMU already enabled
for DomU-s, I think the same would be valid to do for Dom0.

> I would expect flushing an empty TLB to not be very expensive?

I wouldn't "expect" this - it might be this way, but it surely depends
on whether an implementation can easily tell whether the TLB is empty,
and whether its emptiness actually makes a difference for the latency
of a flush operation.

>> --- a/xen/drivers/passthrough/x86/iommu.c
>> +++ b/xen/drivers/passthrough/x86/iommu.c
>> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
>>  
>>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>>  {
>> -    unsigned long i, top, max_pfn;
>> -    unsigned int flush_flags = 0;
>> +    unsigned long i, top, max_pfn, start, count;
>> +    unsigned int flush_flags = 0, start_perms = 0;
>>  
>>      BUG_ON(!is_hardware_domain(d));
>>  
>> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>>       * setting up potentially conflicting mappings here.
>>       */
>> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>  
>> -    for ( ; i < top; i++ )
>> +    for ( i = start, count = 0; i < top; )
>>      {
>>          unsigned long pfn = pdx_to_pfn(i);
>>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>>          if ( !perms )
>>              rc = 0;
>>          else if ( paging_mode_translate(d) )
>> +        {
>>              rc = p2m_add_identity_entry(d, pfn,
>>                                          perms & IOMMUF_writable ? p2m_access_rw
>>                                                                  : p2m_access_r,
>>                                          0);
>> +            if ( rc )
>> +                printk(XENLOG_WARNING
>> +                       "%pd: identity mapping of %lx failed: %d\n",
>> +                       d, pfn, rc);
>> +        }
>> +        else if ( pfn != start + count || perms != start_perms )
>> +        {
>> +        commit:
>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
>> +                           &flush_flags);
>> +            if ( rc )
>> +                printk(XENLOG_WARNING
>> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
>> +                       d, pfn, pfn + count, rc);
>> +            SWAP(start, pfn);
>> +            start_perms = perms;
>> +            count = 1;
>> +        }
>>          else
>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>> -                           perms, &flush_flags);
>> +        {
>> +            ++count;
>> +            rc = 0;
> 
> Seeing as we want to process this in blocks now, I wonder whether it
> would make sense to take a different approach, and use a rangeset to
> track which regions need to be mapped.  What gets added would be based
> on the host e820 plus the options
> iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
> based on the logic in hwdom_iommu_map() and finally we could iterate
> over the regions afterwards using rangeset_consume_ranges().
> 
> Not that you strictly need to do it here, just think the end result
> would be clearer.

The end result might indeed be, but it would be more of a change right
here. Hence I'd prefer to leave that out of the series for now.

Jan
Roger Pau Monné May 4, 2022, 11:20 a.m. UTC | #3
On Wed, May 04, 2022 at 11:46:37AM +0200, Jan Beulich wrote:
> On 03.05.2022 16:49, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
> >> For large page mappings to be easily usable (i.e. in particular without
> >> un-shattering of smaller page mappings) and for mapping operations to
> >> then also be more efficient, pass batches of Dom0 memory to iommu_map().
> >> In dom0_construct_pv() and its helpers (covering strict mode) this
> >> additionally requires establishing the type of those pages (albeit with
> >> zero type references).
> > 
> > I think it's possible I've already asked this.  Would it make sense to
> > add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
> > specific flag?
> 
> I don't think you did ask, but now that you do: This would look like a
> layering violation to me. I don't think allocation should ever have
> mapping (into the IOMMU or elsewhere) as a "side effect", no matter
> that ...

Hm, I'm certainly not that familiar with PV itself to likely be able
to make a proper argument here.  I fully agree with you for translated
guests using a p2m.

For PV we currently establish/teardown IOMMU mappings in
_get_page_type(), which already looks like a layering violation to me,
hence also doing so in alloc_domheap_pages() wouldn't seem that bad if
it allows to simplify the resulting code overall.

> > It would seem to me that doing it that way would also allow the
> > mappings to get established in blocks for domUs.
> 
> ... then this would perhaps be possible.
> 
> >> The installing of zero-ref writable types has in fact shown (observed
> >> while putting together the change) that despite the intention by the
> >> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
> >> sufficiently ordinary pages (at the very least initrd and P2M ones as
> >> well as pages that are part of the initial allocation but not part of
> >> the initial mapping) still have been starting out as PGT_none, meaning
> >> that they would have gained IOMMU mappings only the first time these
> >> pages would get mapped writably. Consequently an open question is
> >> whether iommu_memory_setup() should set the pages to PGT_writable_page
> >> independent of need_iommu_pt_sync().
> > 
> > I think I'm confused, doesn't the setting of PGT_writable_page happen
> > as a result of need_iommu_pt_sync() and having those pages added to
> > the IOMMU page tables? (so they can be properly tracked and IOMMU
> > mappings are removed if thte page is also removed)
> 
> In principle yes - in guest_physmap_add_page(). But this function isn't
> called for the pages I did enumerate in the remark. XSA-288 really only
> cared about getting this right for DomU-s.

Would it make sense to change guest_physmap_add_page() to be able to
pass the page_order parameter down to iommu_map(), and then use it for
dom0 build instead of introducing iommu_memory_setup()?

I think guest_physmap_add_page() will need to be adjusted at some
point for domUs, and hence it could be unified with dom0 usage
also?

> > If the pages are not added here (because dom0 is not running in strict
> > mode) then setting PGT_writable_page is not required?
> 
> Correct - in that case we skip fiddling with IOMMU mappings on
> transitions to/from PGT_writable_page, and hence putting this type in
> place would be benign (but improve consistency).
> 
> >> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
> >> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
> >> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
> >> occurs only afterwards anyway, so nothing can have made it into TLBs
> >> just yet.
> > 
> > Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
> > away, as we must strictly do the hwdom init before enabling the iommu
> > itself.
> 
> Why would that be? That's imo as much of an implementation detail as
> ...

Well, you want to have the reserved/inclusive options applied (and
mappings created) before enabling the IOMMU, because at that point
devices have already been assigned.  So it depends more on a
combination of devices assigned & IOMMU enabled rather than just IOMMU
being enabled.

> > The one in dom0 build I'm less convinced, just to be on the safe side
> > if we ever change the order of IOMMU init and memory setup.
> 
> ... this. Just like we populate tables with the IOMMU already enabled
> for DomU-s, I think the same would be valid to do for Dom0.
> 
> > I would expect flushing an empty TLB to not be very expensive?
> 
> I wouldn't "expect" this - it might be this way, but it surely depends
> on whether an implementation can easily tell whether the TLB is empty,
> and whether its emptiness actually makes a difference for the latency
> of a flush operation.
> 
> >> --- a/xen/drivers/passthrough/x86/iommu.c
> >> +++ b/xen/drivers/passthrough/x86/iommu.c
> >> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
> >>  
> >>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >>  {
> >> -    unsigned long i, top, max_pfn;
> >> -    unsigned int flush_flags = 0;
> >> +    unsigned long i, top, max_pfn, start, count;
> >> +    unsigned int flush_flags = 0, start_perms = 0;
> >>  
> >>      BUG_ON(!is_hardware_domain(d));
> >>  
> >> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
> >>       * setting up potentially conflicting mappings here.
> >>       */
> >> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> >> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> >>  
> >> -    for ( ; i < top; i++ )
> >> +    for ( i = start, count = 0; i < top; )
> >>      {
> >>          unsigned long pfn = pdx_to_pfn(i);
> >>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> >> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>          if ( !perms )
> >>              rc = 0;
> >>          else if ( paging_mode_translate(d) )
> >> +        {
> >>              rc = p2m_add_identity_entry(d, pfn,
> >>                                          perms & IOMMUF_writable ? p2m_access_rw
> >>                                                                  : p2m_access_r,
> >>                                          0);
> >> +            if ( rc )
> >> +                printk(XENLOG_WARNING
> >> +                       "%pd: identity mapping of %lx failed: %d\n",
> >> +                       d, pfn, rc);
> >> +        }
> >> +        else if ( pfn != start + count || perms != start_perms )
> >> +        {
> >> +        commit:
> >> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
> >> +                           &flush_flags);
> >> +            if ( rc )
> >> +                printk(XENLOG_WARNING
> >> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
> >> +                       d, pfn, pfn + count, rc);
> >> +            SWAP(start, pfn);
> >> +            start_perms = perms;
> >> +            count = 1;
> >> +        }
> >>          else
> >> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >> -                           perms, &flush_flags);
> >> +        {
> >> +            ++count;
> >> +            rc = 0;
> > 
> > Seeing as we want to process this in blocks now, I wonder whether it
> > would make sense to take a different approach, and use a rangeset to
> > track which regions need to be mapped.  What gets added would be based
> > on the host e820 plus the options
> > iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
> > based on the logic in hwdom_iommu_map() and finally we could iterate
> > over the regions afterwards using rangeset_consume_ranges().
> > 
> > Not that you strictly need to do it here, just think the end result
> > would be clearer.
> 
> The end result might indeed be, but it would be more of a change right
> here. Hence I'd prefer to leave that out of the series for now.

OK.  I think it might be nice to add a comment in that regard, mostly
because I tend to forget myself.

Thanks, Roger.
Jan Beulich May 4, 2022, 12:27 p.m. UTC | #4
On 04.05.2022 13:20, Roger Pau Monné wrote:
> On Wed, May 04, 2022 at 11:46:37AM +0200, Jan Beulich wrote:
>> On 03.05.2022 16:49, Roger Pau Monné wrote:
>>> On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
>>>> For large page mappings to be easily usable (i.e. in particular without
>>>> un-shattering of smaller page mappings) and for mapping operations to
>>>> then also be more efficient, pass batches of Dom0 memory to iommu_map().
>>>> In dom0_construct_pv() and its helpers (covering strict mode) this
>>>> additionally requires establishing the type of those pages (albeit with
>>>> zero type references).
>>>
>>> I think it's possible I've already asked this.  Would it make sense to
>>> add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
>>> specific flag?
>>
>> I don't think you did ask, but now that you do: This would look like a
>> layering violation to me. I don't think allocation should ever have
>> mapping (into the IOMMU or elsewhere) as a "side effect", no matter
>> that ...
> 
> Hm, I'm certainly not that familiar with PV itself to likely be able
> to make a proper argument here.  I fully agree with you for translated
> guests using a p2m.
> 
> For PV we currently establish/teardown IOMMU mappings in
> _get_page_type(), which already looks like a layering violation to me,
> hence also doing so in alloc_domheap_pages() wouldn't seem that bad if
> it allows to simplify the resulting code overall.

That's a layering violation too, I agree, but at least it's one central
place.

>>> It would seem to me that doing it that way would also allow the
>>> mappings to get established in blocks for domUs.
>>
>> ... then this would perhaps be possible.
>>
>>>> The installing of zero-ref writable types has in fact shown (observed
>>>> while putting together the change) that despite the intention by the
>>>> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
>>>> sufficiently ordinary pages (at the very least initrd and P2M ones as
>>>> well as pages that are part of the initial allocation but not part of
>>>> the initial mapping) still have been starting out as PGT_none, meaning
>>>> that they would have gained IOMMU mappings only the first time these
>>>> pages would get mapped writably. Consequently an open question is
>>>> whether iommu_memory_setup() should set the pages to PGT_writable_page
>>>> independent of need_iommu_pt_sync().
>>>
>>> I think I'm confused, doesn't the setting of PGT_writable_page happen
>>> as a result of need_iommu_pt_sync() and having those pages added to
>>> the IOMMU page tables? (so they can be properly tracked and IOMMU
>>> mappings are removed if thte page is also removed)
>>
>> In principle yes - in guest_physmap_add_page(). But this function isn't
>> called for the pages I did enumerate in the remark. XSA-288 really only
>> cared about getting this right for DomU-s.
> 
> Would it make sense to change guest_physmap_add_page() to be able to
> pass the page_order parameter down to iommu_map(), and then use it for
> dom0 build instead of introducing iommu_memory_setup()?

To be quite frank: This is something that I might have been willing to
do months ago, when this series was still fresh. If I was to start
re-doing all of this code now, it would take far more time than it
would have taken back then. Hence I'd like to avoid a full re-work here
unless entirely unacceptable in the way currently done (which largely
fits with how we have been doing Dom0 setup).

Furthermore, guest_physmap_add_page() doesn't itself call iommu_map().
What you're suggesting would require get_page_and_type() to be able to
work on higher-order pages. I view adjustments like this as well out
of scope for this series.

> I think guest_physmap_add_page() will need to be adjusted at some
> point for domUs, and hence it could be unified with dom0 usage
> also?

As an optimization - perhaps. I view it as more important to have HVM
guests work reasonably well (which includes the performance aspect of
setting them up).

>>> If the pages are not added here (because dom0 is not running in strict
>>> mode) then setting PGT_writable_page is not required?
>>
>> Correct - in that case we skip fiddling with IOMMU mappings on
>> transitions to/from PGT_writable_page, and hence putting this type in
>> place would be benign (but improve consistency).
>>
>>>> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
>>>> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
>>>> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
>>>> occurs only afterwards anyway, so nothing can have made it into TLBs
>>>> just yet.
>>>
>>> Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
>>> away, as we must strictly do the hwdom init before enabling the iommu
>>> itself.
>>
>> Why would that be? That's imo as much of an implementation detail as
>> ...
> 
> Well, you want to have the reserved/inclusive options applied (and
> mappings created) before enabling the IOMMU, because at that point
> devices have already been assigned.  So it depends more on a
> combination of devices assigned & IOMMU enabled rather than just IOMMU
> being enabled.
> 
>>> The one in dom0 build I'm less convinced, just to be on the safe side
>>> if we ever change the order of IOMMU init and memory setup.
>>
>> ... this. Just like we populate tables with the IOMMU already enabled
>> for DomU-s, I think the same would be valid to do for Dom0.
>>
>>> I would expect flushing an empty TLB to not be very expensive?
>>
>> I wouldn't "expect" this - it might be this way, but it surely depends
>> on whether an implementation can easily tell whether the TLB is empty,
>> and whether its emptiness actually makes a difference for the latency
>> of a flush operation.
>>
>>>> --- a/xen/drivers/passthrough/x86/iommu.c
>>>> +++ b/xen/drivers/passthrough/x86/iommu.c
>>>> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
>>>>  
>>>>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>>>>  {
>>>> -    unsigned long i, top, max_pfn;
>>>> -    unsigned int flush_flags = 0;
>>>> +    unsigned long i, top, max_pfn, start, count;
>>>> +    unsigned int flush_flags = 0, start_perms = 0;
>>>>  
>>>>      BUG_ON(!is_hardware_domain(d));
>>>>  
>>>> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>>>>       * setting up potentially conflicting mappings here.
>>>>       */
>>>> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>>> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>>>  
>>>> -    for ( ; i < top; i++ )
>>>> +    for ( i = start, count = 0; i < top; )
>>>>      {
>>>>          unsigned long pfn = pdx_to_pfn(i);
>>>>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>>>> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>          if ( !perms )
>>>>              rc = 0;
>>>>          else if ( paging_mode_translate(d) )
>>>> +        {
>>>>              rc = p2m_add_identity_entry(d, pfn,
>>>>                                          perms & IOMMUF_writable ? p2m_access_rw
>>>>                                                                  : p2m_access_r,
>>>>                                          0);
>>>> +            if ( rc )
>>>> +                printk(XENLOG_WARNING
>>>> +                       "%pd: identity mapping of %lx failed: %d\n",
>>>> +                       d, pfn, rc);
>>>> +        }
>>>> +        else if ( pfn != start + count || perms != start_perms )
>>>> +        {
>>>> +        commit:
>>>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
>>>> +                           &flush_flags);
>>>> +            if ( rc )
>>>> +                printk(XENLOG_WARNING
>>>> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
>>>> +                       d, pfn, pfn + count, rc);
>>>> +            SWAP(start, pfn);
>>>> +            start_perms = perms;
>>>> +            count = 1;
>>>> +        }
>>>>          else
>>>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>>>> -                           perms, &flush_flags);
>>>> +        {
>>>> +            ++count;
>>>> +            rc = 0;
>>>
>>> Seeing as we want to process this in blocks now, I wonder whether it
>>> would make sense to take a different approach, and use a rangeset to
>>> track which regions need to be mapped.  What gets added would be based
>>> on the host e820 plus the options
>>> iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
>>> based on the logic in hwdom_iommu_map() and finally we could iterate
>>> over the regions afterwards using rangeset_consume_ranges().
>>>
>>> Not that you strictly need to do it here, just think the end result
>>> would be clearer.
>>
>> The end result might indeed be, but it would be more of a change right
>> here. Hence I'd prefer to leave that out of the series for now.
> 
> OK.  I think it might be nice to add a comment in that regard, mostly
> because I tend to forget myself.

Sure, I've added another post-commit-message remark.

Jan
Roger Pau Monné May 4, 2022, 1:55 p.m. UTC | #5
On Wed, May 04, 2022 at 02:27:14PM +0200, Jan Beulich wrote:
> On 04.05.2022 13:20, Roger Pau Monné wrote:
> > On Wed, May 04, 2022 at 11:46:37AM +0200, Jan Beulich wrote:
> >> On 03.05.2022 16:49, Roger Pau Monné wrote:
> >>> On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
> >>> It would seem to me that doing it that way would also allow the
> >>> mappings to get established in blocks for domUs.
> >>
> >> ... then this would perhaps be possible.
> >>
> >>>> The installing of zero-ref writable types has in fact shown (observed
> >>>> while putting together the change) that despite the intention by the
> >>>> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
> >>>> sufficiently ordinary pages (at the very least initrd and P2M ones as
> >>>> well as pages that are part of the initial allocation but not part of
> >>>> the initial mapping) still have been starting out as PGT_none, meaning
> >>>> that they would have gained IOMMU mappings only the first time these
> >>>> pages would get mapped writably. Consequently an open question is
> >>>> whether iommu_memory_setup() should set the pages to PGT_writable_page
> >>>> independent of need_iommu_pt_sync().
> >>>
> >>> I think I'm confused, doesn't the setting of PGT_writable_page happen
> >>> as a result of need_iommu_pt_sync() and having those pages added to
> >>> the IOMMU page tables? (so they can be properly tracked and IOMMU
> >>> mappings are removed if thte page is also removed)
> >>
> >> In principle yes - in guest_physmap_add_page(). But this function isn't
> >> called for the pages I did enumerate in the remark. XSA-288 really only
> >> cared about getting this right for DomU-s.
> > 
> > Would it make sense to change guest_physmap_add_page() to be able to
> > pass the page_order parameter down to iommu_map(), and then use it for
> > dom0 build instead of introducing iommu_memory_setup()?
> 
> To be quite frank: This is something that I might have been willing to
> do months ago, when this series was still fresh. If I was to start
> re-doing all of this code now, it would take far more time than it
> would have taken back then. Hence I'd like to avoid a full re-work here
> unless entirely unacceptable in the way currently done (which largely
> fits with how we have been doing Dom0 setup).

Sorry, I would have really liked to be more on time with reviews of
this, but there's always something that comes up.

> Furthermore, guest_physmap_add_page() doesn't itself call iommu_map().
> What you're suggesting would require get_page_and_type() to be able to
> work on higher-order pages. I view adjustments like this as well out
> of scope for this series.

Well, my initial thinking was to do something similar to what you
currently have in iommu_memory_setup: a direct call to iommu_map and
adjust the page types manually, but I think this will only work for
dom0 because pages are fresh at that point.  For domUs we must use
get_page_and_type so any previous mapping is also removed.

> > I think guest_physmap_add_page() will need to be adjusted at some
> > point for domUs, and hence it could be unified with dom0 usage
> > also?
> 
> As an optimization - perhaps. I view it as more important to have HVM
> guests work reasonably well (which includes the performance aspect of
> setting them up).

OK, I'm fine with focusing on HVM.

> >>>> --- a/xen/drivers/passthrough/x86/iommu.c
> >>>> +++ b/xen/drivers/passthrough/x86/iommu.c
> >>>> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
> >>>>  
> >>>>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >>>>  {
> >>>> -    unsigned long i, top, max_pfn;
> >>>> -    unsigned int flush_flags = 0;
> >>>> +    unsigned long i, top, max_pfn, start, count;
> >>>> +    unsigned int flush_flags = 0, start_perms = 0;
> >>>>  
> >>>>      BUG_ON(!is_hardware_domain(d));
> >>>>  
> >>>> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>>>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
> >>>>       * setting up potentially conflicting mappings here.
> >>>>       */
> >>>> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> >>>> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> >>>>  
> >>>> -    for ( ; i < top; i++ )
> >>>> +    for ( i = start, count = 0; i < top; )
> >>>>      {
> >>>>          unsigned long pfn = pdx_to_pfn(i);
> >>>>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> >>>> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>>>          if ( !perms )
> >>>>              rc = 0;
> >>>>          else if ( paging_mode_translate(d) )
> >>>> +        {
> >>>>              rc = p2m_add_identity_entry(d, pfn,
> >>>>                                          perms & IOMMUF_writable ? p2m_access_rw
> >>>>                                                                  : p2m_access_r,
> >>>>                                          0);
> >>>> +            if ( rc )
> >>>> +                printk(XENLOG_WARNING
> >>>> +                       "%pd: identity mapping of %lx failed: %d\n",
> >>>> +                       d, pfn, rc);
> >>>> +        }
> >>>> +        else if ( pfn != start + count || perms != start_perms )
> >>>> +        {
> >>>> +        commit:
> >>>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
> >>>> +                           &flush_flags);
> >>>> +            if ( rc )
> >>>> +                printk(XENLOG_WARNING
> >>>> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
> >>>> +                       d, pfn, pfn + count, rc);
> >>>> +            SWAP(start, pfn);
> >>>> +            start_perms = perms;
> >>>> +            count = 1;
> >>>> +        }
> >>>>          else
> >>>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >>>> -                           perms, &flush_flags);
> >>>> +        {
> >>>> +            ++count;
> >>>> +            rc = 0;
> >>>
> >>> Seeing as we want to process this in blocks now, I wonder whether it
> >>> would make sense to take a different approach, and use a rangeset to
> >>> track which regions need to be mapped.  What gets added would be based
> >>> on the host e820 plus the options
> >>> iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
> >>> based on the logic in hwdom_iommu_map() and finally we could iterate
> >>> over the regions afterwards using rangeset_consume_ranges().
> >>>
> >>> Not that you strictly need to do it here, just think the end result
> >>> would be clearer.
> >>
> >> The end result might indeed be, but it would be more of a change right
> >> here. Hence I'd prefer to leave that out of the series for now.
> > 
> > OK.  I think it might be nice to add a comment in that regard, mostly
> > because I tend to forget myself.
> 
> Sure, I've added another post-commit-message remark.

Sorry for being confused, but are those reflected in the final commit
message, or in the code itself?

Thanks, Roger.
Jan Beulich May 4, 2022, 2:26 p.m. UTC | #6
On 04.05.2022 15:55, Roger Pau Monné wrote:
> On Wed, May 04, 2022 at 02:27:14PM +0200, Jan Beulich wrote:
>> On 04.05.2022 13:20, Roger Pau Monné wrote:
>>> On Wed, May 04, 2022 at 11:46:37AM +0200, Jan Beulich wrote:
>>>> On 03.05.2022 16:49, Roger Pau Monné wrote:
>>>>> On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
>>>>>> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>>>          if ( !perms )
>>>>>>              rc = 0;
>>>>>>          else if ( paging_mode_translate(d) )
>>>>>> +        {
>>>>>>              rc = p2m_add_identity_entry(d, pfn,
>>>>>>                                          perms & IOMMUF_writable ? p2m_access_rw
>>>>>>                                                                  : p2m_access_r,
>>>>>>                                          0);
>>>>>> +            if ( rc )
>>>>>> +                printk(XENLOG_WARNING
>>>>>> +                       "%pd: identity mapping of %lx failed: %d\n",
>>>>>> +                       d, pfn, rc);
>>>>>> +        }
>>>>>> +        else if ( pfn != start + count || perms != start_perms )
>>>>>> +        {
>>>>>> +        commit:
>>>>>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
>>>>>> +                           &flush_flags);
>>>>>> +            if ( rc )
>>>>>> +                printk(XENLOG_WARNING
>>>>>> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
>>>>>> +                       d, pfn, pfn + count, rc);
>>>>>> +            SWAP(start, pfn);
>>>>>> +            start_perms = perms;
>>>>>> +            count = 1;
>>>>>> +        }
>>>>>>          else
>>>>>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>>>>>> -                           perms, &flush_flags);
>>>>>> +        {
>>>>>> +            ++count;
>>>>>> +            rc = 0;
>>>>>
>>>>> Seeing as we want to process this in blocks now, I wonder whether it
>>>>> would make sense to take a different approach, and use a rangeset to
>>>>> track which regions need to be mapped.  What gets added would be based
>>>>> on the host e820 plus the options
>>>>> iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
>>>>> based on the logic in hwdom_iommu_map() and finally we could iterate
>>>>> over the regions afterwards using rangeset_consume_ranges().
>>>>>
>>>>> Not that you strictly need to do it here, just think the end result
>>>>> would be clearer.
>>>>
>>>> The end result might indeed be, but it would be more of a change right
>>>> here. Hence I'd prefer to leave that out of the series for now.
>>>
>>> OK.  I think it might be nice to add a comment in that regard, mostly
>>> because I tend to forget myself.
>>
>> Sure, I've added another post-commit-message remark.
> 
> Sorry for being confused, but are those reflected in the final commit
> message, or in the code itself?

Neither - I didn't think we have any code comments anywhere which outline
future plans, including reasons why not doing so right away. When writing
that new remark I didn't even think it would belong in the commit message.

Jan
diff mbox series

Patch

--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -46,7 +46,8 @@  void __init dom0_update_physmap(bool com
 static __init void mark_pv_pt_pages_rdonly(struct domain *d,
                                            l4_pgentry_t *l4start,
                                            unsigned long vpt_start,
-                                           unsigned long nr_pt_pages)
+                                           unsigned long nr_pt_pages,
+                                           unsigned int *flush_flags)
 {
     unsigned long count;
     struct page_info *page;
@@ -71,6 +72,14 @@  static __init void mark_pv_pt_pages_rdon
         ASSERT((page->u.inuse.type_info & PGT_type_mask) <= PGT_root_page_table);
         ASSERT(!(page->u.inuse.type_info & ~(PGT_type_mask | PGT_pae_xen_l2)));
 
+        /*
+         * Page table pages need to be removed from the IOMMU again in case
+         * iommu_memory_setup() ended up mapping them.
+         */
+        if ( need_iommu_pt_sync(d) &&
+             iommu_unmap(d, _dfn(mfn_x(page_to_mfn(page))), 1, flush_flags) )
+            BUG();
+
         /* Read-only mapping + PGC_allocated + page-table page. */
         page->count_info         = PGC_allocated | 3;
         page->u.inuse.type_info |= PGT_validated | 1;
@@ -107,11 +116,43 @@  static __init void mark_pv_pt_pages_rdon
     unmap_domain_page(pl3e);
 }
 
+static void __init iommu_memory_setup(struct domain *d, const char *what,
+                                      struct page_info *page, unsigned long nr,
+                                      unsigned int *flush_flags)
+{
+    int rc;
+    mfn_t mfn = page_to_mfn(page);
+
+    if ( !need_iommu_pt_sync(d) )
+        return;
+
+    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn, nr,
+                   IOMMUF_readable | IOMMUF_writable, flush_flags);
+    if ( rc )
+    {
+        printk(XENLOG_ERR "pre-mapping %s MFN [%lx,%lx) into IOMMU failed: %d\n",
+               what, mfn_x(mfn), mfn_x(mfn) + nr, rc);
+        return;
+    }
+
+    /*
+     * For successfully established IOMMU mappings the type of the page(s)
+     * needs to match (for _get_page_type() to unmap upon type change). Set
+     * the page(s) to writable with no type ref.
+     */
+    for ( ; nr--; ++page )
+    {
+        ASSERT(!page->u.inuse.type_info);
+        page->u.inuse.type_info = PGT_writable_page | PGT_validated;
+    }
+}
+
 static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
                                     unsigned long v_start, unsigned long v_end,
                                     unsigned long vphysmap_start,
                                     unsigned long vphysmap_end,
-                                    unsigned long nr_pages)
+                                    unsigned long nr_pages,
+                                    unsigned int *flush_flags)
 {
     struct page_info *page = NULL;
     l4_pgentry_t *pl4e, *l4start = map_domain_page(_mfn(pgtbl_pfn));
@@ -177,6 +218,10 @@  static __init void setup_pv_physmap(stru
                                              L3_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
+                iommu_memory_setup(d, "P2M 1G", page,
+                                   SUPERPAGE_PAGES * SUPERPAGE_PAGES,
+                                   flush_flags);
+
                 *pl3e = l3e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L3_PAGETABLE_SHIFT;
                 continue;
@@ -203,6 +248,9 @@  static __init void setup_pv_physmap(stru
                                              L2_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
+                iommu_memory_setup(d, "P2M 2M", page, SUPERPAGE_PAGES,
+                                   flush_flags);
+
                 *pl2e = l2e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L2_PAGETABLE_SHIFT;
                 continue;
@@ -311,6 +359,7 @@  int __init dom0_construct_pv(struct doma
     unsigned long initrd_pfn = -1, initrd_mfn = 0;
     unsigned long count;
     struct page_info *page = NULL;
+    unsigned int flush_flags = 0;
     start_info_t *si;
     struct vcpu *v = d->vcpu[0];
     void *image_base = bootstrap_map(image);
@@ -573,6 +622,9 @@  int __init dom0_construct_pv(struct doma
                     BUG();
         }
         initrd->mod_end = 0;
+
+        iommu_memory_setup(d, "initrd", mfn_to_page(_mfn(initrd_mfn)),
+                           PFN_UP(initrd_len), &flush_flags);
     }
 
     printk("PHYSICAL MEMORY ARRANGEMENT:\n"
@@ -606,6 +658,13 @@  int __init dom0_construct_pv(struct doma
 
     process_pending_softirqs();
 
+    /*
+     * Map the full range here and then punch holes for page tables
+     * alongside marking them as such in mark_pv_pt_pages_rdonly().
+     */
+    iommu_memory_setup(d, "init-alloc", mfn_to_page(_mfn(alloc_spfn)),
+                       alloc_epfn - alloc_spfn, &flush_flags);
+
     mpt_alloc = (vpt_start - v_start) + pfn_to_paddr(alloc_spfn);
     if ( vinitrd_start )
         mpt_alloc -= PAGE_ALIGN(initrd_len);
@@ -690,7 +749,8 @@  int __init dom0_construct_pv(struct doma
         l1tab++;
 
         page = mfn_to_page(_mfn(mfn));
-        if ( !page->u.inuse.type_info &&
+        if ( (!page->u.inuse.type_info ||
+              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&
              !get_page_and_type(page, d, PGT_writable_page) )
             BUG();
     }
@@ -719,7 +779,7 @@  int __init dom0_construct_pv(struct doma
     }
 
     /* Pages that are part of page tables must be read only. */
-    mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
+    mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages, &flush_flags);
 
     /* Mask all upcalls... */
     for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
@@ -794,7 +854,7 @@  int __init dom0_construct_pv(struct doma
     {
         pfn = pagetable_get_pfn(v->arch.guest_table);
         setup_pv_physmap(d, pfn, v_start, v_end, vphysmap_start, vphysmap_end,
-                         nr_pages);
+                         nr_pages, &flush_flags);
     }
 
     /* Write the phys->machine and machine->phys table entries. */
@@ -825,7 +885,9 @@  int __init dom0_construct_pv(struct doma
         if ( get_gpfn_from_mfn(mfn) >= count )
         {
             BUG_ON(compat);
-            if ( !page->u.inuse.type_info &&
+            if ( (!page->u.inuse.type_info ||
+                  page->u.inuse.type_info == (PGT_writable_page |
+                                              PGT_validated)) &&
                  !get_page_and_type(page, d, PGT_writable_page) )
                 BUG();
 
@@ -841,8 +903,12 @@  int __init dom0_construct_pv(struct doma
 #endif
     while ( pfn < nr_pages )
     {
-        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
+        count = domain_tot_pages(d);
+        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
             panic("Not enough RAM for DOM0 reservation\n");
+
+        iommu_memory_setup(d, "chunk", page, domain_tot_pages(d) - count,
+                           &flush_flags);
         while ( pfn < domain_tot_pages(d) )
         {
             mfn = mfn_x(page_to_mfn(page));
@@ -857,6 +923,10 @@  int __init dom0_construct_pv(struct doma
         }
     }
 
+    /* Use while() to avoid compiler warning. */
+    while ( iommu_iotlb_flush_all(d, flush_flags) )
+        break;
+
     if ( initrd_len != 0 )
     {
         si->mod_start = vinitrd_start ?: initrd_pfn;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -347,8 +347,8 @@  static unsigned int __hwdom_init hwdom_i
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
 {
-    unsigned long i, top, max_pfn;
-    unsigned int flush_flags = 0;
+    unsigned long i, top, max_pfn, start, count;
+    unsigned int flush_flags = 0, start_perms = 0;
 
     BUG_ON(!is_hardware_domain(d));
 
@@ -379,9 +379,9 @@  void __hwdom_init arch_iommu_hwdom_init(
      * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
      * setting up potentially conflicting mappings here.
      */
-    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
+    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
 
-    for ( ; i < top; i++ )
+    for ( i = start, count = 0; i < top; )
     {
         unsigned long pfn = pdx_to_pfn(i);
         unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
@@ -390,20 +390,41 @@  void __hwdom_init arch_iommu_hwdom_init(
         if ( !perms )
             rc = 0;
         else if ( paging_mode_translate(d) )
+        {
             rc = p2m_add_identity_entry(d, pfn,
                                         perms & IOMMUF_writable ? p2m_access_rw
                                                                 : p2m_access_r,
                                         0);
+            if ( rc )
+                printk(XENLOG_WARNING
+                       "%pd: identity mapping of %lx failed: %d\n",
+                       d, pfn, rc);
+        }
+        else if ( pfn != start + count || perms != start_perms )
+        {
+        commit:
+            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
+                           &flush_flags);
+            if ( rc )
+                printk(XENLOG_WARNING
+                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
+                       d, pfn, pfn + count, rc);
+            SWAP(start, pfn);
+            start_perms = perms;
+            count = 1;
+        }
         else
-            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           perms, &flush_flags);
+        {
+            ++count;
+            rc = 0;
+        }
 
-        if ( rc )
-            printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
-                   d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
 
-        if (!(i & 0xfffff))
+        if ( !(++i & 0xfffff) )
             process_pending_softirqs();
+
+        if ( i == top && count )
+            goto commit;
     }
 
     /* Use if to avoid compiler warning */