diff mbox series

[for-4.15,v2,1/5] xen/x86: p2m: Don't map the special pages in the IOMMU page-tables

Message ID 20210209152816.15792-2-julien@xen.org (mailing list archive)
State Superseded
Headers show
Series xen/iommu: Collection of bug fixes for IOMMU teadorwn | expand

Commit Message

Julien Grall Feb. 9, 2021, 3:28 p.m. UTC
From: Julien Grall <jgrall@amazon.com>

Currently, the IOMMU page-tables will be populated early in the domain
creation if the hardware is able to virtualize the local APIC. However,
the IOMMU page tables will not be freed during early failure and will
result to a leak.

An assigned device should not need to DMA into the vLAPIC page, so we
can avoid to map the page in the IOMMU page-tables.

This statement is also true for any special pages (the vLAPIC page is
one of them). So to take the opportunity to prevent the mapping for all
of them.

Note that:
    - This is matching the existing behavior with PV guest
    - This doesn't change the behavior when the P2M is shared with the
    IOMMU. IOW, the special pages will still be accessibled by the
    device.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>

---
    Changes in v2:
        - New patch
---
 xen/include/asm-x86/p2m.h | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Paul Durrant Feb. 9, 2021, 8:19 p.m. UTC | #1
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Julien Grall
> Sent: 09 February 2021 15:28
> To: xen-devel@lists.xenproject.org
> Cc: hongyxia@amazon.co.uk; iwj@xenproject.org; Julien Grall <jgrall@amazon.com>; Jan Beulich
> <jbeulich@suse.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau Monné
> <roger.pau@citrix.com>; Wei Liu <wl@xen.org>
> Subject: [for-4.15][PATCH v2 1/5] xen/x86: p2m: Don't map the special pages in the IOMMU page-tables
> 
> From: Julien Grall <jgrall@amazon.com>
> 
> Currently, the IOMMU page-tables will be populated early in the domain
> creation if the hardware is able to virtualize the local APIC. However,
> the IOMMU page tables will not be freed during early failure and will
> result to a leak.
> 
> An assigned device should not need to DMA into the vLAPIC page, so we
> can avoid to map the page in the IOMMU page-tables.
> 
> This statement is also true for any special pages (the vLAPIC page is
> one of them). So to take the opportunity to prevent the mapping for all
> of them.
> 
> Note that:
>     - This is matching the existing behavior with PV guest
>     - This doesn't change the behavior when the P2M is shared with the
>     IOMMU. IOW, the special pages will still be accessibled by the
>     device.
> 
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Julien Grall <jgrall@amazon.com>
> 

Reviewed-by: Paul Durrant <paul@xen.org>
Roger Pau Monné Feb. 10, 2021, 8:29 a.m. UTC | #2
On Tue, Feb 09, 2021 at 03:28:12PM +0000, Julien Grall wrote:
> From: Julien Grall <jgrall@amazon.com>
> 
> Currently, the IOMMU page-tables will be populated early in the domain
> creation if the hardware is able to virtualize the local APIC. However,
> the IOMMU page tables will not be freed during early failure and will
> result to a leak.
> 
> An assigned device should not need to DMA into the vLAPIC page, so we
> can avoid to map the page in the IOMMU page-tables.
> 
> This statement is also true for any special pages (the vLAPIC page is
> one of them). So to take the opportunity to prevent the mapping for all
> of them.

Hm, OK, while I assume it's likely for special pages to not be target
of DMA operations, it's not easy to spot what are special pages.

> Note that:
>     - This is matching the existing behavior with PV guest

You might make HVM guests not sharing page-tables 'match' PV
behavior, but you are making behavior between HVM guests themselves
diverge.


>     - This doesn't change the behavior when the P2M is shared with the
>     IOMMU. IOW, the special pages will still be accessibled by the
>     device.

I have to admit I don't like this part at all. Having diverging device
mappings depending on whether the page tables are shared or not is
bad IMO, as there might be subtle bugs affecting one of the two
modes.

I get the feeling this is just papering over an existing issue instead
of actually fixing it: IOMMU page tables need to be properly freed
during early failure.

> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Julien Grall <jgrall@amazon.com>
> 
> ---
>     Changes in v2:
>         - New patch
> ---
>  xen/include/asm-x86/p2m.h | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
> index 7d63f5787e62..1802545969b3 100644
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -919,6 +919,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
>  {
>      unsigned int flags;
>  
> +    /* Don't map special pages in the IOMMU page-tables. */

I think this should explain why special pages don't require IOMMU
mappings, or even just note that special pages cannot be added to the
IOMMU page tables due to failure to free them afterwards and that this
is a bodge for it.

Thanks, Roger.
Julien Grall Feb. 10, 2021, 8:50 a.m. UTC | #3
Hi Roger,

On 10/02/2021 08:29, Roger Pau Monné wrote:
> On Tue, Feb 09, 2021 at 03:28:12PM +0000, Julien Grall wrote:
>> From: Julien Grall <jgrall@amazon.com>
>>
>> Currently, the IOMMU page-tables will be populated early in the domain
>> creation if the hardware is able to virtualize the local APIC. However,
>> the IOMMU page tables will not be freed during early failure and will
>> result to a leak.
>>
>> An assigned device should not need to DMA into the vLAPIC page, so we
>> can avoid to map the page in the IOMMU page-tables.
>>
>> This statement is also true for any special pages (the vLAPIC page is
>> one of them). So to take the opportunity to prevent the mapping for all
>> of them.
> 
> Hm, OK, while I assume it's likely for special pages to not be target
> of DMA operations, it's not easy to spot what are special pages.

Special pages are allocated by Xen for grant-table, vCPU info...

> 
>> Note that:
>>      - This is matching the existing behavior with PV guest
> 
> You might make HVM guests not sharing page-tables 'match' PV
> behavior, but you are making behavior between HVM guests themselves
> diverge.
> 
> 
>>      - This doesn't change the behavior when the P2M is shared with the
>>      IOMMU. IOW, the special pages will still be accessibled by the
>>      device.
> 
> I have to admit I don't like this part at all. Having diverging device
> mappings depending on whether the page tables are shared or not is
> bad IMO, as there might be subtle bugs affecting one of the two
> modes.
> 
> I get the feeling this is just papering over an existing issue instead
> of actually fixing it: IOMMU page tables need to be properly freed
> during early failure.

My initial approach was to free the IOMMU page tables during early 
failure (see [1] and [2]). But Jan said the special pages should really 
not be mapped in the IOMMU ([3]) and he wasn't very happy with freeing 
the IOMMU pages table for early failure.

I don't particularly care about the approach as long as we don't leak 
IOMMU page-tables at the end.

So please try to find a common ground with Jan here.

Cheers,

[1] <20201222154338.9459-1-julien@xen.org>
[2] <20201222154338.9459-5-julien@xen.org>
Julien Grall Feb. 10, 2021, 9:34 a.m. UTC | #4
On 10/02/2021 08:50, Julien Grall wrote:
> Hi Roger,
> 
> On 10/02/2021 08:29, Roger Pau Monné wrote:
>> On Tue, Feb 09, 2021 at 03:28:12PM +0000, Julien Grall wrote:
>>> From: Julien Grall <jgrall@amazon.com>
>>>
>>> Currently, the IOMMU page-tables will be populated early in the domain
>>> creation if the hardware is able to virtualize the local APIC. However,
>>> the IOMMU page tables will not be freed during early failure and will
>>> result to a leak.
>>>
>>> An assigned device should not need to DMA into the vLAPIC page, so we
>>> can avoid to map the page in the IOMMU page-tables.
>>>
>>> This statement is also true for any special pages (the vLAPIC page is
>>> one of them). So to take the opportunity to prevent the mapping for all
>>> of them.
>>
>> Hm, OK, while I assume it's likely for special pages to not be target
>> of DMA operations, it's not easy to spot what are special pages.
> 
> Special pages are allocated by Xen for grant-table, vCPU info...
> 
>>
>>> Note that:
>>>      - This is matching the existing behavior with PV guest
>>
>> You might make HVM guests not sharing page-tables 'match' PV
>> behavior, but you are making behavior between HVM guests themselves
>> diverge.
>>
>>
>>>      - This doesn't change the behavior when the P2M is shared with the
>>>      IOMMU. IOW, the special pages will still be accessibled by the
>>>      device.
>>
>> I have to admit I don't like this part at all. Having diverging device
>> mappings depending on whether the page tables are shared or not is
>> bad IMO, as there might be subtle bugs affecting one of the two
>> modes.
>>
>> I get the feeling this is just papering over an existing issue instead
>> of actually fixing it: IOMMU page tables need to be properly freed
>> during early failure.
> 
> My initial approach was to free the IOMMU page tables during early 
> failure (see [1] and [2]). But Jan said the special pages should really 
> not be mapped in the IOMMU ([3]) and he wasn't very happy with freeing 
> the IOMMU pages table for early failure.
> 
> I don't particularly care about the approach as long as we don't leak 
> IOMMU page-tables at the end.
> 
> So please try to find a common ground with Jan here.
> 
> Cheers,
> 
> [1] <20201222154338.9459-1-julien@xen.org>
> [2] <20201222154338.9459-5-julien@xen.org>

Roger pointed out on IRC that I forgot to add a link for [3]. So here we go:

[3] <a22f7364-518f-ea5f-3b87-5c0462cfc193@suse.com>

Cheers,
Jan Beulich Feb. 10, 2021, 11:10 a.m. UTC | #5
On 10.02.2021 09:29, Roger Pau Monné wrote:
> On Tue, Feb 09, 2021 at 03:28:12PM +0000, Julien Grall wrote:
>> From: Julien Grall <jgrall@amazon.com>
>>
>> Currently, the IOMMU page-tables will be populated early in the domain
>> creation if the hardware is able to virtualize the local APIC. However,
>> the IOMMU page tables will not be freed during early failure and will
>> result to a leak.
>>
>> An assigned device should not need to DMA into the vLAPIC page, so we
>> can avoid to map the page in the IOMMU page-tables.
>>
>> This statement is also true for any special pages (the vLAPIC page is
>> one of them). So to take the opportunity to prevent the mapping for all
>> of them.
> 
> Hm, OK, while I assume it's likely for special pages to not be target
> of DMA operations, it's not easy to spot what are special pages.
> 
>> Note that:
>>     - This is matching the existing behavior with PV guest
> 
> You might make HVM guests not sharing page-tables 'match' PV
> behavior, but you are making behavior between HVM guests themselves
> diverge.
> 
> 
>>     - This doesn't change the behavior when the P2M is shared with the
>>     IOMMU. IOW, the special pages will still be accessibled by the
>>     device.
> 
> I have to admit I don't like this part at all. Having diverging device
> mappings depending on whether the page tables are shared or not is
> bad IMO, as there might be subtle bugs affecting one of the two
> modes.

This is one way to look at things, yes. But if you take the
other perspective that special pages shouldn't be
IOMMU-mapped, then the divergence is the price to pay for
being able to share pages (and it's not Julien introducing
bad behavior here).

Additionally it may be possible to utilize the divergence to
our advantage: If one way of setting up things works and the
other doesn't, we have a reasonable clue where to look. In
fact the aspect above may, together with possible future
findings, end up being a reason to not default to or even
disallow (like for AMD) page table sharing.

> I get the feeling this is just papering over an existing issue instead
> of actually fixing it: IOMMU page tables need to be properly freed
> during early failure.

I take a different perspective: IOMMU page tables shouldn't
get created (yet) at all in the course of
XEN_DOMCTL_createdomain - this op is supposed to produce an
empty container for a VM.

Jan
Jan Beulich Feb. 10, 2021, 11:26 a.m. UTC | #6
On 09.02.2021 16:28, Julien Grall wrote:
> From: Julien Grall <jgrall@amazon.com>
> 
> Currently, the IOMMU page-tables will be populated early in the domain
> creation if the hardware is able to virtualize the local APIC. However,
> the IOMMU page tables will not be freed during early failure and will
> result to a leak.
> 
> An assigned device should not need to DMA into the vLAPIC page, so we
> can avoid to map the page in the IOMMU page-tables.

Here and below, may I ask that you use the correct term "APIC
access page", as there are other pages involved in vLAPIC
handling (in particular the virtual APIC page, which is where
accesses actually go to that translate to the APIC access page
in EPT).

> This statement is also true for any special pages (the vLAPIC page is
> one of them). So to take the opportunity to prevent the mapping for all
> of them.

I probably should have realized this earlier, but there is a
downside to this: A guest wanting to core dump itself may want
to dump e.g. shared info and vcpu info pages. Hence ...

> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -919,6 +919,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
>  {
>      unsigned int flags;
>  
> +    /* Don't map special pages in the IOMMU page-tables. */
> +    if ( mfn_valid(mfn) && is_special_page(mfn_to_page(mfn)) )
> +        return 0;

... instead of is_special_page() I think you want to limit the
check here to seeing whether PGC_extra is set.

But as said on irc, since this crude way of setting up the APIC
access page is now firmly a problem, I intend to try to redo it.
I can't tell yet whether this will still take a PGC_extra page
of some form (nor if my plans will work out in the first place).
Irrespective of this I think we indeed want to exclude PGC_extra
pages from getting mapped. However, the APIC access page is, I
think, an outlier here - we wouldn't insert other PGC_extra
pages into the p2m, so for all other cases the above addition
would be effectively dead code.

Jan
Roger Pau Monné Feb. 10, 2021, 11:34 a.m. UTC | #7
On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote:
> On 10.02.2021 09:29, Roger Pau Monné wrote:
> > On Tue, Feb 09, 2021 at 03:28:12PM +0000, Julien Grall wrote:
> >> From: Julien Grall <jgrall@amazon.com>
> >>
> >> Currently, the IOMMU page-tables will be populated early in the domain
> >> creation if the hardware is able to virtualize the local APIC. However,
> >> the IOMMU page tables will not be freed during early failure and will
> >> result to a leak.
> >>
> >> An assigned device should not need to DMA into the vLAPIC page, so we
> >> can avoid to map the page in the IOMMU page-tables.
> >>
> >> This statement is also true for any special pages (the vLAPIC page is
> >> one of them). So to take the opportunity to prevent the mapping for all
> >> of them.
> > 
> > Hm, OK, while I assume it's likely for special pages to not be target
> > of DMA operations, it's not easy to spot what are special pages.
> > 
> >> Note that:
> >>     - This is matching the existing behavior with PV guest
> > 
> > You might make HVM guests not sharing page-tables 'match' PV
> > behavior, but you are making behavior between HVM guests themselves
> > diverge.
> > 
> > 
> >>     - This doesn't change the behavior when the P2M is shared with the
> >>     IOMMU. IOW, the special pages will still be accessibled by the
> >>     device.
> > 
> > I have to admit I don't like this part at all. Having diverging device
> > mappings depending on whether the page tables are shared or not is
> > bad IMO, as there might be subtle bugs affecting one of the two
> > modes.
> 
> This is one way to look at things, yes. But if you take the
> other perspective that special pages shouldn't be
> IOMMU-mapped, then the divergence is the price to pay for
> being able to share pages (and it's not Julien introducing
> bad behavior here).

Since when sharing we have no option but to make whatever is
accessible to the CPU also available to devices, it would seem
reasonable to also do it when not sharing.

> Additionally it may be possible to utilize the divergence to
> our advantage: If one way of setting up things works and the
> other doesn't, we have a reasonable clue where to look. In
> fact the aspect above may, together with possible future
> findings, end up being a reason to not default to or even
> disallow (like for AMD) page table sharing.

I think such approach is risky: we don't likely test VT-d without
sharing that much (or at all?), now that IOMMUs support the same page
sizes as EPT, hence it's likely for bugs to go unnoticed.

> > I get the feeling this is just papering over an existing issue instead
> > of actually fixing it: IOMMU page tables need to be properly freed
> > during early failure.
> 
> I take a different perspective: IOMMU page tables shouldn't
> get created (yet) at all in the course of
> XEN_DOMCTL_createdomain - this op is supposed to produce an
> empty container for a VM.

The same would apply for CPU page-tables then, yet they seem to be
created and populating them (ie: adding the lapic access page) doesn't
leak such entries, which points at an asymmetry. Either we setup both
tables and handle freeing them properly, or we set none of them.

Roger.
Jan Beulich Feb. 10, 2021, 11:38 a.m. UTC | #8
On 10.02.2021 12:34, Roger Pau Monné wrote:
> On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote:
>> On 10.02.2021 09:29, Roger Pau Monné wrote:
>>> I get the feeling this is just papering over an existing issue instead
>>> of actually fixing it: IOMMU page tables need to be properly freed
>>> during early failure.
>>
>> I take a different perspective: IOMMU page tables shouldn't
>> get created (yet) at all in the course of
>> XEN_DOMCTL_createdomain - this op is supposed to produce an
>> empty container for a VM.
> 
> The same would apply for CPU page-tables then, yet they seem to be
> created and populating them (ie: adding the lapic access page) doesn't
> leak such entries, which points at an asymmetry. Either we setup both
> tables and handle freeing them properly, or we set none of them.

Where would CPU page tables get created from at this early stage?
There's no memory assigned to the guest yet, so there's nothing to
map afaics.

Jan
Julien Grall Feb. 10, 2021, 11:40 a.m. UTC | #9
On 10/02/2021 11:38, Jan Beulich wrote:
> On 10.02.2021 12:34, Roger Pau Monné wrote:
>> On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote:
>>> On 10.02.2021 09:29, Roger Pau Monné wrote:
>>>> I get the feeling this is just papering over an existing issue instead
>>>> of actually fixing it: IOMMU page tables need to be properly freed
>>>> during early failure.
>>>
>>> I take a different perspective: IOMMU page tables shouldn't
>>> get created (yet) at all in the course of
>>> XEN_DOMCTL_createdomain - this op is supposed to produce an
>>> empty container for a VM.
>>
>> The same would apply for CPU page-tables then, yet they seem to be
>> created and populating them (ie: adding the lapic access page) doesn't
>> leak such entries, which points at an asymmetry. Either we setup both
>> tables and handle freeing them properly, or we set none of them.
> 
> Where would CPU page tables get created from at this early stage?

When mapping the APIC page in the P2M. I don't think you can get away 
with removing it completely.

Cheers,
Jan Beulich Feb. 10, 2021, 11:45 a.m. UTC | #10
On 10.02.2021 12:40, Julien Grall wrote:
> On 10/02/2021 11:38, Jan Beulich wrote:
>> On 10.02.2021 12:34, Roger Pau Monné wrote:
>>> On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote:
>>>> On 10.02.2021 09:29, Roger Pau Monné wrote:
>>>>> I get the feeling this is just papering over an existing issue instead
>>>>> of actually fixing it: IOMMU page tables need to be properly freed
>>>>> during early failure.
>>>>
>>>> I take a different perspective: IOMMU page tables shouldn't
>>>> get created (yet) at all in the course of
>>>> XEN_DOMCTL_createdomain - this op is supposed to produce an
>>>> empty container for a VM.
>>>
>>> The same would apply for CPU page-tables then, yet they seem to be
>>> created and populating them (ie: adding the lapic access page) doesn't
>>> leak such entries, which points at an asymmetry. Either we setup both
>>> tables and handle freeing them properly, or we set none of them.
>>
>> Where would CPU page tables get created from at this early stage?
> 
> When mapping the APIC page in the P2M. I don't think you can get away 
> with removing it completely.

It doesn't need putting in the p2m this early. It would be quite
fine to defer this until e.g. the first vCPU gets created.

Jan
Julien Grall Feb. 10, 2021, 11:48 a.m. UTC | #11
On 10/02/2021 11:45, Jan Beulich wrote:
> On 10.02.2021 12:40, Julien Grall wrote:
>> On 10/02/2021 11:38, Jan Beulich wrote:
>>> On 10.02.2021 12:34, Roger Pau Monné wrote:
>>>> On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote:
>>>>> On 10.02.2021 09:29, Roger Pau Monné wrote:
>>>>>> I get the feeling this is just papering over an existing issue instead
>>>>>> of actually fixing it: IOMMU page tables need to be properly freed
>>>>>> during early failure.
>>>>>
>>>>> I take a different perspective: IOMMU page tables shouldn't
>>>>> get created (yet) at all in the course of
>>>>> XEN_DOMCTL_createdomain - this op is supposed to produce an
>>>>> empty container for a VM.
>>>>
>>>> The same would apply for CPU page-tables then, yet they seem to be
>>>> created and populating them (ie: adding the lapic access page) doesn't
>>>> leak such entries, which points at an asymmetry. Either we setup both
>>>> tables and handle freeing them properly, or we set none of them.
>>>
>>> Where would CPU page tables get created from at this early stage?
>>
>> When mapping the APIC page in the P2M. I don't think you can get away
>> with removing it completely.
> 
> It doesn't need putting in the p2m this early. It would be quite
> fine to defer this until e.g. the first vCPU gets created.

It feels wrong to me to setup a per-domain mapping when initializing the 
first vCPU.

But, I was under the impression that there is plan to remove 
XEN_DOMCTL_max_vcpus. So it would only buy just a bit of time...

Cheers,
Roger Pau Monné Feb. 10, 2021, 11:54 a.m. UTC | #12
On Wed, Feb 10, 2021 at 11:48:40AM +0000, Julien Grall wrote:
> 
> 
> On 10/02/2021 11:45, Jan Beulich wrote:
> > On 10.02.2021 12:40, Julien Grall wrote:
> > > On 10/02/2021 11:38, Jan Beulich wrote:
> > > > On 10.02.2021 12:34, Roger Pau Monné wrote:
> > > > > On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote:
> > > > > > On 10.02.2021 09:29, Roger Pau Monné wrote:
> > > > > > > I get the feeling this is just papering over an existing issue instead
> > > > > > > of actually fixing it: IOMMU page tables need to be properly freed
> > > > > > > during early failure.
> > > > > > 
> > > > > > I take a different perspective: IOMMU page tables shouldn't
> > > > > > get created (yet) at all in the course of
> > > > > > XEN_DOMCTL_createdomain - this op is supposed to produce an
> > > > > > empty container for a VM.
> > > > > 
> > > > > The same would apply for CPU page-tables then, yet they seem to be
> > > > > created and populating them (ie: adding the lapic access page) doesn't
> > > > > leak such entries, which points at an asymmetry. Either we setup both
> > > > > tables and handle freeing them properly, or we set none of them.
> > > > 
> > > > Where would CPU page tables get created from at this early stage?
> > > 
> > > When mapping the APIC page in the P2M. I don't think you can get away
> > > with removing it completely.
> > 
> > It doesn't need putting in the p2m this early. It would be quite
> > fine to defer this until e.g. the first vCPU gets created.
> 
> It feels wrong to me to setup a per-domain mapping when initializing the
> first vCPU.
> 
> But, I was under the impression that there is plan to remove
> XEN_DOMCTL_max_vcpus. So it would only buy just a bit of time...

I was also under that impression. We could setup the lapic access page
at vlapic_init for the BSP (which is part of XEN_DOMCTL_max_vcpus
ATM).

But then I think there should be some kind of check to prevent
populating either the CPU or the IOMMU page tables at domain creation
hypercall, and so the logic to free CPU table tables on failure could
be removed.

Roger.
Jan Beulich Feb. 10, 2021, 1:08 p.m. UTC | #13
On 10.02.2021 12:48, Julien Grall wrote:
> 
> 
> On 10/02/2021 11:45, Jan Beulich wrote:
>> On 10.02.2021 12:40, Julien Grall wrote:
>>> On 10/02/2021 11:38, Jan Beulich wrote:
>>>> On 10.02.2021 12:34, Roger Pau Monné wrote:
>>>>> On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote:
>>>>>> On 10.02.2021 09:29, Roger Pau Monné wrote:
>>>>>>> I get the feeling this is just papering over an existing issue instead
>>>>>>> of actually fixing it: IOMMU page tables need to be properly freed
>>>>>>> during early failure.
>>>>>>
>>>>>> I take a different perspective: IOMMU page tables shouldn't
>>>>>> get created (yet) at all in the course of
>>>>>> XEN_DOMCTL_createdomain - this op is supposed to produce an
>>>>>> empty container for a VM.
>>>>>
>>>>> The same would apply for CPU page-tables then, yet they seem to be
>>>>> created and populating them (ie: adding the lapic access page) doesn't
>>>>> leak such entries, which points at an asymmetry. Either we setup both
>>>>> tables and handle freeing them properly, or we set none of them.
>>>>
>>>> Where would CPU page tables get created from at this early stage?
>>>
>>> When mapping the APIC page in the P2M. I don't think you can get away
>>> with removing it completely.
>>
>> It doesn't need putting in the p2m this early. It would be quite
>> fine to defer this until e.g. the first vCPU gets created.
> 
> It feels wrong to me to setup a per-domain mapping when initializing the 
> first vCPU.

Then we could do it even later. Any time up to when the domain
would first get unpaused would do.

Jan
Jan Beulich Feb. 10, 2021, 1:12 p.m. UTC | #14
On 10.02.2021 12:54, Roger Pau Monné wrote:
> On Wed, Feb 10, 2021 at 11:48:40AM +0000, Julien Grall wrote:
>> It feels wrong to me to setup a per-domain mapping when initializing the
>> first vCPU.
>>
>> But, I was under the impression that there is plan to remove
>> XEN_DOMCTL_max_vcpus. So it would only buy just a bit of time...
> 
> I was also under that impression. We could setup the lapic access page
> at vlapic_init for the BSP (which is part of XEN_DOMCTL_max_vcpus
> ATM).
> 
> But then I think there should be some kind of check to prevent
> populating either the CPU or the IOMMU page tables at domain creation
> hypercall, and so the logic to free CPU table tables on failure could
> be removed.

I can spot paging_final_teardown() on an error path there, but I
don't suppose that's what you mean? I guess I'm not looking in
the right place (there are quite a few after all) ...

Jan
Roger Pau Monné Feb. 10, 2021, 3:24 p.m. UTC | #15
On Wed, Feb 10, 2021 at 02:12:38PM +0100, Jan Beulich wrote:
> On 10.02.2021 12:54, Roger Pau Monné wrote:
> > On Wed, Feb 10, 2021 at 11:48:40AM +0000, Julien Grall wrote:
> >> It feels wrong to me to setup a per-domain mapping when initializing the
> >> first vCPU.
> >>
> >> But, I was under the impression that there is plan to remove
> >> XEN_DOMCTL_max_vcpus. So it would only buy just a bit of time...
> > 
> > I was also under that impression. We could setup the lapic access page
> > at vlapic_init for the BSP (which is part of XEN_DOMCTL_max_vcpus
> > ATM).
> > 
> > But then I think there should be some kind of check to prevent
> > populating either the CPU or the IOMMU page tables at domain creation
> > hypercall, and so the logic to free CPU table tables on failure could
> > be removed.
> 
> I can spot paging_final_teardown() on an error path there, but I
> don't suppose that's what you mean? I guess I'm not looking in
> the right place (there are quite a few after all) ...

Well, I assume some freeing of the EPT page tables must happen on
error paths, or else we would be leaking them like IOMMU tables are
leaked currently?

Maybe I've not correctly understanding the issue here.

Roger.
Jan Beulich Feb. 10, 2021, 3:53 p.m. UTC | #16
On 10.02.2021 16:24, Roger Pau Monné wrote:
> On Wed, Feb 10, 2021 at 02:12:38PM +0100, Jan Beulich wrote:
>> On 10.02.2021 12:54, Roger Pau Monné wrote:
>>> On Wed, Feb 10, 2021 at 11:48:40AM +0000, Julien Grall wrote:
>>>> It feels wrong to me to setup a per-domain mapping when initializing the
>>>> first vCPU.
>>>>
>>>> But, I was under the impression that there is plan to remove
>>>> XEN_DOMCTL_max_vcpus. So it would only buy just a bit of time...
>>>
>>> I was also under that impression. We could setup the lapic access page
>>> at vlapic_init for the BSP (which is part of XEN_DOMCTL_max_vcpus
>>> ATM).
>>>
>>> But then I think there should be some kind of check to prevent
>>> populating either the CPU or the IOMMU page tables at domain creation
>>> hypercall, and so the logic to free CPU table tables on failure could
>>> be removed.
>>
>> I can spot paging_final_teardown() on an error path there, but I
>> don't suppose that's what you mean? I guess I'm not looking in
>> the right place (there are quite a few after all) ...
> 
> Well, I assume some freeing of the EPT page tables must happen on
> error paths, or else we would be leaking them like IOMMU tables are
> leaked currently?

Well, you can't eliminate paging_final_teardown() from that
error path because it frees internal structures. It _also_
sets HAP's / shadow's allocation to zero, so has the side
effect of freeing why may have been CPU page tables.

Jan
Julien Grall Feb. 15, 2021, 11:38 a.m. UTC | #17
Hi Jan,

On 10/02/2021 11:26, Jan Beulich wrote:
> On 09.02.2021 16:28, Julien Grall wrote:
>> From: Julien Grall <jgrall@amazon.com>
>>
>> Currently, the IOMMU page-tables will be populated early in the domain
>> creation if the hardware is able to virtualize the local APIC. However,
>> the IOMMU page tables will not be freed during early failure and will
>> result to a leak.
>>
>> An assigned device should not need to DMA into the vLAPIC page, so we
>> can avoid to map the page in the IOMMU page-tables.
> 
> Here and below, may I ask that you use the correct term "APIC
> access page", as there are other pages involved in vLAPIC
> handling (in particular the virtual APIC page, which is where
> accesses actually go to that translate to the APIC access page
> in EPT).
> 
>> This statement is also true for any special pages (the vLAPIC page is
>> one of them). So to take the opportunity to prevent the mapping for all
>> of them.
> 
> I probably should have realized this earlier, but there is a
> downside to this: A guest wanting to core dump itself may want
> to dump e.g. shared info and vcpu info pages. Hence ...
> 
>> --- a/xen/include/asm-x86/p2m.h
>> +++ b/xen/include/asm-x86/p2m.h
>> @@ -919,6 +919,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
>>   {
>>       unsigned int flags;
>>   
>> +    /* Don't map special pages in the IOMMU page-tables. */
>> +    if ( mfn_valid(mfn) && is_special_page(mfn_to_page(mfn)) )
>> +        return 0;
> 
> ... instead of is_special_page() I think you want to limit the
> check here to seeing whether PGC_extra is set.
> 
> But as said on irc, since this crude way of setting up the APIC
> access page is now firmly a problem, I intend to try to redo it.

Given this series needs to go in 4.15 (we would introduce a 0-day 
otherwise), could you clarify whether your patch [1] is intended to 
replace this one in 4.15?

Cheers,

[1] <1b6a411b-84e7-bfb1-647e-511a13df838c@suse.com>
Jan Beulich Feb. 15, 2021, 12:36 p.m. UTC | #18
On 15.02.2021 12:38, Julien Grall wrote:
> On 10/02/2021 11:26, Jan Beulich wrote:
>> On 09.02.2021 16:28, Julien Grall wrote:
>>> From: Julien Grall <jgrall@amazon.com>
>>>
>>> Currently, the IOMMU page-tables will be populated early in the domain
>>> creation if the hardware is able to virtualize the local APIC. However,
>>> the IOMMU page tables will not be freed during early failure and will
>>> result to a leak.
>>>
>>> An assigned device should not need to DMA into the vLAPIC page, so we
>>> can avoid to map the page in the IOMMU page-tables.
>>
>> Here and below, may I ask that you use the correct term "APIC
>> access page", as there are other pages involved in vLAPIC
>> handling (in particular the virtual APIC page, which is where
>> accesses actually go to that translate to the APIC access page
>> in EPT).
>>
>>> This statement is also true for any special pages (the vLAPIC page is
>>> one of them). So to take the opportunity to prevent the mapping for all
>>> of them.
>>
>> I probably should have realized this earlier, but there is a
>> downside to this: A guest wanting to core dump itself may want
>> to dump e.g. shared info and vcpu info pages. Hence ...
>>
>>> --- a/xen/include/asm-x86/p2m.h
>>> +++ b/xen/include/asm-x86/p2m.h
>>> @@ -919,6 +919,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
>>>   {
>>>       unsigned int flags;
>>>   
>>> +    /* Don't map special pages in the IOMMU page-tables. */
>>> +    if ( mfn_valid(mfn) && is_special_page(mfn_to_page(mfn)) )
>>> +        return 0;
>>
>> ... instead of is_special_page() I think you want to limit the
>> check here to seeing whether PGC_extra is set.
>>
>> But as said on irc, since this crude way of setting up the APIC
>> access page is now firmly a problem, I intend to try to redo it.
> 
> Given this series needs to go in 4.15 (we would introduce a 0-day 
> otherwise), could you clarify whether your patch [1] is intended to 
> replace this one in 4.15?

Yes, that or a cut down variant (simply moving the invocation of
set_mmio_p2m_entry()). The more that there the controversy
continued regarding the adjustment to p2m_get_iommu_flags(). I
did indicate there that I've dropped it for v2.

> [1] <1b6a411b-84e7-bfb1-647e-511a13df838c@suse.com>

Given the context I was able to guess what mail you refer to, but
I would very much like to ask you and anyone else to provide links
rather than mail IDs as references. Not every mail UI allows
looking up by ID.

Jan
Julien Grall Feb. 15, 2021, 12:54 p.m. UTC | #19
Hi Jan,

On 15/02/2021 12:36, Jan Beulich wrote:
> On 15.02.2021 12:38, Julien Grall wrote:
>> On 10/02/2021 11:26, Jan Beulich wrote:
>>> On 09.02.2021 16:28, Julien Grall wrote:
>>>> From: Julien Grall <jgrall@amazon.com>
>>>>
>>>> Currently, the IOMMU page-tables will be populated early in the domain
>>>> creation if the hardware is able to virtualize the local APIC. However,
>>>> the IOMMU page tables will not be freed during early failure and will
>>>> result to a leak.
>>>>
>>>> An assigned device should not need to DMA into the vLAPIC page, so we
>>>> can avoid to map the page in the IOMMU page-tables.
>>>
>>> Here and below, may I ask that you use the correct term "APIC
>>> access page", as there are other pages involved in vLAPIC
>>> handling (in particular the virtual APIC page, which is where
>>> accesses actually go to that translate to the APIC access page
>>> in EPT).
>>>
>>>> This statement is also true for any special pages (the vLAPIC page is
>>>> one of them). So to take the opportunity to prevent the mapping for all
>>>> of them.
>>>
>>> I probably should have realized this earlier, but there is a
>>> downside to this: A guest wanting to core dump itself may want
>>> to dump e.g. shared info and vcpu info pages. Hence ...
>>>
>>>> --- a/xen/include/asm-x86/p2m.h
>>>> +++ b/xen/include/asm-x86/p2m.h
>>>> @@ -919,6 +919,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
>>>>    {
>>>>        unsigned int flags;
>>>>    
>>>> +    /* Don't map special pages in the IOMMU page-tables. */
>>>> +    if ( mfn_valid(mfn) && is_special_page(mfn_to_page(mfn)) )
>>>> +        return 0;
>>>
>>> ... instead of is_special_page() I think you want to limit the
>>> check here to seeing whether PGC_extra is set.
>>>
>>> But as said on irc, since this crude way of setting up the APIC
>>> access page is now firmly a problem, I intend to try to redo it.
>>
>> Given this series needs to go in 4.15 (we would introduce a 0-day
>> otherwise), could you clarify whether your patch [1] is intended to
>> replace this one in 4.15?
> 
> Yes, that or a cut down variant (simply moving the invocation of
> set_mmio_p2m_entry()). The more that there the controversy
> continued regarding the adjustment to p2m_get_iommu_flags(). I
> did indicate there that I've dropped it for v2.

Do you have a link to v2? I would like to try with my series.

> 
>> [1] <1b6a411b-84e7-bfb1-647e-511a13df838c@suse.com>
> 
> Given the context I was able to guess what mail you refer to, but
> I would very much like to ask you and anyone else to provide links
> rather than mail IDs as references. Not every mail UI allows
> looking up by ID.
It is pretty trivial nowadays to search for a message by ID online:

https://lore.kernel.org/xen-devel/<message-id>

Cheers,
Jan Beulich Feb. 15, 2021, 1:14 p.m. UTC | #20
On 15.02.2021 13:54, Julien Grall wrote:
> On 15/02/2021 12:36, Jan Beulich wrote:
>> On 15.02.2021 12:38, Julien Grall wrote:
>>> Given this series needs to go in 4.15 (we would introduce a 0-day
>>> otherwise), could you clarify whether your patch [1] is intended to
>>> replace this one in 4.15?
>>
>> Yes, that or a cut down variant (simply moving the invocation of
>> set_mmio_p2m_entry()). The more that there the controversy
>> continued regarding the adjustment to p2m_get_iommu_flags(). I
>> did indicate there that I've dropped it for v2.
> 
> Do you have a link to v2? I would like to try with my series.

I didn't post it yet, as I didn't consider the v1 discussion
settled so far. The intermediate version I have at present is
below.

Jan

VMX: use a single, global APIC access page

The address of this page is used by the CPU only to recognize when to
access the virtual APIC page instead. No accesses would ever go to this
page. It only needs to be present in the (CPU) page tables so that
address translation will produce its address as result for respective
accesses.

By making this page global, we also eliminate the need to refcount it,
or to assign it to any domain in the first place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Avoid insertion when !has_vlapic(). Split off change to
    p2m_get_iommu_flags().
---
Hooking p2m insertion onto arch_domain_creation_finished() isn't very
nice, but I couldn't find any better hook (nor a good place where to
introduce a new one). In particular there look to be no hvm_funcs hooks
being used on a domain-wide basis (except for init/destroy of course).
I did consider connecting this to the setting of HVM_PARAM_IDENT_PT, but
considered this no better, the more that the tool stack could be smarter
and avoid setting that param when not needed.

I did further consider not allocating any real page at all, but just
using the address of some unpopulated space (which would require
announcing this page as reserved to Dom0, so it wouldn't put any PCI
MMIO BARs there). But I thought this would be too controversial, because
of the possible risks associated with this.

--- unstable.orig/xen/arch/x86/domain.c
+++ unstable/xen/arch/x86/domain.c
@@ -1005,6 +1005,8 @@ int arch_domain_soft_reset(struct domain
 
 void arch_domain_creation_finished(struct domain *d)
 {
+    if ( is_hvm_domain(d) )
+        hvm_domain_creation_finished(d);
 }
 
 #define xen_vcpu_guest_context vcpu_guest_context
--- unstable.orig/xen/arch/x86/hvm/vmx/vmx.c
+++ unstable/xen/arch/x86/hvm/vmx/vmx.c
@@ -66,8 +66,7 @@ boolean_param("force-ept", opt_force_ept
 static void vmx_ctxt_switch_from(struct vcpu *v);
 static void vmx_ctxt_switch_to(struct vcpu *v);
 
-static int  vmx_alloc_vlapic_mapping(struct domain *d);
-static void vmx_free_vlapic_mapping(struct domain *d);
+static int alloc_vlapic_mapping(void);
 static void vmx_install_vlapic_mapping(struct vcpu *v);
 static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr,
                                 unsigned int flags);
@@ -78,6 +77,8 @@ static int vmx_msr_read_intercept(unsign
 static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content);
 static void vmx_invlpg(struct vcpu *v, unsigned long linear);
 
+static mfn_t __read_mostly apic_access_mfn;
+
 /* Values for domain's ->arch.hvm_domain.pi_ops.flags. */
 #define PI_CSW_FROM (1u << 0)
 #define PI_CSW_TO   (1u << 1)
@@ -401,7 +402,6 @@ static int vmx_domain_initialise(struct
         .to   = vmx_ctxt_switch_to,
         .tail = vmx_do_resume,
     };
-    int rc;
 
     d->arch.ctxt_switch = &csw;
 
@@ -411,21 +411,16 @@ static int vmx_domain_initialise(struct
      */
     d->arch.hvm.vmx.exec_sp = is_hardware_domain(d) || opt_ept_exec_sp;
 
-    if ( !has_vlapic(d) )
-        return 0;
-
-    if ( (rc = vmx_alloc_vlapic_mapping(d)) != 0 )
-        return rc;
-
     return 0;
 }
 
-static void vmx_domain_relinquish_resources(struct domain *d)
+static void domain_creation_finished(struct domain *d)
 {
-    if ( !has_vlapic(d) )
-        return;
 
-    vmx_free_vlapic_mapping(d);
+    if ( has_vlapic(d) && !mfn_eq(apic_access_mfn, _mfn(0)) &&
+         set_mmio_p2m_entry(d, gaddr_to_gfn(APIC_DEFAULT_PHYS_BASE),
+                            apic_access_mfn, PAGE_ORDER_4K) )
+        domain_crash(d);
 }
 
 static int vmx_vcpu_initialise(struct vcpu *v)
@@ -2264,7 +2259,7 @@ static struct hvm_function_table __initd
     .cpu_up_prepare       = vmx_cpu_up_prepare,
     .cpu_dead             = vmx_cpu_dead,
     .domain_initialise    = vmx_domain_initialise,
-    .domain_relinquish_resources = vmx_domain_relinquish_resources,
+    .domain_creation_finished = domain_creation_finished,
     .vcpu_initialise      = vmx_vcpu_initialise,
     .vcpu_destroy         = vmx_vcpu_destroy,
     .save_cpu_ctxt        = vmx_save_vmcs_ctxt,
@@ -2503,7 +2498,7 @@ const struct hvm_function_table * __init
 {
     set_in_cr4(X86_CR4_VMXE);
 
-    if ( vmx_vmcs_init() )
+    if ( vmx_vmcs_init() || alloc_vlapic_mapping() )
     {
         printk("VMX: failed to initialise.\n");
         return NULL;
@@ -3064,7 +3059,7 @@ gp_fault:
     return X86EMUL_EXCEPTION;
 }
 
-static int vmx_alloc_vlapic_mapping(struct domain *d)
+static int __init alloc_vlapic_mapping(void)
 {
     struct page_info *pg;
     mfn_t mfn;
@@ -3072,53 +3067,28 @@ static int vmx_alloc_vlapic_mapping(stru
     if ( !cpu_has_vmx_virtualize_apic_accesses )
         return 0;
 
-    pg = alloc_domheap_page(d, MEMF_no_refcount);
+    pg = alloc_domheap_page(NULL, 0);
     if ( !pg )
         return -ENOMEM;
 
-    if ( !get_page_and_type(pg, d, PGT_writable_page) )
-    {
-        /*
-         * The domain can't possibly know about this page yet, so failure
-         * here is a clear indication of something fishy going on.
-         */
-        domain_crash(d);
-        return -ENODATA;
-    }
-
     mfn = page_to_mfn(pg);
     clear_domain_page(mfn);
-    d->arch.hvm.vmx.apic_access_mfn = mfn;
+    apic_access_mfn = mfn;
 
-    return set_mmio_p2m_entry(d, gaddr_to_gfn(APIC_DEFAULT_PHYS_BASE), mfn,
-                              PAGE_ORDER_4K);
-}
-
-static void vmx_free_vlapic_mapping(struct domain *d)
-{
-    mfn_t mfn = d->arch.hvm.vmx.apic_access_mfn;
-
-    d->arch.hvm.vmx.apic_access_mfn = _mfn(0);
-    if ( !mfn_eq(mfn, _mfn(0)) )
-    {
-        struct page_info *pg = mfn_to_page(mfn);
-
-        put_page_alloc_ref(pg);
-        put_page_and_type(pg);
-    }
+    return 0;
 }
 
 static void vmx_install_vlapic_mapping(struct vcpu *v)
 {
     paddr_t virt_page_ma, apic_page_ma;
 
-    if ( mfn_eq(v->domain->arch.hvm.vmx.apic_access_mfn, _mfn(0)) )
+    if ( !has_vlapic(v->domain) || mfn_eq(apic_access_mfn, _mfn(0)) )
         return;
 
     ASSERT(cpu_has_vmx_virtualize_apic_accesses);
 
     virt_page_ma = page_to_maddr(vcpu_vlapic(v)->regs_page);
-    apic_page_ma = mfn_to_maddr(v->domain->arch.hvm.vmx.apic_access_mfn);
+    apic_page_ma = mfn_to_maddr(apic_access_mfn);
 
     vmx_vmcs_enter(v);
     __vmwrite(VIRTUAL_APIC_PAGE_ADDR, virt_page_ma);
--- unstable.orig/xen/include/asm-x86/hvm/hvm.h
+++ unstable/xen/include/asm-x86/hvm/hvm.h
@@ -106,6 +106,7 @@ struct hvm_function_table {
      * Initialise/destroy HVM domain/vcpu resources
      */
     int  (*domain_initialise)(struct domain *d);
+    void (*domain_creation_finished)(struct domain *d);
     void (*domain_relinquish_resources)(struct domain *d);
     void (*domain_destroy)(struct domain *d);
     int  (*vcpu_initialise)(struct vcpu *v);
@@ -383,6 +384,12 @@ static inline bool hvm_has_set_descripto
     return hvm_funcs.set_descriptor_access_exiting;
 }
 
+static inline void hvm_domain_creation_finished(struct domain *d)
+{
+    if ( hvm_funcs.domain_creation_finished )
+        alternative_vcall(hvm_funcs.domain_creation_finished, d);
+}
+
 static inline int
 hvm_guest_x86_mode(struct vcpu *v)
 {
@@ -715,6 +722,11 @@ static inline void hvm_invlpg(const stru
 {
     ASSERT_UNREACHABLE();
 }
+
+static inline void hvm_domain_creation_finished(struct domain *d)
+{
+    ASSERT_UNREACHABLE();
+}
 
 /*
  * Shadow code needs further cleanup to eliminate some HVM-only paths. For
--- unstable.orig/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ unstable/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -58,7 +58,6 @@ struct ept_data {
 #define _VMX_DOMAIN_PML_ENABLED    0
 #define VMX_DOMAIN_PML_ENABLED     (1ul << _VMX_DOMAIN_PML_ENABLED)
 struct vmx_domain {
-    mfn_t apic_access_mfn;
     /* VMX_DOMAIN_* */
     unsigned int status;
Julien Grall Feb. 17, 2021, 11:21 a.m. UTC | #21
On 15/02/2021 13:14, Jan Beulich wrote:
> On 15.02.2021 13:54, Julien Grall wrote:
>> On 15/02/2021 12:36, Jan Beulich wrote:
>>> On 15.02.2021 12:38, Julien Grall wrote:
>>>> Given this series needs to go in 4.15 (we would introduce a 0-day
>>>> otherwise), could you clarify whether your patch [1] is intended to
>>>> replace this one in 4.15?
>>>
>>> Yes, that or a cut down variant (simply moving the invocation of
>>> set_mmio_p2m_entry()). The more that there the controversy
>>> continued regarding the adjustment to p2m_get_iommu_flags(). I
>>> did indicate there that I've dropped it for v2.
>>
>> Do you have a link to v2? I would like to try with my series.
> 
> I didn't post it yet, as I didn't consider the v1 discussion
> settled so far. The intermediate version I have at present is
> below.

Thanks! I will drop patch #1 from my series and resend it.

Cheers,
diff mbox series

Patch

diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 7d63f5787e62..1802545969b3 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -919,6 +919,10 @@  static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
 {
     unsigned int flags;
 
+    /* Don't map special pages in the IOMMU page-tables. */
+    if ( mfn_valid(mfn) && is_special_page(mfn_to_page(mfn)) )
+        return 0;
+
     switch( p2mt )
     {
     case p2m_ram_rw: