Message ID | 20210209152816.15792-2-julien@xen.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xen/iommu: Collection of bug fixes for IOMMU teadorwn | expand |
> -----Original Message----- > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Julien Grall > Sent: 09 February 2021 15:28 > To: xen-devel@lists.xenproject.org > Cc: hongyxia@amazon.co.uk; iwj@xenproject.org; Julien Grall <jgrall@amazon.com>; Jan Beulich > <jbeulich@suse.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau Monné > <roger.pau@citrix.com>; Wei Liu <wl@xen.org> > Subject: [for-4.15][PATCH v2 1/5] xen/x86: p2m: Don't map the special pages in the IOMMU page-tables > > From: Julien Grall <jgrall@amazon.com> > > Currently, the IOMMU page-tables will be populated early in the domain > creation if the hardware is able to virtualize the local APIC. However, > the IOMMU page tables will not be freed during early failure and will > result to a leak. > > An assigned device should not need to DMA into the vLAPIC page, so we > can avoid to map the page in the IOMMU page-tables. > > This statement is also true for any special pages (the vLAPIC page is > one of them). So to take the opportunity to prevent the mapping for all > of them. > > Note that: > - This is matching the existing behavior with PV guest > - This doesn't change the behavior when the P2M is shared with the > IOMMU. IOW, the special pages will still be accessibled by the > device. > > Suggested-by: Jan Beulich <jbeulich@suse.com> > Signed-off-by: Julien Grall <jgrall@amazon.com> > Reviewed-by: Paul Durrant <paul@xen.org>
On Tue, Feb 09, 2021 at 03:28:12PM +0000, Julien Grall wrote: > From: Julien Grall <jgrall@amazon.com> > > Currently, the IOMMU page-tables will be populated early in the domain > creation if the hardware is able to virtualize the local APIC. However, > the IOMMU page tables will not be freed during early failure and will > result to a leak. > > An assigned device should not need to DMA into the vLAPIC page, so we > can avoid to map the page in the IOMMU page-tables. > > This statement is also true for any special pages (the vLAPIC page is > one of them). So to take the opportunity to prevent the mapping for all > of them. Hm, OK, while I assume it's likely for special pages to not be target of DMA operations, it's not easy to spot what are special pages. > Note that: > - This is matching the existing behavior with PV guest You might make HVM guests not sharing page-tables 'match' PV behavior, but you are making behavior between HVM guests themselves diverge. > - This doesn't change the behavior when the P2M is shared with the > IOMMU. IOW, the special pages will still be accessibled by the > device. I have to admit I don't like this part at all. Having diverging device mappings depending on whether the page tables are shared or not is bad IMO, as there might be subtle bugs affecting one of the two modes. I get the feeling this is just papering over an existing issue instead of actually fixing it: IOMMU page tables need to be properly freed during early failure. > Suggested-by: Jan Beulich <jbeulich@suse.com> > Signed-off-by: Julien Grall <jgrall@amazon.com> > > --- > Changes in v2: > - New patch > --- > xen/include/asm-x86/p2m.h | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h > index 7d63f5787e62..1802545969b3 100644 > --- a/xen/include/asm-x86/p2m.h > +++ b/xen/include/asm-x86/p2m.h > @@ -919,6 +919,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn) > { > unsigned int flags; > > + /* Don't map special pages in the IOMMU page-tables. */ I think this should explain why special pages don't require IOMMU mappings, or even just note that special pages cannot be added to the IOMMU page tables due to failure to free them afterwards and that this is a bodge for it. Thanks, Roger.
Hi Roger, On 10/02/2021 08:29, Roger Pau Monné wrote: > On Tue, Feb 09, 2021 at 03:28:12PM +0000, Julien Grall wrote: >> From: Julien Grall <jgrall@amazon.com> >> >> Currently, the IOMMU page-tables will be populated early in the domain >> creation if the hardware is able to virtualize the local APIC. However, >> the IOMMU page tables will not be freed during early failure and will >> result to a leak. >> >> An assigned device should not need to DMA into the vLAPIC page, so we >> can avoid to map the page in the IOMMU page-tables. >> >> This statement is also true for any special pages (the vLAPIC page is >> one of them). So to take the opportunity to prevent the mapping for all >> of them. > > Hm, OK, while I assume it's likely for special pages to not be target > of DMA operations, it's not easy to spot what are special pages. Special pages are allocated by Xen for grant-table, vCPU info... > >> Note that: >> - This is matching the existing behavior with PV guest > > You might make HVM guests not sharing page-tables 'match' PV > behavior, but you are making behavior between HVM guests themselves > diverge. > > >> - This doesn't change the behavior when the P2M is shared with the >> IOMMU. IOW, the special pages will still be accessibled by the >> device. > > I have to admit I don't like this part at all. Having diverging device > mappings depending on whether the page tables are shared or not is > bad IMO, as there might be subtle bugs affecting one of the two > modes. > > I get the feeling this is just papering over an existing issue instead > of actually fixing it: IOMMU page tables need to be properly freed > during early failure. My initial approach was to free the IOMMU page tables during early failure (see [1] and [2]). But Jan said the special pages should really not be mapped in the IOMMU ([3]) and he wasn't very happy with freeing the IOMMU pages table for early failure. I don't particularly care about the approach as long as we don't leak IOMMU page-tables at the end. So please try to find a common ground with Jan here. Cheers, [1] <20201222154338.9459-1-julien@xen.org> [2] <20201222154338.9459-5-julien@xen.org>
On 10/02/2021 08:50, Julien Grall wrote: > Hi Roger, > > On 10/02/2021 08:29, Roger Pau Monné wrote: >> On Tue, Feb 09, 2021 at 03:28:12PM +0000, Julien Grall wrote: >>> From: Julien Grall <jgrall@amazon.com> >>> >>> Currently, the IOMMU page-tables will be populated early in the domain >>> creation if the hardware is able to virtualize the local APIC. However, >>> the IOMMU page tables will not be freed during early failure and will >>> result to a leak. >>> >>> An assigned device should not need to DMA into the vLAPIC page, so we >>> can avoid to map the page in the IOMMU page-tables. >>> >>> This statement is also true for any special pages (the vLAPIC page is >>> one of them). So to take the opportunity to prevent the mapping for all >>> of them. >> >> Hm, OK, while I assume it's likely for special pages to not be target >> of DMA operations, it's not easy to spot what are special pages. > > Special pages are allocated by Xen for grant-table, vCPU info... > >> >>> Note that: >>> - This is matching the existing behavior with PV guest >> >> You might make HVM guests not sharing page-tables 'match' PV >> behavior, but you are making behavior between HVM guests themselves >> diverge. >> >> >>> - This doesn't change the behavior when the P2M is shared with the >>> IOMMU. IOW, the special pages will still be accessibled by the >>> device. >> >> I have to admit I don't like this part at all. Having diverging device >> mappings depending on whether the page tables are shared or not is >> bad IMO, as there might be subtle bugs affecting one of the two >> modes. >> >> I get the feeling this is just papering over an existing issue instead >> of actually fixing it: IOMMU page tables need to be properly freed >> during early failure. > > My initial approach was to free the IOMMU page tables during early > failure (see [1] and [2]). But Jan said the special pages should really > not be mapped in the IOMMU ([3]) and he wasn't very happy with freeing > the IOMMU pages table for early failure. > > I don't particularly care about the approach as long as we don't leak > IOMMU page-tables at the end. > > So please try to find a common ground with Jan here. > > Cheers, > > [1] <20201222154338.9459-1-julien@xen.org> > [2] <20201222154338.9459-5-julien@xen.org> Roger pointed out on IRC that I forgot to add a link for [3]. So here we go: [3] <a22f7364-518f-ea5f-3b87-5c0462cfc193@suse.com> Cheers,
On 10.02.2021 09:29, Roger Pau Monné wrote: > On Tue, Feb 09, 2021 at 03:28:12PM +0000, Julien Grall wrote: >> From: Julien Grall <jgrall@amazon.com> >> >> Currently, the IOMMU page-tables will be populated early in the domain >> creation if the hardware is able to virtualize the local APIC. However, >> the IOMMU page tables will not be freed during early failure and will >> result to a leak. >> >> An assigned device should not need to DMA into the vLAPIC page, so we >> can avoid to map the page in the IOMMU page-tables. >> >> This statement is also true for any special pages (the vLAPIC page is >> one of them). So to take the opportunity to prevent the mapping for all >> of them. > > Hm, OK, while I assume it's likely for special pages to not be target > of DMA operations, it's not easy to spot what are special pages. > >> Note that: >> - This is matching the existing behavior with PV guest > > You might make HVM guests not sharing page-tables 'match' PV > behavior, but you are making behavior between HVM guests themselves > diverge. > > >> - This doesn't change the behavior when the P2M is shared with the >> IOMMU. IOW, the special pages will still be accessibled by the >> device. > > I have to admit I don't like this part at all. Having diverging device > mappings depending on whether the page tables are shared or not is > bad IMO, as there might be subtle bugs affecting one of the two > modes. This is one way to look at things, yes. But if you take the other perspective that special pages shouldn't be IOMMU-mapped, then the divergence is the price to pay for being able to share pages (and it's not Julien introducing bad behavior here). Additionally it may be possible to utilize the divergence to our advantage: If one way of setting up things works and the other doesn't, we have a reasonable clue where to look. In fact the aspect above may, together with possible future findings, end up being a reason to not default to or even disallow (like for AMD) page table sharing. > I get the feeling this is just papering over an existing issue instead > of actually fixing it: IOMMU page tables need to be properly freed > during early failure. I take a different perspective: IOMMU page tables shouldn't get created (yet) at all in the course of XEN_DOMCTL_createdomain - this op is supposed to produce an empty container for a VM. Jan
On 09.02.2021 16:28, Julien Grall wrote: > From: Julien Grall <jgrall@amazon.com> > > Currently, the IOMMU page-tables will be populated early in the domain > creation if the hardware is able to virtualize the local APIC. However, > the IOMMU page tables will not be freed during early failure and will > result to a leak. > > An assigned device should not need to DMA into the vLAPIC page, so we > can avoid to map the page in the IOMMU page-tables. Here and below, may I ask that you use the correct term "APIC access page", as there are other pages involved in vLAPIC handling (in particular the virtual APIC page, which is where accesses actually go to that translate to the APIC access page in EPT). > This statement is also true for any special pages (the vLAPIC page is > one of them). So to take the opportunity to prevent the mapping for all > of them. I probably should have realized this earlier, but there is a downside to this: A guest wanting to core dump itself may want to dump e.g. shared info and vcpu info pages. Hence ... > --- a/xen/include/asm-x86/p2m.h > +++ b/xen/include/asm-x86/p2m.h > @@ -919,6 +919,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn) > { > unsigned int flags; > > + /* Don't map special pages in the IOMMU page-tables. */ > + if ( mfn_valid(mfn) && is_special_page(mfn_to_page(mfn)) ) > + return 0; ... instead of is_special_page() I think you want to limit the check here to seeing whether PGC_extra is set. But as said on irc, since this crude way of setting up the APIC access page is now firmly a problem, I intend to try to redo it. I can't tell yet whether this will still take a PGC_extra page of some form (nor if my plans will work out in the first place). Irrespective of this I think we indeed want to exclude PGC_extra pages from getting mapped. However, the APIC access page is, I think, an outlier here - we wouldn't insert other PGC_extra pages into the p2m, so for all other cases the above addition would be effectively dead code. Jan
On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote: > On 10.02.2021 09:29, Roger Pau Monné wrote: > > On Tue, Feb 09, 2021 at 03:28:12PM +0000, Julien Grall wrote: > >> From: Julien Grall <jgrall@amazon.com> > >> > >> Currently, the IOMMU page-tables will be populated early in the domain > >> creation if the hardware is able to virtualize the local APIC. However, > >> the IOMMU page tables will not be freed during early failure and will > >> result to a leak. > >> > >> An assigned device should not need to DMA into the vLAPIC page, so we > >> can avoid to map the page in the IOMMU page-tables. > >> > >> This statement is also true for any special pages (the vLAPIC page is > >> one of them). So to take the opportunity to prevent the mapping for all > >> of them. > > > > Hm, OK, while I assume it's likely for special pages to not be target > > of DMA operations, it's not easy to spot what are special pages. > > > >> Note that: > >> - This is matching the existing behavior with PV guest > > > > You might make HVM guests not sharing page-tables 'match' PV > > behavior, but you are making behavior between HVM guests themselves > > diverge. > > > > > >> - This doesn't change the behavior when the P2M is shared with the > >> IOMMU. IOW, the special pages will still be accessibled by the > >> device. > > > > I have to admit I don't like this part at all. Having diverging device > > mappings depending on whether the page tables are shared or not is > > bad IMO, as there might be subtle bugs affecting one of the two > > modes. > > This is one way to look at things, yes. But if you take the > other perspective that special pages shouldn't be > IOMMU-mapped, then the divergence is the price to pay for > being able to share pages (and it's not Julien introducing > bad behavior here). Since when sharing we have no option but to make whatever is accessible to the CPU also available to devices, it would seem reasonable to also do it when not sharing. > Additionally it may be possible to utilize the divergence to > our advantage: If one way of setting up things works and the > other doesn't, we have a reasonable clue where to look. In > fact the aspect above may, together with possible future > findings, end up being a reason to not default to or even > disallow (like for AMD) page table sharing. I think such approach is risky: we don't likely test VT-d without sharing that much (or at all?), now that IOMMUs support the same page sizes as EPT, hence it's likely for bugs to go unnoticed. > > I get the feeling this is just papering over an existing issue instead > > of actually fixing it: IOMMU page tables need to be properly freed > > during early failure. > > I take a different perspective: IOMMU page tables shouldn't > get created (yet) at all in the course of > XEN_DOMCTL_createdomain - this op is supposed to produce an > empty container for a VM. The same would apply for CPU page-tables then, yet they seem to be created and populating them (ie: adding the lapic access page) doesn't leak such entries, which points at an asymmetry. Either we setup both tables and handle freeing them properly, or we set none of them. Roger.
On 10.02.2021 12:34, Roger Pau Monné wrote: > On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote: >> On 10.02.2021 09:29, Roger Pau Monné wrote: >>> I get the feeling this is just papering over an existing issue instead >>> of actually fixing it: IOMMU page tables need to be properly freed >>> during early failure. >> >> I take a different perspective: IOMMU page tables shouldn't >> get created (yet) at all in the course of >> XEN_DOMCTL_createdomain - this op is supposed to produce an >> empty container for a VM. > > The same would apply for CPU page-tables then, yet they seem to be > created and populating them (ie: adding the lapic access page) doesn't > leak such entries, which points at an asymmetry. Either we setup both > tables and handle freeing them properly, or we set none of them. Where would CPU page tables get created from at this early stage? There's no memory assigned to the guest yet, so there's nothing to map afaics. Jan
On 10/02/2021 11:38, Jan Beulich wrote: > On 10.02.2021 12:34, Roger Pau Monné wrote: >> On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote: >>> On 10.02.2021 09:29, Roger Pau Monné wrote: >>>> I get the feeling this is just papering over an existing issue instead >>>> of actually fixing it: IOMMU page tables need to be properly freed >>>> during early failure. >>> >>> I take a different perspective: IOMMU page tables shouldn't >>> get created (yet) at all in the course of >>> XEN_DOMCTL_createdomain - this op is supposed to produce an >>> empty container for a VM. >> >> The same would apply for CPU page-tables then, yet they seem to be >> created and populating them (ie: adding the lapic access page) doesn't >> leak such entries, which points at an asymmetry. Either we setup both >> tables and handle freeing them properly, or we set none of them. > > Where would CPU page tables get created from at this early stage? When mapping the APIC page in the P2M. I don't think you can get away with removing it completely. Cheers,
On 10.02.2021 12:40, Julien Grall wrote: > On 10/02/2021 11:38, Jan Beulich wrote: >> On 10.02.2021 12:34, Roger Pau Monné wrote: >>> On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote: >>>> On 10.02.2021 09:29, Roger Pau Monné wrote: >>>>> I get the feeling this is just papering over an existing issue instead >>>>> of actually fixing it: IOMMU page tables need to be properly freed >>>>> during early failure. >>>> >>>> I take a different perspective: IOMMU page tables shouldn't >>>> get created (yet) at all in the course of >>>> XEN_DOMCTL_createdomain - this op is supposed to produce an >>>> empty container for a VM. >>> >>> The same would apply for CPU page-tables then, yet they seem to be >>> created and populating them (ie: adding the lapic access page) doesn't >>> leak such entries, which points at an asymmetry. Either we setup both >>> tables and handle freeing them properly, or we set none of them. >> >> Where would CPU page tables get created from at this early stage? > > When mapping the APIC page in the P2M. I don't think you can get away > with removing it completely. It doesn't need putting in the p2m this early. It would be quite fine to defer this until e.g. the first vCPU gets created. Jan
On 10/02/2021 11:45, Jan Beulich wrote: > On 10.02.2021 12:40, Julien Grall wrote: >> On 10/02/2021 11:38, Jan Beulich wrote: >>> On 10.02.2021 12:34, Roger Pau Monné wrote: >>>> On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote: >>>>> On 10.02.2021 09:29, Roger Pau Monné wrote: >>>>>> I get the feeling this is just papering over an existing issue instead >>>>>> of actually fixing it: IOMMU page tables need to be properly freed >>>>>> during early failure. >>>>> >>>>> I take a different perspective: IOMMU page tables shouldn't >>>>> get created (yet) at all in the course of >>>>> XEN_DOMCTL_createdomain - this op is supposed to produce an >>>>> empty container for a VM. >>>> >>>> The same would apply for CPU page-tables then, yet they seem to be >>>> created and populating them (ie: adding the lapic access page) doesn't >>>> leak such entries, which points at an asymmetry. Either we setup both >>>> tables and handle freeing them properly, or we set none of them. >>> >>> Where would CPU page tables get created from at this early stage? >> >> When mapping the APIC page in the P2M. I don't think you can get away >> with removing it completely. > > It doesn't need putting in the p2m this early. It would be quite > fine to defer this until e.g. the first vCPU gets created. It feels wrong to me to setup a per-domain mapping when initializing the first vCPU. But, I was under the impression that there is plan to remove XEN_DOMCTL_max_vcpus. So it would only buy just a bit of time... Cheers,
On Wed, Feb 10, 2021 at 11:48:40AM +0000, Julien Grall wrote: > > > On 10/02/2021 11:45, Jan Beulich wrote: > > On 10.02.2021 12:40, Julien Grall wrote: > > > On 10/02/2021 11:38, Jan Beulich wrote: > > > > On 10.02.2021 12:34, Roger Pau Monné wrote: > > > > > On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote: > > > > > > On 10.02.2021 09:29, Roger Pau Monné wrote: > > > > > > > I get the feeling this is just papering over an existing issue instead > > > > > > > of actually fixing it: IOMMU page tables need to be properly freed > > > > > > > during early failure. > > > > > > > > > > > > I take a different perspective: IOMMU page tables shouldn't > > > > > > get created (yet) at all in the course of > > > > > > XEN_DOMCTL_createdomain - this op is supposed to produce an > > > > > > empty container for a VM. > > > > > > > > > > The same would apply for CPU page-tables then, yet they seem to be > > > > > created and populating them (ie: adding the lapic access page) doesn't > > > > > leak such entries, which points at an asymmetry. Either we setup both > > > > > tables and handle freeing them properly, or we set none of them. > > > > > > > > Where would CPU page tables get created from at this early stage? > > > > > > When mapping the APIC page in the P2M. I don't think you can get away > > > with removing it completely. > > > > It doesn't need putting in the p2m this early. It would be quite > > fine to defer this until e.g. the first vCPU gets created. > > It feels wrong to me to setup a per-domain mapping when initializing the > first vCPU. > > But, I was under the impression that there is plan to remove > XEN_DOMCTL_max_vcpus. So it would only buy just a bit of time... I was also under that impression. We could setup the lapic access page at vlapic_init for the BSP (which is part of XEN_DOMCTL_max_vcpus ATM). But then I think there should be some kind of check to prevent populating either the CPU or the IOMMU page tables at domain creation hypercall, and so the logic to free CPU table tables on failure could be removed. Roger.
On 10.02.2021 12:48, Julien Grall wrote: > > > On 10/02/2021 11:45, Jan Beulich wrote: >> On 10.02.2021 12:40, Julien Grall wrote: >>> On 10/02/2021 11:38, Jan Beulich wrote: >>>> On 10.02.2021 12:34, Roger Pau Monné wrote: >>>>> On Wed, Feb 10, 2021 at 12:10:09PM +0100, Jan Beulich wrote: >>>>>> On 10.02.2021 09:29, Roger Pau Monné wrote: >>>>>>> I get the feeling this is just papering over an existing issue instead >>>>>>> of actually fixing it: IOMMU page tables need to be properly freed >>>>>>> during early failure. >>>>>> >>>>>> I take a different perspective: IOMMU page tables shouldn't >>>>>> get created (yet) at all in the course of >>>>>> XEN_DOMCTL_createdomain - this op is supposed to produce an >>>>>> empty container for a VM. >>>>> >>>>> The same would apply for CPU page-tables then, yet they seem to be >>>>> created and populating them (ie: adding the lapic access page) doesn't >>>>> leak such entries, which points at an asymmetry. Either we setup both >>>>> tables and handle freeing them properly, or we set none of them. >>>> >>>> Where would CPU page tables get created from at this early stage? >>> >>> When mapping the APIC page in the P2M. I don't think you can get away >>> with removing it completely. >> >> It doesn't need putting in the p2m this early. It would be quite >> fine to defer this until e.g. the first vCPU gets created. > > It feels wrong to me to setup a per-domain mapping when initializing the > first vCPU. Then we could do it even later. Any time up to when the domain would first get unpaused would do. Jan
On 10.02.2021 12:54, Roger Pau Monné wrote: > On Wed, Feb 10, 2021 at 11:48:40AM +0000, Julien Grall wrote: >> It feels wrong to me to setup a per-domain mapping when initializing the >> first vCPU. >> >> But, I was under the impression that there is plan to remove >> XEN_DOMCTL_max_vcpus. So it would only buy just a bit of time... > > I was also under that impression. We could setup the lapic access page > at vlapic_init for the BSP (which is part of XEN_DOMCTL_max_vcpus > ATM). > > But then I think there should be some kind of check to prevent > populating either the CPU or the IOMMU page tables at domain creation > hypercall, and so the logic to free CPU table tables on failure could > be removed. I can spot paging_final_teardown() on an error path there, but I don't suppose that's what you mean? I guess I'm not looking in the right place (there are quite a few after all) ... Jan
On Wed, Feb 10, 2021 at 02:12:38PM +0100, Jan Beulich wrote: > On 10.02.2021 12:54, Roger Pau Monné wrote: > > On Wed, Feb 10, 2021 at 11:48:40AM +0000, Julien Grall wrote: > >> It feels wrong to me to setup a per-domain mapping when initializing the > >> first vCPU. > >> > >> But, I was under the impression that there is plan to remove > >> XEN_DOMCTL_max_vcpus. So it would only buy just a bit of time... > > > > I was also under that impression. We could setup the lapic access page > > at vlapic_init for the BSP (which is part of XEN_DOMCTL_max_vcpus > > ATM). > > > > But then I think there should be some kind of check to prevent > > populating either the CPU or the IOMMU page tables at domain creation > > hypercall, and so the logic to free CPU table tables on failure could > > be removed. > > I can spot paging_final_teardown() on an error path there, but I > don't suppose that's what you mean? I guess I'm not looking in > the right place (there are quite a few after all) ... Well, I assume some freeing of the EPT page tables must happen on error paths, or else we would be leaking them like IOMMU tables are leaked currently? Maybe I've not correctly understanding the issue here. Roger.
On 10.02.2021 16:24, Roger Pau Monné wrote: > On Wed, Feb 10, 2021 at 02:12:38PM +0100, Jan Beulich wrote: >> On 10.02.2021 12:54, Roger Pau Monné wrote: >>> On Wed, Feb 10, 2021 at 11:48:40AM +0000, Julien Grall wrote: >>>> It feels wrong to me to setup a per-domain mapping when initializing the >>>> first vCPU. >>>> >>>> But, I was under the impression that there is plan to remove >>>> XEN_DOMCTL_max_vcpus. So it would only buy just a bit of time... >>> >>> I was also under that impression. We could setup the lapic access page >>> at vlapic_init for the BSP (which is part of XEN_DOMCTL_max_vcpus >>> ATM). >>> >>> But then I think there should be some kind of check to prevent >>> populating either the CPU or the IOMMU page tables at domain creation >>> hypercall, and so the logic to free CPU table tables on failure could >>> be removed. >> >> I can spot paging_final_teardown() on an error path there, but I >> don't suppose that's what you mean? I guess I'm not looking in >> the right place (there are quite a few after all) ... > > Well, I assume some freeing of the EPT page tables must happen on > error paths, or else we would be leaking them like IOMMU tables are > leaked currently? Well, you can't eliminate paging_final_teardown() from that error path because it frees internal structures. It _also_ sets HAP's / shadow's allocation to zero, so has the side effect of freeing why may have been CPU page tables. Jan
Hi Jan, On 10/02/2021 11:26, Jan Beulich wrote: > On 09.02.2021 16:28, Julien Grall wrote: >> From: Julien Grall <jgrall@amazon.com> >> >> Currently, the IOMMU page-tables will be populated early in the domain >> creation if the hardware is able to virtualize the local APIC. However, >> the IOMMU page tables will not be freed during early failure and will >> result to a leak. >> >> An assigned device should not need to DMA into the vLAPIC page, so we >> can avoid to map the page in the IOMMU page-tables. > > Here and below, may I ask that you use the correct term "APIC > access page", as there are other pages involved in vLAPIC > handling (in particular the virtual APIC page, which is where > accesses actually go to that translate to the APIC access page > in EPT). > >> This statement is also true for any special pages (the vLAPIC page is >> one of them). So to take the opportunity to prevent the mapping for all >> of them. > > I probably should have realized this earlier, but there is a > downside to this: A guest wanting to core dump itself may want > to dump e.g. shared info and vcpu info pages. Hence ... > >> --- a/xen/include/asm-x86/p2m.h >> +++ b/xen/include/asm-x86/p2m.h >> @@ -919,6 +919,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn) >> { >> unsigned int flags; >> >> + /* Don't map special pages in the IOMMU page-tables. */ >> + if ( mfn_valid(mfn) && is_special_page(mfn_to_page(mfn)) ) >> + return 0; > > ... instead of is_special_page() I think you want to limit the > check here to seeing whether PGC_extra is set. > > But as said on irc, since this crude way of setting up the APIC > access page is now firmly a problem, I intend to try to redo it. Given this series needs to go in 4.15 (we would introduce a 0-day otherwise), could you clarify whether your patch [1] is intended to replace this one in 4.15? Cheers, [1] <1b6a411b-84e7-bfb1-647e-511a13df838c@suse.com>
On 15.02.2021 12:38, Julien Grall wrote: > On 10/02/2021 11:26, Jan Beulich wrote: >> On 09.02.2021 16:28, Julien Grall wrote: >>> From: Julien Grall <jgrall@amazon.com> >>> >>> Currently, the IOMMU page-tables will be populated early in the domain >>> creation if the hardware is able to virtualize the local APIC. However, >>> the IOMMU page tables will not be freed during early failure and will >>> result to a leak. >>> >>> An assigned device should not need to DMA into the vLAPIC page, so we >>> can avoid to map the page in the IOMMU page-tables. >> >> Here and below, may I ask that you use the correct term "APIC >> access page", as there are other pages involved in vLAPIC >> handling (in particular the virtual APIC page, which is where >> accesses actually go to that translate to the APIC access page >> in EPT). >> >>> This statement is also true for any special pages (the vLAPIC page is >>> one of them). So to take the opportunity to prevent the mapping for all >>> of them. >> >> I probably should have realized this earlier, but there is a >> downside to this: A guest wanting to core dump itself may want >> to dump e.g. shared info and vcpu info pages. Hence ... >> >>> --- a/xen/include/asm-x86/p2m.h >>> +++ b/xen/include/asm-x86/p2m.h >>> @@ -919,6 +919,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn) >>> { >>> unsigned int flags; >>> >>> + /* Don't map special pages in the IOMMU page-tables. */ >>> + if ( mfn_valid(mfn) && is_special_page(mfn_to_page(mfn)) ) >>> + return 0; >> >> ... instead of is_special_page() I think you want to limit the >> check here to seeing whether PGC_extra is set. >> >> But as said on irc, since this crude way of setting up the APIC >> access page is now firmly a problem, I intend to try to redo it. > > Given this series needs to go in 4.15 (we would introduce a 0-day > otherwise), could you clarify whether your patch [1] is intended to > replace this one in 4.15? Yes, that or a cut down variant (simply moving the invocation of set_mmio_p2m_entry()). The more that there the controversy continued regarding the adjustment to p2m_get_iommu_flags(). I did indicate there that I've dropped it for v2. > [1] <1b6a411b-84e7-bfb1-647e-511a13df838c@suse.com> Given the context I was able to guess what mail you refer to, but I would very much like to ask you and anyone else to provide links rather than mail IDs as references. Not every mail UI allows looking up by ID. Jan
Hi Jan, On 15/02/2021 12:36, Jan Beulich wrote: > On 15.02.2021 12:38, Julien Grall wrote: >> On 10/02/2021 11:26, Jan Beulich wrote: >>> On 09.02.2021 16:28, Julien Grall wrote: >>>> From: Julien Grall <jgrall@amazon.com> >>>> >>>> Currently, the IOMMU page-tables will be populated early in the domain >>>> creation if the hardware is able to virtualize the local APIC. However, >>>> the IOMMU page tables will not be freed during early failure and will >>>> result to a leak. >>>> >>>> An assigned device should not need to DMA into the vLAPIC page, so we >>>> can avoid to map the page in the IOMMU page-tables. >>> >>> Here and below, may I ask that you use the correct term "APIC >>> access page", as there are other pages involved in vLAPIC >>> handling (in particular the virtual APIC page, which is where >>> accesses actually go to that translate to the APIC access page >>> in EPT). >>> >>>> This statement is also true for any special pages (the vLAPIC page is >>>> one of them). So to take the opportunity to prevent the mapping for all >>>> of them. >>> >>> I probably should have realized this earlier, but there is a >>> downside to this: A guest wanting to core dump itself may want >>> to dump e.g. shared info and vcpu info pages. Hence ... >>> >>>> --- a/xen/include/asm-x86/p2m.h >>>> +++ b/xen/include/asm-x86/p2m.h >>>> @@ -919,6 +919,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn) >>>> { >>>> unsigned int flags; >>>> >>>> + /* Don't map special pages in the IOMMU page-tables. */ >>>> + if ( mfn_valid(mfn) && is_special_page(mfn_to_page(mfn)) ) >>>> + return 0; >>> >>> ... instead of is_special_page() I think you want to limit the >>> check here to seeing whether PGC_extra is set. >>> >>> But as said on irc, since this crude way of setting up the APIC >>> access page is now firmly a problem, I intend to try to redo it. >> >> Given this series needs to go in 4.15 (we would introduce a 0-day >> otherwise), could you clarify whether your patch [1] is intended to >> replace this one in 4.15? > > Yes, that or a cut down variant (simply moving the invocation of > set_mmio_p2m_entry()). The more that there the controversy > continued regarding the adjustment to p2m_get_iommu_flags(). I > did indicate there that I've dropped it for v2. Do you have a link to v2? I would like to try with my series. > >> [1] <1b6a411b-84e7-bfb1-647e-511a13df838c@suse.com> > > Given the context I was able to guess what mail you refer to, but > I would very much like to ask you and anyone else to provide links > rather than mail IDs as references. Not every mail UI allows > looking up by ID. It is pretty trivial nowadays to search for a message by ID online: https://lore.kernel.org/xen-devel/<message-id> Cheers,
On 15.02.2021 13:54, Julien Grall wrote: > On 15/02/2021 12:36, Jan Beulich wrote: >> On 15.02.2021 12:38, Julien Grall wrote: >>> Given this series needs to go in 4.15 (we would introduce a 0-day >>> otherwise), could you clarify whether your patch [1] is intended to >>> replace this one in 4.15? >> >> Yes, that or a cut down variant (simply moving the invocation of >> set_mmio_p2m_entry()). The more that there the controversy >> continued regarding the adjustment to p2m_get_iommu_flags(). I >> did indicate there that I've dropped it for v2. > > Do you have a link to v2? I would like to try with my series. I didn't post it yet, as I didn't consider the v1 discussion settled so far. The intermediate version I have at present is below. Jan VMX: use a single, global APIC access page The address of this page is used by the CPU only to recognize when to access the virtual APIC page instead. No accesses would ever go to this page. It only needs to be present in the (CPU) page tables so that address translation will produce its address as result for respective accesses. By making this page global, we also eliminate the need to refcount it, or to assign it to any domain in the first place. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- v2: Avoid insertion when !has_vlapic(). Split off change to p2m_get_iommu_flags(). --- Hooking p2m insertion onto arch_domain_creation_finished() isn't very nice, but I couldn't find any better hook (nor a good place where to introduce a new one). In particular there look to be no hvm_funcs hooks being used on a domain-wide basis (except for init/destroy of course). I did consider connecting this to the setting of HVM_PARAM_IDENT_PT, but considered this no better, the more that the tool stack could be smarter and avoid setting that param when not needed. I did further consider not allocating any real page at all, but just using the address of some unpopulated space (which would require announcing this page as reserved to Dom0, so it wouldn't put any PCI MMIO BARs there). But I thought this would be too controversial, because of the possible risks associated with this. --- unstable.orig/xen/arch/x86/domain.c +++ unstable/xen/arch/x86/domain.c @@ -1005,6 +1005,8 @@ int arch_domain_soft_reset(struct domain void arch_domain_creation_finished(struct domain *d) { + if ( is_hvm_domain(d) ) + hvm_domain_creation_finished(d); } #define xen_vcpu_guest_context vcpu_guest_context --- unstable.orig/xen/arch/x86/hvm/vmx/vmx.c +++ unstable/xen/arch/x86/hvm/vmx/vmx.c @@ -66,8 +66,7 @@ boolean_param("force-ept", opt_force_ept static void vmx_ctxt_switch_from(struct vcpu *v); static void vmx_ctxt_switch_to(struct vcpu *v); -static int vmx_alloc_vlapic_mapping(struct domain *d); -static void vmx_free_vlapic_mapping(struct domain *d); +static int alloc_vlapic_mapping(void); static void vmx_install_vlapic_mapping(struct vcpu *v); static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr, unsigned int flags); @@ -78,6 +77,8 @@ static int vmx_msr_read_intercept(unsign static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content); static void vmx_invlpg(struct vcpu *v, unsigned long linear); +static mfn_t __read_mostly apic_access_mfn; + /* Values for domain's ->arch.hvm_domain.pi_ops.flags. */ #define PI_CSW_FROM (1u << 0) #define PI_CSW_TO (1u << 1) @@ -401,7 +402,6 @@ static int vmx_domain_initialise(struct .to = vmx_ctxt_switch_to, .tail = vmx_do_resume, }; - int rc; d->arch.ctxt_switch = &csw; @@ -411,21 +411,16 @@ static int vmx_domain_initialise(struct */ d->arch.hvm.vmx.exec_sp = is_hardware_domain(d) || opt_ept_exec_sp; - if ( !has_vlapic(d) ) - return 0; - - if ( (rc = vmx_alloc_vlapic_mapping(d)) != 0 ) - return rc; - return 0; } -static void vmx_domain_relinquish_resources(struct domain *d) +static void domain_creation_finished(struct domain *d) { - if ( !has_vlapic(d) ) - return; - vmx_free_vlapic_mapping(d); + if ( has_vlapic(d) && !mfn_eq(apic_access_mfn, _mfn(0)) && + set_mmio_p2m_entry(d, gaddr_to_gfn(APIC_DEFAULT_PHYS_BASE), + apic_access_mfn, PAGE_ORDER_4K) ) + domain_crash(d); } static int vmx_vcpu_initialise(struct vcpu *v) @@ -2264,7 +2259,7 @@ static struct hvm_function_table __initd .cpu_up_prepare = vmx_cpu_up_prepare, .cpu_dead = vmx_cpu_dead, .domain_initialise = vmx_domain_initialise, - .domain_relinquish_resources = vmx_domain_relinquish_resources, + .domain_creation_finished = domain_creation_finished, .vcpu_initialise = vmx_vcpu_initialise, .vcpu_destroy = vmx_vcpu_destroy, .save_cpu_ctxt = vmx_save_vmcs_ctxt, @@ -2503,7 +2498,7 @@ const struct hvm_function_table * __init { set_in_cr4(X86_CR4_VMXE); - if ( vmx_vmcs_init() ) + if ( vmx_vmcs_init() || alloc_vlapic_mapping() ) { printk("VMX: failed to initialise.\n"); return NULL; @@ -3064,7 +3059,7 @@ gp_fault: return X86EMUL_EXCEPTION; } -static int vmx_alloc_vlapic_mapping(struct domain *d) +static int __init alloc_vlapic_mapping(void) { struct page_info *pg; mfn_t mfn; @@ -3072,53 +3067,28 @@ static int vmx_alloc_vlapic_mapping(stru if ( !cpu_has_vmx_virtualize_apic_accesses ) return 0; - pg = alloc_domheap_page(d, MEMF_no_refcount); + pg = alloc_domheap_page(NULL, 0); if ( !pg ) return -ENOMEM; - if ( !get_page_and_type(pg, d, PGT_writable_page) ) - { - /* - * The domain can't possibly know about this page yet, so failure - * here is a clear indication of something fishy going on. - */ - domain_crash(d); - return -ENODATA; - } - mfn = page_to_mfn(pg); clear_domain_page(mfn); - d->arch.hvm.vmx.apic_access_mfn = mfn; + apic_access_mfn = mfn; - return set_mmio_p2m_entry(d, gaddr_to_gfn(APIC_DEFAULT_PHYS_BASE), mfn, - PAGE_ORDER_4K); -} - -static void vmx_free_vlapic_mapping(struct domain *d) -{ - mfn_t mfn = d->arch.hvm.vmx.apic_access_mfn; - - d->arch.hvm.vmx.apic_access_mfn = _mfn(0); - if ( !mfn_eq(mfn, _mfn(0)) ) - { - struct page_info *pg = mfn_to_page(mfn); - - put_page_alloc_ref(pg); - put_page_and_type(pg); - } + return 0; } static void vmx_install_vlapic_mapping(struct vcpu *v) { paddr_t virt_page_ma, apic_page_ma; - if ( mfn_eq(v->domain->arch.hvm.vmx.apic_access_mfn, _mfn(0)) ) + if ( !has_vlapic(v->domain) || mfn_eq(apic_access_mfn, _mfn(0)) ) return; ASSERT(cpu_has_vmx_virtualize_apic_accesses); virt_page_ma = page_to_maddr(vcpu_vlapic(v)->regs_page); - apic_page_ma = mfn_to_maddr(v->domain->arch.hvm.vmx.apic_access_mfn); + apic_page_ma = mfn_to_maddr(apic_access_mfn); vmx_vmcs_enter(v); __vmwrite(VIRTUAL_APIC_PAGE_ADDR, virt_page_ma); --- unstable.orig/xen/include/asm-x86/hvm/hvm.h +++ unstable/xen/include/asm-x86/hvm/hvm.h @@ -106,6 +106,7 @@ struct hvm_function_table { * Initialise/destroy HVM domain/vcpu resources */ int (*domain_initialise)(struct domain *d); + void (*domain_creation_finished)(struct domain *d); void (*domain_relinquish_resources)(struct domain *d); void (*domain_destroy)(struct domain *d); int (*vcpu_initialise)(struct vcpu *v); @@ -383,6 +384,12 @@ static inline bool hvm_has_set_descripto return hvm_funcs.set_descriptor_access_exiting; } +static inline void hvm_domain_creation_finished(struct domain *d) +{ + if ( hvm_funcs.domain_creation_finished ) + alternative_vcall(hvm_funcs.domain_creation_finished, d); +} + static inline int hvm_guest_x86_mode(struct vcpu *v) { @@ -715,6 +722,11 @@ static inline void hvm_invlpg(const stru { ASSERT_UNREACHABLE(); } + +static inline void hvm_domain_creation_finished(struct domain *d) +{ + ASSERT_UNREACHABLE(); +} /* * Shadow code needs further cleanup to eliminate some HVM-only paths. For --- unstable.orig/xen/include/asm-x86/hvm/vmx/vmcs.h +++ unstable/xen/include/asm-x86/hvm/vmx/vmcs.h @@ -58,7 +58,6 @@ struct ept_data { #define _VMX_DOMAIN_PML_ENABLED 0 #define VMX_DOMAIN_PML_ENABLED (1ul << _VMX_DOMAIN_PML_ENABLED) struct vmx_domain { - mfn_t apic_access_mfn; /* VMX_DOMAIN_* */ unsigned int status;
On 15/02/2021 13:14, Jan Beulich wrote: > On 15.02.2021 13:54, Julien Grall wrote: >> On 15/02/2021 12:36, Jan Beulich wrote: >>> On 15.02.2021 12:38, Julien Grall wrote: >>>> Given this series needs to go in 4.15 (we would introduce a 0-day >>>> otherwise), could you clarify whether your patch [1] is intended to >>>> replace this one in 4.15? >>> >>> Yes, that or a cut down variant (simply moving the invocation of >>> set_mmio_p2m_entry()). The more that there the controversy >>> continued regarding the adjustment to p2m_get_iommu_flags(). I >>> did indicate there that I've dropped it for v2. >> >> Do you have a link to v2? I would like to try with my series. > > I didn't post it yet, as I didn't consider the v1 discussion > settled so far. The intermediate version I have at present is > below. Thanks! I will drop patch #1 from my series and resend it. Cheers,
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h index 7d63f5787e62..1802545969b3 100644 --- a/xen/include/asm-x86/p2m.h +++ b/xen/include/asm-x86/p2m.h @@ -919,6 +919,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn) { unsigned int flags; + /* Don't map special pages in the IOMMU page-tables. */ + if ( mfn_valid(mfn) && is_special_page(mfn_to_page(mfn)) ) + return 0; + switch( p2mt ) { case p2m_ram_rw: