diff mbox series

[for-4.17] x86/pci: allow BARs to be positioned on e820 reserved regions

Message ID 20221004153645.3686-1-roger.pau@citrix.com (mailing list archive)
State New, archived
Headers show
Series [for-4.17] x86/pci: allow BARs to be positioned on e820 reserved regions | expand

Commit Message

Roger Pau Monné Oct. 4, 2022, 3:36 p.m. UTC
The EFI memory map contains two memory types (EfiMemoryMappedIO and
EfiMemoryMappedIOPortSpace) used to describe IO memory areas used by
EFI firmware.

The current parsing of the EFI memory map is translating
EfiMemoryMappedIO to E820_RESERVED on x86.  This causes issues on some
boxes as the firmware is relying on using those regions to position
the BARs of devices being used (possibly during runtime) for the
firmware.

Xen will disable memory decoding on any device that has BARs
positioned over any regions on the e820 memory map, hence the firmware
will malfunction after Xen turning off memory decoding for the
device(s) that have BARs mapped in EfiMemoryMappedIO regions.

The system under which this was observed has:

EFI memory map:
[...]
 00000fd000000-00000fe7fffff type=11 attr=800000000000100d
[...]
0000:00:1f.5 disabled: BAR [0xfe010, 0xfe010] overlaps with memory map

The device behind this BAR is:

00:1f.5 Serial bus controller [0c80]: Intel Corporation Lewisburg SPI Controller (rev 09)
	Subsystem: Super Micro Computer Inc Device 091c
	Flags: fast devsel
	Memory at fe010000 (32-bit, non-prefetchable) [size=4K]well

For the record, the symptom observed in that machine was a hard freeze
when attempting to set an EFI variable (XEN_EFI_set_variable).

Fix by allowing BARs of PCI devices to be positioned over reserved
memory regions, but print a warning message about such overlap.

Fixes: 75cc460a1b ('xen/pci: detect when BARs are not suitably positioned')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/arm/include/asm/pci.h |  2 --
 xen/arch/x86/include/asm/pci.h | 10 ----------
 xen/arch/x86/pci.c             | 27 +++++++++++++++++++++++++++
 xen/include/xen/pci.h          |  1 +
 4 files changed, 28 insertions(+), 12 deletions(-)

Comments

Jan Beulich Oct. 4, 2022, 4:11 p.m. UTC | #1
On 04.10.2022 17:36, Roger Pau Monne wrote:
> The EFI memory map contains two memory types (EfiMemoryMappedIO and
> EfiMemoryMappedIOPortSpace) used to describe IO memory areas used by
> EFI firmware.
> 
> The current parsing of the EFI memory map is translating
> EfiMemoryMappedIO to E820_RESERVED on x86.  This causes issues on some
> boxes as the firmware is relying on using those regions to position
> the BARs of devices being used (possibly during runtime) for the
> firmware.
> 
> Xen will disable memory decoding on any device that has BARs
> positioned over any regions on the e820 memory map, hence the firmware
> will malfunction after Xen turning off memory decoding for the
> device(s) that have BARs mapped in EfiMemoryMappedIO regions.
> 
> The system under which this was observed has:
> 
> EFI memory map:
> [...]
>  00000fd000000-00000fe7fffff type=11 attr=800000000000100d
> [...]
> 0000:00:1f.5 disabled: BAR [0xfe010, 0xfe010] overlaps with memory map
> 
> The device behind this BAR is:
> 
> 00:1f.5 Serial bus controller [0c80]: Intel Corporation Lewisburg SPI Controller (rev 09)
> 	Subsystem: Super Micro Computer Inc Device 091c
> 	Flags: fast devsel
> 	Memory at fe010000 (32-bit, non-prefetchable) [size=4K]well
> 
> For the record, the symptom observed in that machine was a hard freeze
> when attempting to set an EFI variable (XEN_EFI_set_variable).
> 
> Fix by allowing BARs of PCI devices to be positioned over reserved
> memory regions, but print a warning message about such overlap.

Somewhat hesitantly I might ack this change, but I'd like to give
others (Andrew in particular) some time to voice their views. As
said during the earlier discussion - I think we're relaxing things
too much by going this route.

> --- a/xen/arch/x86/pci.c
> +++ b/xen/arch/x86/pci.c
> @@ -98,3 +98,30 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
>  
>      return rc;
>  }
> +
> +bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end)
> +{
> +    unsigned long mfn;
> +
> +    /*
> +     * Check if BAR is not overlapping with any memory region defined
> +     * in the memory map.
> +     */
> +    if ( is_memory_hole(start, end) )
> +        return true;
> +
> +    /*
> +     * Also allow BARs placed on reserved regions in order to deal with EFI
> +     * firmware using EfiMemoryMappedIO regions to place the BARs of devices
> +     * that can be used during runtime.  But print a warning when doing so.
> +     */
> +    for ( mfn = mfn_x(start); mfn <= mfn_x(end); mfn++ )
> +        if ( !page_is_ram_type(mfn, RAM_TYPE_RESERVED) )
> +            return false;

We don't need to be arch-independent here and hence instead of this
(potentially long) loop can't we use a single call to e820_all_mapped()?

Jan
Roger Pau Monné Oct. 5, 2022, 8:37 a.m. UTC | #2
On Tue, Oct 04, 2022 at 06:11:50PM +0200, Jan Beulich wrote:
> On 04.10.2022 17:36, Roger Pau Monne wrote:
> > The EFI memory map contains two memory types (EfiMemoryMappedIO and
> > EfiMemoryMappedIOPortSpace) used to describe IO memory areas used by
> > EFI firmware.
> > 
> > The current parsing of the EFI memory map is translating
> > EfiMemoryMappedIO to E820_RESERVED on x86.  This causes issues on some
> > boxes as the firmware is relying on using those regions to position
> > the BARs of devices being used (possibly during runtime) for the
> > firmware.
> > 
> > Xen will disable memory decoding on any device that has BARs
> > positioned over any regions on the e820 memory map, hence the firmware
> > will malfunction after Xen turning off memory decoding for the
> > device(s) that have BARs mapped in EfiMemoryMappedIO regions.
> > 
> > The system under which this was observed has:
> > 
> > EFI memory map:
> > [...]
> >  00000fd000000-00000fe7fffff type=11 attr=800000000000100d
> > [...]
> > 0000:00:1f.5 disabled: BAR [0xfe010, 0xfe010] overlaps with memory map
> > 
> > The device behind this BAR is:
> > 
> > 00:1f.5 Serial bus controller [0c80]: Intel Corporation Lewisburg SPI Controller (rev 09)
> > 	Subsystem: Super Micro Computer Inc Device 091c
> > 	Flags: fast devsel
> > 	Memory at fe010000 (32-bit, non-prefetchable) [size=4K]well
> > 
> > For the record, the symptom observed in that machine was a hard freeze
> > when attempting to set an EFI variable (XEN_EFI_set_variable).
> > 
> > Fix by allowing BARs of PCI devices to be positioned over reserved
> > memory regions, but print a warning message about such overlap.
> 
> Somewhat hesitantly I might ack this change, but I'd like to give
> others (Andrew in particular) some time to voice their views. As
> said during the earlier discussion - I think we're relaxing things
> too much by going this route.

Another option would be to explicitly check in efi_memmap for
EfiMemoryMappedIO regions and BAR overlap and only allow those.  That
would be more cumbersome code wise AFAICT.

> > --- a/xen/arch/x86/pci.c
> > +++ b/xen/arch/x86/pci.c
> > @@ -98,3 +98,30 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
> >  
> >      return rc;
> >  }
> > +
> > +bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end)
> > +{
> > +    unsigned long mfn;
> > +
> > +    /*
> > +     * Check if BAR is not overlapping with any memory region defined
> > +     * in the memory map.
> > +     */
> > +    if ( is_memory_hole(start, end) )
> > +        return true;
> > +
> > +    /*
> > +     * Also allow BARs placed on reserved regions in order to deal with EFI
> > +     * firmware using EfiMemoryMappedIO regions to place the BARs of devices
> > +     * that can be used during runtime.  But print a warning when doing so.
> > +     */
> > +    for ( mfn = mfn_x(start); mfn <= mfn_x(end); mfn++ )
> > +        if ( !page_is_ram_type(mfn, RAM_TYPE_RESERVED) )
> > +            return false;
> 
> We don't need to be arch-independent here and hence instead of this
> (potentially long) loop can't we use a single call to e820_all_mapped()?

Sure, was searching for a range checking functions but wrongly looked
into mm.c instead of e820.  I would have to make the function
non-init, but I think that's fine.

Thanks, Roger.
Jan Beulich Oct. 5, 2022, 8:53 a.m. UTC | #3
On 05.10.2022 10:37, Roger Pau Monné wrote:
> On Tue, Oct 04, 2022 at 06:11:50PM +0200, Jan Beulich wrote:
>> On 04.10.2022 17:36, Roger Pau Monne wrote:
>>> The EFI memory map contains two memory types (EfiMemoryMappedIO and
>>> EfiMemoryMappedIOPortSpace) used to describe IO memory areas used by
>>> EFI firmware.
>>>
>>> The current parsing of the EFI memory map is translating
>>> EfiMemoryMappedIO to E820_RESERVED on x86.  This causes issues on some
>>> boxes as the firmware is relying on using those regions to position
>>> the BARs of devices being used (possibly during runtime) for the
>>> firmware.
>>>
>>> Xen will disable memory decoding on any device that has BARs
>>> positioned over any regions on the e820 memory map, hence the firmware
>>> will malfunction after Xen turning off memory decoding for the
>>> device(s) that have BARs mapped in EfiMemoryMappedIO regions.
>>>
>>> The system under which this was observed has:
>>>
>>> EFI memory map:
>>> [...]
>>>  00000fd000000-00000fe7fffff type=11 attr=800000000000100d
>>> [...]
>>> 0000:00:1f.5 disabled: BAR [0xfe010, 0xfe010] overlaps with memory map
>>>
>>> The device behind this BAR is:
>>>
>>> 00:1f.5 Serial bus controller [0c80]: Intel Corporation Lewisburg SPI Controller (rev 09)
>>> 	Subsystem: Super Micro Computer Inc Device 091c
>>> 	Flags: fast devsel
>>> 	Memory at fe010000 (32-bit, non-prefetchable) [size=4K]well
>>>
>>> For the record, the symptom observed in that machine was a hard freeze
>>> when attempting to set an EFI variable (XEN_EFI_set_variable).
>>>
>>> Fix by allowing BARs of PCI devices to be positioned over reserved
>>> memory regions, but print a warning message about such overlap.
>>
>> Somewhat hesitantly I might ack this change, but I'd like to give
>> others (Andrew in particular) some time to voice their views. As
>> said during the earlier discussion - I think we're relaxing things
>> too much by going this route.
> 
> Another option would be to explicitly check in efi_memmap for
> EfiMemoryMappedIO regions and BAR overlap and only allow those.  That
> would be more cumbersome code wise AFAICT.

Indeed there's a question of balancing well here, between two outcomes
neither of which is really desirable.

Jan
Roger Pau Monné Oct. 5, 2022, 2:12 p.m. UTC | #4
On Wed, Oct 05, 2022 at 10:53:38AM +0200, Jan Beulich wrote:
> On 05.10.2022 10:37, Roger Pau Monné wrote:
> > On Tue, Oct 04, 2022 at 06:11:50PM +0200, Jan Beulich wrote:
> >> On 04.10.2022 17:36, Roger Pau Monne wrote:
> >>> The EFI memory map contains two memory types (EfiMemoryMappedIO and
> >>> EfiMemoryMappedIOPortSpace) used to describe IO memory areas used by
> >>> EFI firmware.
> >>>
> >>> The current parsing of the EFI memory map is translating
> >>> EfiMemoryMappedIO to E820_RESERVED on x86.  This causes issues on some
> >>> boxes as the firmware is relying on using those regions to position
> >>> the BARs of devices being used (possibly during runtime) for the
> >>> firmware.
> >>>
> >>> Xen will disable memory decoding on any device that has BARs
> >>> positioned over any regions on the e820 memory map, hence the firmware
> >>> will malfunction after Xen turning off memory decoding for the
> >>> device(s) that have BARs mapped in EfiMemoryMappedIO regions.
> >>>
> >>> The system under which this was observed has:
> >>>
> >>> EFI memory map:
> >>> [...]
> >>>  00000fd000000-00000fe7fffff type=11 attr=800000000000100d
> >>> [...]
> >>> 0000:00:1f.5 disabled: BAR [0xfe010, 0xfe010] overlaps with memory map
> >>>
> >>> The device behind this BAR is:
> >>>
> >>> 00:1f.5 Serial bus controller [0c80]: Intel Corporation Lewisburg SPI Controller (rev 09)
> >>> 	Subsystem: Super Micro Computer Inc Device 091c
> >>> 	Flags: fast devsel
> >>> 	Memory at fe010000 (32-bit, non-prefetchable) [size=4K]well
> >>>
> >>> For the record, the symptom observed in that machine was a hard freeze
> >>> when attempting to set an EFI variable (XEN_EFI_set_variable).
> >>>
> >>> Fix by allowing BARs of PCI devices to be positioned over reserved
> >>> memory regions, but print a warning message about such overlap.
> >>
> >> Somewhat hesitantly I might ack this change, but I'd like to give
> >> others (Andrew in particular) some time to voice their views. As
> >> said during the earlier discussion - I think we're relaxing things
> >> too much by going this route.
> > 
> > Another option would be to explicitly check in efi_memmap for
> > EfiMemoryMappedIO regions and BAR overlap and only allow those.  That
> > would be more cumbersome code wise AFAICT.
> 
> Indeed there's a question of balancing well here, between two outcomes
> neither of which is really desirable.

Hm, I have the following diff attached below which is more limited,
and not so bad I think.  Initially I wanted to introduce an
efi_all_mapped() helper, but that would require exposing EFI_MEMORY_TYPE
which is quite intrusive.

Let me know if you think the proposal below is better and I will
formally send a patch with it.

Thanks, Roger.
---
diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h
index f4a58c8acf..c8e1a9ecdb 100644
--- a/xen/arch/x86/include/asm/pci.h
+++ b/xen/arch/x86/include/asm/pci.h
@@ -57,14 +57,4 @@ static always_inline bool is_pci_passthrough_enabled(void)
 
 void arch_pci_init_pdev(struct pci_dev *pdev);
 
-static inline bool pci_check_bar(const struct pci_dev *pdev,
-                                 mfn_t start, mfn_t end)
-{
-    /*
-     * Check if BAR is not overlapping with any memory region defined
-     * in the memory map.
-     */
-    return is_memory_hole(start, end);
-}
-
 #endif /* __X86_PCI_H__ */
diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c
index 97b792e578..c3737e226d 100644
--- a/xen/arch/x86/pci.c
+++ b/xen/arch/x86/pci.c
@@ -4,6 +4,7 @@
  * Architecture-dependent PCI access functions.
  */
 
+#include <xen/efi.h>
 #include <xen/spinlock.h>
 #include <xen/pci.h>
 #include <asm/io.h>
@@ -98,3 +99,28 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
 
     return rc;
 }
+
+bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end)
+{
+    /*
+     * Check if BAR is not overlapping with any memory region defined
+     * in the memory map.
+     */
+    if ( is_memory_hole(start, end) )
+        return true;
+
+    /*
+     * Also allow BARs placed on EfiMemoryMappedIO regions in order to deal
+     * with EFI firmware using those regions to place the BARs of devices that
+     * can be used during runtime.  But print a warning when doing so.
+     */
+    if ( !efi_all_runtime_mmio(mfn_to_maddr(start),
+                               mfn_to_maddr(mfn_add(end, 1))) )
+        return false;
+
+    printk(XENLOG_WARNING
+           "%pp: BAR [%#" PRI_mfn ", %#" PRI_mfn "] overlaps reserved region\n",
+           &pdev->sbdf, mfn_x(start), mfn_x(end));
+
+    return true;
+}
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index 13b0975866..b69c710ce3 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -78,6 +78,30 @@ bool efi_enabled(unsigned int feature)
     return test_bit(feature, &efi_flags);
 }
 
+/*
+ * This function checks if the entire range [start,end) is contained inside of
+ * a single EfiMemoryMappedIO descriptor with the runtime attribute set.
+ */
+bool efi_all_runtime_mmio(uint64_t start, uint64_t end)
+{
+    unsigned int i;
+
+    for ( i = 0; i < efi_memmap_size; i += efi_mdesc_size )
+    {
+        EFI_MEMORY_DESCRIPTOR *desc = efi_memmap + i;
+        uint64_t len = desc->NumberOfPages << EFI_PAGE_SHIFT;
+
+        if ( desc->Type != EfiMemoryMappedIO ||
+             !(desc->Attribute & EFI_MEMORY_RUNTIME) )
+            continue;
+
+        if ( start >= desc->PhysicalStart && end <= desc->PhysicalStart + len )
+            return true;
+    }
+
+    return false;
+}
+
 #ifndef CONFIG_ARM /* TODO - disabled until implemented on ARM */
 
 struct efi_rs_state efi_rs_enter(void)
diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
index 94a7e547f9..f29ea1a320 100644
--- a/xen/include/xen/efi.h
+++ b/xen/include/xen/efi.h
@@ -45,6 +45,8 @@ int efi_runtime_call(struct xenpf_efi_runtime_call *);
 int efi_compat_get_info(uint32_t idx, union compat_pf_efi_info *);
 int efi_compat_runtime_call(struct compat_pf_efi_runtime_call *);
 
+bool efi_all_runtime_mmio(uint64_t start, uint64_t end);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __XEN_EFI_H__ */
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 5975ca2f30..64995fc68d 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -211,6 +211,7 @@ unsigned int pci_size_mem_bar(pci_sbdf_t sbdf, unsigned int pos,
 
 void pci_intx(const struct pci_dev *, bool enable);
 bool_t pcie_aer_get_firmware_first(const struct pci_dev *);
+bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end);
 
 struct pirq;
 int msixtbl_pt_register(struct domain *, struct pirq *, uint64_t gtable);
Jan Beulich Oct. 5, 2022, 3:42 p.m. UTC | #5
On 05.10.2022 16:12, Roger Pau Monné wrote:
> Hm, I have the following diff attached below which is more limited,
> and not so bad I think.  Initially I wanted to introduce an
> efi_all_mapped() helper, but that would require exposing EFI_MEMORY_TYPE
> which is quite intrusive.
> 
> Let me know if you think the proposal below is better and I will
> formally send a patch with it.

Hmm, personally I like this slightly better for, as you say, its more
limited effect. Objectively, however, I'm still unconvinced of making
this an EFI special case. How would non-EFI firmware go about
communicating that it is going to access a device at runtime (which,
as said before, I consider a no-go in the first place)? Likely by
putting its BAR range(s) in an E820_RESERVED region.

Plus the MMIO range covered on the system in question is pretty large.
That way we're still allowing pretty wide an abuse by the firmware.
Furthermore the MCFG range would also be covered by an
EfiMemoryMappedIO descriptor (in fact that's the only use of the type
I had been aware of so far). IOW the change specifically permits an
overlap of a BAR with an MCFG range.

Who's the manufacturer of the system? Or put in different words - how
likely is it that we could first gain understanding on their
intentions with this region? You did say the system hangs hard without
some kind of workaround, but that doesn't clarify to me in how far a
use of the device by the firmware was actually involved there.

Have you considered other routes towards dealing with the issue? One
approach coming to mind would build on top of what you've been doing
so far (either variant): Besides avoiding the turning off of memory
decode, also invoke pci_ro_device(), thus protecting it from having
its BARs relocated, and also preventing any driver in Dom0 to gain
control of it, thus avoiding two parties competing for the device.

Relocating the BAR outside of the reserved region would be nice, but
will likely not resolve the hang.

In any event I'm still hoping to have a 3rd view on the situation as a
whole, irrespective of specific ideas towards possible workarounds ...

Independent of the above a couple of purely cosmetic comments:

> @@ -98,3 +99,28 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
>  
>      return rc;
>  }
> +
> +bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end)
> +{
> +    /*
> +     * Check if BAR is not overlapping with any memory region defined
> +     * in the memory map.
> +     */
> +    if ( is_memory_hole(start, end) )
> +        return true;
> +
> +    /*
> +     * Also allow BARs placed on EfiMemoryMappedIO regions in order to deal
> +     * with EFI firmware using those regions to place the BARs of devices that
> +     * can be used during runtime.  But print a warning when doing so.
> +     */
> +    if ( !efi_all_runtime_mmio(mfn_to_maddr(start),
> +                               mfn_to_maddr(mfn_add(end, 1))) )
> +        return false;
> +
> +    printk(XENLOG_WARNING
> +           "%pp: BAR [%#" PRI_mfn ", %#" PRI_mfn "] overlaps reserved region\n",
> +           &pdev->sbdf, mfn_x(start), mfn_x(end));

Perhaps re-word the message now that the check is a different one?

> --- a/xen/common/efi/runtime.c
> +++ b/xen/common/efi/runtime.c
> @@ -78,6 +78,30 @@ bool efi_enabled(unsigned int feature)
>      return test_bit(feature, &efi_flags);
>  }
>  
> +/*
> + * This function checks if the entire range [start,end) is contained inside of
> + * a single EfiMemoryMappedIO descriptor with the runtime attribute set.
> + */
> +bool efi_all_runtime_mmio(uint64_t start, uint64_t end)
> +{
> +    unsigned int i;
> +
> +    for ( i = 0; i < efi_memmap_size; i += efi_mdesc_size )
> +    {
> +        EFI_MEMORY_DESCRIPTOR *desc = efi_memmap + i;

const?

> +        uint64_t len = desc->NumberOfPages << EFI_PAGE_SHIFT;
> +
> +        if ( desc->Type != EfiMemoryMappedIO ||
> +             !(desc->Attribute & EFI_MEMORY_RUNTIME) )
> +            continue;
> +
> +        if ( start >= desc->PhysicalStart && end <= desc->PhysicalStart + len )
> +            return true;
> +    }
> +
> +    return false;
> +}
> +
>  #ifndef CONFIG_ARM /* TODO - disabled until implemented on ARM */

Perhaps put the function inside this #ifdef?

Jan
Roger Pau Monné Oct. 6, 2022, 8:48 a.m. UTC | #6
On Wed, Oct 05, 2022 at 05:42:08PM +0200, Jan Beulich wrote:
> On 05.10.2022 16:12, Roger Pau Monné wrote:
> > Hm, I have the following diff attached below which is more limited,
> > and not so bad I think.  Initially I wanted to introduce an
> > efi_all_mapped() helper, but that would require exposing EFI_MEMORY_TYPE
> > which is quite intrusive.
> > 
> > Let me know if you think the proposal below is better and I will
> > formally send a patch with it.
> 
> Hmm, personally I like this slightly better for, as you say, its more
> limited effect. Objectively, however, I'm still unconvinced of making
> this an EFI special case. How would non-EFI firmware go about
> communicating that it is going to access a device at runtime (which,
> as said before, I consider a no-go in the first place)? Likely by
> putting its BAR range(s) in an E820_RESERVED region.

I've never encountered a non-EFI firmware that attempts to access the
BAR of a device so far (or at least none that caused problems with
Xen I should say), so I would worry about such use-case when we find
one.

> Plus the MMIO range covered on the system in question is pretty large.
> That way we're still allowing pretty wide an abuse by the firmware.
> Furthermore the MCFG range would also be covered by an
> EfiMemoryMappedIO descriptor (in fact that's the only use of the type
> I had been aware of so far). IOW the change specifically permits an
> overlap of a BAR with an MCFG range.

Additionally I could check if the device overlaps with any known MCFG
regions in pci_check_bar(), again not sure if that's more fine grained
than needed.

> 
> Who's the manufacturer of the system?

It's from Supermicro: "Supermicro X11DPU BIOS"

> Or put in different words - how
> likely is it that we could first gain understanding on their
> intentions with this region? You did say the system hangs hard without
> some kind of workaround, but that doesn't clarify to me in how far a
> use of the device by the firmware was actually involved there.

It's a black box to me, so I have no idea what the firmware attempts
to do.  It looks like that BAR is used to store some information
related to the boot sequence, the hang happened when trying to create
a new variable bootnum using efibootmgr (that's just a guess
really).

> Have you considered other routes towards dealing with the issue? One
> approach coming to mind would build on top of what you've been doing
> so far (either variant): Besides avoiding the turning off of memory
> decode, also invoke pci_ro_device(), thus protecting it from having
> its BARs relocated, and also preventing any driver in Dom0 to gain
> control of it, thus avoiding two parties competing for the device.

IMO using pci_ro_device() would be going too far here - it's not Xen
the entity using the device, so Xen doesn't really know whether
there's already some interface between the firmware and the OS driver
(or just OS) in order to signal that the device is being used by the
firmware.

> Relocating the BAR outside of the reserved region would be nice, but
> will likely not resolve the hang.

Not an option for Xen due to not having enough information about the
memory layout, and it's risky IMO anyway.

> In any event I'm still hoping to have a 3rd view on the situation as a
> whole, irrespective of specific ideas towards possible workarounds ...

Thanks for the detailed feedback.

Roger.
Jan Beulich Oct. 6, 2022, 9:02 a.m. UTC | #7
On 06.10.2022 10:48, Roger Pau Monné wrote:
> On Wed, Oct 05, 2022 at 05:42:08PM +0200, Jan Beulich wrote:
>> On 05.10.2022 16:12, Roger Pau Monné wrote:
>>> Hm, I have the following diff attached below which is more limited,
>>> and not so bad I think.  Initially I wanted to introduce an
>>> efi_all_mapped() helper, but that would require exposing EFI_MEMORY_TYPE
>>> which is quite intrusive.
>>>
>>> Let me know if you think the proposal below is better and I will
>>> formally send a patch with it.
>>
>> Hmm, personally I like this slightly better for, as you say, its more
>> limited effect. Objectively, however, I'm still unconvinced of making
>> this an EFI special case. How would non-EFI firmware go about
>> communicating that it is going to access a device at runtime (which,
>> as said before, I consider a no-go in the first place)? Likely by
>> putting its BAR range(s) in an E820_RESERVED region.
> 
> I've never encountered a non-EFI firmware that attempts to access the
> BAR of a device so far (or at least none that caused problems with
> Xen I should say), so I would worry about such use-case when we find
> one.
> 
>> Plus the MMIO range covered on the system in question is pretty large.
>> That way we're still allowing pretty wide an abuse by the firmware.
>> Furthermore the MCFG range would also be covered by an
>> EfiMemoryMappedIO descriptor (in fact that's the only use of the type
>> I had been aware of so far). IOW the change specifically permits an
>> overlap of a BAR with an MCFG range.
> 
> Additionally I could check if the device overlaps with any known MCFG
> regions in pci_check_bar(), again not sure if that's more fine grained
> than needed.
> 
>>
>> Who's the manufacturer of the system?
> 
> It's from Supermicro: "Supermicro X11DPU BIOS"
> 
>> Or put in different words - how
>> likely is it that we could first gain understanding on their
>> intentions with this region? You did say the system hangs hard without
>> some kind of workaround, but that doesn't clarify to me in how far a
>> use of the device by the firmware was actually involved there.
> 
> It's a black box to me, so I have no idea what the firmware attempts
> to do.

Right - I don't expect there's a realistic chance of getting hold of a
firmware person of theirs.

>  It looks like that BAR is used to store some information
> related to the boot sequence, the hang happened when trying to create
> a new variable bootnum using efibootmgr (that's just a guess
> really).
> 
>> Have you considered other routes towards dealing with the issue? One
>> approach coming to mind would build on top of what you've been doing
>> so far (either variant): Besides avoiding the turning off of memory
>> decode, also invoke pci_ro_device(), thus protecting it from having
>> its BARs relocated, and also preventing any driver in Dom0 to gain
>> control of it, thus avoiding two parties competing for the device.
> 
> IMO using pci_ro_device() would be going too far here - it's not Xen
> the entity using the device, so Xen doesn't really know whether
> there's already some interface between the firmware and the OS driver
> (or just OS) in order to signal that the device is being used by the
> firmware.

Fair point (I did think of that, but ended up not spelling it out).

Jan
Roger Pau Monné Oct. 6, 2022, 9:43 a.m. UTC | #8
On Thu, Oct 06, 2022 at 11:02:32AM +0200, Jan Beulich wrote:
> On 06.10.2022 10:48, Roger Pau Monné wrote:
> > On Wed, Oct 05, 2022 at 05:42:08PM +0200, Jan Beulich wrote:
> >> Or put in different words - how
> >> likely is it that we could first gain understanding on their
> >> intentions with this region? You did say the system hangs hard without
> >> some kind of workaround, but that doesn't clarify to me in how far a
> >> use of the device by the firmware was actually involved there.
> > 
> > It's a black box to me, so I have no idea what the firmware attempts
> > to do.
> 
> Right - I don't expect there's a realistic chance of getting hold of a
> firmware person of theirs.

Even if somehow I could manage to do that (which I think it's
unlikely), it's not going to happen before the release, and there are
production boxes out there that have this behavior and we need to deal
with them.

Thanks, Roger.
diff mbox series

Patch

diff --git a/xen/arch/arm/include/asm/pci.h b/xen/arch/arm/include/asm/pci.h
index 8cb46f6b71..80a2431804 100644
--- a/xen/arch/arm/include/asm/pci.h
+++ b/xen/arch/arm/include/asm/pci.h
@@ -126,8 +126,6 @@  int pci_host_iterate_bridges_and_count(struct domain *d,
 
 int pci_host_bridge_mappings(struct domain *d);
 
-bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end);
-
 #else   /*!CONFIG_HAS_PCI*/
 
 struct arch_pci_dev { };
diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h
index f4a58c8acf..c8e1a9ecdb 100644
--- a/xen/arch/x86/include/asm/pci.h
+++ b/xen/arch/x86/include/asm/pci.h
@@ -57,14 +57,4 @@  static always_inline bool is_pci_passthrough_enabled(void)
 
 void arch_pci_init_pdev(struct pci_dev *pdev);
 
-static inline bool pci_check_bar(const struct pci_dev *pdev,
-                                 mfn_t start, mfn_t end)
-{
-    /*
-     * Check if BAR is not overlapping with any memory region defined
-     * in the memory map.
-     */
-    return is_memory_hole(start, end);
-}
-
 #endif /* __X86_PCI_H__ */
diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c
index 97b792e578..6920bf2168 100644
--- a/xen/arch/x86/pci.c
+++ b/xen/arch/x86/pci.c
@@ -98,3 +98,30 @@  int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
 
     return rc;
 }
+
+bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end)
+{
+    unsigned long mfn;
+
+    /*
+     * Check if BAR is not overlapping with any memory region defined
+     * in the memory map.
+     */
+    if ( is_memory_hole(start, end) )
+        return true;
+
+    /*
+     * Also allow BARs placed on reserved regions in order to deal with EFI
+     * firmware using EfiMemoryMappedIO regions to place the BARs of devices
+     * that can be used during runtime.  But print a warning when doing so.
+     */
+    for ( mfn = mfn_x(start); mfn <= mfn_x(end); mfn++ )
+        if ( !page_is_ram_type(mfn, RAM_TYPE_RESERVED) )
+            return false;
+
+    printk(XENLOG_WARNING
+           "%pp: BAR [%#" PRI_mfn ", %#" PRI_mfn "] overlaps reserved region\n",
+           &pdev->sbdf, mfn_x(start), mfn_x(end));
+
+    return true;
+}
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 5975ca2f30..64995fc68d 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -211,6 +211,7 @@  unsigned int pci_size_mem_bar(pci_sbdf_t sbdf, unsigned int pos,
 
 void pci_intx(const struct pci_dev *, bool enable);
 bool_t pcie_aer_get_firmware_first(const struct pci_dev *);
+bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end);
 
 struct pirq;
 int msixtbl_pt_register(struct domain *, struct pirq *, uint64_t gtable);