diff mbox series

hvmloader: pass PCI MMIO layout to OVMF as an info table

Message ID 1610340812-24397-1-git-send-email-igor.druzhinin@citrix.com (mailing list archive)
State New, archived
Headers show
Series hvmloader: pass PCI MMIO layout to OVMF as an info table | expand

Commit Message

Igor Druzhinin Jan. 11, 2021, 4:53 a.m. UTC
We faced a problem with passing through a PCI device with 64GB BAR to
UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
64G address which pushes physical address space to 37 bits. OVMF uses
address width early in PEI phase to make DXE identity pages covering
the whole addressable space so it needs to know the last address it needs
to cover but at the same time not overdo the mappings.

As there is seemingly no other way to pass or get this information in
OVMF at this early phase (ACPI is not yet available, PCI is not yet enumerated,
xenstore is not yet initialized) - extend the info structure with a new
table. Since the structure was initially created to be extendable -
the change is backward compatible.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
---

Companion change in OVMF:
https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg00516.html

---
 tools/firmware/hvmloader/ovmf.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Jan Beulich Jan. 11, 2021, 9:27 a.m. UTC | #1
On 11.01.2021 05:53, Igor Druzhinin wrote:
> We faced a problem with passing through a PCI device with 64GB BAR to
> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
> 64G address which pushes physical address space to 37 bits. OVMF uses
> address width early in PEI phase to make DXE identity pages covering
> the whole addressable space so it needs to know the last address it needs
> to cover but at the same time not overdo the mappings.
> 
> As there is seemingly no other way to pass or get this information in
> OVMF at this early phase (ACPI is not yet available, PCI is not yet enumerated,
> xenstore is not yet initialized) - extend the info structure with a new
> table. Since the structure was initially created to be extendable -
> the change is backward compatible.

How does UEFI handle the same situation on baremetal? I'd guess it is
in even more trouble there, as it couldn't even read addresses from
BARs, but would first need to assign them (or at least calculate
their intended positions).

> --- a/tools/firmware/hvmloader/ovmf.c
> +++ b/tools/firmware/hvmloader/ovmf.c
> @@ -61,6 +61,14 @@ struct ovmf_info {
>      uint32_t e820_nr;
>  } __attribute__ ((packed));
>  
> +#define OVMF_INFO_PCI_TABLE 0
> +struct ovmf_pci_info {
> +    uint64_t low_start;
> +    uint64_t low_end;
> +    uint64_t hi_start;
> +    uint64_t hi_end;
> +} __attribute__ ((packed));

Forming part of ABI, I believe this belongs in a public header,
which consumers could at least in principle use verbatim if
they wanted to.

> @@ -74,9 +82,21 @@ static void ovmf_setup_bios_info(void)
>  static void ovmf_finish_bios_info(void)
>  {
>      struct ovmf_info *info = (void *)OVMF_INFO_PHYSICAL_ADDRESS;
> +    struct ovmf_pci_info *pci_info;
> +    uint64_t *tables = scratch_alloc(sizeof(uint64_t)*OVMF_INFO_MAX_TABLES, 0);

I wasn't able to locate OVMF_INFO_MAX_TABLES in either
xen/include/public/ or tools/firmware/. Where does it get
defined?

Also (nit) missing blanks around * .

>      uint32_t i;
>      uint8_t checksum;
>  
> +    pci_info = scratch_alloc(sizeof(struct ovmf_pci_info), 0);

Is "scratch" correct here and above? I guess intended usage /
scope will want spelling out somewhere.

> +    pci_info->low_start = pci_mem_start;
> +    pci_info->low_end = pci_mem_end;
> +    pci_info->hi_start = pci_hi_mem_start;
> +    pci_info->hi_end = pci_hi_mem_end;
> +
> +    tables[OVMF_INFO_PCI_TABLE] = (uint32_t)pci_info;
> +    info->tables = (uint32_t)tables;
> +    info->tables_nr = 1;

In how far is this problem (and hence solution / workaround) OVMF
specific? IOW don't we need a more generic approach here?

Jan
Igor Druzhinin Jan. 11, 2021, 2 p.m. UTC | #2
On 11/01/2021 09:27, Jan Beulich wrote:
> On 11.01.2021 05:53, Igor Druzhinin wrote:
>> We faced a problem with passing through a PCI device with 64GB BAR to
>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>> 64G address which pushes physical address space to 37 bits. OVMF uses
>> address width early in PEI phase to make DXE identity pages covering
>> the whole addressable space so it needs to know the last address it needs
>> to cover but at the same time not overdo the mappings.
>>
>> As there is seemingly no other way to pass or get this information in
>> OVMF at this early phase (ACPI is not yet available, PCI is not yet enumerated,
>> xenstore is not yet initialized) - extend the info structure with a new
>> table. Since the structure was initially created to be extendable -
>> the change is backward compatible.
> 
> How does UEFI handle the same situation on baremetal? I'd guess it is
> in even more trouble there, as it couldn't even read addresses from
> BARs, but would first need to assign them (or at least calculate
> their intended positions).

Maybe Laszlo or Anthony could answer this question quickly while I'm investigating?

>> --- a/tools/firmware/hvmloader/ovmf.c
>> +++ b/tools/firmware/hvmloader/ovmf.c
>> @@ -61,6 +61,14 @@ struct ovmf_info {
>>      uint32_t e820_nr;
>>  } __attribute__ ((packed));
>>  
>> +#define OVMF_INFO_PCI_TABLE 0
>> +struct ovmf_pci_info {
>> +    uint64_t low_start;
>> +    uint64_t low_end;
>> +    uint64_t hi_start;
>> +    uint64_t hi_end;
>> +} __attribute__ ((packed));
> 
> Forming part of ABI, I believe this belongs in a public header,
> which consumers could at least in principle use verbatim if
> they wanted to.

It probably does, but if we'd want to move all of hand-over structures
wholesale that would include seabios as well. I'd stick with the current
approach to avoid code churn in various repos. Besides the structures
are not the only bits of ABI that are implicitly shared with BIOS images.

>> @@ -74,9 +82,21 @@ static void ovmf_setup_bios_info(void)
>>  static void ovmf_finish_bios_info(void)
>>  {
>>      struct ovmf_info *info = (void *)OVMF_INFO_PHYSICAL_ADDRESS;
>> +    struct ovmf_pci_info *pci_info;
>> +    uint64_t *tables = scratch_alloc(sizeof(uint64_t)*OVMF_INFO_MAX_TABLES, 0);
> 
> I wasn't able to locate OVMF_INFO_MAX_TABLES in either
> xen/include/public/ or tools/firmware/. Where does it get
> defined?

I expect it to be unlimited from OVMF side. It just expects an array of 
tables_nr elements.

> Also (nit) missing blanks around * .
> 
>>      uint32_t i;
>>      uint8_t checksum;
>>  
>> +    pci_info = scratch_alloc(sizeof(struct ovmf_pci_info), 0);
> 
> Is "scratch" correct here and above? I guess intended usage /
> scope will want spelling out somewhere.

Again, scratch_alloc is used universally for handing over info between hvmloader
and BIOS images. Where would you want it to be spelled out?

>> +    pci_info->low_start = pci_mem_start;
>> +    pci_info->low_end = pci_mem_end;
>> +    pci_info->hi_start = pci_hi_mem_start;
>> +    pci_info->hi_end = pci_hi_mem_end;
>> +
>> +    tables[OVMF_INFO_PCI_TABLE] = (uint32_t)pci_info;
>> +    info->tables = (uint32_t)tables;
>> +    info->tables_nr = 1;
> 
> In how far is this problem (and hence solution / workaround) OVMF
> specific? IOW don't we need a more generic approach here?

I believe it's very OVMF specific given only OVMF constructs identity page
tables for the whole address space - that's how it was designed. Seabios to
the best of my knowledge only has access to lower 4G.

Igor
Jan Beulich Jan. 11, 2021, 2:14 p.m. UTC | #3
On 11.01.2021 15:00, Igor Druzhinin wrote:
> On 11/01/2021 09:27, Jan Beulich wrote:
>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>> --- a/tools/firmware/hvmloader/ovmf.c
>>> +++ b/tools/firmware/hvmloader/ovmf.c
>>> @@ -61,6 +61,14 @@ struct ovmf_info {
>>>      uint32_t e820_nr;
>>>  } __attribute__ ((packed));
>>>  
>>> +#define OVMF_INFO_PCI_TABLE 0
>>> +struct ovmf_pci_info {
>>> +    uint64_t low_start;
>>> +    uint64_t low_end;
>>> +    uint64_t hi_start;
>>> +    uint64_t hi_end;
>>> +} __attribute__ ((packed));
>>
>> Forming part of ABI, I believe this belongs in a public header,
>> which consumers could at least in principle use verbatim if
>> they wanted to.
> 
> It probably does, but if we'd want to move all of hand-over structures
> wholesale that would include seabios as well. I'd stick with the current
> approach to avoid code churn in various repos. Besides the structures
> are not the only bits of ABI that are implicitly shared with BIOS images.

Well, so be it then for the time being. I'm going to be
hesitant though ack-ing such, no matter that there are (bad)
precedents. What I'd like to ask for as a minimum is to have
a comment here clarifying this struct can't be changed
arbitrarily because of being part of an ABI.

>>> @@ -74,9 +82,21 @@ static void ovmf_setup_bios_info(void)
>>>  static void ovmf_finish_bios_info(void)
>>>  {
>>>      struct ovmf_info *info = (void *)OVMF_INFO_PHYSICAL_ADDRESS;
>>> +    struct ovmf_pci_info *pci_info;
>>> +    uint64_t *tables = scratch_alloc(sizeof(uint64_t)*OVMF_INFO_MAX_TABLES, 0);
>>
>> I wasn't able to locate OVMF_INFO_MAX_TABLES in either
>> xen/include/public/ or tools/firmware/. Where does it get
>> defined?
> 
> I expect it to be unlimited from OVMF side. It just expects an array of 
> tables_nr elements.

That wasn't the (primary) question. Me not being able to locate
the place where this constant gets #define-d means I wonder how
this code builds.

>> Also (nit) missing blanks around * .
>>
>>>      uint32_t i;
>>>      uint8_t checksum;
>>>  
>>> +    pci_info = scratch_alloc(sizeof(struct ovmf_pci_info), 0);
>>
>> Is "scratch" correct here and above? I guess intended usage /
>> scope will want spelling out somewhere.
> 
> Again, scratch_alloc is used universally for handing over info between hvmloader
> and BIOS images. Where would you want it to be spelled out?

Next to where all the involved structures get declared.
Consumers need to be aware they may need to take precautions to
avoid clobbering the contents before consuming it. But as per
above there doesn't look to be such a central place (yet).

>>> +    pci_info->low_start = pci_mem_start;
>>> +    pci_info->low_end = pci_mem_end;
>>> +    pci_info->hi_start = pci_hi_mem_start;
>>> +    pci_info->hi_end = pci_hi_mem_end;
>>> +
>>> +    tables[OVMF_INFO_PCI_TABLE] = (uint32_t)pci_info;
>>> +    info->tables = (uint32_t)tables;
>>> +    info->tables_nr = 1;
>>
>> In how far is this problem (and hence solution / workaround) OVMF
>> specific? IOW don't we need a more generic approach here?
> 
> I believe it's very OVMF specific given only OVMF constructs identity page
> tables for the whole address space - that's how it was designed. Seabios to
> the best of my knowledge only has access to lower 4G.

Quite likely, yet how would SeaBIOS access such a huge frame
buffer then? They can't possibly place it below 4G. Do systems
with such video cards get penalized by e.g. not surfacing VESA
mode changing functionality?

In general I think any BIOS should be eligible to receive
information one BIOS finds necessary to receive. They're all
fine to ignore what they get handed. But yes, moving this a
layer up can certainly also be done later.

Jan
Igor Druzhinin Jan. 11, 2021, 2:43 p.m. UTC | #4
On 11/01/2021 14:14, Jan Beulich wrote:
> On 11.01.2021 15:00, Igor Druzhinin wrote:
>> On 11/01/2021 09:27, Jan Beulich wrote:
>>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>>> --- a/tools/firmware/hvmloader/ovmf.c
>>>> +++ b/tools/firmware/hvmloader/ovmf.c
>>>> @@ -61,6 +61,14 @@ struct ovmf_info {
>>>>      uint32_t e820_nr;
>>>>  } __attribute__ ((packed));
>>>>  
>>>> +#define OVMF_INFO_PCI_TABLE 0
>>>> +struct ovmf_pci_info {
>>>> +    uint64_t low_start;
>>>> +    uint64_t low_end;
>>>> +    uint64_t hi_start;
>>>> +    uint64_t hi_end;
>>>> +} __attribute__ ((packed));
>>>
>>> Forming part of ABI, I believe this belongs in a public header,
>>> which consumers could at least in principle use verbatim if
>>> they wanted to.
>>
>> It probably does, but if we'd want to move all of hand-over structures
>> wholesale that would include seabios as well. I'd stick with the current
>> approach to avoid code churn in various repos. Besides the structures
>> are not the only bits of ABI that are implicitly shared with BIOS images.
> 
> Well, so be it then for the time being. I'm going to be
> hesitant though ack-ing such, no matter that there are (bad)
> precedents. What I'd like to ask for as a minimum is to have
> a comment here clarifying this struct can't be changed
> arbitrarily because of being part of an ABI.

Ok, I will improve information in comments in an additional commit.

>>>> @@ -74,9 +82,21 @@ static void ovmf_setup_bios_info(void)
>>>>  static void ovmf_finish_bios_info(void)
>>>>  {
>>>>      struct ovmf_info *info = (void *)OVMF_INFO_PHYSICAL_ADDRESS;
>>>> +    struct ovmf_pci_info *pci_info;
>>>> +    uint64_t *tables = scratch_alloc(sizeof(uint64_t)*OVMF_INFO_MAX_TABLES, 0);
>>>
>>> I wasn't able to locate OVMF_INFO_MAX_TABLES in either
>>> xen/include/public/ or tools/firmware/. Where does it get
>>> defined?
>>
>> I expect it to be unlimited from OVMF side. It just expects an array of 
>> tables_nr elements.
> 
> That wasn't the (primary) question. Me not being able to locate
> the place where this constant gets #define-d means I wonder how
> this code builds.

It's right up there in the same file.

>>> Also (nit) missing blanks around * .
>>>
>>>>      uint32_t i;
>>>>      uint8_t checksum;
>>>>  
>>>> +    pci_info = scratch_alloc(sizeof(struct ovmf_pci_info), 0);
>>>
>>> Is "scratch" correct here and above? I guess intended usage /
>>> scope will want spelling out somewhere.
>>
>> Again, scratch_alloc is used universally for handing over info between hvmloader
>> and BIOS images. Where would you want it to be spelled out?
> 
> Next to where all the involved structures get declared.
> Consumers need to be aware they may need to take precautions to
> avoid clobbering the contents before consuming it. But as per
> above there doesn't look to be such a central place (yet).

I will duplicate the comments for now in all places involved.
The struct checksum I believe servers exactly the purpose you described -
to catch that sort of bugs early.

>>>> +    pci_info->low_start = pci_mem_start;
>>>> +    pci_info->low_end = pci_mem_end;
>>>> +    pci_info->hi_start = pci_hi_mem_start;
>>>> +    pci_info->hi_end = pci_hi_mem_end;
>>>> +
>>>> +    tables[OVMF_INFO_PCI_TABLE] = (uint32_t)pci_info;
>>>> +    info->tables = (uint32_t)tables;
>>>> +    info->tables_nr = 1;
>>>
>>> In how far is this problem (and hence solution / workaround) OVMF
>>> specific? IOW don't we need a more generic approach here?
>>
>> I believe it's very OVMF specific given only OVMF constructs identity page
>> tables for the whole address space - that's how it was designed. Seabios to
>> the best of my knowledge only has access to lower 4G.
> 
> Quite likely, yet how would SeaBIOS access such a huge frame
> buffer then? They can't possibly place it below 4G. Do systems
> with such video cards get penalized by e.g. not surfacing VESA
> mode changing functionality?

Yes, VESA FB pointer is 32 bit only.
The framebuffer itself from my experience is located in a separate smaller BAR
on real cards. That makes it usually land in below 4G that masks the problem
in most scenarios.

Igor
Laszlo Ersek Jan. 11, 2021, 2:49 p.m. UTC | #5
On 01/11/21 15:00, Igor Druzhinin wrote:
> On 11/01/2021 09:27, Jan Beulich wrote:
>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>> We faced a problem with passing through a PCI device with 64GB BAR to
>>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>>> 64G address which pushes physical address space to 37 bits. OVMF uses
>>> address width early in PEI phase to make DXE identity pages covering
>>> the whole addressable space so it needs to know the last address it needs
>>> to cover but at the same time not overdo the mappings.
>>>
>>> As there is seemingly no other way to pass or get this information in
>>> OVMF at this early phase (ACPI is not yet available, PCI is not yet enumerated,
>>> xenstore is not yet initialized) - extend the info structure with a new
>>> table. Since the structure was initially created to be extendable -
>>> the change is backward compatible.
>>
>> How does UEFI handle the same situation on baremetal? I'd guess it is
>> in even more trouble there, as it couldn't even read addresses from
>> BARs, but would first need to assign them (or at least calculate
>> their intended positions).
> 
> Maybe Laszlo or Anthony could answer this question quickly while I'm investigating?

On the bare metal, the phys address width of the processor is known.

OVMF does the whole calculation in reverse because there's no way for it
to know the physical address width of the physical (= host) CPU.
"Overdoing" the mappings doesn't only waste resources, it breaks hard
with EPT -- access to a GPA that is inexpressible with the phys address
width of the host CPU (= not mappable successfully with the nested page
tables) will behave super bad. I don't recall the exact symptoms, but it
prevents booting the guest OS.

This is why the most conservative 36-bit width is assumed by default.

> 
>>> --- a/tools/firmware/hvmloader/ovmf.c
>>> +++ b/tools/firmware/hvmloader/ovmf.c
>>> @@ -61,6 +61,14 @@ struct ovmf_info {
>>>      uint32_t e820_nr;
>>>  } __attribute__ ((packed));
>>>  
>>> +#define OVMF_INFO_PCI_TABLE 0
>>> +struct ovmf_pci_info {
>>> +    uint64_t low_start;
>>> +    uint64_t low_end;
>>> +    uint64_t hi_start;
>>> +    uint64_t hi_end;
>>> +} __attribute__ ((packed));
>>
>> Forming part of ABI, I believe this belongs in a public header,
>> which consumers could at least in principle use verbatim if
>> they wanted to.

(In OVMF I strongly prefer hand-coded structures, due to the particular
coding style edk2 employs. Although Xen headers have been imported and
fixed up in the past, and so further importing would not be without
precedent for Xen in OVMF, those imported headers continue to stick out
like a sore thumb, due to their different coding style. That's not to
say the Xen coding style is "wrong" or anything; just that esp. when
those structs are *used* in code, they look quite out of place.)

Thanks,
Laszlo

> 
> It probably does, but if we'd want to move all of hand-over structures
> wholesale that would include seabios as well. I'd stick with the current
> approach to avoid code churn in various repos. Besides the structures
> are not the only bits of ABI that are implicitly shared with BIOS images.
> 
>>> @@ -74,9 +82,21 @@ static void ovmf_setup_bios_info(void)
>>>  static void ovmf_finish_bios_info(void)
>>>  {
>>>      struct ovmf_info *info = (void *)OVMF_INFO_PHYSICAL_ADDRESS;
>>> +    struct ovmf_pci_info *pci_info;
>>> +    uint64_t *tables = scratch_alloc(sizeof(uint64_t)*OVMF_INFO_MAX_TABLES, 0);
>>
>> I wasn't able to locate OVMF_INFO_MAX_TABLES in either
>> xen/include/public/ or tools/firmware/. Where does it get
>> defined?
> 
> I expect it to be unlimited from OVMF side. It just expects an array of 
> tables_nr elements.
> 
>> Also (nit) missing blanks around * .
>>
>>>      uint32_t i;
>>>      uint8_t checksum;
>>>  
>>> +    pci_info = scratch_alloc(sizeof(struct ovmf_pci_info), 0);
>>
>> Is "scratch" correct here and above? I guess intended usage /
>> scope will want spelling out somewhere.
> 
> Again, scratch_alloc is used universally for handing over info between hvmloader
> and BIOS images. Where would you want it to be spelled out?
> 
>>> +    pci_info->low_start = pci_mem_start;
>>> +    pci_info->low_end = pci_mem_end;
>>> +    pci_info->hi_start = pci_hi_mem_start;
>>> +    pci_info->hi_end = pci_hi_mem_end;
>>> +
>>> +    tables[OVMF_INFO_PCI_TABLE] = (uint32_t)pci_info;
>>> +    info->tables = (uint32_t)tables;
>>> +    info->tables_nr = 1;
>>
>> In how far is this problem (and hence solution / workaround) OVMF
>> specific? IOW don't we need a more generic approach here?
> 
> I believe it's very OVMF specific given only OVMF constructs identity page
> tables for the whole address space - that's how it was designed. Seabios to
> the best of my knowledge only has access to lower 4G.
> 
> Igor
>
Jan Beulich Jan. 11, 2021, 3:21 p.m. UTC | #6
On 11.01.2021 15:49, Laszlo Ersek wrote:
> On 01/11/21 15:00, Igor Druzhinin wrote:
>> On 11/01/2021 09:27, Jan Beulich wrote:
>>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>>> We faced a problem with passing through a PCI device with 64GB BAR to
>>>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>>>> 64G address which pushes physical address space to 37 bits. OVMF uses
>>>> address width early in PEI phase to make DXE identity pages covering
>>>> the whole addressable space so it needs to know the last address it needs
>>>> to cover but at the same time not overdo the mappings.
>>>>
>>>> As there is seemingly no other way to pass or get this information in
>>>> OVMF at this early phase (ACPI is not yet available, PCI is not yet enumerated,
>>>> xenstore is not yet initialized) - extend the info structure with a new
>>>> table. Since the structure was initially created to be extendable -
>>>> the change is backward compatible.
>>>
>>> How does UEFI handle the same situation on baremetal? I'd guess it is
>>> in even more trouble there, as it couldn't even read addresses from
>>> BARs, but would first need to assign them (or at least calculate
>>> their intended positions).
>>
>> Maybe Laszlo or Anthony could answer this question quickly while I'm investigating?
> 
> On the bare metal, the phys address width of the processor is known.

From CPUID I suppose.

> OVMF does the whole calculation in reverse because there's no way for it
> to know the physical address width of the physical (= host) CPU.
> "Overdoing" the mappings doesn't only waste resources, it breaks hard
> with EPT -- access to a GPA that is inexpressible with the phys address
> width of the host CPU (= not mappable successfully with the nested page
> tables) will behave super bad. I don't recall the exact symptoms, but it
> prevents booting the guest OS.
> 
> This is why the most conservative 36-bit width is assumed by default.

IOW you don't trust virtualized CPUID output?

Jan
Igor Druzhinin Jan. 11, 2021, 3:26 p.m. UTC | #7
On 11/01/2021 15:21, Jan Beulich wrote:
> On 11.01.2021 15:49, Laszlo Ersek wrote:
>> On 01/11/21 15:00, Igor Druzhinin wrote:
>>> On 11/01/2021 09:27, Jan Beulich wrote:
>>>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>>>> We faced a problem with passing through a PCI device with 64GB BAR to
>>>>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>>>>> 64G address which pushes physical address space to 37 bits. OVMF uses
>>>>> address width early in PEI phase to make DXE identity pages covering
>>>>> the whole addressable space so it needs to know the last address it needs
>>>>> to cover but at the same time not overdo the mappings.
>>>>>
>>>>> As there is seemingly no other way to pass or get this information in
>>>>> OVMF at this early phase (ACPI is not yet available, PCI is not yet enumerated,
>>>>> xenstore is not yet initialized) - extend the info structure with a new
>>>>> table. Since the structure was initially created to be extendable -
>>>>> the change is backward compatible.
>>>>
>>>> How does UEFI handle the same situation on baremetal? I'd guess it is
>>>> in even more trouble there, as it couldn't even read addresses from
>>>> BARs, but would first need to assign them (or at least calculate
>>>> their intended positions).
>>>
>>> Maybe Laszlo or Anthony could answer this question quickly while I'm investigating?
>>
>> On the bare metal, the phys address width of the processor is known.
> 
> From CPUID I suppose.
> 
>> OVMF does the whole calculation in reverse because there's no way for it
>> to know the physical address width of the physical (= host) CPU.
>> "Overdoing" the mappings doesn't only waste resources, it breaks hard
>> with EPT -- access to a GPA that is inexpressible with the phys address
>> width of the host CPU (= not mappable successfully with the nested page
>> tables) will behave super bad. I don't recall the exact symptoms, but it
>> prevents booting the guest OS.
>>
>> This is why the most conservative 36-bit width is assumed by default.
> 
> IOW you don't trust virtualized CPUID output?

I'm discussing this with Andrew and it appears we're certainly more lax in
wiring physical address width into the guest from hardware directly rather
than KVM.

Another problem that I faced while experimenting is that creating page
tables for 46-bits (that CPUID returned in my case) of address space takes
about a minute on a modern CPU.

Igor
Laszlo Ersek Jan. 11, 2021, 3:30 p.m. UTC | #8
On 01/11/21 16:21, Jan Beulich wrote:
> On 11.01.2021 15:49, Laszlo Ersek wrote:
>> On 01/11/21 15:00, Igor Druzhinin wrote:
>>> On 11/01/2021 09:27, Jan Beulich wrote:
>>>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>>>> We faced a problem with passing through a PCI device with 64GB BAR to
>>>>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>>>>> 64G address which pushes physical address space to 37 bits. OVMF uses
>>>>> address width early in PEI phase to make DXE identity pages covering
>>>>> the whole addressable space so it needs to know the last address it needs
>>>>> to cover but at the same time not overdo the mappings.
>>>>>
>>>>> As there is seemingly no other way to pass or get this information in
>>>>> OVMF at this early phase (ACPI is not yet available, PCI is not yet enumerated,
>>>>> xenstore is not yet initialized) - extend the info structure with a new
>>>>> table. Since the structure was initially created to be extendable -
>>>>> the change is backward compatible.
>>>>
>>>> How does UEFI handle the same situation on baremetal? I'd guess it is
>>>> in even more trouble there, as it couldn't even read addresses from
>>>> BARs, but would first need to assign them (or at least calculate
>>>> their intended positions).
>>>
>>> Maybe Laszlo or Anthony could answer this question quickly while I'm investigating?
>>
>> On the bare metal, the phys address width of the processor is known.
> 
> From CPUID I suppose.
> 
>> OVMF does the whole calculation in reverse because there's no way for it
>> to know the physical address width of the physical (= host) CPU.
>> "Overdoing" the mappings doesn't only waste resources, it breaks hard
>> with EPT -- access to a GPA that is inexpressible with the phys address
>> width of the host CPU (= not mappable successfully with the nested page
>> tables) will behave super bad. I don't recall the exact symptoms, but it
>> prevents booting the guest OS.
>>
>> This is why the most conservative 36-bit width is assumed by default.
> 
> IOW you don't trust virtualized CPUID output?

That's correct; it's not trustworthy / reliable.

One of the discussions (of the many) is here:

https://lists.gnu.org/archive/html/qemu-devel/2020-06/msg04716.html

Thanks
Laszlo
Jan Beulich Jan. 11, 2021, 3:31 p.m. UTC | #9
On 11.01.2021 16:26, Igor Druzhinin wrote:
> Another problem that I faced while experimenting is that creating page
> tables for 46-bits (that CPUID returned in my case) of address space takes
> about a minute on a modern CPU.

Which probably isn't fundamentally different from bare metal?

Jan
Laszlo Ersek Jan. 11, 2021, 3:35 p.m. UTC | #10
On 01/11/21 16:26, Igor Druzhinin wrote:
> On 11/01/2021 15:21, Jan Beulich wrote:
>> On 11.01.2021 15:49, Laszlo Ersek wrote:
>>> On 01/11/21 15:00, Igor Druzhinin wrote:
>>>> On 11/01/2021 09:27, Jan Beulich wrote:
>>>>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>>>>> We faced a problem with passing through a PCI device with 64GB BAR to
>>>>>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>>>>>> 64G address which pushes physical address space to 37 bits. OVMF uses
>>>>>> address width early in PEI phase to make DXE identity pages covering
>>>>>> the whole addressable space so it needs to know the last address it needs
>>>>>> to cover but at the same time not overdo the mappings.
>>>>>>
>>>>>> As there is seemingly no other way to pass or get this information in
>>>>>> OVMF at this early phase (ACPI is not yet available, PCI is not yet enumerated,
>>>>>> xenstore is not yet initialized) - extend the info structure with a new
>>>>>> table. Since the structure was initially created to be extendable -
>>>>>> the change is backward compatible.
>>>>>
>>>>> How does UEFI handle the same situation on baremetal? I'd guess it is
>>>>> in even more trouble there, as it couldn't even read addresses from
>>>>> BARs, but would first need to assign them (or at least calculate
>>>>> their intended positions).
>>>>
>>>> Maybe Laszlo or Anthony could answer this question quickly while I'm investigating?
>>>
>>> On the bare metal, the phys address width of the processor is known.
>>
>> From CPUID I suppose.
>>
>>> OVMF does the whole calculation in reverse because there's no way for it
>>> to know the physical address width of the physical (= host) CPU.
>>> "Overdoing" the mappings doesn't only waste resources, it breaks hard
>>> with EPT -- access to a GPA that is inexpressible with the phys address
>>> width of the host CPU (= not mappable successfully with the nested page
>>> tables) will behave super bad. I don't recall the exact symptoms, but it
>>> prevents booting the guest OS.
>>>
>>> This is why the most conservative 36-bit width is assumed by default.
>>
>> IOW you don't trust virtualized CPUID output?
> 
> I'm discussing this with Andrew and it appears we're certainly more lax in
> wiring physical address width into the guest from hardware directly rather
> than KVM.
> 
> Another problem that I faced while experimenting is that creating page
> tables for 46-bits (that CPUID returned in my case) of address space takes
> about a minute on a modern CPU.

Even if you enable 1GiB pages?

(In the libvirt domain XML, it's expressed as

    <feature policy='require' name='pdpe1gb'/>
)

... I'm not doubtful, just curious. I guess that, when the physical
address width is so large, a physical UEFI platform firmware will limit
itself to a lesser width -- it could even offer some knobs in the setup TUI.

Thanks,
Laszlo

Laszlo
Igor Druzhinin Jan. 11, 2021, 4:31 p.m. UTC | #11
On 11/01/2021 15:35, Laszlo Ersek wrote:
> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
> 
> On 01/11/21 16:26, Igor Druzhinin wrote:
>> On 11/01/2021 15:21, Jan Beulich wrote:
>>> On 11.01.2021 15:49, Laszlo Ersek wrote:
>>>> On 01/11/21 15:00, Igor Druzhinin wrote:
>>>>> On 11/01/2021 09:27, Jan Beulich wrote:
>>>>>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>>>>>> We faced a problem with passing through a PCI device with 64GB BAR to
>>>>>>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>>>>>>> 64G address which pushes physical address space to 37 bits. OVMF uses
>>>>>>> address width early in PEI phase to make DXE identity pages covering
>>>>>>> the whole addressable space so it needs to know the last address it needs
>>>>>>> to cover but at the same time not overdo the mappings.
>>>>>>>
>>>>>>> As there is seemingly no other way to pass or get this information in
>>>>>>> OVMF at this early phase (ACPI is not yet available, PCI is not yet enumerated,
>>>>>>> xenstore is not yet initialized) - extend the info structure with a new
>>>>>>> table. Since the structure was initially created to be extendable -
>>>>>>> the change is backward compatible.
>>>>>>
>>>>>> How does UEFI handle the same situation on baremetal? I'd guess it is
>>>>>> in even more trouble there, as it couldn't even read addresses from
>>>>>> BARs, but would first need to assign them (or at least calculate
>>>>>> their intended positions).
>>>>>
>>>>> Maybe Laszlo or Anthony could answer this question quickly while I'm investigating?
>>>>
>>>> On the bare metal, the phys address width of the processor is known.
>>>
>>> From CPUID I suppose.
>>>
>>>> OVMF does the whole calculation in reverse because there's no way for it
>>>> to know the physical address width of the physical (= host) CPU.
>>>> "Overdoing" the mappings doesn't only waste resources, it breaks hard
>>>> with EPT -- access to a GPA that is inexpressible with the phys address
>>>> width of the host CPU (= not mappable successfully with the nested page
>>>> tables) will behave super bad. I don't recall the exact symptoms, but it
>>>> prevents booting the guest OS.
>>>>
>>>> This is why the most conservative 36-bit width is assumed by default.
>>>
>>> IOW you don't trust virtualized CPUID output?
>>
>> I'm discussing this with Andrew and it appears we're certainly more lax in
>> wiring physical address width into the guest from hardware directly rather
>> than KVM.
>>
>> Another problem that I faced while experimenting is that creating page
>> tables for 46-bits (that CPUID returned in my case) of address space takes
>> about a minute on a modern CPU.
> 
> Even if you enable 1GiB pages?
> 
> (In the libvirt domain XML, it's expressed as
> 
>     <feature policy='require' name='pdpe1gb'/>
> )
> 
> ... I'm not doubtful, just curious. I guess that, when the physical
> address width is so large, a physical UEFI platform firmware will limit
> itself to a lesser width -- it could even offer some knobs in the setup TUI.

So it wasn't the feature bit that we expose by default in Xen but the OVMF configuration
with 1G pages disabled for that use. I enabled it and got booting even with 46-bits
in reasonable time now.

Given we're not that sensitive in Xen to physical address being different and prefer to
control that on different level I'd like to abandon that ABI change approach (does anyone
have any objections?) and instead take physical address width directly from CPUID which
we do in hvmloader already. The change would be local to Xen platform.

Igor
Laszlo Ersek Jan. 11, 2021, 4:35 p.m. UTC | #12
On 01/11/21 17:31, Igor Druzhinin wrote:
> On 11/01/2021 15:35, Laszlo Ersek wrote:
>> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
>>
>> On 01/11/21 16:26, Igor Druzhinin wrote:
>>> On 11/01/2021 15:21, Jan Beulich wrote:
>>>> On 11.01.2021 15:49, Laszlo Ersek wrote:
>>>>> On 01/11/21 15:00, Igor Druzhinin wrote:
>>>>>> On 11/01/2021 09:27, Jan Beulich wrote:
>>>>>>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>>>>>>> We faced a problem with passing through a PCI device with 64GB BAR to
>>>>>>>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>>>>>>>> 64G address which pushes physical address space to 37 bits. OVMF uses
>>>>>>>> address width early in PEI phase to make DXE identity pages covering
>>>>>>>> the whole addressable space so it needs to know the last address it needs
>>>>>>>> to cover but at the same time not overdo the mappings.
>>>>>>>>
>>>>>>>> As there is seemingly no other way to pass or get this information in
>>>>>>>> OVMF at this early phase (ACPI is not yet available, PCI is not yet enumerated,
>>>>>>>> xenstore is not yet initialized) - extend the info structure with a new
>>>>>>>> table. Since the structure was initially created to be extendable -
>>>>>>>> the change is backward compatible.
>>>>>>>
>>>>>>> How does UEFI handle the same situation on baremetal? I'd guess it is
>>>>>>> in even more trouble there, as it couldn't even read addresses from
>>>>>>> BARs, but would first need to assign them (or at least calculate
>>>>>>> their intended positions).
>>>>>>
>>>>>> Maybe Laszlo or Anthony could answer this question quickly while I'm investigating?
>>>>>
>>>>> On the bare metal, the phys address width of the processor is known.
>>>>
>>>> From CPUID I suppose.
>>>>
>>>>> OVMF does the whole calculation in reverse because there's no way for it
>>>>> to know the physical address width of the physical (= host) CPU.
>>>>> "Overdoing" the mappings doesn't only waste resources, it breaks hard
>>>>> with EPT -- access to a GPA that is inexpressible with the phys address
>>>>> width of the host CPU (= not mappable successfully with the nested page
>>>>> tables) will behave super bad. I don't recall the exact symptoms, but it
>>>>> prevents booting the guest OS.
>>>>>
>>>>> This is why the most conservative 36-bit width is assumed by default.
>>>>
>>>> IOW you don't trust virtualized CPUID output?
>>>
>>> I'm discussing this with Andrew and it appears we're certainly more lax in
>>> wiring physical address width into the guest from hardware directly rather
>>> than KVM.
>>>
>>> Another problem that I faced while experimenting is that creating page
>>> tables for 46-bits (that CPUID returned in my case) of address space takes
>>> about a minute on a modern CPU.
>>
>> Even if you enable 1GiB pages?
>>
>> (In the libvirt domain XML, it's expressed as
>>
>>     <feature policy='require' name='pdpe1gb'/>
>> )
>>
>> ... I'm not doubtful, just curious. I guess that, when the physical
>> address width is so large, a physical UEFI platform firmware will limit
>> itself to a lesser width -- it could even offer some knobs in the setup TUI.
> 
> So it wasn't the feature bit that we expose by default in Xen but the OVMF configuration
> with 1G pages disabled for that use. I enabled it and got booting even with 46-bits
> in reasonable time now.
> 
> Given we're not that sensitive in Xen to physical address being different and prefer to
> control that on different level I'd like to abandon that ABI change approach (does anyone
> have any objections?) and instead take physical address width directly from CPUID which
> we do in hvmloader already. The change would be local to Xen platform.

Yes, as long as you limit the approach to "OvmfPkg/XenPlatformPei" (or,
more generally, to the "OvmfPkg/OvmfXen.dsc" platform), it makes perfect
sense.

Thanks!
Laszlo
diff mbox series

Patch

diff --git a/tools/firmware/hvmloader/ovmf.c b/tools/firmware/hvmloader/ovmf.c
index 23610a0..9bfe274 100644
--- a/tools/firmware/hvmloader/ovmf.c
+++ b/tools/firmware/hvmloader/ovmf.c
@@ -61,6 +61,14 @@  struct ovmf_info {
     uint32_t e820_nr;
 } __attribute__ ((packed));
 
+#define OVMF_INFO_PCI_TABLE 0
+struct ovmf_pci_info {
+    uint64_t low_start;
+    uint64_t low_end;
+    uint64_t hi_start;
+    uint64_t hi_end;
+} __attribute__ ((packed));
+
 static void ovmf_setup_bios_info(void)
 {
     struct ovmf_info *info = (void *)OVMF_INFO_PHYSICAL_ADDRESS;
@@ -74,9 +82,21 @@  static void ovmf_setup_bios_info(void)
 static void ovmf_finish_bios_info(void)
 {
     struct ovmf_info *info = (void *)OVMF_INFO_PHYSICAL_ADDRESS;
+    struct ovmf_pci_info *pci_info;
+    uint64_t *tables = scratch_alloc(sizeof(uint64_t)*OVMF_INFO_MAX_TABLES, 0);
     uint32_t i;
     uint8_t checksum;
 
+    pci_info = scratch_alloc(sizeof(struct ovmf_pci_info), 0);
+    pci_info->low_start = pci_mem_start;
+    pci_info->low_end = pci_mem_end;
+    pci_info->hi_start = pci_hi_mem_start;
+    pci_info->hi_end = pci_hi_mem_end;
+
+    tables[OVMF_INFO_PCI_TABLE] = (uint32_t)pci_info;
+    info->tables = (uint32_t)tables;
+    info->tables_nr = 1;
+
     checksum = 0;
     for ( i = 0; i < info->length; i++ )
         checksum += ((uint8_t *)(info))[i];