diff mbox series

OvmfPkg/XenPlatformPei: Use CPUID to get physical address width on Xen

Message ID 1610509335-23314-1-git-send-email-igor.druzhinin@citrix.com (mailing list archive)
State New, archived
Headers show
Series OvmfPkg/XenPlatformPei: Use CPUID to get physical address width on Xen | expand

Commit Message

Igor Druzhinin Jan. 13, 2021, 3:42 a.m. UTC
We faced a problem with passing through a PCI device with 64GB BAR to
UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
64G address which pushes physical address space to 37 bits. That is above
36-bit width that OVMF exposes currently to a guest without tweaking
PcdPciMmio64Size knob.

The reverse calculation using this knob was inhereted from QEMU-KVM platform
code where it serves the purpose of finding max accessible physical
address without necessary trusting emulated CPUID physbits value (that could
be different from host physbits). On Xen we expect to use CPUID policy
to level the data correctly to prevent situations with guest physbits >
host physbits e.g. across migrations.

The next aspect raising concern - resource consumption for DXE IPL page tables
and time required to map the whole address space in case of using CPUID
bits directly. That could be mitigated by enabling support for 1G pages
in DXE IPL configuration. 1G pages are available on most CPUs produced in
the last 10 years and those without don't have many phys bits.

Remove all the redundant code now (including PcdPciMmio64.. handling that's
not used on Xen anyway) and grab physbits directly from CPUID that should
be what baremetal UEFI systems do.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
---
 OvmfPkg/OvmfXen.dsc                |   3 +
 OvmfPkg/XenPlatformPei/MemDetect.c | 166 +++----------------------------------
 2 files changed, 15 insertions(+), 154 deletions(-)

Comments

Laszlo Ersek Jan. 13, 2021, 3:55 a.m. UTC | #1
On 01/13/21 04:42, Igor Druzhinin wrote:
> We faced a problem with passing through a PCI device with 64GB BAR to
> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
> 64G address which pushes physical address space to 37 bits. That is above
> 36-bit width that OVMF exposes currently to a guest without tweaking
> PcdPciMmio64Size knob.
> 
> The reverse calculation using this knob was inhereted from QEMU-KVM platform
> code where it serves the purpose of finding max accessible physical
> address without necessary trusting emulated CPUID physbits value (that could
> be different from host physbits). On Xen we expect to use CPUID policy
> to level the data correctly to prevent situations with guest physbits >
> host physbits e.g. across migrations.
> 
> The next aspect raising concern - resource consumption for DXE IPL page tables
> and time required to map the whole address space in case of using CPUID
> bits directly. That could be mitigated by enabling support for 1G pages
> in DXE IPL configuration. 1G pages are available on most CPUs produced in
> the last 10 years and those without don't have many phys bits.
> 
> Remove all the redundant code now (including PcdPciMmio64.. handling that's
> not used on Xen anyway) and grab physbits directly from CPUID that should
> be what baremetal UEFI systems do.
> 
> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> ---
>  OvmfPkg/OvmfXen.dsc                |   3 +
>  OvmfPkg/XenPlatformPei/MemDetect.c | 166 +++----------------------------------
>  2 files changed, 15 insertions(+), 154 deletions(-)

Acked-by: Laszlo Ersek <lersek@redhat.com>

Anthony and/or Julien should also approve this patch (and anyone from
the Xen community is welcome to comment of course, as always).

Thanks
Laszlo


> diff --git a/OvmfPkg/OvmfXen.dsc b/OvmfPkg/OvmfXen.dsc
> index 7d31e88..8ae6ed0 100644
> --- a/OvmfPkg/OvmfXen.dsc
> +++ b/OvmfPkg/OvmfXen.dsc
> @@ -444,6 +444,9 @@
>    ## Xen vlapic's frequence is 100 MHz
>    gEfiMdePkgTokenSpaceGuid.PcdFSBClock|100000000
>  
> +  # We populate DXE IPL tables with 1G pages preferably on Xen
> +  gEfiMdeModulePkgTokenSpaceGuid.PcdUse1GPageTable|TRUE
> +
>  ################################################################################
>  #
>  # Pcd Dynamic Section - list of all EDK II PCD Entries defined by this Platform
> diff --git a/OvmfPkg/XenPlatformPei/MemDetect.c b/OvmfPkg/XenPlatformPei/MemDetect.c
> index 1f81eee..1970b63 100644
> --- a/OvmfPkg/XenPlatformPei/MemDetect.c
> +++ b/OvmfPkg/XenPlatformPei/MemDetect.c
> @@ -172,175 +172,33 @@ GetSystemMemorySizeBelow4gb (
>    return (UINT32) (((UINTN)((Cmos0x35 << 8) + Cmos0x34) << 16) + SIZE_16MB);
>  }
>  
> -
> -STATIC
> -UINT64
> -GetSystemMemorySizeAbove4gb (
> -  )
> -{
> -  UINT32 Size;
> -  UINTN  CmosIndex;
> -
> -  //
> -  // In PVH case, there is no CMOS, we have to calculate the memory size
> -  // from parsing the E820
> -  //
> -  if (XenPvhDetected ()) {
> -    UINT64  HighestAddress;
> -
> -    HighestAddress = GetHighestSystemMemoryAddress (FALSE);
> -    ASSERT (HighestAddress == 0 || HighestAddress >= BASE_4GB);
> -
> -    if (HighestAddress >= BASE_4GB) {
> -      HighestAddress -= BASE_4GB;
> -    }
> -
> -    return HighestAddress;
> -  }
> -
> -  //
> -  // CMOS 0x5b-0x5d specifies the system memory above 4GB MB.
> -  // * CMOS(0x5d) is the most significant size byte
> -  // * CMOS(0x5c) is the middle size byte
> -  // * CMOS(0x5b) is the least significant size byte
> -  // * The size is specified in 64kb chunks
> -  //
> -
> -  Size = 0;
> -  for (CmosIndex = 0x5d; CmosIndex >= 0x5b; CmosIndex--) {
> -    Size = (UINT32) (Size << 8) + (UINT32) CmosRead8 (CmosIndex);
> -  }
> -
> -  return LShiftU64 (Size, 16);
> -}
> -
> -
> -/**
> -  Return the highest address that DXE could possibly use, plus one.
> -**/
> -STATIC
> -UINT64
> -GetFirstNonAddress (
> -  VOID
> -  )
> -{
> -  UINT64               FirstNonAddress;
> -  UINT64               Pci64Base, Pci64Size;
> -  RETURN_STATUS        PcdStatus;
> -
> -  FirstNonAddress = BASE_4GB + GetSystemMemorySizeAbove4gb ();
> -
> -  //
> -  // If DXE is 32-bit, then we're done; PciBusDxe will degrade 64-bit MMIO
> -  // resources to 32-bit anyway. See DegradeResource() in
> -  // "PciResourceSupport.c".
> -  //
> -#ifdef MDE_CPU_IA32
> -  if (!FeaturePcdGet (PcdDxeIplSwitchToLongMode)) {
> -    return FirstNonAddress;
> -  }
> -#endif
> -
> -  //
> -  // Otherwise, in order to calculate the highest address plus one, we must
> -  // consider the 64-bit PCI host aperture too. Fetch the default size.
> -  //
> -  Pci64Size = PcdGet64 (PcdPciMmio64Size);
> -
> -  if (Pci64Size == 0) {
> -    if (mBootMode != BOOT_ON_S3_RESUME) {
> -      DEBUG ((DEBUG_INFO, "%a: disabling 64-bit PCI host aperture\n",
> -        __FUNCTION__));
> -      PcdStatus = PcdSet64S (PcdPciMmio64Size, 0);
> -      ASSERT_RETURN_ERROR (PcdStatus);
> -    }
> -
> -    //
> -    // There's nothing more to do; the amount of memory above 4GB fully
> -    // determines the highest address plus one. The memory hotplug area (see
> -    // below) plays no role for the firmware in this case.
> -    //
> -    return FirstNonAddress;
> -  }
> -
> -  //
> -  // SeaBIOS aligns both boundaries of the 64-bit PCI host aperture to 1GB, so
> -  // that the host can map it with 1GB hugepages. Follow suit.
> -  //
> -  Pci64Base = ALIGN_VALUE (FirstNonAddress, (UINT64)SIZE_1GB);
> -  Pci64Size = ALIGN_VALUE (Pci64Size, (UINT64)SIZE_1GB);
> -
> -  //
> -  // The 64-bit PCI host aperture should also be "naturally" aligned. The
> -  // alignment is determined by rounding the size of the aperture down to the
> -  // next smaller or equal power of two. That is, align the aperture by the
> -  // largest BAR size that can fit into it.
> -  //
> -  Pci64Base = ALIGN_VALUE (Pci64Base, GetPowerOfTwo64 (Pci64Size));
> -
> -  if (mBootMode != BOOT_ON_S3_RESUME) {
> -    //
> -    // The core PciHostBridgeDxe driver will automatically add this range to
> -    // the GCD memory space map through our PciHostBridgeLib instance; here we
> -    // only need to set the PCDs.
> -    //
> -    PcdStatus = PcdSet64S (PcdPciMmio64Base, Pci64Base);
> -    ASSERT_RETURN_ERROR (PcdStatus);
> -    PcdStatus = PcdSet64S (PcdPciMmio64Size, Pci64Size);
> -    ASSERT_RETURN_ERROR (PcdStatus);
> -
> -    DEBUG ((DEBUG_INFO, "%a: Pci64Base=0x%Lx Pci64Size=0x%Lx\n",
> -      __FUNCTION__, Pci64Base, Pci64Size));
> -  }
> -
> -  //
> -  // The useful address space ends with the 64-bit PCI host aperture.
> -  //
> -  FirstNonAddress = Pci64Base + Pci64Size;
> -  return FirstNonAddress;
> -}
> -
> -
>  /**
> -  Initialize the mPhysMemAddressWidth variable, based on guest RAM size.
> +  Initialize the mPhysMemAddressWidth variable, based on CPUID data.
>  **/
>  VOID
>  AddressWidthInitialization (
>    VOID
>    )
>  {
> -  UINT64 FirstNonAddress;
> +  UINT32 RegEax;
>  
> -  //
> -  // As guest-physical memory size grows, the permanent PEI RAM requirements
> -  // are dominated by the identity-mapping page tables built by the DXE IPL.
> -  // The DXL IPL keys off of the physical address bits advertized in the CPU
> -  // HOB. To conserve memory, we calculate the minimum address width here.
> -  //
> -  FirstNonAddress      = GetFirstNonAddress ();
> -  mPhysMemAddressWidth = (UINT8)HighBitSet64 (FirstNonAddress);
> -
> -  //
> -  // If FirstNonAddress is not an integral power of two, then we need an
> -  // additional bit.
> -  //
> -  if ((FirstNonAddress & (FirstNonAddress - 1)) != 0) {
> -    ++mPhysMemAddressWidth;
> +  AsmCpuid (0x80000000, &RegEax, NULL, NULL, NULL);
> +  if (RegEax >= 0x80000008) {
> +    AsmCpuid (0x80000008, &RegEax, NULL, NULL, NULL);
> +    mPhysMemAddressWidth = (UINT8) RegEax;
> +  } else {
> +    mPhysMemAddressWidth = 36;
>    }
>  
>    //
> -  // The minimum address width is 36 (covers up to and excluding 64 GB, which
> -  // is the maximum for Ia32 + PAE). The theoretical architecture maximum for
> -  // X64 long mode is 52 bits, but the DXE IPL clamps that down to 48 bits. We
> -  // can simply assert that here, since 48 bits are good enough for 256 TB.
> +  // IA-32e paging translates 48-bit linear addresses to 52-bit physical addresses.
>    //
> -  if (mPhysMemAddressWidth <= 36) {
> -    mPhysMemAddressWidth = 36;
> +  ASSERT (mPhysMemAddressWidth <= 52);
> +  if (mPhysMemAddressWidth > 48) {
> +    mPhysMemAddressWidth = 48;
>    }
> -  ASSERT (mPhysMemAddressWidth <= 48);
>  }
>  
> -
>  /**
>    Calculate the cap for the permanent PEI memory.
>  **/
>
Julien Grall Jan. 19, 2021, 9:37 a.m. UTC | #2
Hi Igor,

On 13/01/2021 03:42, Igor Druzhinin wrote:
> We faced a problem with passing through a PCI device with 64GB BAR to
> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
> 64G address which pushes physical address space to 37 bits. That is above
> 36-bit width that OVMF exposes currently to a guest without tweaking
> PcdPciMmio64Size knob.
> 
> The reverse calculation using this knob was inhereted from QEMU-KVM platform
> code where it serves the purpose of finding max accessible physical
> address without necessary trusting emulated CPUID physbits value (that could
> be different from host physbits). On Xen we expect to use CPUID policy
> to level the data correctly to prevent situations with guest physbits >
> host physbits e.g. across migrations.
> 
> The next aspect raising concern - resource consumption for DXE IPL page tables
> and time required to map the whole address space in case of using CPUID
> bits directly. That could be mitigated by enabling support for 1G pages
> in DXE IPL configuration. 1G pages are available on most CPUs produced in
> the last 10 years and those without don't have many phys bits.
> 
> Remove all the redundant code now (including PcdPciMmio64.. handling that's
> not used on Xen anyway) and grab physbits directly from CPUID that should
> be what baremetal UEFI systems do.
> 
> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>

Reviewed-by: Julien Grall <julien@xen.org>

Cheers,

> ---
>   OvmfPkg/OvmfXen.dsc                |   3 +
>   OvmfPkg/XenPlatformPei/MemDetect.c | 166 +++----------------------------------
>   2 files changed, 15 insertions(+), 154 deletions(-)
> 
> diff --git a/OvmfPkg/OvmfXen.dsc b/OvmfPkg/OvmfXen.dsc
> index 7d31e88..8ae6ed0 100644
> --- a/OvmfPkg/OvmfXen.dsc
> +++ b/OvmfPkg/OvmfXen.dsc
> @@ -444,6 +444,9 @@
>     ## Xen vlapic's frequence is 100 MHz
>     gEfiMdePkgTokenSpaceGuid.PcdFSBClock|100000000
>   
> +  # We populate DXE IPL tables with 1G pages preferably on Xen
> +  gEfiMdeModulePkgTokenSpaceGuid.PcdUse1GPageTable|TRUE
> +
>   ################################################################################
>   #
>   # Pcd Dynamic Section - list of all EDK II PCD Entries defined by this Platform
> diff --git a/OvmfPkg/XenPlatformPei/MemDetect.c b/OvmfPkg/XenPlatformPei/MemDetect.c
> index 1f81eee..1970b63 100644
> --- a/OvmfPkg/XenPlatformPei/MemDetect.c
> +++ b/OvmfPkg/XenPlatformPei/MemDetect.c
> @@ -172,175 +172,33 @@ GetSystemMemorySizeBelow4gb (
>     return (UINT32) (((UINTN)((Cmos0x35 << 8) + Cmos0x34) << 16) + SIZE_16MB);
>   }
>   
> -
> -STATIC
> -UINT64
> -GetSystemMemorySizeAbove4gb (
> -  )
> -{
> -  UINT32 Size;
> -  UINTN  CmosIndex;
> -
> -  //
> -  // In PVH case, there is no CMOS, we have to calculate the memory size
> -  // from parsing the E820
> -  //
> -  if (XenPvhDetected ()) {
> -    UINT64  HighestAddress;
> -
> -    HighestAddress = GetHighestSystemMemoryAddress (FALSE);
> -    ASSERT (HighestAddress == 0 || HighestAddress >= BASE_4GB);
> -
> -    if (HighestAddress >= BASE_4GB) {
> -      HighestAddress -= BASE_4GB;
> -    }
> -
> -    return HighestAddress;
> -  }
> -
> -  //
> -  // CMOS 0x5b-0x5d specifies the system memory above 4GB MB.
> -  // * CMOS(0x5d) is the most significant size byte
> -  // * CMOS(0x5c) is the middle size byte
> -  // * CMOS(0x5b) is the least significant size byte
> -  // * The size is specified in 64kb chunks
> -  //
> -
> -  Size = 0;
> -  for (CmosIndex = 0x5d; CmosIndex >= 0x5b; CmosIndex--) {
> -    Size = (UINT32) (Size << 8) + (UINT32) CmosRead8 (CmosIndex);
> -  }
> -
> -  return LShiftU64 (Size, 16);
> -}
> -
> -
> -/**
> -  Return the highest address that DXE could possibly use, plus one.
> -**/
> -STATIC
> -UINT64
> -GetFirstNonAddress (
> -  VOID
> -  )
> -{
> -  UINT64               FirstNonAddress;
> -  UINT64               Pci64Base, Pci64Size;
> -  RETURN_STATUS        PcdStatus;
> -
> -  FirstNonAddress = BASE_4GB + GetSystemMemorySizeAbove4gb ();
> -
> -  //
> -  // If DXE is 32-bit, then we're done; PciBusDxe will degrade 64-bit MMIO
> -  // resources to 32-bit anyway. See DegradeResource() in
> -  // "PciResourceSupport.c".
> -  //
> -#ifdef MDE_CPU_IA32
> -  if (!FeaturePcdGet (PcdDxeIplSwitchToLongMode)) {
> -    return FirstNonAddress;
> -  }
> -#endif
> -
> -  //
> -  // Otherwise, in order to calculate the highest address plus one, we must
> -  // consider the 64-bit PCI host aperture too. Fetch the default size.
> -  //
> -  Pci64Size = PcdGet64 (PcdPciMmio64Size);
> -
> -  if (Pci64Size == 0) {
> -    if (mBootMode != BOOT_ON_S3_RESUME) {
> -      DEBUG ((DEBUG_INFO, "%a: disabling 64-bit PCI host aperture\n",
> -        __FUNCTION__));
> -      PcdStatus = PcdSet64S (PcdPciMmio64Size, 0);
> -      ASSERT_RETURN_ERROR (PcdStatus);
> -    }
> -
> -    //
> -    // There's nothing more to do; the amount of memory above 4GB fully
> -    // determines the highest address plus one. The memory hotplug area (see
> -    // below) plays no role for the firmware in this case.
> -    //
> -    return FirstNonAddress;
> -  }
> -
> -  //
> -  // SeaBIOS aligns both boundaries of the 64-bit PCI host aperture to 1GB, so
> -  // that the host can map it with 1GB hugepages. Follow suit.
> -  //
> -  Pci64Base = ALIGN_VALUE (FirstNonAddress, (UINT64)SIZE_1GB);
> -  Pci64Size = ALIGN_VALUE (Pci64Size, (UINT64)SIZE_1GB);
> -
> -  //
> -  // The 64-bit PCI host aperture should also be "naturally" aligned. The
> -  // alignment is determined by rounding the size of the aperture down to the
> -  // next smaller or equal power of two. That is, align the aperture by the
> -  // largest BAR size that can fit into it.
> -  //
> -  Pci64Base = ALIGN_VALUE (Pci64Base, GetPowerOfTwo64 (Pci64Size));
> -
> -  if (mBootMode != BOOT_ON_S3_RESUME) {
> -    //
> -    // The core PciHostBridgeDxe driver will automatically add this range to
> -    // the GCD memory space map through our PciHostBridgeLib instance; here we
> -    // only need to set the PCDs.
> -    //
> -    PcdStatus = PcdSet64S (PcdPciMmio64Base, Pci64Base);
> -    ASSERT_RETURN_ERROR (PcdStatus);
> -    PcdStatus = PcdSet64S (PcdPciMmio64Size, Pci64Size);
> -    ASSERT_RETURN_ERROR (PcdStatus);
> -
> -    DEBUG ((DEBUG_INFO, "%a: Pci64Base=0x%Lx Pci64Size=0x%Lx\n",
> -      __FUNCTION__, Pci64Base, Pci64Size));
> -  }
> -
> -  //
> -  // The useful address space ends with the 64-bit PCI host aperture.
> -  //
> -  FirstNonAddress = Pci64Base + Pci64Size;
> -  return FirstNonAddress;
> -}
> -
> -
>   /**
> -  Initialize the mPhysMemAddressWidth variable, based on guest RAM size.
> +  Initialize the mPhysMemAddressWidth variable, based on CPUID data.
>   **/
>   VOID
>   AddressWidthInitialization (
>     VOID
>     )
>   {
> -  UINT64 FirstNonAddress;
> +  UINT32 RegEax;
>   
> -  //
> -  // As guest-physical memory size grows, the permanent PEI RAM requirements
> -  // are dominated by the identity-mapping page tables built by the DXE IPL.
> -  // The DXL IPL keys off of the physical address bits advertized in the CPU
> -  // HOB. To conserve memory, we calculate the minimum address width here.
> -  //
> -  FirstNonAddress      = GetFirstNonAddress ();
> -  mPhysMemAddressWidth = (UINT8)HighBitSet64 (FirstNonAddress);
> -
> -  //
> -  // If FirstNonAddress is not an integral power of two, then we need an
> -  // additional bit.
> -  //
> -  if ((FirstNonAddress & (FirstNonAddress - 1)) != 0) {
> -    ++mPhysMemAddressWidth;
> +  AsmCpuid (0x80000000, &RegEax, NULL, NULL, NULL);
> +  if (RegEax >= 0x80000008) {
> +    AsmCpuid (0x80000008, &RegEax, NULL, NULL, NULL);
> +    mPhysMemAddressWidth = (UINT8) RegEax;
> +  } else {
> +    mPhysMemAddressWidth = 36;
>     }
>   
>     //
> -  // The minimum address width is 36 (covers up to and excluding 64 GB, which
> -  // is the maximum for Ia32 + PAE). The theoretical architecture maximum for
> -  // X64 long mode is 52 bits, but the DXE IPL clamps that down to 48 bits. We
> -  // can simply assert that here, since 48 bits are good enough for 256 TB.
> +  // IA-32e paging translates 48-bit linear addresses to 52-bit physical addresses.
>     //
> -  if (mPhysMemAddressWidth <= 36) {
> -    mPhysMemAddressWidth = 36;
> +  ASSERT (mPhysMemAddressWidth <= 52);
> +  if (mPhysMemAddressWidth > 48) {
> +    mPhysMemAddressWidth = 48;
>     }
> -  ASSERT (mPhysMemAddressWidth <= 48);
>   }
>   
> -
>   /**
>     Calculate the cap for the permanent PEI memory.
>   **/
>
Laszlo Ersek Jan. 19, 2021, 2:49 p.m. UTC | #3
On 01/19/21 10:37, Julien Grall wrote:
> Hi Igor,
> 
> On 13/01/2021 03:42, Igor Druzhinin wrote:
>> We faced a problem with passing through a PCI device with 64GB BAR to
>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>> 64G address which pushes physical address space to 37 bits. That is above
>> 36-bit width that OVMF exposes currently to a guest without tweaking
>> PcdPciMmio64Size knob.
>>
>> The reverse calculation using this knob was inhereted from QEMU-KVM
>> platform
>> code where it serves the purpose of finding max accessible physical
>> address without necessary trusting emulated CPUID physbits value (that
>> could
>> be different from host physbits). On Xen we expect to use CPUID policy
>> to level the data correctly to prevent situations with guest physbits >
>> host physbits e.g. across migrations.
>>
>> The next aspect raising concern - resource consumption for DXE IPL
>> page tables
>> and time required to map the whole address space in case of using CPUID
>> bits directly. That could be mitigated by enabling support for 1G pages
>> in DXE IPL configuration. 1G pages are available on most CPUs produced in
>> the last 10 years and those without don't have many phys bits.
>>
>> Remove all the redundant code now (including PcdPciMmio64.. handling
>> that's
>> not used on Xen anyway) and grab physbits directly from CPUID that should
>> be what baremetal UEFI systems do.
>>
>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> 
> Reviewed-by: Julien Grall <julien@xen.org>

I'm going to hold this patch until Anthony responds in this thread as well:

http://mid.mail-archive.com/63cf6053-9787-d8cf-db18-982ebcab1780@citrix.com
https://edk2.groups.io/g/devel/message/70541

Thanks
Laszlo

> 
> Cheers,
> 
>> ---
>>   OvmfPkg/OvmfXen.dsc                |   3 +
>>   OvmfPkg/XenPlatformPei/MemDetect.c | 166
>> +++----------------------------------
>>   2 files changed, 15 insertions(+), 154 deletions(-)
>>
>> diff --git a/OvmfPkg/OvmfXen.dsc b/OvmfPkg/OvmfXen.dsc
>> index 7d31e88..8ae6ed0 100644
>> --- a/OvmfPkg/OvmfXen.dsc
>> +++ b/OvmfPkg/OvmfXen.dsc
>> @@ -444,6 +444,9 @@
>>     ## Xen vlapic's frequence is 100 MHz
>>     gEfiMdePkgTokenSpaceGuid.PcdFSBClock|100000000
>>   +  # We populate DXE IPL tables with 1G pages preferably on Xen
>> +  gEfiMdeModulePkgTokenSpaceGuid.PcdUse1GPageTable|TRUE
>> +
>>  
>> ################################################################################
>>
>>   #
>>   # Pcd Dynamic Section - list of all EDK II PCD Entries defined by
>> this Platform
>> diff --git a/OvmfPkg/XenPlatformPei/MemDetect.c
>> b/OvmfPkg/XenPlatformPei/MemDetect.c
>> index 1f81eee..1970b63 100644
>> --- a/OvmfPkg/XenPlatformPei/MemDetect.c
>> +++ b/OvmfPkg/XenPlatformPei/MemDetect.c
>> @@ -172,175 +172,33 @@ GetSystemMemorySizeBelow4gb (
>>     return (UINT32) (((UINTN)((Cmos0x35 << 8) + Cmos0x34) << 16) +
>> SIZE_16MB);
>>   }
>>   -
>> -STATIC
>> -UINT64
>> -GetSystemMemorySizeAbove4gb (
>> -  )
>> -{
>> -  UINT32 Size;
>> -  UINTN  CmosIndex;
>> -
>> -  //
>> -  // In PVH case, there is no CMOS, we have to calculate the memory size
>> -  // from parsing the E820
>> -  //
>> -  if (XenPvhDetected ()) {
>> -    UINT64  HighestAddress;
>> -
>> -    HighestAddress = GetHighestSystemMemoryAddress (FALSE);
>> -    ASSERT (HighestAddress == 0 || HighestAddress >= BASE_4GB);
>> -
>> -    if (HighestAddress >= BASE_4GB) {
>> -      HighestAddress -= BASE_4GB;
>> -    }
>> -
>> -    return HighestAddress;
>> -  }
>> -
>> -  //
>> -  // CMOS 0x5b-0x5d specifies the system memory above 4GB MB.
>> -  // * CMOS(0x5d) is the most significant size byte
>> -  // * CMOS(0x5c) is the middle size byte
>> -  // * CMOS(0x5b) is the least significant size byte
>> -  // * The size is specified in 64kb chunks
>> -  //
>> -
>> -  Size = 0;
>> -  for (CmosIndex = 0x5d; CmosIndex >= 0x5b; CmosIndex--) {
>> -    Size = (UINT32) (Size << 8) + (UINT32) CmosRead8 (CmosIndex);
>> -  }
>> -
>> -  return LShiftU64 (Size, 16);
>> -}
>> -
>> -
>> -/**
>> -  Return the highest address that DXE could possibly use, plus one.
>> -**/
>> -STATIC
>> -UINT64
>> -GetFirstNonAddress (
>> -  VOID
>> -  )
>> -{
>> -  UINT64               FirstNonAddress;
>> -  UINT64               Pci64Base, Pci64Size;
>> -  RETURN_STATUS        PcdStatus;
>> -
>> -  FirstNonAddress = BASE_4GB + GetSystemMemorySizeAbove4gb ();
>> -
>> -  //
>> -  // If DXE is 32-bit, then we're done; PciBusDxe will degrade 64-bit
>> MMIO
>> -  // resources to 32-bit anyway. See DegradeResource() in
>> -  // "PciResourceSupport.c".
>> -  //
>> -#ifdef MDE_CPU_IA32
>> -  if (!FeaturePcdGet (PcdDxeIplSwitchToLongMode)) {
>> -    return FirstNonAddress;
>> -  }
>> -#endif
>> -
>> -  //
>> -  // Otherwise, in order to calculate the highest address plus one,
>> we must
>> -  // consider the 64-bit PCI host aperture too. Fetch the default size.
>> -  //
>> -  Pci64Size = PcdGet64 (PcdPciMmio64Size);
>> -
>> -  if (Pci64Size == 0) {
>> -    if (mBootMode != BOOT_ON_S3_RESUME) {
>> -      DEBUG ((DEBUG_INFO, "%a: disabling 64-bit PCI host aperture\n",
>> -        __FUNCTION__));
>> -      PcdStatus = PcdSet64S (PcdPciMmio64Size, 0);
>> -      ASSERT_RETURN_ERROR (PcdStatus);
>> -    }
>> -
>> -    //
>> -    // There's nothing more to do; the amount of memory above 4GB fully
>> -    // determines the highest address plus one. The memory hotplug
>> area (see
>> -    // below) plays no role for the firmware in this case.
>> -    //
>> -    return FirstNonAddress;
>> -  }
>> -
>> -  //
>> -  // SeaBIOS aligns both boundaries of the 64-bit PCI host aperture
>> to 1GB, so
>> -  // that the host can map it with 1GB hugepages. Follow suit.
>> -  //
>> -  Pci64Base = ALIGN_VALUE (FirstNonAddress, (UINT64)SIZE_1GB);
>> -  Pci64Size = ALIGN_VALUE (Pci64Size, (UINT64)SIZE_1GB);
>> -
>> -  //
>> -  // The 64-bit PCI host aperture should also be "naturally" aligned.
>> The
>> -  // alignment is determined by rounding the size of the aperture
>> down to the
>> -  // next smaller or equal power of two. That is, align the aperture
>> by the
>> -  // largest BAR size that can fit into it.
>> -  //
>> -  Pci64Base = ALIGN_VALUE (Pci64Base, GetPowerOfTwo64 (Pci64Size));
>> -
>> -  if (mBootMode != BOOT_ON_S3_RESUME) {
>> -    //
>> -    // The core PciHostBridgeDxe driver will automatically add this
>> range to
>> -    // the GCD memory space map through our PciHostBridgeLib
>> instance; here we
>> -    // only need to set the PCDs.
>> -    //
>> -    PcdStatus = PcdSet64S (PcdPciMmio64Base, Pci64Base);
>> -    ASSERT_RETURN_ERROR (PcdStatus);
>> -    PcdStatus = PcdSet64S (PcdPciMmio64Size, Pci64Size);
>> -    ASSERT_RETURN_ERROR (PcdStatus);
>> -
>> -    DEBUG ((DEBUG_INFO, "%a: Pci64Base=0x%Lx Pci64Size=0x%Lx\n",
>> -      __FUNCTION__, Pci64Base, Pci64Size));
>> -  }
>> -
>> -  //
>> -  // The useful address space ends with the 64-bit PCI host aperture.
>> -  //
>> -  FirstNonAddress = Pci64Base + Pci64Size;
>> -  return FirstNonAddress;
>> -}
>> -
>> -
>>   /**
>> -  Initialize the mPhysMemAddressWidth variable, based on guest RAM size.
>> +  Initialize the mPhysMemAddressWidth variable, based on CPUID data.
>>   **/
>>   VOID
>>   AddressWidthInitialization (
>>     VOID
>>     )
>>   {
>> -  UINT64 FirstNonAddress;
>> +  UINT32 RegEax;
>>   -  //
>> -  // As guest-physical memory size grows, the permanent PEI RAM
>> requirements
>> -  // are dominated by the identity-mapping page tables built by the
>> DXE IPL.
>> -  // The DXL IPL keys off of the physical address bits advertized in
>> the CPU
>> -  // HOB. To conserve memory, we calculate the minimum address width
>> here.
>> -  //
>> -  FirstNonAddress      = GetFirstNonAddress ();
>> -  mPhysMemAddressWidth = (UINT8)HighBitSet64 (FirstNonAddress);
>> -
>> -  //
>> -  // If FirstNonAddress is not an integral power of two, then we need an
>> -  // additional bit.
>> -  //
>> -  if ((FirstNonAddress & (FirstNonAddress - 1)) != 0) {
>> -    ++mPhysMemAddressWidth;
>> +  AsmCpuid (0x80000000, &RegEax, NULL, NULL, NULL);
>> +  if (RegEax >= 0x80000008) {
>> +    AsmCpuid (0x80000008, &RegEax, NULL, NULL, NULL);
>> +    mPhysMemAddressWidth = (UINT8) RegEax;
>> +  } else {
>> +    mPhysMemAddressWidth = 36;
>>     }
>>       //
>> -  // The minimum address width is 36 (covers up to and excluding 64
>> GB, which
>> -  // is the maximum for Ia32 + PAE). The theoretical architecture
>> maximum for
>> -  // X64 long mode is 52 bits, but the DXE IPL clamps that down to 48
>> bits. We
>> -  // can simply assert that here, since 48 bits are good enough for
>> 256 TB.
>> +  // IA-32e paging translates 48-bit linear addresses to 52-bit
>> physical addresses.
>>     //
>> -  if (mPhysMemAddressWidth <= 36) {
>> -    mPhysMemAddressWidth = 36;
>> +  ASSERT (mPhysMemAddressWidth <= 52);
>> +  if (mPhysMemAddressWidth > 48) {
>> +    mPhysMemAddressWidth = 48;
>>     }
>> -  ASSERT (mPhysMemAddressWidth <= 48);
>>   }
>>   -
>>   /**
>>     Calculate the cap for the permanent PEI memory.
>>   **/
>>
>
Anthony PERARD Jan. 19, 2021, 3:52 p.m. UTC | #4
On Tue, Jan 19, 2021 at 03:49:34PM +0100, Laszlo Ersek wrote:
> On 01/19/21 10:37, Julien Grall wrote:
> > Hi Igor,
> > 
> > On 13/01/2021 03:42, Igor Druzhinin wrote:
> >> We faced a problem with passing through a PCI device with 64GB BAR to
> >> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
> >> 64G address which pushes physical address space to 37 bits. That is above
> >> 36-bit width that OVMF exposes currently to a guest without tweaking
> >> PcdPciMmio64Size knob.
> >>
> >> The reverse calculation using this knob was inhereted from QEMU-KVM
> >> platform
> >> code where it serves the purpose of finding max accessible physical
> >> address without necessary trusting emulated CPUID physbits value (that
> >> could
> >> be different from host physbits). On Xen we expect to use CPUID policy
> >> to level the data correctly to prevent situations with guest physbits >
> >> host physbits e.g. across migrations.
> >>
> >> The next aspect raising concern - resource consumption for DXE IPL
> >> page tables
> >> and time required to map the whole address space in case of using CPUID
> >> bits directly. That could be mitigated by enabling support for 1G pages
> >> in DXE IPL configuration. 1G pages are available on most CPUs produced in
> >> the last 10 years and those without don't have many phys bits.
> >>
> >> Remove all the redundant code now (including PcdPciMmio64.. handling
> >> that's
> >> not used on Xen anyway) and grab physbits directly from CPUID that should
> >> be what baremetal UEFI systems do.
> >>
> >> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> > 
> > Reviewed-by: Julien Grall <julien@xen.org>
> 
> I'm going to hold this patch until Anthony responds in this thread as well:

This new patch is fine, I didn't notice that it was a replacement for
the previous one.

Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>

Thanks,
Laszlo Ersek Jan. 19, 2021, 5:23 p.m. UTC | #5
On 01/19/21 16:52, Anthony PERARD via groups.io wrote:
> On Tue, Jan 19, 2021 at 03:49:34PM +0100, Laszlo Ersek wrote:
>> On 01/19/21 10:37, Julien Grall wrote:
>>> Hi Igor,
>>>
>>> On 13/01/2021 03:42, Igor Druzhinin wrote:
>>>> We faced a problem with passing through a PCI device with 64GB BAR to
>>>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>>>> 64G address which pushes physical address space to 37 bits. That is above
>>>> 36-bit width that OVMF exposes currently to a guest without tweaking
>>>> PcdPciMmio64Size knob.
>>>>
>>>> The reverse calculation using this knob was inhereted from QEMU-KVM
>>>> platform
>>>> code where it serves the purpose of finding max accessible physical
>>>> address without necessary trusting emulated CPUID physbits value (that
>>>> could
>>>> be different from host physbits). On Xen we expect to use CPUID policy
>>>> to level the data correctly to prevent situations with guest physbits >
>>>> host physbits e.g. across migrations.
>>>>
>>>> The next aspect raising concern - resource consumption for DXE IPL
>>>> page tables
>>>> and time required to map the whole address space in case of using CPUID
>>>> bits directly. That could be mitigated by enabling support for 1G pages
>>>> in DXE IPL configuration. 1G pages are available on most CPUs produced in
>>>> the last 10 years and those without don't have many phys bits.
>>>>
>>>> Remove all the redundant code now (including PcdPciMmio64.. handling
>>>> that's
>>>> not used on Xen anyway) and grab physbits directly from CPUID that should
>>>> be what baremetal UEFI systems do.
>>>>
>>>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>>>
>>> Reviewed-by: Julien Grall <julien@xen.org>
>>
>> I'm going to hold this patch until Anthony responds in this thread as well:
> 
> This new patch is fine, I didn't notice that it was a replacement for
> the previous one.
> 
> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>

[lersek@redhat.com: fix up authorship from groups.io-mangled From line]
[lersek@redhat.com: wrap commit message at 74 characters]

Merged via <https://github.com/tianocore/edk2/pull/1369>, as commit
e68c2a22caae.

Thanks
Laszlo
diff mbox series

Patch

diff --git a/OvmfPkg/OvmfXen.dsc b/OvmfPkg/OvmfXen.dsc
index 7d31e88..8ae6ed0 100644
--- a/OvmfPkg/OvmfXen.dsc
+++ b/OvmfPkg/OvmfXen.dsc
@@ -444,6 +444,9 @@ 
   ## Xen vlapic's frequence is 100 MHz
   gEfiMdePkgTokenSpaceGuid.PcdFSBClock|100000000
 
+  # We populate DXE IPL tables with 1G pages preferably on Xen
+  gEfiMdeModulePkgTokenSpaceGuid.PcdUse1GPageTable|TRUE
+
 ################################################################################
 #
 # Pcd Dynamic Section - list of all EDK II PCD Entries defined by this Platform
diff --git a/OvmfPkg/XenPlatformPei/MemDetect.c b/OvmfPkg/XenPlatformPei/MemDetect.c
index 1f81eee..1970b63 100644
--- a/OvmfPkg/XenPlatformPei/MemDetect.c
+++ b/OvmfPkg/XenPlatformPei/MemDetect.c
@@ -172,175 +172,33 @@  GetSystemMemorySizeBelow4gb (
   return (UINT32) (((UINTN)((Cmos0x35 << 8) + Cmos0x34) << 16) + SIZE_16MB);
 }
 
-
-STATIC
-UINT64
-GetSystemMemorySizeAbove4gb (
-  )
-{
-  UINT32 Size;
-  UINTN  CmosIndex;
-
-  //
-  // In PVH case, there is no CMOS, we have to calculate the memory size
-  // from parsing the E820
-  //
-  if (XenPvhDetected ()) {
-    UINT64  HighestAddress;
-
-    HighestAddress = GetHighestSystemMemoryAddress (FALSE);
-    ASSERT (HighestAddress == 0 || HighestAddress >= BASE_4GB);
-
-    if (HighestAddress >= BASE_4GB) {
-      HighestAddress -= BASE_4GB;
-    }
-
-    return HighestAddress;
-  }
-
-  //
-  // CMOS 0x5b-0x5d specifies the system memory above 4GB MB.
-  // * CMOS(0x5d) is the most significant size byte
-  // * CMOS(0x5c) is the middle size byte
-  // * CMOS(0x5b) is the least significant size byte
-  // * The size is specified in 64kb chunks
-  //
-
-  Size = 0;
-  for (CmosIndex = 0x5d; CmosIndex >= 0x5b; CmosIndex--) {
-    Size = (UINT32) (Size << 8) + (UINT32) CmosRead8 (CmosIndex);
-  }
-
-  return LShiftU64 (Size, 16);
-}
-
-
-/**
-  Return the highest address that DXE could possibly use, plus one.
-**/
-STATIC
-UINT64
-GetFirstNonAddress (
-  VOID
-  )
-{
-  UINT64               FirstNonAddress;
-  UINT64               Pci64Base, Pci64Size;
-  RETURN_STATUS        PcdStatus;
-
-  FirstNonAddress = BASE_4GB + GetSystemMemorySizeAbove4gb ();
-
-  //
-  // If DXE is 32-bit, then we're done; PciBusDxe will degrade 64-bit MMIO
-  // resources to 32-bit anyway. See DegradeResource() in
-  // "PciResourceSupport.c".
-  //
-#ifdef MDE_CPU_IA32
-  if (!FeaturePcdGet (PcdDxeIplSwitchToLongMode)) {
-    return FirstNonAddress;
-  }
-#endif
-
-  //
-  // Otherwise, in order to calculate the highest address plus one, we must
-  // consider the 64-bit PCI host aperture too. Fetch the default size.
-  //
-  Pci64Size = PcdGet64 (PcdPciMmio64Size);
-
-  if (Pci64Size == 0) {
-    if (mBootMode != BOOT_ON_S3_RESUME) {
-      DEBUG ((DEBUG_INFO, "%a: disabling 64-bit PCI host aperture\n",
-        __FUNCTION__));
-      PcdStatus = PcdSet64S (PcdPciMmio64Size, 0);
-      ASSERT_RETURN_ERROR (PcdStatus);
-    }
-
-    //
-    // There's nothing more to do; the amount of memory above 4GB fully
-    // determines the highest address plus one. The memory hotplug area (see
-    // below) plays no role for the firmware in this case.
-    //
-    return FirstNonAddress;
-  }
-
-  //
-  // SeaBIOS aligns both boundaries of the 64-bit PCI host aperture to 1GB, so
-  // that the host can map it with 1GB hugepages. Follow suit.
-  //
-  Pci64Base = ALIGN_VALUE (FirstNonAddress, (UINT64)SIZE_1GB);
-  Pci64Size = ALIGN_VALUE (Pci64Size, (UINT64)SIZE_1GB);
-
-  //
-  // The 64-bit PCI host aperture should also be "naturally" aligned. The
-  // alignment is determined by rounding the size of the aperture down to the
-  // next smaller or equal power of two. That is, align the aperture by the
-  // largest BAR size that can fit into it.
-  //
-  Pci64Base = ALIGN_VALUE (Pci64Base, GetPowerOfTwo64 (Pci64Size));
-
-  if (mBootMode != BOOT_ON_S3_RESUME) {
-    //
-    // The core PciHostBridgeDxe driver will automatically add this range to
-    // the GCD memory space map through our PciHostBridgeLib instance; here we
-    // only need to set the PCDs.
-    //
-    PcdStatus = PcdSet64S (PcdPciMmio64Base, Pci64Base);
-    ASSERT_RETURN_ERROR (PcdStatus);
-    PcdStatus = PcdSet64S (PcdPciMmio64Size, Pci64Size);
-    ASSERT_RETURN_ERROR (PcdStatus);
-
-    DEBUG ((DEBUG_INFO, "%a: Pci64Base=0x%Lx Pci64Size=0x%Lx\n",
-      __FUNCTION__, Pci64Base, Pci64Size));
-  }
-
-  //
-  // The useful address space ends with the 64-bit PCI host aperture.
-  //
-  FirstNonAddress = Pci64Base + Pci64Size;
-  return FirstNonAddress;
-}
-
-
 /**
-  Initialize the mPhysMemAddressWidth variable, based on guest RAM size.
+  Initialize the mPhysMemAddressWidth variable, based on CPUID data.
 **/
 VOID
 AddressWidthInitialization (
   VOID
   )
 {
-  UINT64 FirstNonAddress;
+  UINT32 RegEax;
 
-  //
-  // As guest-physical memory size grows, the permanent PEI RAM requirements
-  // are dominated by the identity-mapping page tables built by the DXE IPL.
-  // The DXL IPL keys off of the physical address bits advertized in the CPU
-  // HOB. To conserve memory, we calculate the minimum address width here.
-  //
-  FirstNonAddress      = GetFirstNonAddress ();
-  mPhysMemAddressWidth = (UINT8)HighBitSet64 (FirstNonAddress);
-
-  //
-  // If FirstNonAddress is not an integral power of two, then we need an
-  // additional bit.
-  //
-  if ((FirstNonAddress & (FirstNonAddress - 1)) != 0) {
-    ++mPhysMemAddressWidth;
+  AsmCpuid (0x80000000, &RegEax, NULL, NULL, NULL);
+  if (RegEax >= 0x80000008) {
+    AsmCpuid (0x80000008, &RegEax, NULL, NULL, NULL);
+    mPhysMemAddressWidth = (UINT8) RegEax;
+  } else {
+    mPhysMemAddressWidth = 36;
   }
 
   //
-  // The minimum address width is 36 (covers up to and excluding 64 GB, which
-  // is the maximum for Ia32 + PAE). The theoretical architecture maximum for
-  // X64 long mode is 52 bits, but the DXE IPL clamps that down to 48 bits. We
-  // can simply assert that here, since 48 bits are good enough for 256 TB.
+  // IA-32e paging translates 48-bit linear addresses to 52-bit physical addresses.
   //
-  if (mPhysMemAddressWidth <= 36) {
-    mPhysMemAddressWidth = 36;
+  ASSERT (mPhysMemAddressWidth <= 52);
+  if (mPhysMemAddressWidth > 48) {
+    mPhysMemAddressWidth = 48;
   }
-  ASSERT (mPhysMemAddressWidth <= 48);
 }
 
-
 /**
   Calculate the cap for the permanent PEI memory.
 **/