mbox series

[RFC,0/5] ARM: reduce the memory consumed when mapping UEFI flash images

Message ID 20201116104216.439650-1-david.edmondson@oracle.com (mailing list archive)
Headers show
Series ARM: reduce the memory consumed when mapping UEFI flash images | expand

Message

David Edmondson Nov. 16, 2020, 10:42 a.m. UTC
Currently ARM UEFI images are typically built as 2MB/768kB flash
images for code and variables respectively. These images are both then
padded out to 64MB before being loaded by QEMU.

Because the images are 64MB each, QEMU allocates 128MB of memory to
read them, and then proceeds to read all 128MB from disk (dirtying the
memory). Of this 128MB less than 3MB is useful - the rest is zero
padding.

On a machine with 100 VMs this wastes over 12GB of memory.

This set of patches aims to reclaim the wasted memory by allowing QEMU
to respect the size of the flash images and allocate only the
necessary memory. This will, of course, require that the flash build
process be modified to avoid padding the images to 64MB.

Because existing machine types expect the full 128MB reserved for
flash to be occupied, do so for machine types older than virt-5.2. The
changes are beneficial even in this case, because while the full 128MB
of memory is allocated, only that required to actually load the flash
images from disk is touched.

David Edmondson (5):
  hw/block: blk_check_size_and_read_all should report backend name
  hw/block: Flash images can be smaller than the device
  hw/arm: Convert assertions about flash image size to error_report
  hw/arm: Flash image mapping follows image size
  hw/arm: Only minimise flash size on older machines

 hw/arm/trace-events      |  2 +
 hw/arm/virt-acpi-build.c | 30 ++++++++------
 hw/arm/virt.c            | 86 +++++++++++++++++++++++++++++-----------
 hw/block/block.c         | 26 ++++++------
 include/hw/arm/virt.h    |  2 +
 5 files changed, 97 insertions(+), 49 deletions(-)

Comments

Philippe Mathieu-Daudé Nov. 16, 2020, 11:39 a.m. UTC | #1
Hi David,

On 11/16/20 11:42 AM, David Edmondson wrote:
> Currently ARM UEFI images are typically built as 2MB/768kB flash
> images for code and variables respectively. These images are both then
> padded out to 64MB before being loaded by QEMU.
> 
> Because the images are 64MB each, QEMU allocates 128MB of memory to
> read them, and then proceeds to read all 128MB from disk (dirtying the
> memory). Of this 128MB less than 3MB is useful - the rest is zero
> padding.

2 years ago I commented the same problem, and suggested to access the
underlying storage by "block", as this is a "block storage".

Back then the response was "do not try to fix something that works".
This is why we choose the big hammer "do not accept image size not
matching device size" way.

While your series seems to help, it only postpone the same
implementation problem. If what you want is use the least memory
required, I still believe accessing the device by block is the
best approach.

Regards,

Phil.

> 
> On a machine with 100 VMs this wastes over 12GB of memory.
> 
> This set of patches aims to reclaim the wasted memory by allowing QEMU
> to respect the size of the flash images and allocate only the
> necessary memory. This will, of course, require that the flash build
> process be modified to avoid padding the images to 64MB.
> 
> Because existing machine types expect the full 128MB reserved for
> flash to be occupied, do so for machine types older than virt-5.2. The
> changes are beneficial even in this case, because while the full 128MB
> of memory is allocated, only that required to actually load the flash
> images from disk is touched.
> 
> David Edmondson (5):
>   hw/block: blk_check_size_and_read_all should report backend name
>   hw/block: Flash images can be smaller than the device
>   hw/arm: Convert assertions about flash image size to error_report
>   hw/arm: Flash image mapping follows image size
>   hw/arm: Only minimise flash size on older machines
> 
>  hw/arm/trace-events      |  2 +
>  hw/arm/virt-acpi-build.c | 30 ++++++++------
>  hw/arm/virt.c            | 86 +++++++++++++++++++++++++++++-----------
>  hw/block/block.c         | 26 ++++++------
>  include/hw/arm/virt.h    |  2 +
>  5 files changed, 97 insertions(+), 49 deletions(-)
>
David Edmondson Nov. 16, 2020, 1:43 p.m. UTC | #2
On Monday, 2020-11-16 at 12:39:46 +01, Philippe Mathieu-Daudé wrote:

> Hi David,
>
> On 11/16/20 11:42 AM, David Edmondson wrote:
>> Currently ARM UEFI images are typically built as 2MB/768kB flash
>> images for code and variables respectively. These images are both then
>> padded out to 64MB before being loaded by QEMU.
>> 
>> Because the images are 64MB each, QEMU allocates 128MB of memory to
>> read them, and then proceeds to read all 128MB from disk (dirtying the
>> memory). Of this 128MB less than 3MB is useful - the rest is zero
>> padding.
>
> 2 years ago I commented the same problem, and suggested to access the
> underlying storage by "block", as this is a "block storage".
>
> Back then the response was "do not try to fix something that works".
> This is why we choose the big hammer "do not accept image size not
> matching device size" way.
>
> While your series seems to help, it only postpone the same
> implementation problem. If what you want is use the least memory
> required, I still believe accessing the device by block is the
> best approach.

I agree that would reduce the size further, but it feels like it may be
a case of diminishing returns.

Currently the consumption for a single guest is 128MB. This change will
bring it down under 3MB, with the block approach perhaps reducing that
to zero (there will be some buffer block usage presumably, and perhaps a
small cache would make sense, so it won't really be zero).

Scaling that up for 100 guests, we're going from 12.5GB now down to
under 300MB with the changes, and again something around zero with the
block based approach.

I guess that it would mean that the machine model wouldn't have to
change, as we could return zeros for reads outside the underlying device
extent. This seems like the most interesting part - are there likely to
be any consequential *benefits* from reducing the amount of guest
address space consumed by the flash regions?

> Regards,
>
> Phil.
>
>> 
>> On a machine with 100 VMs this wastes over 12GB of memory.
>> 
>> This set of patches aims to reclaim the wasted memory by allowing QEMU
>> to respect the size of the flash images and allocate only the
>> necessary memory. This will, of course, require that the flash build
>> process be modified to avoid padding the images to 64MB.
>> 
>> Because existing machine types expect the full 128MB reserved for
>> flash to be occupied, do so for machine types older than virt-5.2. The
>> changes are beneficial even in this case, because while the full 128MB
>> of memory is allocated, only that required to actually load the flash
>> images from disk is touched.
>> 
>> David Edmondson (5):
>>   hw/block: blk_check_size_and_read_all should report backend name
>>   hw/block: Flash images can be smaller than the device
>>   hw/arm: Convert assertions about flash image size to error_report
>>   hw/arm: Flash image mapping follows image size
>>   hw/arm: Only minimise flash size on older machines
>> 
>>  hw/arm/trace-events      |  2 +
>>  hw/arm/virt-acpi-build.c | 30 ++++++++------
>>  hw/arm/virt.c            | 86 +++++++++++++++++++++++++++++-----------
>>  hw/block/block.c         | 26 ++++++------
>>  include/hw/arm/virt.h    |  2 +
>>  5 files changed, 97 insertions(+), 49 deletions(-)
>> 

dme.
Markus Armbruster Nov. 16, 2020, 1:48 p.m. UTC | #3
Philippe Mathieu-Daudé <philmd@redhat.com> writes:

> Hi David,
>
> On 11/16/20 11:42 AM, David Edmondson wrote:
>> Currently ARM UEFI images are typically built as 2MB/768kB flash
>> images for code and variables respectively. These images are both then
>> padded out to 64MB before being loaded by QEMU.
>> 
>> Because the images are 64MB each, QEMU allocates 128MB of memory to
>> read them, and then proceeds to read all 128MB from disk (dirtying the
>> memory). Of this 128MB less than 3MB is useful - the rest is zero
>> padding.
>
> 2 years ago I commented the same problem, and suggested to access the
> underlying storage by "block", as this is a "block storage".
>
> Back then the response was "do not try to fix something that works".
> This is why we choose the big hammer "do not accept image size not
> matching device size" way.
>
> While your series seems to help, it only postpone the same
> implementation problem. If what you want is use the least memory
> required, I still believe accessing the device by block is the
> best approach.

"Do not try to fix something that works" is hard to disagree with.
However, at least some users seem to disagree with "this works".  Enough
to overcome the resistance to change?
Philippe Mathieu-Daudé Nov. 19, 2020, 6:09 a.m. UTC | #4
On 11/16/20 2:48 PM, Markus Armbruster wrote:
> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
> 
>> Hi David,
>>
>> On 11/16/20 11:42 AM, David Edmondson wrote:
>>> Currently ARM UEFI images are typically built as 2MB/768kB flash
>>> images for code and variables respectively. These images are both then
>>> padded out to 64MB before being loaded by QEMU.
>>>
>>> Because the images are 64MB each, QEMU allocates 128MB of memory to
>>> read them, and then proceeds to read all 128MB from disk (dirtying the
>>> memory). Of this 128MB less than 3MB is useful - the rest is zero
>>> padding.
>>
>> 2 years ago I commented the same problem, and suggested to access the
>> underlying storage by "block", as this is a "block storage".
>>
>> Back then the response was "do not try to fix something that works".
>> This is why we choose the big hammer "do not accept image size not
>> matching device size" way.
>>
>> While your series seems to help, it only postpone the same
>> implementation problem. If what you want is use the least memory
>> required, I still believe accessing the device by block is the
>> best approach.
> 
> "Do not try to fix something that works" is hard to disagree with.
> However, at least some users seem to disagree with "this works".  Enough
> to overcome the resistance to change?

Yeah, at least 3 big users so far:

- Huawei
https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html
- Tencent
https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html
- Oracle
(this series).

Then Huawei tried the MicroVM approach:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg680103.html

I simply wanted to save David time by remembering this other approach,
since Peter already commented on David's one (see Huawei link).

Regards,

Phil.
Alex Bennée Nov. 19, 2020, 11:45 a.m. UTC | #5
Philippe Mathieu-Daudé <philmd@redhat.com> writes:

> On 11/16/20 2:48 PM, Markus Armbruster wrote:
>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>> 
>>> Hi David,
>>>
>>> On 11/16/20 11:42 AM, David Edmondson wrote:
>>>> Currently ARM UEFI images are typically built as 2MB/768kB flash
>>>> images for code and variables respectively. These images are both then
>>>> padded out to 64MB before being loaded by QEMU.
>>>>
>>>> Because the images are 64MB each, QEMU allocates 128MB of memory to
>>>> read them, and then proceeds to read all 128MB from disk (dirtying the
>>>> memory). Of this 128MB less than 3MB is useful - the rest is zero
>>>> padding.
>>>
>>> 2 years ago I commented the same problem, and suggested to access the
>>> underlying storage by "block", as this is a "block storage".
>>>
>>> Back then the response was "do not try to fix something that works".
>>> This is why we choose the big hammer "do not accept image size not
>>> matching device size" way.
>>>
>>> While your series seems to help, it only postpone the same
>>> implementation problem. If what you want is use the least memory
>>> required, I still believe accessing the device by block is the
>>> best approach.
>> 
>> "Do not try to fix something that works" is hard to disagree with.
>> However, at least some users seem to disagree with "this works".  Enough
>> to overcome the resistance to change?
>
> Yeah, at least 3 big users so far:
>
> - Huawei
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html
> - Tencent
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html
> - Oracle
> (this series).
>
> Then Huawei tried the MicroVM approach:
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg680103.html
>
> I simply wanted to save David time by remembering this other approach,
> since Peter already commented on David's one (see Huawei link).

IIRC the two questions that came up were:

  - what would reading memory not covered by a file look like (0's or
    something more like real HW, 7f?).

  - what happens when the guest writes beyond the bounds of a backing
    file?

I'm guessing for these cloudy type applications no one cares about
persistence of EFI variables? Maybe we just need a formulation for the
second pflash which is explicit about writes being ephemeral while also
being accepted?

>
> Regards,
>
> Phil.
Philippe Mathieu-Daudé Nov. 19, 2020, 11:57 a.m. UTC | #6
+Ard & Leif for EDK2.

On 11/19/20 12:45 PM, Alex Bennée wrote:
> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>> On 11/16/20 2:48 PM, Markus Armbruster wrote:
>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>>
>>>> Hi David,
>>>>
>>>> On 11/16/20 11:42 AM, David Edmondson wrote:
>>>>> Currently ARM UEFI images are typically built as 2MB/768kB flash
>>>>> images for code and variables respectively. These images are both then
>>>>> padded out to 64MB before being loaded by QEMU.
>>>>>
>>>>> Because the images are 64MB each, QEMU allocates 128MB of memory to
>>>>> read them, and then proceeds to read all 128MB from disk (dirtying the
>>>>> memory). Of this 128MB less than 3MB is useful - the rest is zero
>>>>> padding.
>>>>
>>>> 2 years ago I commented the same problem, and suggested to access the
>>>> underlying storage by "block", as this is a "block storage".
>>>>
>>>> Back then the response was "do not try to fix something that works".
>>>> This is why we choose the big hammer "do not accept image size not
>>>> matching device size" way.
>>>>
>>>> While your series seems to help, it only postpone the same
>>>> implementation problem. If what you want is use the least memory
>>>> required, I still believe accessing the device by block is the
>>>> best approach.
>>>
>>> "Do not try to fix something that works" is hard to disagree with.
>>> However, at least some users seem to disagree with "this works".  Enough
>>> to overcome the resistance to change?
>>
>> Yeah, at least 3 big users so far:
>>
>> - Huawei
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html
>> - Tencent
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html
>> - Oracle
>> (this series).
>>
>> Then Huawei tried the MicroVM approach:
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg680103.html
>>
>> I simply wanted to save David time by remembering this other approach,
>> since Peter already commented on David's one (see Huawei link).
> 
> IIRC the two questions that came up were:
> 
>   - what would reading memory not covered by a file look like (0's or
>     something more like real HW, 7f?).

For NOR flashes erased bit is high, programmed bit is low, so: 0xff.

> 
>   - what happens when the guest writes beyond the bounds of a backing
>     file?

Report an hardware error, so guest firmware have a chance to do
do something (not sure what, beside rebooting...).

> 
> I'm guessing for these cloudy type applications no one cares about
> persistence of EFI variables? Maybe we just need a formulation for the
> second pflash which is explicit about writes being ephemeral while also
> being accepted?

Someone suggested adding a new machine type to QEMU to be able to use
smaller flash for cloud usage (but I don't remember who). Then EDK2
could be built with for this new flash size.

Regards,

Phil.
David Edmondson Dec. 7, 2020, 12:07 p.m. UTC | #7
Sorry for the delay in replying - holiday then distractions.

On Thursday, 2020-11-19 at 11:45:11 GMT, Alex Bennée wrote:

> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>
>> On 11/16/20 2:48 PM, Markus Armbruster wrote:
>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes:
>>> 
>>>> Hi David,
>>>>
>>>> On 11/16/20 11:42 AM, David Edmondson wrote:
>>>>> Currently ARM UEFI images are typically built as 2MB/768kB flash
>>>>> images for code and variables respectively. These images are both then
>>>>> padded out to 64MB before being loaded by QEMU.
>>>>>
>>>>> Because the images are 64MB each, QEMU allocates 128MB of memory to
>>>>> read them, and then proceeds to read all 128MB from disk (dirtying the
>>>>> memory). Of this 128MB less than 3MB is useful - the rest is zero
>>>>> padding.
>>>>
>>>> 2 years ago I commented the same problem, and suggested to access the
>>>> underlying storage by "block", as this is a "block storage".
>>>>
>>>> Back then the response was "do not try to fix something that works".
>>>> This is why we choose the big hammer "do not accept image size not
>>>> matching device size" way.
>>>>
>>>> While your series seems to help, it only postpone the same
>>>> implementation problem. If what you want is use the least memory
>>>> required, I still believe accessing the device by block is the
>>>> best approach.
>>> 
>>> "Do not try to fix something that works" is hard to disagree with.
>>> However, at least some users seem to disagree with "this works".  Enough
>>> to overcome the resistance to change?
>>
>> Yeah, at least 3 big users so far:
>>
>> - Huawei
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html
>> - Tencent
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html
>> - Oracle
>> (this series).
>>
>> Then Huawei tried the MicroVM approach:
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg680103.html
>>
>> I simply wanted to save David time by remembering this other approach,
>> since Peter already commented on David's one (see Huawei link).
>
> IIRC the two questions that came up were:
>
>   - what would reading memory not covered by a file look like (0's or
>     something more like real HW, 7f?).
>
>   - what happens when the guest writes beyond the bounds of a backing
>     file?

In the non-backward-compatible-case (pre virt-5.2 in the patches, but
obviously not actually in 5.2, where the memory range is declared to the
guest as the extent of the file) neither of these issues should arise.

Reading/writing outside of the declared range should generate a fault,
much like any other non-existent region.

In the backward-compatible case the patches are obviously flawed - a
write outside of the extent of the backing file would have nowhere to
go.

In order to figure out how to move forward, I think that the summary of
the previous conversations is that either:
- it's not possible to reduce the size of the memory region declared to
  the guest from 64M+64M because of $reasons, or
- we don't want to reduce the size of the memory region declared to the
  guest from 64M+64M because of $reasons.

If that's right, I'd be curious to better understand $reasons, but have
no basis on which to argue.

Presuming that it's not acceptable to reduce the declared extent of the
flash regions, I will probably look to implement Philippe's suggestion:

> If what you want is use the least memory required, I still believe
> accessing the device by block is the best approach.

...which I interpret to mean that the pflash drivers should not allocate
memory and read the images (later writing back modified blocks), but
handle each IO request from the caller by reading and writing blocks
from the underlying device as necessary.

If this interpretation is wrong, please let me know :-)

Looking at doing that, it seems that a new variant of
memory_region_init_rom_device() that does not allocate memory will be
required, unless there anything already available that performs a
similar function?

Booting a VM with AAVMF and writing some variables to the flash doesn't
exercise much of the pflash functionality. In particular it would seem
like the right time to fix the batch write omission in the current
code. Is there another consumer of the pflash drivers that is likely to
exercise more? (I can write tests that run in the guest, of course,
presuming that the Linux driver can be pushed into using those paths.)

> I'm guessing for these cloudy type applications no one cares about
> persistence of EFI variables? Maybe we just need a formulation for the
> second pflash which is explicit about writes being ephemeral while also
> being accepted?

Our current deployments on x86 do not have writable flash for VARS,
though I believe that was problematic on arm64 until a recent
change. It's my suspicion that assuming we can continue to present
read-only VARS generally is going to be proven wrong before too long.

dme.