mbox series

[RFC,0/3] hw/pflash_cfi01: Reduce memory consumption when flash image is smaller than region

Message ID 20210216142721.1985543-1-david.edmondson@oracle.com (mailing list archive)
Headers show
Series hw/pflash_cfi01: Reduce memory consumption when flash image is smaller than region | expand

Message

David Edmondson Feb. 16, 2021, 2:27 p.m. UTC
As described in
https://lore.kernel.org/r/20201116104216.439650-1-david.edmondson@oracle.com,
I'd like to reduce the amount of memory consumed by QEMU mapping UEFI
images on aarch64.

To recap:

> Currently ARM UEFI images are typically built as 2MB/768kB flash
> images for code and variables respectively. These images are both
> then padded out to 64MB before being loaded by QEMU.
>
> Because the images are 64MB each, QEMU allocates 128MB of memory to
> read them, and then proceeds to read all 128MB from disk (dirtying
> the memory). Of this 128MB less than 3MB is useful - the rest is
> zero padding.
>
> On a machine with 100 VMs this wastes over 12GB of memory.

There were objections to my previous patch because it changed the size
of the regions reported to the guest via the memory map (the reported
size depended on the size of the image).

This is a smaller patch which only helps with read-only flash images,
as it does so by changing the memory region that covers the entire
region to be IO rather than RAM, and loads the flash image into a
smaller sub-region that is the more traditional mixed IO/ROMD type.

All read/write operations to areas outside of the underlying block
device are handled directly (reads return 0, writes fail (which is
okay, because this path only supports read-only devices)).

This reduces the memory consumption for the read-only AAVMF code image
from 64MB to around 2MB (presuming that the UEFI image is adjusted
accordingly). It does nothing to improve the memory consumption caused
by the read-write AAVMF vars image.

There was a suggestion in a previous thread that perhaps the pflash
driver could be re-worked to use the block IO interfaces to access the
underlying device "on demand" rather than reading in the entire image
at startup (at least, that's how I understood the comment).

I looked at implementing this and struggled to get it to work for all
of the required use cases. Specifically, there are several code paths
that expect to retrieve a pointer to the flat memory image of the
pflash device and manipulate it directly (examples include the Malta
board and encrypted memory support on x86), or write the entire image
to storage (after migration).

My implementation was based around mapping the flash region only for
IO, which meant that every read or write had to be handled directly by
the pflash driver (there was no ROMD style operation), which also made
booting an aarch64 VM noticeably slower - getting through the firmware
went from under 1 second to around 10 seconds.

Improving the writeable device support requires some more general
infrastructure, I think, but I'm not familiar with everything that
QEMU currently provides, and would be very happy to learn otherwise.

David Edmondson (3):
  hw/pflash_cfi*: Replace DPRINTF with trace events
  hw/pflash_cfi01: Correct the type of PFlashCFI01.ro
  hw/pflash_cfi01: Allow read-only devices to have a smaller backing
    device

 hw/block/pflash_cfi01.c | 197 +++++++++++++++++++++++++---------------
 hw/block/pflash_cfi02.c |  75 ++++++---------
 hw/block/trace-events   |  42 +++++++--
 3 files changed, 186 insertions(+), 128 deletions(-)

Comments

Philippe Mathieu-Daudé Feb. 16, 2021, 3:03 p.m. UTC | #1
On 2/16/21 3:27 PM, David Edmondson wrote:
> As described in
> https://lore.kernel.org/r/20201116104216.439650-1-david.edmondson@oracle.com,
> I'd like to reduce the amount of memory consumed by QEMU mapping UEFI
> images on aarch64.
> 
> To recap:
> 
>> Currently ARM UEFI images are typically built as 2MB/768kB flash
>> images for code and variables respectively. These images are both
>> then padded out to 64MB before being loaded by QEMU.
>>
>> Because the images are 64MB each, QEMU allocates 128MB of memory to
>> read them, and then proceeds to read all 128MB from disk (dirtying
>> the memory). Of this 128MB less than 3MB is useful - the rest is
>> zero padding.
>>
>> On a machine with 100 VMs this wastes over 12GB of memory.
> 
> There were objections to my previous patch because it changed the size
> of the regions reported to the guest via the memory map (the reported
> size depended on the size of the image).
> 
> This is a smaller patch which only helps with read-only flash images,
> as it does so by changing the memory region that covers the entire
> region to be IO rather than RAM, and loads the flash image into a
> smaller sub-region that is the more traditional mixed IO/ROMD type.
> 
> All read/write operations to areas outside of the underlying block
> device are handled directly (reads return 0, writes fail (which is
> okay, because this path only supports read-only devices)).
> 
> This reduces the memory consumption for the read-only AAVMF code image
> from 64MB to around 2MB (presuming that the UEFI image is adjusted
> accordingly). It does nothing to improve the memory consumption caused
> by the read-write AAVMF vars image.

So for each VM this changes from 64 + 64 to 2 + 64 MiB.

100 VMs now use 6.5GB instead of 400MB. Quite an improvement already :)

> There was a suggestion in a previous thread that perhaps the pflash
> driver could be re-worked to use the block IO interfaces to access the
> underlying device "on demand" rather than reading in the entire image
> at startup (at least, that's how I understood the comment).
> 
> I looked at implementing this and struggled to get it to work for all
> of the required use cases. Specifically, there are several code paths
> that expect to retrieve a pointer to the flat memory image of the
> pflash device and manipulate it directly (examples include the Malta
> board and encrypted memory support on x86), or write the entire image
> to storage (after migration).

IIUC these are specific uses when the machine is paused. For Malta we
can map a ROM instead.

I don't know about encrypted x86 machines.

> My implementation was based around mapping the flash region only for
> IO, which meant that every read or write had to be handled directly by
> the pflash driver (there was no ROMD style operation), which also made
> booting an aarch64 VM noticeably slower - getting through the firmware
> went from under 1 second to around 10 seconds.
> 
> Improving the writeable device support requires some more general
> infrastructure, I think, but I'm not familiar with everything that
> QEMU currently provides, and would be very happy to learn otherwise.

I am not a block expert, but I wonder if something like this could
be used:

- create a raw (parent) block image of 64MiB

- add a raw (child) block with your 768kB of VARS file

- add a null-co (child) block of 63Mib + 256kiB

- pass the parent block to the pflash device

Regards,

Phil.
David Edmondson Feb. 16, 2021, 3:22 p.m. UTC | #2
On Tuesday, 2021-02-16 at 16:03:05 +01, Philippe Mathieu-Daudé wrote:

> I am not a block expert, but I wonder if something like this could
> be used:
>
> - create a raw (parent) block image of 64MiB
>
> - add a raw (child) block with your 768kB of VARS file
>
> - add a null-co (child) block of 63Mib + 256kiB
>
> - pass the parent block to the pflash device

I'm not clear how this would behave if there is a write to the device at
(say) 1MiB.

More philosophically, how should it behave?

My expectation was that if the machine says that there is 64MiB of
writable flash, we have to allow writes throughout the full 64MiB and
(significantly) persist them to the backing block device.

Just because the backing block device started out 768KiB big doesn't
mean that we should not write to the remaining extent if that's what the
VM attempts.

Would the above approach achieve that? (It doesn't sound like it.)

dme.
Philippe Mathieu-Daudé Feb. 16, 2021, 3:44 p.m. UTC | #3
On 2/16/21 4:22 PM, David Edmondson wrote:
> On Tuesday, 2021-02-16 at 16:03:05 +01, Philippe Mathieu-Daudé wrote:
> 
>> I am not a block expert, but I wonder if something like this could
>> be used:
>>
>> - create a raw (parent) block image of 64MiB
>>
>> - add a raw (child) block with your 768kB of VARS file
>>
>> - add a null-co (child) block of 63Mib + 256kiB
>>
>> - pass the parent block to the pflash device
> 
> I'm not clear how this would behave if there is a write to the device at
> (say) 1MiB.

Discarded.

> More philosophically, how should it behave?

null-co: reads return zeroes, writes are discarded.

> My expectation was that if the machine says that there is 64MiB of
> writable flash, we have to allow writes throughout the full 64MiB and
> (significantly) persist them to the backing block device.
> 
> Just because the backing block device started out 768KiB big doesn't
> mean that we should not write to the remaining extent if that's what the
> VM attempts.
> 
> Would the above approach achieve that? (It doesn't sound like it.)

Well this was a simple suggestion if you know your guest won't access
anything beside these 768KiB (IIRC AAVMF "consumes" the flash devices
and doesn't expose them to the guest via ACPI/other).

If you are into memory optimization, this is worth considering.
Else it doesn't make sense.

AAVMF is designed for virtual world. Is the virtual world interested in
doing firmware upgrade on the CODE flash? Unlikely, if you want to
upgrade AAVMF you'd do it on the host. Why not mount the CODE flash as
ROM? I suppose because AAVMF does CFI MMIO accesses to detect the flash,
but I wonder what is the point if this flash will be ever written...

Regards,

Phil.
David Edmondson Feb. 16, 2021, 3:53 p.m. UTC | #4
On Tuesday, 2021-02-16 at 16:44:58 +01, Philippe Mathieu-Daudé wrote:

> On 2/16/21 4:22 PM, David Edmondson wrote:
>> On Tuesday, 2021-02-16 at 16:03:05 +01, Philippe Mathieu-Daudé wrote:
>> 
>>> I am not a block expert, but I wonder if something like this could
>>> be used:
>>>
>>> - create a raw (parent) block image of 64MiB
>>>
>>> - add a raw (child) block with your 768kB of VARS file
>>>
>>> - add a null-co (child) block of 63Mib + 256kiB
>>>
>>> - pass the parent block to the pflash device
>> 
>> I'm not clear how this would behave if there is a write to the device at
>> (say) 1MiB.
>
> Discarded.
>
>> More philosophically, how should it behave?
>
> null-co: reads return zeroes, writes are discarded.
>
>> My expectation was that if the machine says that there is 64MiB of
>> writable flash, we have to allow writes throughout the full 64MiB and
>> (significantly) persist them to the backing block device.
>> 
>> Just because the backing block device started out 768KiB big doesn't
>> mean that we should not write to the remaining extent if that's what the
>> VM attempts.
>> 
>> Would the above approach achieve that? (It doesn't sound like it.)
>
> Well this was a simple suggestion if you know your guest won't access
> anything beside these 768KiB (IIRC AAVMF "consumes" the flash devices
> and doesn't expose them to the guest via ACPI/other).

If that's the case, mirroring the approach in the patch that I sent
should work - we run the 768kiB as a subregion with IO/ROMD behaviour,
fail or discard writes to the rest and read as zero.

> If you are into memory optimization, this is worth considering.
> Else it doesn't make sense.
>
> AAVMF is designed for virtual world. Is the virtual world interested in
> doing firmware upgrade on the CODE flash? Unlikely, if you want to
> upgrade AAVMF you'd do it on the host. Why not mount the CODE flash as
> ROM? I suppose because AAVMF does CFI MMIO accesses to detect the flash,
> but I wonder what is the point if this flash will be ever written...

I don't expect that the CODE flash will ever be written by the guest, no
(it's generally presented to the guest as read-only), and currently the
VARS flash is also often presented read-only as well, but I don't think
that is a situation that we can maintain given increasing use (DBX
updates, for example).

If it's acceptable to limit flash writes to the extent of the underlying
block device then things are straightforward. It's not clear to me that
this is the case.

dme.
David Edmondson Feb. 18, 2021, 10:34 a.m. UTC | #5
On Tuesday, 2021-02-16 at 15:53:48 GMT, David Edmondson wrote:

> On Tuesday, 2021-02-16 at 16:44:58 +01, Philippe Mathieu-Daudé wrote:
>
>> On 2/16/21 4:22 PM, David Edmondson wrote:
>>> On Tuesday, 2021-02-16 at 16:03:05 +01, Philippe Mathieu-Daudé wrote:
>>> 
>>>> I am not a block expert, but I wonder if something like this could
>>>> be used:
>>>>
>>>> - create a raw (parent) block image of 64MiB
>>>>
>>>> - add a raw (child) block with your 768kB of VARS file
>>>>
>>>> - add a null-co (child) block of 63Mib + 256kiB
>>>>
>>>> - pass the parent block to the pflash device
>>> 
>>> I'm not clear how this would behave if there is a write to the device at
>>> (say) 1MiB.
>>
>> Discarded.
>>
>>> More philosophically, how should it behave?
>>
>> null-co: reads return zeroes, writes are discarded.
>>
>>> My expectation was that if the machine says that there is 64MiB of
>>> writable flash, we have to allow writes throughout the full 64MiB and
>>> (significantly) persist them to the backing block device.
>>> 
>>> Just because the backing block device started out 768KiB big doesn't
>>> mean that we should not write to the remaining extent if that's what the
>>> VM attempts.
>>> 
>>> Would the above approach achieve that? (It doesn't sound like it.)
>>
>> Well this was a simple suggestion if you know your guest won't access
>> anything beside these 768KiB (IIRC AAVMF "consumes" the flash devices
>> and doesn't expose them to the guest via ACPI/other).
>
> If that's the case, mirroring the approach in the patch that I sent
> should work - we run the 768kiB as a subregion with IO/ROMD behaviour,
> fail or discard writes to the rest and read as zero.
>
>> If you are into memory optimization, this is worth considering.
>> Else it doesn't make sense.
>>
>> AAVMF is designed for virtual world. Is the virtual world interested in
>> doing firmware upgrade on the CODE flash? Unlikely, if you want to
>> upgrade AAVMF you'd do it on the host. Why not mount the CODE flash as
>> ROM? I suppose because AAVMF does CFI MMIO accesses to detect the flash,
>> but I wonder what is the point if this flash will be ever written...
>
> I don't expect that the CODE flash will ever be written by the guest, no
> (it's generally presented to the guest as read-only), and currently the
> VARS flash is also often presented read-only as well, but I don't think
> that is a situation that we can maintain given increasing use (DBX
> updates, for example).
>
> If it's acceptable to limit flash writes to the extent of the underlying
> block device then things are straightforward. It's not clear to me that
> this is the case.

Looking at the AAVMF implementation, it seems to me that if the initial
VARS image is prepared as being 768KiB in size (which it is), none of
AAVMF itself will attempt to write outside of that extent.

If that is correct, it would support the idea of allowing writes only to
the sub-region of the 64MiB that is covered by the backing block device.

This would allow the same approach as used for read-only devices to be
used for read-write devices - it's mostly a matter of deleting some code
from the patch that I sent. The additional patch on top of the previous
series would be:
https://p.dme.org/0004-hw-pflash_cfi01-Allow-devices-to-have-a-smaller-back.patch.html
(Presumably this would just be squashed into the previous patch in
reality.)

If AAVMF is not going to try to use any of the region outside of the
VARS, do we need to persist writes to the rest of the region?

If so, would it be acceptable for those writes to persist only for the
lifetime of a QEMU instance? (Memory could be allocated "on demand" so
that a read after write would behave, but the content would be thrown
away when QEMU exits.)

dme.