mbox series

[v4,0/1] hw/pflash_cfi01: Allow an administrator to reduce the memory consumption of flash devices

Message ID 20210810134050.396747-1-david.edmondson@oracle.com (mailing list archive)
Headers show
Series hw/pflash_cfi01: Allow an administrator to reduce the memory consumption of flash devices | expand

Message

David Edmondson Aug. 10, 2021, 1:40 p.m. UTC
As described in
https://lore.kernel.org/r/20201116104216.439650-1-david.edmondson@oracle.com
and
https://lore.kernel.org/r/20210222174757.2329740-1-david.edmondson@oracle.com
I'd like to reduce the amount of memory consumed by QEMU mapping UEFI
images on aarch64.

To recap:

> Currently ARM UEFI images are typically built as 2MB/768kB flash
> images for code and variables respectively. These images are both
> then padded out to 64MB before being loaded by QEMU.
>
> Because the images are 64MB each, QEMU allocates 128MB of memory to
> read them, and then proceeds to read all 128MB from disk (dirtying
> the memory). Of this 128MB less than 3MB is useful - the rest is
> zero padding.
>
> On a machine with 100 VMs this wastes over 12GB of memory.

Some of the cleanups in the previous patches were incorporated, but
the patch that reduced memory consumption was not accepted. This is
essentially that patch rebased after some unrelated changes. Having
investigated alternatives, I think that the patch here is useful as it
stands.

All read/write operations to areas outside of the underlying block
device are handled directly. Reads return 0, writes either fail
(read-only devices) or are discarded (writable devices).

This reduces the memory consumption for the AAVMF code image from
64MiB to around 2MB and that for the AAVMF vars from 64MiB to 768KiB
(presuming that the UEFI images are adjusted accordingly).

For read-only devices (such as the AAVMF code) this seems completely
safe.

For writable devices there is a change in behaviour - previously it
was possible to write anywhere in the extent of the flash device, read
back the data written and have that data persist through a restart of
QEMU. This is no longer the case - writes outside of the extent of the
underlying backing block device will be discarded. That is, a read
after a write will *not* return the written data, either immediately
or after a QEMU restart - it will return zeros.

Looking at the AAVMF implementation, it seems to me that if the
initial VARS image is prepared as being 768KiB in size (which it is),
none of AAVMF itself will attempt to write outside of that extent, and
so I believe that this is an acceptable compromise.

It would be relatively straightforward to allow writes outside of the
backing device to persist for the lifetime of a particular QEMU by
allocating memory on demand (i.e. when there is a write to the
relevant region). This would allow a read to return the relevant data,
but only until a QEMU restart, at which point the data would be lost.

It may be possible to persist writes by extending the underlying
backing device to accommodate a new extent. This would definitely add
complication, as ideally the size of the memory sub-region would also
be updated. I have not investigated this further.

There was a suggestion in a previous thread that perhaps the pflash
driver could be re-worked to use the block IO interfaces to access the
underlying device "on demand" rather than reading in the entire image
at startup (at least, that's how I understood the comment).

An implementation of this based around mapping the flash region only
for IO, which meant that every read or write had to be handled
directly by the pflash driver (there was no ROMD style operation),
made booting an aarch64 VM significantly slower - getting through the
firmware went from under 1 second to around 10 seconds. It's possible
that this could be improved by caching blocks or some other mechanism,
but I have not pursued it further.

Philippe implemented a suggestion to use mmap() to avoid the need to
allocate (and dirty) memory for read-only pflash images in
https://lore.kernel.org/qemu-devel/20210301115329.411762-1-philmd@redhat.com/.

This solution was, I believe, considered incomplete, as:
- it does not handle the case where the image underlying a pflash
  device is changed via QAPI,
- it does not handle writable devices.

There is also an assumption that multiple QEMU instances on a single
host will share the same AAVMF code image (to benefit from a shared
mapping) - this is not the case in the environment that I am looking
to support.

If using mmap() for read-only device is particularly valuable, it
could be combined with the patches here - the benefit would be
cumulative.

The only drawback that I see with this patch is the change in
behaviour for writes beyond the extent of an underlying image. Unless
the AAVMF build process is modified to generate smaller images (768kB
for the variables, for example), this will never be a problem in
reality, as the underlying image will match the size of the device.

Only when a deliberate decision is taken to use an image smaller than
the device does this drawback come to the fore, which is a tradeoff
that an administrator can choose to make if they wish.

v2:
- Unify the approach for both read-only and writable devices, saving
  another 63MiB per QEMU instance.

v3:
- Add Reviewed-by: for two changes (Philippe).
- Call blk_pread() directly rather than using
  blk_check_size_and_read_all(), given that we know how much we want
  to read.

v4:
- Remove changes already upstream.
- Rebase.

David Edmondson (1):
  hw/pflash_cfi01: Allow backing devices to be smaller than memory
    region

 hw/block/pflash_cfi01.c | 105 ++++++++++++++++++++++++++++++++--------
 hw/block/trace-events   |   3 ++
 2 files changed, 87 insertions(+), 21 deletions(-)

Comments

Philippe Mathieu-Daudé Aug. 20, 2021, 6:19 p.m. UTC | #1
Hi David,

On 8/10/21 3:40 PM, David Edmondson wrote:
> As described in
> https://lore.kernel.org/r/20201116104216.439650-1-david.edmondson@oracle.com
> and
> https://lore.kernel.org/r/20210222174757.2329740-1-david.edmondson@oracle.com
> I'd like to reduce the amount of memory consumed by QEMU mapping UEFI
> images on aarch64.

> v4:
> - Rebase.

Sorry I haven't forgotten about your patch, I can't make room
to think quietly about your problem. I reserved a dedicated
slot of few hours next week.