Message ID | 20201116104216.439650-1-david.edmondson@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | ARM: reduce the memory consumed when mapping UEFI flash images | expand |
Hi David, On 11/16/20 11:42 AM, David Edmondson wrote: > Currently ARM UEFI images are typically built as 2MB/768kB flash > images for code and variables respectively. These images are both then > padded out to 64MB before being loaded by QEMU. > > Because the images are 64MB each, QEMU allocates 128MB of memory to > read them, and then proceeds to read all 128MB from disk (dirtying the > memory). Of this 128MB less than 3MB is useful - the rest is zero > padding. 2 years ago I commented the same problem, and suggested to access the underlying storage by "block", as this is a "block storage". Back then the response was "do not try to fix something that works". This is why we choose the big hammer "do not accept image size not matching device size" way. While your series seems to help, it only postpone the same implementation problem. If what you want is use the least memory required, I still believe accessing the device by block is the best approach. Regards, Phil. > > On a machine with 100 VMs this wastes over 12GB of memory. > > This set of patches aims to reclaim the wasted memory by allowing QEMU > to respect the size of the flash images and allocate only the > necessary memory. This will, of course, require that the flash build > process be modified to avoid padding the images to 64MB. > > Because existing machine types expect the full 128MB reserved for > flash to be occupied, do so for machine types older than virt-5.2. The > changes are beneficial even in this case, because while the full 128MB > of memory is allocated, only that required to actually load the flash > images from disk is touched. > > David Edmondson (5): > hw/block: blk_check_size_and_read_all should report backend name > hw/block: Flash images can be smaller than the device > hw/arm: Convert assertions about flash image size to error_report > hw/arm: Flash image mapping follows image size > hw/arm: Only minimise flash size on older machines > > hw/arm/trace-events | 2 + > hw/arm/virt-acpi-build.c | 30 ++++++++------ > hw/arm/virt.c | 86 +++++++++++++++++++++++++++++----------- > hw/block/block.c | 26 ++++++------ > include/hw/arm/virt.h | 2 + > 5 files changed, 97 insertions(+), 49 deletions(-) >
On Monday, 2020-11-16 at 12:39:46 +01, Philippe Mathieu-Daudé wrote: > Hi David, > > On 11/16/20 11:42 AM, David Edmondson wrote: >> Currently ARM UEFI images are typically built as 2MB/768kB flash >> images for code and variables respectively. These images are both then >> padded out to 64MB before being loaded by QEMU. >> >> Because the images are 64MB each, QEMU allocates 128MB of memory to >> read them, and then proceeds to read all 128MB from disk (dirtying the >> memory). Of this 128MB less than 3MB is useful - the rest is zero >> padding. > > 2 years ago I commented the same problem, and suggested to access the > underlying storage by "block", as this is a "block storage". > > Back then the response was "do not try to fix something that works". > This is why we choose the big hammer "do not accept image size not > matching device size" way. > > While your series seems to help, it only postpone the same > implementation problem. If what you want is use the least memory > required, I still believe accessing the device by block is the > best approach. I agree that would reduce the size further, but it feels like it may be a case of diminishing returns. Currently the consumption for a single guest is 128MB. This change will bring it down under 3MB, with the block approach perhaps reducing that to zero (there will be some buffer block usage presumably, and perhaps a small cache would make sense, so it won't really be zero). Scaling that up for 100 guests, we're going from 12.5GB now down to under 300MB with the changes, and again something around zero with the block based approach. I guess that it would mean that the machine model wouldn't have to change, as we could return zeros for reads outside the underlying device extent. This seems like the most interesting part - are there likely to be any consequential *benefits* from reducing the amount of guest address space consumed by the flash regions? > Regards, > > Phil. > >> >> On a machine with 100 VMs this wastes over 12GB of memory. >> >> This set of patches aims to reclaim the wasted memory by allowing QEMU >> to respect the size of the flash images and allocate only the >> necessary memory. This will, of course, require that the flash build >> process be modified to avoid padding the images to 64MB. >> >> Because existing machine types expect the full 128MB reserved for >> flash to be occupied, do so for machine types older than virt-5.2. The >> changes are beneficial even in this case, because while the full 128MB >> of memory is allocated, only that required to actually load the flash >> images from disk is touched. >> >> David Edmondson (5): >> hw/block: blk_check_size_and_read_all should report backend name >> hw/block: Flash images can be smaller than the device >> hw/arm: Convert assertions about flash image size to error_report >> hw/arm: Flash image mapping follows image size >> hw/arm: Only minimise flash size on older machines >> >> hw/arm/trace-events | 2 + >> hw/arm/virt-acpi-build.c | 30 ++++++++------ >> hw/arm/virt.c | 86 +++++++++++++++++++++++++++++----------- >> hw/block/block.c | 26 ++++++------ >> include/hw/arm/virt.h | 2 + >> 5 files changed, 97 insertions(+), 49 deletions(-) >> dme.
Philippe Mathieu-Daudé <philmd@redhat.com> writes: > Hi David, > > On 11/16/20 11:42 AM, David Edmondson wrote: >> Currently ARM UEFI images are typically built as 2MB/768kB flash >> images for code and variables respectively. These images are both then >> padded out to 64MB before being loaded by QEMU. >> >> Because the images are 64MB each, QEMU allocates 128MB of memory to >> read them, and then proceeds to read all 128MB from disk (dirtying the >> memory). Of this 128MB less than 3MB is useful - the rest is zero >> padding. > > 2 years ago I commented the same problem, and suggested to access the > underlying storage by "block", as this is a "block storage". > > Back then the response was "do not try to fix something that works". > This is why we choose the big hammer "do not accept image size not > matching device size" way. > > While your series seems to help, it only postpone the same > implementation problem. If what you want is use the least memory > required, I still believe accessing the device by block is the > best approach. "Do not try to fix something that works" is hard to disagree with. However, at least some users seem to disagree with "this works". Enough to overcome the resistance to change?
On 11/16/20 2:48 PM, Markus Armbruster wrote: > Philippe Mathieu-Daudé <philmd@redhat.com> writes: > >> Hi David, >> >> On 11/16/20 11:42 AM, David Edmondson wrote: >>> Currently ARM UEFI images are typically built as 2MB/768kB flash >>> images for code and variables respectively. These images are both then >>> padded out to 64MB before being loaded by QEMU. >>> >>> Because the images are 64MB each, QEMU allocates 128MB of memory to >>> read them, and then proceeds to read all 128MB from disk (dirtying the >>> memory). Of this 128MB less than 3MB is useful - the rest is zero >>> padding. >> >> 2 years ago I commented the same problem, and suggested to access the >> underlying storage by "block", as this is a "block storage". >> >> Back then the response was "do not try to fix something that works". >> This is why we choose the big hammer "do not accept image size not >> matching device size" way. >> >> While your series seems to help, it only postpone the same >> implementation problem. If what you want is use the least memory >> required, I still believe accessing the device by block is the >> best approach. > > "Do not try to fix something that works" is hard to disagree with. > However, at least some users seem to disagree with "this works". Enough > to overcome the resistance to change? Yeah, at least 3 big users so far: - Huawei https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html - Tencent https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html - Oracle (this series). Then Huawei tried the MicroVM approach: https://www.mail-archive.com/qemu-devel@nongnu.org/msg680103.html I simply wanted to save David time by remembering this other approach, since Peter already commented on David's one (see Huawei link). Regards, Phil.
Philippe Mathieu-Daudé <philmd@redhat.com> writes: > On 11/16/20 2:48 PM, Markus Armbruster wrote: >> Philippe Mathieu-Daudé <philmd@redhat.com> writes: >> >>> Hi David, >>> >>> On 11/16/20 11:42 AM, David Edmondson wrote: >>>> Currently ARM UEFI images are typically built as 2MB/768kB flash >>>> images for code and variables respectively. These images are both then >>>> padded out to 64MB before being loaded by QEMU. >>>> >>>> Because the images are 64MB each, QEMU allocates 128MB of memory to >>>> read them, and then proceeds to read all 128MB from disk (dirtying the >>>> memory). Of this 128MB less than 3MB is useful - the rest is zero >>>> padding. >>> >>> 2 years ago I commented the same problem, and suggested to access the >>> underlying storage by "block", as this is a "block storage". >>> >>> Back then the response was "do not try to fix something that works". >>> This is why we choose the big hammer "do not accept image size not >>> matching device size" way. >>> >>> While your series seems to help, it only postpone the same >>> implementation problem. If what you want is use the least memory >>> required, I still believe accessing the device by block is the >>> best approach. >> >> "Do not try to fix something that works" is hard to disagree with. >> However, at least some users seem to disagree with "this works". Enough >> to overcome the resistance to change? > > Yeah, at least 3 big users so far: > > - Huawei > https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html > - Tencent > https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html > - Oracle > (this series). > > Then Huawei tried the MicroVM approach: > https://www.mail-archive.com/qemu-devel@nongnu.org/msg680103.html > > I simply wanted to save David time by remembering this other approach, > since Peter already commented on David's one (see Huawei link). IIRC the two questions that came up were: - what would reading memory not covered by a file look like (0's or something more like real HW, 7f?). - what happens when the guest writes beyond the bounds of a backing file? I'm guessing for these cloudy type applications no one cares about persistence of EFI variables? Maybe we just need a formulation for the second pflash which is explicit about writes being ephemeral while also being accepted? > > Regards, > > Phil.
+Ard & Leif for EDK2. On 11/19/20 12:45 PM, Alex Bennée wrote: > Philippe Mathieu-Daudé <philmd@redhat.com> writes: >> On 11/16/20 2:48 PM, Markus Armbruster wrote: >>> Philippe Mathieu-Daudé <philmd@redhat.com> writes: >>> >>>> Hi David, >>>> >>>> On 11/16/20 11:42 AM, David Edmondson wrote: >>>>> Currently ARM UEFI images are typically built as 2MB/768kB flash >>>>> images for code and variables respectively. These images are both then >>>>> padded out to 64MB before being loaded by QEMU. >>>>> >>>>> Because the images are 64MB each, QEMU allocates 128MB of memory to >>>>> read them, and then proceeds to read all 128MB from disk (dirtying the >>>>> memory). Of this 128MB less than 3MB is useful - the rest is zero >>>>> padding. >>>> >>>> 2 years ago I commented the same problem, and suggested to access the >>>> underlying storage by "block", as this is a "block storage". >>>> >>>> Back then the response was "do not try to fix something that works". >>>> This is why we choose the big hammer "do not accept image size not >>>> matching device size" way. >>>> >>>> While your series seems to help, it only postpone the same >>>> implementation problem. If what you want is use the least memory >>>> required, I still believe accessing the device by block is the >>>> best approach. >>> >>> "Do not try to fix something that works" is hard to disagree with. >>> However, at least some users seem to disagree with "this works". Enough >>> to overcome the resistance to change? >> >> Yeah, at least 3 big users so far: >> >> - Huawei >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html >> - Tencent >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html >> - Oracle >> (this series). >> >> Then Huawei tried the MicroVM approach: >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg680103.html >> >> I simply wanted to save David time by remembering this other approach, >> since Peter already commented on David's one (see Huawei link). > > IIRC the two questions that came up were: > > - what would reading memory not covered by a file look like (0's or > something more like real HW, 7f?). For NOR flashes erased bit is high, programmed bit is low, so: 0xff. > > - what happens when the guest writes beyond the bounds of a backing > file? Report an hardware error, so guest firmware have a chance to do do something (not sure what, beside rebooting...). > > I'm guessing for these cloudy type applications no one cares about > persistence of EFI variables? Maybe we just need a formulation for the > second pflash which is explicit about writes being ephemeral while also > being accepted? Someone suggested adding a new machine type to QEMU to be able to use smaller flash for cloud usage (but I don't remember who). Then EDK2 could be built with for this new flash size. Regards, Phil.
Sorry for the delay in replying - holiday then distractions. On Thursday, 2020-11-19 at 11:45:11 GMT, Alex Bennée wrote: > Philippe Mathieu-Daudé <philmd@redhat.com> writes: > >> On 11/16/20 2:48 PM, Markus Armbruster wrote: >>> Philippe Mathieu-Daudé <philmd@redhat.com> writes: >>> >>>> Hi David, >>>> >>>> On 11/16/20 11:42 AM, David Edmondson wrote: >>>>> Currently ARM UEFI images are typically built as 2MB/768kB flash >>>>> images for code and variables respectively. These images are both then >>>>> padded out to 64MB before being loaded by QEMU. >>>>> >>>>> Because the images are 64MB each, QEMU allocates 128MB of memory to >>>>> read them, and then proceeds to read all 128MB from disk (dirtying the >>>>> memory). Of this 128MB less than 3MB is useful - the rest is zero >>>>> padding. >>>> >>>> 2 years ago I commented the same problem, and suggested to access the >>>> underlying storage by "block", as this is a "block storage". >>>> >>>> Back then the response was "do not try to fix something that works". >>>> This is why we choose the big hammer "do not accept image size not >>>> matching device size" way. >>>> >>>> While your series seems to help, it only postpone the same >>>> implementation problem. If what you want is use the least memory >>>> required, I still believe accessing the device by block is the >>>> best approach. >>> >>> "Do not try to fix something that works" is hard to disagree with. >>> However, at least some users seem to disagree with "this works". Enough >>> to overcome the resistance to change? >> >> Yeah, at least 3 big users so far: >> >> - Huawei >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html >> - Tencent >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html >> - Oracle >> (this series). >> >> Then Huawei tried the MicroVM approach: >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg680103.html >> >> I simply wanted to save David time by remembering this other approach, >> since Peter already commented on David's one (see Huawei link). > > IIRC the two questions that came up were: > > - what would reading memory not covered by a file look like (0's or > something more like real HW, 7f?). > > - what happens when the guest writes beyond the bounds of a backing > file? In the non-backward-compatible-case (pre virt-5.2 in the patches, but obviously not actually in 5.2, where the memory range is declared to the guest as the extent of the file) neither of these issues should arise. Reading/writing outside of the declared range should generate a fault, much like any other non-existent region. In the backward-compatible case the patches are obviously flawed - a write outside of the extent of the backing file would have nowhere to go. In order to figure out how to move forward, I think that the summary of the previous conversations is that either: - it's not possible to reduce the size of the memory region declared to the guest from 64M+64M because of $reasons, or - we don't want to reduce the size of the memory region declared to the guest from 64M+64M because of $reasons. If that's right, I'd be curious to better understand $reasons, but have no basis on which to argue. Presuming that it's not acceptable to reduce the declared extent of the flash regions, I will probably look to implement Philippe's suggestion: > If what you want is use the least memory required, I still believe > accessing the device by block is the best approach. ...which I interpret to mean that the pflash drivers should not allocate memory and read the images (later writing back modified blocks), but handle each IO request from the caller by reading and writing blocks from the underlying device as necessary. If this interpretation is wrong, please let me know :-) Looking at doing that, it seems that a new variant of memory_region_init_rom_device() that does not allocate memory will be required, unless there anything already available that performs a similar function? Booting a VM with AAVMF and writing some variables to the flash doesn't exercise much of the pflash functionality. In particular it would seem like the right time to fix the batch write omission in the current code. Is there another consumer of the pflash drivers that is likely to exercise more? (I can write tests that run in the guest, of course, presuming that the Linux driver can be pushed into using those paths.) > I'm guessing for these cloudy type applications no one cares about > persistence of EFI variables? Maybe we just need a formulation for the > second pflash which is explicit about writes being ephemeral while also > being accepted? Our current deployments on x86 do not have writable flash for VARS, though I believe that was problematic on arm64 until a recent change. It's my suspicion that assuming we can continue to present read-only VARS generally is going to be proven wrong before too long. dme.