Message ID | 20210216142721.1985543-1-david.edmondson@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | hw/pflash_cfi01: Reduce memory consumption when flash image is smaller than region | expand |
On 2/16/21 3:27 PM, David Edmondson wrote: > As described in > https://lore.kernel.org/r/20201116104216.439650-1-david.edmondson@oracle.com, > I'd like to reduce the amount of memory consumed by QEMU mapping UEFI > images on aarch64. > > To recap: > >> Currently ARM UEFI images are typically built as 2MB/768kB flash >> images for code and variables respectively. These images are both >> then padded out to 64MB before being loaded by QEMU. >> >> Because the images are 64MB each, QEMU allocates 128MB of memory to >> read them, and then proceeds to read all 128MB from disk (dirtying >> the memory). Of this 128MB less than 3MB is useful - the rest is >> zero padding. >> >> On a machine with 100 VMs this wastes over 12GB of memory. > > There were objections to my previous patch because it changed the size > of the regions reported to the guest via the memory map (the reported > size depended on the size of the image). > > This is a smaller patch which only helps with read-only flash images, > as it does so by changing the memory region that covers the entire > region to be IO rather than RAM, and loads the flash image into a > smaller sub-region that is the more traditional mixed IO/ROMD type. > > All read/write operations to areas outside of the underlying block > device are handled directly (reads return 0, writes fail (which is > okay, because this path only supports read-only devices)). > > This reduces the memory consumption for the read-only AAVMF code image > from 64MB to around 2MB (presuming that the UEFI image is adjusted > accordingly). It does nothing to improve the memory consumption caused > by the read-write AAVMF vars image. So for each VM this changes from 64 + 64 to 2 + 64 MiB. 100 VMs now use 6.5GB instead of 400MB. Quite an improvement already :) > There was a suggestion in a previous thread that perhaps the pflash > driver could be re-worked to use the block IO interfaces to access the > underlying device "on demand" rather than reading in the entire image > at startup (at least, that's how I understood the comment). > > I looked at implementing this and struggled to get it to work for all > of the required use cases. Specifically, there are several code paths > that expect to retrieve a pointer to the flat memory image of the > pflash device and manipulate it directly (examples include the Malta > board and encrypted memory support on x86), or write the entire image > to storage (after migration). IIUC these are specific uses when the machine is paused. For Malta we can map a ROM instead. I don't know about encrypted x86 machines. > My implementation was based around mapping the flash region only for > IO, which meant that every read or write had to be handled directly by > the pflash driver (there was no ROMD style operation), which also made > booting an aarch64 VM noticeably slower - getting through the firmware > went from under 1 second to around 10 seconds. > > Improving the writeable device support requires some more general > infrastructure, I think, but I'm not familiar with everything that > QEMU currently provides, and would be very happy to learn otherwise. I am not a block expert, but I wonder if something like this could be used: - create a raw (parent) block image of 64MiB - add a raw (child) block with your 768kB of VARS file - add a null-co (child) block of 63Mib + 256kiB - pass the parent block to the pflash device Regards, Phil.
On Tuesday, 2021-02-16 at 16:03:05 +01, Philippe Mathieu-Daudé wrote: > I am not a block expert, but I wonder if something like this could > be used: > > - create a raw (parent) block image of 64MiB > > - add a raw (child) block with your 768kB of VARS file > > - add a null-co (child) block of 63Mib + 256kiB > > - pass the parent block to the pflash device I'm not clear how this would behave if there is a write to the device at (say) 1MiB. More philosophically, how should it behave? My expectation was that if the machine says that there is 64MiB of writable flash, we have to allow writes throughout the full 64MiB and (significantly) persist them to the backing block device. Just because the backing block device started out 768KiB big doesn't mean that we should not write to the remaining extent if that's what the VM attempts. Would the above approach achieve that? (It doesn't sound like it.) dme.
On 2/16/21 4:22 PM, David Edmondson wrote: > On Tuesday, 2021-02-16 at 16:03:05 +01, Philippe Mathieu-Daudé wrote: > >> I am not a block expert, but I wonder if something like this could >> be used: >> >> - create a raw (parent) block image of 64MiB >> >> - add a raw (child) block with your 768kB of VARS file >> >> - add a null-co (child) block of 63Mib + 256kiB >> >> - pass the parent block to the pflash device > > I'm not clear how this would behave if there is a write to the device at > (say) 1MiB. Discarded. > More philosophically, how should it behave? null-co: reads return zeroes, writes are discarded. > My expectation was that if the machine says that there is 64MiB of > writable flash, we have to allow writes throughout the full 64MiB and > (significantly) persist them to the backing block device. > > Just because the backing block device started out 768KiB big doesn't > mean that we should not write to the remaining extent if that's what the > VM attempts. > > Would the above approach achieve that? (It doesn't sound like it.) Well this was a simple suggestion if you know your guest won't access anything beside these 768KiB (IIRC AAVMF "consumes" the flash devices and doesn't expose them to the guest via ACPI/other). If you are into memory optimization, this is worth considering. Else it doesn't make sense. AAVMF is designed for virtual world. Is the virtual world interested in doing firmware upgrade on the CODE flash? Unlikely, if you want to upgrade AAVMF you'd do it on the host. Why not mount the CODE flash as ROM? I suppose because AAVMF does CFI MMIO accesses to detect the flash, but I wonder what is the point if this flash will be ever written... Regards, Phil.
On Tuesday, 2021-02-16 at 16:44:58 +01, Philippe Mathieu-Daudé wrote: > On 2/16/21 4:22 PM, David Edmondson wrote: >> On Tuesday, 2021-02-16 at 16:03:05 +01, Philippe Mathieu-Daudé wrote: >> >>> I am not a block expert, but I wonder if something like this could >>> be used: >>> >>> - create a raw (parent) block image of 64MiB >>> >>> - add a raw (child) block with your 768kB of VARS file >>> >>> - add a null-co (child) block of 63Mib + 256kiB >>> >>> - pass the parent block to the pflash device >> >> I'm not clear how this would behave if there is a write to the device at >> (say) 1MiB. > > Discarded. > >> More philosophically, how should it behave? > > null-co: reads return zeroes, writes are discarded. > >> My expectation was that if the machine says that there is 64MiB of >> writable flash, we have to allow writes throughout the full 64MiB and >> (significantly) persist them to the backing block device. >> >> Just because the backing block device started out 768KiB big doesn't >> mean that we should not write to the remaining extent if that's what the >> VM attempts. >> >> Would the above approach achieve that? (It doesn't sound like it.) > > Well this was a simple suggestion if you know your guest won't access > anything beside these 768KiB (IIRC AAVMF "consumes" the flash devices > and doesn't expose them to the guest via ACPI/other). If that's the case, mirroring the approach in the patch that I sent should work - we run the 768kiB as a subregion with IO/ROMD behaviour, fail or discard writes to the rest and read as zero. > If you are into memory optimization, this is worth considering. > Else it doesn't make sense. > > AAVMF is designed for virtual world. Is the virtual world interested in > doing firmware upgrade on the CODE flash? Unlikely, if you want to > upgrade AAVMF you'd do it on the host. Why not mount the CODE flash as > ROM? I suppose because AAVMF does CFI MMIO accesses to detect the flash, > but I wonder what is the point if this flash will be ever written... I don't expect that the CODE flash will ever be written by the guest, no (it's generally presented to the guest as read-only), and currently the VARS flash is also often presented read-only as well, but I don't think that is a situation that we can maintain given increasing use (DBX updates, for example). If it's acceptable to limit flash writes to the extent of the underlying block device then things are straightforward. It's not clear to me that this is the case. dme.
On Tuesday, 2021-02-16 at 15:53:48 GMT, David Edmondson wrote: > On Tuesday, 2021-02-16 at 16:44:58 +01, Philippe Mathieu-Daudé wrote: > >> On 2/16/21 4:22 PM, David Edmondson wrote: >>> On Tuesday, 2021-02-16 at 16:03:05 +01, Philippe Mathieu-Daudé wrote: >>> >>>> I am not a block expert, but I wonder if something like this could >>>> be used: >>>> >>>> - create a raw (parent) block image of 64MiB >>>> >>>> - add a raw (child) block with your 768kB of VARS file >>>> >>>> - add a null-co (child) block of 63Mib + 256kiB >>>> >>>> - pass the parent block to the pflash device >>> >>> I'm not clear how this would behave if there is a write to the device at >>> (say) 1MiB. >> >> Discarded. >> >>> More philosophically, how should it behave? >> >> null-co: reads return zeroes, writes are discarded. >> >>> My expectation was that if the machine says that there is 64MiB of >>> writable flash, we have to allow writes throughout the full 64MiB and >>> (significantly) persist them to the backing block device. >>> >>> Just because the backing block device started out 768KiB big doesn't >>> mean that we should not write to the remaining extent if that's what the >>> VM attempts. >>> >>> Would the above approach achieve that? (It doesn't sound like it.) >> >> Well this was a simple suggestion if you know your guest won't access >> anything beside these 768KiB (IIRC AAVMF "consumes" the flash devices >> and doesn't expose them to the guest via ACPI/other). > > If that's the case, mirroring the approach in the patch that I sent > should work - we run the 768kiB as a subregion with IO/ROMD behaviour, > fail or discard writes to the rest and read as zero. > >> If you are into memory optimization, this is worth considering. >> Else it doesn't make sense. >> >> AAVMF is designed for virtual world. Is the virtual world interested in >> doing firmware upgrade on the CODE flash? Unlikely, if you want to >> upgrade AAVMF you'd do it on the host. Why not mount the CODE flash as >> ROM? I suppose because AAVMF does CFI MMIO accesses to detect the flash, >> but I wonder what is the point if this flash will be ever written... > > I don't expect that the CODE flash will ever be written by the guest, no > (it's generally presented to the guest as read-only), and currently the > VARS flash is also often presented read-only as well, but I don't think > that is a situation that we can maintain given increasing use (DBX > updates, for example). > > If it's acceptable to limit flash writes to the extent of the underlying > block device then things are straightforward. It's not clear to me that > this is the case. Looking at the AAVMF implementation, it seems to me that if the initial VARS image is prepared as being 768KiB in size (which it is), none of AAVMF itself will attempt to write outside of that extent. If that is correct, it would support the idea of allowing writes only to the sub-region of the 64MiB that is covered by the backing block device. This would allow the same approach as used for read-only devices to be used for read-write devices - it's mostly a matter of deleting some code from the patch that I sent. The additional patch on top of the previous series would be: https://p.dme.org/0004-hw-pflash_cfi01-Allow-devices-to-have-a-smaller-back.patch.html (Presumably this would just be squashed into the previous patch in reality.) If AAVMF is not going to try to use any of the region outside of the VARS, do we need to persist writes to the rest of the region? If so, would it be acceptable for those writes to persist only for the lifetime of a QEMU instance? (Memory could be allocated "on demand" so that a read after write would behave, but the content would be thrown away when QEMU exits.) dme.