Message ID | 20221220084246.1984871-1-kraxel@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] pflash: Only read non-zero parts of backend image | expand |
On Tue, Dec 20, 2022 at 09:42:46AM +0100, Gerd Hoffmann wrote: > From: Xiang Zheng <zhengxiang9@huawei.com> > > Currently we fill the VIRT_FLASH memory space with two 64MB NOR images > when using persistent UEFI variables on virt board. Actually we only use > a very small(non-zero) part of the memory while the rest significant > large(zero) part of memory is wasted. > > So this patch checks the block status and only writes the non-zero part > into memory. This requires pflash devices to use sparse files for > backends. > > Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> > > [ kraxel: rebased to latest master ] > > Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> > --- > hw/block/block.c | 36 +++++++++++++++++++++++++++++++++++- > 1 file changed, 35 insertions(+), 1 deletion(-) Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel
[Extending to people using UEFI VARStore on Virt machines] On 20/12/22 09:42, Gerd Hoffmann wrote: > From: Xiang Zheng <zhengxiang9@huawei.com> > > Currently we fill the VIRT_FLASH memory space with two 64MB NOR images > when using persistent UEFI variables on virt board. Actually we only use > a very small(non-zero) part of the memory while the rest significant > large(zero) part of memory is wasted. > > So this patch checks the block status and only writes the non-zero part > into memory. This requires pflash devices to use sparse files for > backends. I like the idea, but I'm not sure how to relate with NOR flash devices. From the block layer, we get BDRV_BLOCK_ZERO when a block is fully filled by zeroes ('\0'). We don't want to waste host memory, I get it. Now what "sees" the guest? Is the UEFI VARStore filled with zeroes? If so, is it a EDK2 specific case for all virt machines? This would be a virtualization optimization and in that case, this patch would work. On hardware the NOR flash "erased state" is filled of '\xff'. If EDK2 requires a 64MiB VARStore on NOR flash, I'd expect the non-used area to be filled with \xff, at least up to the sector size. Otherwise it is sub-optimal use of persistent storage on hardware. But instead of keeping insisting on that, I'd like to step back a little and discuss. What is the use case? * Either you want to test UEFI on real hardware and a NOR flash makes sense, * or you are trying to optimize paravirtualized guests. In that case why insist with emulated NOR devices? Why not have EDK2 directly use a paravirtualized block driver which we can optimize / tune without interfering with emulated models? Keeping insisting on optimizing guests using the QEMU pflash device seems wrong to me. I'm pretty sure we can do better optimizing clouds payloads. > Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> > > [ kraxel: rebased to latest master ] > > Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> > --- > hw/block/block.c | 36 +++++++++++++++++++++++++++++++++++- > 1 file changed, 35 insertions(+), 1 deletion(-) > > diff --git a/hw/block/block.c b/hw/block/block.c > index f9c4fe67673b..142ebe4267e4 100644 > --- a/hw/block/block.c > +++ b/hw/block/block.c > @@ -14,6 +14,40 @@ > #include "qapi/error.h" > #include "qapi/qapi-types-block.h" > > +/* > + * Read the non-zeroes parts of @blk into @buf > + * Reading all of the @blk is expensive if the zeroes parts of @blk > + * is large enough. Therefore check the block status and only write > + * the non-zeroes block into @buf. > + * > + * Return 0 on success, non-zero on error. > + */ > +static int blk_pread_nonzeroes(BlockBackend *blk, hwaddr size, void *buf) > +{ > + int ret; > + int64_t bytes, offset = 0; > + BlockDriverState *bs = blk_bs(blk); > + > + for (;;) { > + bytes = MIN(size - offset, BDRV_REQUEST_MAX_SECTORS); > + if (bytes <= 0) { > + return 0; > + } > + ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL); > + if (ret < 0) { > + return ret; > + } > + if (!(ret & BDRV_BLOCK_ZERO)) { > + ret = bdrv_pread(bs->file, offset, bytes, > + (uint8_t *) buf + offset, 0); > + if (ret < 0) { > + return ret; > + } > + } > + offset += bytes; > + } > +} > + > /* > * Read the entire contents of @blk into @buf. > * @blk's contents must be @size bytes, and @size must be at most > @@ -53,7 +87,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, > * block device and read only on demand. > */ > assert(size <= BDRV_REQUEST_MAX_BYTES); > - ret = blk_pread(blk, 0, size, buf, 0); > + ret = blk_pread_nonzeroes(blk, size, buf); > if (ret < 0) { > error_setg_errno(errp, -ret, "can't read block backend"); > return false;
On Tue, Dec 20, 2022 at 10:30:43AM +0100, Philippe Mathieu-Daudé wrote: > [Extending to people using UEFI VARStore on Virt machines] > > On 20/12/22 09:42, Gerd Hoffmann wrote: > > From: Xiang Zheng <zhengxiang9@huawei.com> > > > > Currently we fill the VIRT_FLASH memory space with two 64MB NOR images > > when using persistent UEFI variables on virt board. Actually we only use > > a very small(non-zero) part of the memory while the rest significant > > large(zero) part of memory is wasted. > > > > So this patch checks the block status and only writes the non-zero part > > into memory. This requires pflash devices to use sparse files for > > backends. > > I like the idea, but I'm not sure how to relate with NOR flash devices. > > From the block layer, we get BDRV_BLOCK_ZERO when a block is fully > filled by zeroes ('\0'). > > We don't want to waste host memory, I get it. > > Now what "sees" the guest? Is the UEFI VARStore filled with zeroes? The varstore is filled with 0xff. It's 768k in size. The padding following (63M plus a bit) is 0x00. To be exact: kraxel@sirius ~# hex /usr/share/edk2/aarch64/vars-template-pflash.raw 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000010 8d 2b f1 ff 96 76 8b 4c a9 85 27 47 07 5b 4f 50 .+...v.L..'G.[OP 00000020 00 00 0c 00 00 00 00 00 5f 46 56 48 ff fe 04 00 ........_FVH.... 00000030 48 00 28 09 00 00 00 02 03 00 00 00 00 00 04 00 H.(............. 00000040 00 00 00 00 00 00 00 00 78 2c f3 aa 7b 94 9a 43 ........x,..{..C 00000050 a1 80 2e 14 4e c3 77 92 b8 ff 03 00 5a fe 00 00 ....N.w.....Z... 00000060 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ * 00040000 2b 29 58 9e 68 7c 7d 49 a0 ce 65 00 fd 9f 1b 95 +)X.h|}I..e..... 00040010 5b e7 c6 86 fe ff ff ff e0 ff 03 00 00 00 00 00 [............... 00040020 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ * 000c0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ * > If so, is it a EDK2 specific case for all virt machines? This would > be a virtualization optimization and in that case, this patch would > work. vars-template-pflash.raw (padded image) is simply QEMU_VARS.fd (unpadded image) with 'truncate --size 64M' applied. Yes, that's a pure virtual machine thing. On physical hardware you would probably just flash the first 768k and leave the remaining flash capacity untouched. > * or you are trying to optimize paravirtualized guests. This. Ideally without putting everything upside-down. > In that case why insist with emulated NOR devices? Why not have EDK2 > directly use a paravirtualized block driver which we can optimize / > tune without interfering with emulated models? While that probably would work for the variable store (I think we could very well do with variable store not being mapped and requiring explicit read/write requests) that idea is not going to work very well for the firmware code which must be mapped into the address space. pflash is almost the only device we have which serves that need. The only other option I can see would be a rom (the code is usually mapped r/o anyway), but that has pretty much the same problem space. We would likewise want a big enough fixed size ROM, to avoid life migration problems and all that, and we want the unused space not waste memory. > Keeping insisting on optimizing guests using the QEMU pflash device > seems wrong to me. I'm pretty sure we can do better optimizing clouds > payloads. Moving away from pflash for efi variable storage would cause alot of churn through the whole stack. firmware, qemu, libvirt, upper management, all affected. Is that worth the trouble? Using pflash isn't that much of a problem IMHO.
On Tue, 20 Dec 2022 at 16:33, Gerd Hoffmann <kraxel@redhat.com> wrote: > > On Tue, Dec 20, 2022 at 10:30:43AM +0100, Philippe Mathieu-Daudé wrote: > > [Extending to people using UEFI VARStore on Virt machines] > > > > On 20/12/22 09:42, Gerd Hoffmann wrote: > > > From: Xiang Zheng <zhengxiang9@huawei.com> > > > > > > Currently we fill the VIRT_FLASH memory space with two 64MB NOR images > > > when using persistent UEFI variables on virt board. Actually we only use > > > a very small(non-zero) part of the memory while the rest significant > > > large(zero) part of memory is wasted. > > > > > > So this patch checks the block status and only writes the non-zero part > > > into memory. This requires pflash devices to use sparse files for > > > backends. > > > > I like the idea, but I'm not sure how to relate with NOR flash devices. > > > > From the block layer, we get BDRV_BLOCK_ZERO when a block is fully > > filled by zeroes ('\0'). > > > > We don't want to waste host memory, I get it. > > > > Now what "sees" the guest? Is the UEFI VARStore filled with zeroes? > > The varstore is filled with 0xff. It's 768k in size. The padding > following (63M plus a bit) is 0x00. To be exact: > > kraxel@sirius ~# hex /usr/share/edk2/aarch64/vars-template-pflash.raw > 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000010 8d 2b f1 ff 96 76 8b 4c a9 85 27 47 07 5b 4f 50 .+...v.L..'G.[OP > 00000020 00 00 0c 00 00 00 00 00 5f 46 56 48 ff fe 04 00 ........_FVH.... > 00000030 48 00 28 09 00 00 00 02 03 00 00 00 00 00 04 00 H.(............. > 00000040 00 00 00 00 00 00 00 00 78 2c f3 aa 7b 94 9a 43 ........x,..{..C > 00000050 a1 80 2e 14 4e c3 77 92 b8 ff 03 00 5a fe 00 00 ....N.w.....Z... > 00000060 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > * > 00040000 2b 29 58 9e 68 7c 7d 49 a0 ce 65 00 fd 9f 1b 95 +)X.h|}I..e..... > 00040010 5b e7 c6 86 fe ff ff ff e0 ff 03 00 00 00 00 00 [............... > 00040020 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > * > 000c0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > * > > > If so, is it a EDK2 specific case for all virt machines? This would > > be a virtualization optimization and in that case, this patch would > > work. > > vars-template-pflash.raw (padded image) is simply QEMU_VARS.fd (unpadded > image) with 'truncate --size 64M' applied. > > Yes, that's a pure virtual machine thing. On physical hardware you > would probably just flash the first 768k and leave the remaining flash > capacity untouched. > > > * or you are trying to optimize paravirtualized guests. > > This. Ideally without putting everything upside-down. > > > In that case why insist with emulated NOR devices? Why not have EDK2 > > directly use a paravirtualized block driver which we can optimize / > > tune without interfering with emulated models? > > While that probably would work for the variable store (I think we could > very well do with variable store not being mapped and requiring explicit > read/write requests) that idea is not going to work very well for the > firmware code which must be mapped into the address space. pflash is > almost the only device we have which serves that need. The only other > option I can see would be a rom (the code is usually mapped r/o anyway), > but that has pretty much the same problem space. We would likewise want > a big enough fixed size ROM, to avoid life migration problems and all > that, and we want the unused space not waste memory. > > > Keeping insisting on optimizing guests using the QEMU pflash device > > seems wrong to me. I'm pretty sure we can do better optimizing clouds > > payloads. > > Moving away from pflash for efi variable storage would cause alot of > churn through the whole stack. firmware, qemu, libvirt, upper > management, all affected. Is that worth the trouble? Using pflash > isn't that much of a problem IMHO. > Agreed. pflash is a bit clunky but not a huge problem atm (although setting up and tearing down the r/o memslot for every read resp. write results in some performance issues under kvm/arm64) *If* we decide to replace it, I would suggest an emulated ROM for the executable image (without any emulated programming facility whatsoever) and a paravirtualized get/setvariable interface which can be used in a sane way to virtualize secure boot without having to emulate SMM or other secure world firmware interfaces.
Hi, > > Moving away from pflash for efi variable storage would cause alot of > > churn through the whole stack. firmware, qemu, libvirt, upper > > management, all affected. Is that worth the trouble? Using pflash > > isn't that much of a problem IMHO. > > Agreed. pflash is a bit clunky but not a huge problem atm (although > setting up and tearing down the r/o memslot for every read resp. write > results in some performance issues under kvm/arm64) > > *If* we decide to replace it, I would suggest an emulated ROM for the > executable image (without any emulated programming facility > whatsoever) Sure. > and a paravirtualized get/setvariable interface which can > be used in a sane way to virtualize secure boot without having to > emulate SMM or other secure world firmware interfaces. Suggestions how to do that best? The only option I can see is moving the variable policy processing to the host, so any variable update requests are checked even in case the guest OS bypasses the firmware (which it can easily do when we don't have SMM mode to restrict access to the paravirtual efi variable service device). take care, Gerd
Am 20.12.2022 um 09:42 hat Gerd Hoffmann geschrieben: > From: Xiang Zheng <zhengxiang9@huawei.com> > > Currently we fill the VIRT_FLASH memory space with two 64MB NOR images > when using persistent UEFI variables on virt board. Actually we only use > a very small(non-zero) part of the memory while the rest significant > large(zero) part of memory is wasted. > > So this patch checks the block status and only writes the non-zero part > into memory. This requires pflash devices to use sparse files for > backends. > > Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> > > [ kraxel: rebased to latest master ] > > Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Thanks, applied to the block branch. Even though discussion is ongoing about using alternative devices, it seems to me that this is a simple optimisation that doesn't change the behaviour as seen by the guest and that we want to have either way. If anyone objects and wants me to drop the patch again, let me know. Kevin
diff --git a/hw/block/block.c b/hw/block/block.c index f9c4fe67673b..142ebe4267e4 100644 --- a/hw/block/block.c +++ b/hw/block/block.c @@ -14,6 +14,40 @@ #include "qapi/error.h" #include "qapi/qapi-types-block.h" +/* + * Read the non-zeroes parts of @blk into @buf + * Reading all of the @blk is expensive if the zeroes parts of @blk + * is large enough. Therefore check the block status and only write + * the non-zeroes block into @buf. + * + * Return 0 on success, non-zero on error. + */ +static int blk_pread_nonzeroes(BlockBackend *blk, hwaddr size, void *buf) +{ + int ret; + int64_t bytes, offset = 0; + BlockDriverState *bs = blk_bs(blk); + + for (;;) { + bytes = MIN(size - offset, BDRV_REQUEST_MAX_SECTORS); + if (bytes <= 0) { + return 0; + } + ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL); + if (ret < 0) { + return ret; + } + if (!(ret & BDRV_BLOCK_ZERO)) { + ret = bdrv_pread(bs->file, offset, bytes, + (uint8_t *) buf + offset, 0); + if (ret < 0) { + return ret; + } + } + offset += bytes; + } +} + /* * Read the entire contents of @blk into @buf. * @blk's contents must be @size bytes, and @size must be at most @@ -53,7 +87,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, * block device and read only on demand. */ assert(size <= BDRV_REQUEST_MAX_BYTES); - ret = blk_pread(blk, 0, size, buf, 0); + ret = blk_pread_nonzeroes(blk, size, buf); if (ret < 0) { error_setg_errno(errp, -ret, "can't read block backend"); return false;