[V2] hw/riscv: virt: Remove size restriction for pflash

Message ID	20221107130217.2243815-1-sunilvl@ventanamicro.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Sunil V L <sunilvl@ventanamicro.com> To: Palmer Dabbelt <palmer@dabbelt.com>, Alistair Francis <alistair.francis@wdc.com>, Bin Meng <bin.meng@windriver.com>, Gerd Hoffmann <kraxel@redhat.com> Cc: qemu-riscv@nongnu.org, qemu-devel@nongnu.org, Sunil V L <sunilvl@ventanamicro.com> Subject: [PATCH V2] hw/riscv: virt: Remove size restriction for pflash Date: Mon, 7 Nov 2022 18:32:17 +0530 Message-Id: <20221107130217.2243815-1-sunilvl@ventanamicro.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::52e; envelope-from=sunilvl@ventanamicro.com; helo=mail-pg1-x52e.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	[V2] hw/riscv: virt: Remove size restriction for pflash \| expand [V2] hw/riscv: virt: Remove size restriction for pflash

Sunil V L Nov. 7, 2022, 1:02 p.m. UTC

The pflash implementation currently assumes fixed size of the
backend storage. Due to this, the backend storage file needs to be
exactly of size 32M. Otherwise, there will be an error like below.

"device requires 33554432 bytes, block backend provides 4194304 bytes"

Fix this issue by using the actual size of the backing store.

Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
---

Changes since V1:
	1) Handle the case when blk_getlength() returns failure.
	2) Added assertion to check actual size doesn't exceed the limit
	3) Updated virt_machine_done() to find the flash base dynamically
	4) Added code comments as suggested by Drew

 hw/riscv/virt.c | 59 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 48 insertions(+), 11 deletions(-)

Peter Maydell Nov. 7, 2022, 1:06 p.m. UTC | #1

On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
>
> The pflash implementation currently assumes fixed size of the
> backend storage. Due to this, the backend storage file needs to be
> exactly of size 32M. Otherwise, there will be an error like below.
>
> "device requires 33554432 bytes, block backend provides 4194304 bytes"
>
> Fix this issue by using the actual size of the backing store.
>
> Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
> ---

Do you really want the flash device size presented to the guest
to be variable depending on what the user passed as a block backend?
I don't think this is how we handle flash devices on other boards...

thanks
-- PMM

Philippe Mathieu-Daudé Nov. 7, 2022, 2:08 p.m. UTC | #2

On 7/11/22 14:06, Peter Maydell wrote:
> On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
>>
>> The pflash implementation currently assumes fixed size of the
>> backend storage. Due to this, the backend storage file needs to be
>> exactly of size 32M. Otherwise, there will be an error like below.
>>
>> "device requires 33554432 bytes, block backend provides 4194304 bytes"
>>
>> Fix this issue by using the actual size of the backing store.
>>
>> Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
>> ---
> 
> Do you really want the flash device size presented to the guest
> to be variable depending on what the user passed as a block backend?
> I don't think this is how we handle flash devices on other boards...

Ideally handling smaller/bigger backend size should be transparent for
machine frontend, but we never agreed on what are user expectations and
how to deal with such cases.

Long term I'd go for:

- if flash is read-only

   a/ bigger backend: display a warning and ignore extra backend data.

   b/ smaller backend: assume flash block is in erased state and fill
      missing gap with -1 (the default erase value), displaying a warning
      on startup.

- if flash is read-write

   a/ bigger backend: display a warning and ignore extra backend data.

   b/ smaller backend: add a property to pflash device to handle missing
      gap as erased data. If this flag is not set, display a hint and
      exit with an error.

In Sunil particular case, I suppose the issue comes from commit
334c388f25 ("hw/block/pflash_cfi0{1, 2}: Error out if device length
isn't a power of two") which I'm going to revert because the code
base is not ready for such check:

https://lore.kernel.org/qemu-devel/78b914c5-ce7e-1d4a-0a67-450f286eb869@linaro.org/

Regards,

Phil.

Sunil V L Nov. 7, 2022, 2:08 p.m. UTC | #3

On Mon, Nov 07, 2022 at 01:06:38PM +0000, Peter Maydell wrote:
> On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
> >
> > The pflash implementation currently assumes fixed size of the
> > backend storage. Due to this, the backend storage file needs to be
> > exactly of size 32M. Otherwise, there will be an error like below.
> >
> > "device requires 33554432 bytes, block backend provides 4194304 bytes"
> >
> > Fix this issue by using the actual size of the backing store.
> >
> > Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
> > ---
> 
> Do you really want the flash device size presented to the guest
> to be variable depending on what the user passed as a block backend?
> I don't think this is how we handle flash devices on other boards...
> 

Hi Peter,

x86 appears to support variable flash but arm doesn't. What is
the reason for not supporting variable size flash in arm?

Thanks
Sunil

Alex Bennée Nov. 7, 2022, 3:50 p.m. UTC | #4

Sunil V L <sunilvl@ventanamicro.com> writes:

> On Mon, Nov 07, 2022 at 01:06:38PM +0000, Peter Maydell wrote:
>> On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
>> >
>> > The pflash implementation currently assumes fixed size of the
>> > backend storage. Due to this, the backend storage file needs to be
>> > exactly of size 32M. Otherwise, there will be an error like below.
>> >
>> > "device requires 33554432 bytes, block backend provides 4194304 bytes"
>> >
>> > Fix this issue by using the actual size of the backing store.
>> >
>> > Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
>> > ---
>> 
>> Do you really want the flash device size presented to the guest
>> to be variable depending on what the user passed as a block backend?
>> I don't think this is how we handle flash devices on other boards...
>> 
>
> Hi Peter,
>
> x86 appears to support variable flash but arm doesn't. What is
> the reason for not supporting variable size flash in arm?

If I recall from the last time we went around this is was the question
of what you should pad it with.

  https://patchew.org/QEMU/20190307093723.655-1-armbru@redhat.com/20190307093723.655-3-armbru@redhat.com/

>
> Thanks
> Sunil

Markus Armbruster Nov. 7, 2022, 4:07 p.m. UTC | #5

Philippe Mathieu-Daudé <philmd@linaro.org> writes:

> On 7/11/22 14:06, Peter Maydell wrote:
>> On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
>>>
>>> The pflash implementation currently assumes fixed size of the
>>> backend storage.

Intentional.

commit 06f1521795207359a395996c253c306f4ab7586e
Author: Markus Armbruster <armbru@redhat.com>
Date:   Tue Mar 19 17:35:50 2019 +0100

    pflash: Require backend size to match device, improve errors
    
    We reject undersized backends with a rather enigmatic "failed to read
    the initial flash content" error.  For instance:
    
        $ qemu-system-ppc64 -S -display none -M sam460ex -drive if=pflash,format=raw,file=eins.img
        qemu-system-ppc64: Initialization of device cfi.pflash02 failed: failed to read the initial flash content
    
    We happily accept oversized images, ignoring their tail.  Throwing
    away parts of firmware that way is pretty much certain to end in an
    even more enigmatic failure to boot.
    
    Require the backend's size to match the device's size exactly.  Report
    mismatch like this:
    
        qemu-system-ppc64: Initialization of device cfi.pflash01 failed: device requires 1048576 bytes, block backend provides 512 bytes
    
    Improve the error for actual read failures to "can't read block
    backend".
    
    To avoid duplicating even more code between the two pflash device
    models, do all that in new helper blk_check_size_and_read_all().
    
    The error reporting can still be confusing.  For instance:
    
        qemu-system-ppc64 -S -display none -M taihu -drive if=pflash,format=raw,file=eins.img  -drive if=pflash,unit=1,format=raw,file=zwei.img
        qemu-system-ppc64: Initialization of device cfi.pflash02 failed: device requires 2097152 bytes, block backend provides 512 bytes
    
    Leaves the user guessing which of the two -drive is wrong.  Mention
    the issue in a TODO comment.
    
    Suggested-by: Alex Bennée <alex.bennee@linaro.org>
    Signed-off-by: Markus Armbruster <armbru@redhat.com>
    Message-Id: <20190319163551.32499-2-armbru@redhat.com>
    Reviewed-by: Laszlo Ersek <lersek@redhat.com>
    Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
    Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

>>>                  Due to this, the backend storage file needs to be
>>> exactly of size 32M. Otherwise, there will be an error like below.
>>>
>>> "device requires 33554432 bytes, block backend provides 4194304 bytes"

Why is that a problem?  Genuine question!

>>> Fix this issue by using the actual size of the backing store.
>>>
>>> Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
>>> ---
>> Do you really want the flash device size presented to the guest
>> to be variable depending on what the user passed as a block backend?
>> I don't think this is how we handle flash devices on other boards...

Flash device is generally a property of the machine type.  Similar to
physical machines.  Not an accident.

> Ideally handling smaller/bigger backend size should be transparent for
> machine frontend, but we never agreed on what are user expectations and
> how to deal with such cases.
>
> Long term I'd go for:
>
> - if flash is read-only
>
>   a/ bigger backend: display a warning and ignore extra backend data.

Truncating images seems unlikely to be useful.

>   b/ smaller backend: assume flash block is in erased state and fill
>      missing gap with -1 (the default erase value), displaying a warning
>      on startup.

Padding has a better chance to work.  But is it worth the trouble?

>
> - if flash is read-write
>
>   a/ bigger backend: display a warning and ignore extra backend data.
>
>   b/ smaller backend: add a property to pflash device to handle missing
>      gap as erased data. If this flag is not set, display a hint and
>      exit with an error.

What happens when the guest writes to the part that isn't backed by the
backend?

Is this worth the trouble?

> In Sunil particular case, I suppose the issue comes from commit
> 334c388f25 ("hw/block/pflash_cfi0{1, 2}: Error out if device length
> isn't a power of two") which I'm going to revert because the code
> base is not ready for such check:
>
> https://lore.kernel.org/qemu-devel/78b914c5-ce7e-1d4a-0a67-450f286eb869@linaro.org/
>
> Regards,
>
> Phil.

Peter Maydell Nov. 7, 2022, 4:08 p.m. UTC | #6

On Mon, 7 Nov 2022 at 14:08, Sunil V L <sunilvl@ventanamicro.com> wrote:
>
> On Mon, Nov 07, 2022 at 01:06:38PM +0000, Peter Maydell wrote:
> > On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
> > >
> > > The pflash implementation currently assumes fixed size of the
> > > backend storage. Due to this, the backend storage file needs to be
> > > exactly of size 32M. Otherwise, there will be an error like below.
> > >
> > > "device requires 33554432 bytes, block backend provides 4194304 bytes"
> > >
> > > Fix this issue by using the actual size of the backing store.
> > >
> > > Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
> > > ---
> >
> > Do you really want the flash device size presented to the guest
> > to be variable depending on what the user passed as a block backend?
> > I don't think this is how we handle flash devices on other boards...
> >

> x86 appears to support variable flash but arm doesn't. What is
> the reason for not supporting variable size flash in arm?

Mostly, that that's the straightforward thing to code.
Partly, that it avoids weird cases (eg you can have a backing
file that's an odd number of bytes but you can't have a
flash that size).

If x86 has a standard way of handling this then I'm all
in favour of being consistent across platforms. What's
the x86 board doing there?

thanks
-- PMM

Daniel P. Berrangé Nov. 7, 2022, 4:19 p.m. UTC | #7

On Mon, Nov 07, 2022 at 03:50:44PM +0000, Alex Bennée wrote:
> 
> Sunil V L <sunilvl@ventanamicro.com> writes:
> 
> > On Mon, Nov 07, 2022 at 01:06:38PM +0000, Peter Maydell wrote:
> >> On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
> >> >
> >> > The pflash implementation currently assumes fixed size of the
> >> > backend storage. Due to this, the backend storage file needs to be
> >> > exactly of size 32M. Otherwise, there will be an error like below.
> >> >
> >> > "device requires 33554432 bytes, block backend provides 4194304 bytes"
> >> >
> >> > Fix this issue by using the actual size of the backing store.
> >> >
> >> > Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
> >> > ---
> >> 
> >> Do you really want the flash device size presented to the guest
> >> to be variable depending on what the user passed as a block backend?
> >> I don't think this is how we handle flash devices on other boards...
> >> 
> >
> > Hi Peter,
> >
> > x86 appears to support variable flash but arm doesn't. What is
> > the reason for not supporting variable size flash in arm?
> 
> If I recall from the last time we went around this is was the question
> of what you should pad it with.

Padding is a very good thing from the POV of upgrades. Firmware has shown
a tendancy to change (grow) over time, and the size has an impact of the
guest ABI/live migration state.

To be able to live migrate, or save/restore to/from files, then the machine
firmware size needs to be sufficient to cope with future size changes of
the firmware. The best way to deal with this is to not use the firmware
binaries' minimum compiled size, but instead to pad it upto a higher
boundary.

Enforcing such padding is a decent way to prevent users from inadvertantly
painting themselves into a corner with a very specific firmware binary
size at initial boot.

With regards,
Daniel

Andrew Jones Nov. 7, 2022, 5:32 p.m. UTC | #8

On Mon, Nov 07, 2022 at 04:19:10PM +0000, Daniel P. Berrangé wrote:
> On Mon, Nov 07, 2022 at 03:50:44PM +0000, Alex Bennée wrote:
> > 
> > Sunil V L <sunilvl@ventanamicro.com> writes:
> > 
> > > On Mon, Nov 07, 2022 at 01:06:38PM +0000, Peter Maydell wrote:
> > >> On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
> > >> >
> > >> > The pflash implementation currently assumes fixed size of the
> > >> > backend storage. Due to this, the backend storage file needs to be
> > >> > exactly of size 32M. Otherwise, there will be an error like below.
> > >> >
> > >> > "device requires 33554432 bytes, block backend provides 4194304 bytes"
> > >> >
> > >> > Fix this issue by using the actual size of the backing store.
> > >> >
> > >> > Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
> > >> > ---
> > >> 
> > >> Do you really want the flash device size presented to the guest
> > >> to be variable depending on what the user passed as a block backend?
> > >> I don't think this is how we handle flash devices on other boards...
> > >> 
> > >
> > > Hi Peter,
> > >
> > > x86 appears to support variable flash but arm doesn't. What is
> > > the reason for not supporting variable size flash in arm?
> > 
> > If I recall from the last time we went around this is was the question
> > of what you should pad it with.
> 
> Padding is a very good thing from the POV of upgrades. Firmware has shown
> a tendancy to change (grow) over time, and the size has an impact of the
> guest ABI/live migration state.
> 
> To be able to live migrate, or save/restore to/from files, then the machine
> firmware size needs to be sufficient to cope with future size changes of
> the firmware. The best way to deal with this is to not use the firmware
> binaries' minimum compiled size, but instead to pad it upto a higher
> boundary.
> 
> Enforcing such padding is a decent way to prevent users from inadvertantly
> painting themselves into a corner with a very specific firmware binary
> size at initial boot.

Padding is a good idea, but too much causes other problems. When building
lightweight VMs which may pull the firmware image from a network,
AArch64 VMs require 64MB of mostly zeros to be transferred first, which
can become a substantial amount of the overall boot time[*]. Being able to
create images smaller than the total flash device size, but still add some
pad for later growth, seems like the happy-medium to shoot for.

[*] My web searching skills are failing me at the moment, but I recall
seeing a BZ or gitlab/github issue specifically pointing out the AArch64
64MB firmware size as a pain point.

Thanks,
drew

Daniel P. Berrangé Nov. 7, 2022, 5:34 p.m. UTC | #9

On Mon, Nov 07, 2022 at 06:32:01PM +0100, Andrew Jones wrote:
> On Mon, Nov 07, 2022 at 04:19:10PM +0000, Daniel P. Berrangé wrote:
> > On Mon, Nov 07, 2022 at 03:50:44PM +0000, Alex Bennée wrote:
> > > 
> > > Sunil V L <sunilvl@ventanamicro.com> writes:
> > > 
> > > > On Mon, Nov 07, 2022 at 01:06:38PM +0000, Peter Maydell wrote:
> > > >> On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
> > > >> >
> > > >> > The pflash implementation currently assumes fixed size of the
> > > >> > backend storage. Due to this, the backend storage file needs to be
> > > >> > exactly of size 32M. Otherwise, there will be an error like below.
> > > >> >
> > > >> > "device requires 33554432 bytes, block backend provides 4194304 bytes"
> > > >> >
> > > >> > Fix this issue by using the actual size of the backing store.
> > > >> >
> > > >> > Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
> > > >> > ---
> > > >> 
> > > >> Do you really want the flash device size presented to the guest
> > > >> to be variable depending on what the user passed as a block backend?
> > > >> I don't think this is how we handle flash devices on other boards...
> > > >> 
> > > >
> > > > Hi Peter,
> > > >
> > > > x86 appears to support variable flash but arm doesn't. What is
> > > > the reason for not supporting variable size flash in arm?
> > > 
> > > If I recall from the last time we went around this is was the question
> > > of what you should pad it with.
> > 
> > Padding is a very good thing from the POV of upgrades. Firmware has shown
> > a tendancy to change (grow) over time, and the size has an impact of the
> > guest ABI/live migration state.
> > 
> > To be able to live migrate, or save/restore to/from files, then the machine
> > firmware size needs to be sufficient to cope with future size changes of
> > the firmware. The best way to deal with this is to not use the firmware
> > binaries' minimum compiled size, but instead to pad it upto a higher
> > boundary.
> > 
> > Enforcing such padding is a decent way to prevent users from inadvertantly
> > painting themselves into a corner with a very specific firmware binary
> > size at initial boot.
> 
> Padding is a good idea, but too much causes other problems. When building
> lightweight VMs which may pull the firmware image from a network,
> AArch64 VMs require 64MB of mostly zeros to be transferred first, which
> can become a substantial amount of the overall boot time[*]. Being able to
> create images smaller than the total flash device size, but still add some
> pad for later growth, seems like the happy-medium to shoot for.

QEMU configures the firmware using -blockdev, so can use any file
format that QEMU supports at the block layer.  IOW, you can store
the firmware in a qcow2 file and thus you will never fetch any
of the padding zeros to be transferred.  That said I'm not sure
that libvirt supports anything other than a raw file today. 

With regards,
Daniel

Philippe Mathieu-Daudé Nov. 8, 2022, 2:12 p.m. UTC | #10

On 7/11/22 18:34, Daniel P. Berrangé wrote:
> On Mon, Nov 07, 2022 at 06:32:01PM +0100, Andrew Jones wrote:
>> On Mon, Nov 07, 2022 at 04:19:10PM +0000, Daniel P. Berrangé wrote:
>>> On Mon, Nov 07, 2022 at 03:50:44PM +0000, Alex Bennée wrote:
>>>>
>>>> Sunil V L <sunilvl@ventanamicro.com> writes:
>>>>
>>>>> On Mon, Nov 07, 2022 at 01:06:38PM +0000, Peter Maydell wrote:
>>>>>> On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
>>>>>>>
>>>>>>> The pflash implementation currently assumes fixed size of the
>>>>>>> backend storage. Due to this, the backend storage file needs to be
>>>>>>> exactly of size 32M. Otherwise, there will be an error like below.
>>>>>>>
>>>>>>> "device requires 33554432 bytes, block backend provides 4194304 bytes"
>>>>>>>
>>>>>>> Fix this issue by using the actual size of the backing store.
>>>>>>>
>>>>>>> Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
>>>>>>> ---
>>>>>>
>>>>>> Do you really want the flash device size presented to the guest
>>>>>> to be variable depending on what the user passed as a block backend?
>>>>>> I don't think this is how we handle flash devices on other boards...
>>>>>>
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> x86 appears to support variable flash but arm doesn't. What is
>>>>> the reason for not supporting variable size flash in arm?
>>>>
>>>> If I recall from the last time we went around this is was the question
>>>> of what you should pad it with.
>>>
>>> Padding is a very good thing from the POV of upgrades. Firmware has shown
>>> a tendancy to change (grow) over time, and the size has an impact of the
>>> guest ABI/live migration state.
>>>
>>> To be able to live migrate, or save/restore to/from files, then the machine
>>> firmware size needs to be sufficient to cope with future size changes of
>>> the firmware. The best way to deal with this is to not use the firmware
>>> binaries' minimum compiled size, but instead to pad it upto a higher
>>> boundary.
>>>
>>> Enforcing such padding is a decent way to prevent users from inadvertantly
>>> painting themselves into a corner with a very specific firmware binary
>>> size at initial boot.
>>
>> Padding is a good idea, but too much causes other problems. When building
>> lightweight VMs which may pull the firmware image from a network,
>> AArch64 VMs require 64MB of mostly zeros to be transferred first, which
>> can become a substantial amount of the overall boot time[*]. Being able to
>> create images smaller than the total flash device size, but still add some
>> pad for later growth, seems like the happy-medium to shoot for.
> 
> QEMU configures the firmware using -blockdev, so can use any file
> format that QEMU supports at the block layer.  IOW, you can store
> the firmware in a qcow2 file and thus you will never fetch any
> of the padding zeros to be transferred.  That said I'm not sure
> that libvirt supports anything other than a raw file today.

Drew might be referring to:

https://lore.kernel.org/qemu-devel/20210810134050.396747-1-david.edmondson@oracle.com/

  > Currently ARM UEFI images are typically built as 2MB/768kB flash
  > images for code and variables respectively. These images are both
  > then padded out to 64MB before being loaded by QEMU.
  >
  > Because the images are 64MB each, QEMU allocates 128MB of memory to
  > read them, and then proceeds to read all 128MB from disk (dirtying
  > the memory). Of this 128MB less than 3MB is useful - the rest is
  > zero padding.
  >
  > On a machine with 100 VMs this wastes over 12GB of memory.

See previous attempts:
- Huawei
https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html
- Tencent
https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html
- Oracle
https://www.mail-archive.com/qemu-devel@nongnu.org/msg760065.html
- Red Hat
https://www.mail-archive.com/qemu-block@nongnu.org/msg81714.html

Andrew Jones Nov. 8, 2022, 2:49 p.m. UTC | #11

On Tue, Nov 08, 2022 at 03:12:42PM +0100, Philippe Mathieu-Daudé wrote:
> On 7/11/22 18:34, Daniel P. Berrangé wrote:
> > On Mon, Nov 07, 2022 at 06:32:01PM +0100, Andrew Jones wrote:
> > > On Mon, Nov 07, 2022 at 04:19:10PM +0000, Daniel P. Berrangé wrote:
> > > > On Mon, Nov 07, 2022 at 03:50:44PM +0000, Alex Bennée wrote:
> > > > > 
> > > > > Sunil V L <sunilvl@ventanamicro.com> writes:
> > > > > 
> > > > > > On Mon, Nov 07, 2022 at 01:06:38PM +0000, Peter Maydell wrote:
> > > > > > > On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
> > > > > > > > 
> > > > > > > > The pflash implementation currently assumes fixed size of the
> > > > > > > > backend storage. Due to this, the backend storage file needs to be
> > > > > > > > exactly of size 32M. Otherwise, there will be an error like below.
> > > > > > > > 
> > > > > > > > "device requires 33554432 bytes, block backend provides 4194304 bytes"
> > > > > > > > 
> > > > > > > > Fix this issue by using the actual size of the backing store.
> > > > > > > > 
> > > > > > > > Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
> > > > > > > > ---
> > > > > > > 
> > > > > > > Do you really want the flash device size presented to the guest
> > > > > > > to be variable depending on what the user passed as a block backend?
> > > > > > > I don't think this is how we handle flash devices on other boards...
> > > > > > > 
> > > > > > 
> > > > > > Hi Peter,
> > > > > > 
> > > > > > x86 appears to support variable flash but arm doesn't. What is
> > > > > > the reason for not supporting variable size flash in arm?
> > > > > 
> > > > > If I recall from the last time we went around this is was the question
> > > > > of what you should pad it with.
> > > > 
> > > > Padding is a very good thing from the POV of upgrades. Firmware has shown
> > > > a tendancy to change (grow) over time, and the size has an impact of the
> > > > guest ABI/live migration state.
> > > > 
> > > > To be able to live migrate, or save/restore to/from files, then the machine
> > > > firmware size needs to be sufficient to cope with future size changes of
> > > > the firmware. The best way to deal with this is to not use the firmware
> > > > binaries' minimum compiled size, but instead to pad it upto a higher
> > > > boundary.
> > > > 
> > > > Enforcing such padding is a decent way to prevent users from inadvertantly
> > > > painting themselves into a corner with a very specific firmware binary
> > > > size at initial boot.
> > > 
> > > Padding is a good idea, but too much causes other problems. When building
> > > lightweight VMs which may pull the firmware image from a network,
> > > AArch64 VMs require 64MB of mostly zeros to be transferred first, which
> > > can become a substantial amount of the overall boot time[*]. Being able to
> > > create images smaller than the total flash device size, but still add some
> > > pad for later growth, seems like the happy-medium to shoot for.
> > 
> > QEMU configures the firmware using -blockdev, so can use any file
> > format that QEMU supports at the block layer.  IOW, you can store
> > the firmware in a qcow2 file and thus you will never fetch any
> > of the padding zeros to be transferred.  That said I'm not sure
> > that libvirt supports anything other than a raw file today.
> 
> Drew might be referring to:
> 
> https://lore.kernel.org/qemu-devel/20210810134050.396747-1-david.edmondson@oracle.com/

I was thinking of a different one, but thanks for the reference to this
one. Daniel's qcow2 proposal sounds like something worth investigating,
since some padding makes sense, but the guess for how much will likely
be wrong one direction or the other.

Thanks,
drew

> 
>  > Currently ARM UEFI images are typically built as 2MB/768kB flash
>  > images for code and variables respectively. These images are both
>  > then padded out to 64MB before being loaded by QEMU.
>  >
>  > Because the images are 64MB each, QEMU allocates 128MB of memory to
>  > read them, and then proceeds to read all 128MB from disk (dirtying
>  > the memory). Of this 128MB less than 3MB is useful - the rest is
>  > zero padding.
>  >
>  > On a machine with 100 VMs this wastes over 12GB of memory.
> 
> See previous attempts:
> - Huawei
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html
> - Tencent
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html
> - Oracle
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg760065.html
> - Red Hat
> https://www.mail-archive.com/qemu-block@nongnu.org/msg81714.html
>

Daniel P. Berrangé Nov. 8, 2022, 3:03 p.m. UTC | #12

On Tue, Nov 08, 2022 at 03:12:42PM +0100, Philippe Mathieu-Daudé wrote:
> On 7/11/22 18:34, Daniel P. Berrangé wrote:
> > On Mon, Nov 07, 2022 at 06:32:01PM +0100, Andrew Jones wrote:
> > > On Mon, Nov 07, 2022 at 04:19:10PM +0000, Daniel P. Berrangé wrote:
> > > > On Mon, Nov 07, 2022 at 03:50:44PM +0000, Alex Bennée wrote:
> > > > > 
> > > > > Sunil V L <sunilvl@ventanamicro.com> writes:
> > > > > 
> > > > > > On Mon, Nov 07, 2022 at 01:06:38PM +0000, Peter Maydell wrote:
> > > > > > > On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
> > > > > > > > 
> > > > > > > > The pflash implementation currently assumes fixed size of the
> > > > > > > > backend storage. Due to this, the backend storage file needs to be
> > > > > > > > exactly of size 32M. Otherwise, there will be an error like below.
> > > > > > > > 
> > > > > > > > "device requires 33554432 bytes, block backend provides 4194304 bytes"
> > > > > > > > 
> > > > > > > > Fix this issue by using the actual size of the backing store.
> > > > > > > > 
> > > > > > > > Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
> > > > > > > > ---
> > > > > > > 
> > > > > > > Do you really want the flash device size presented to the guest
> > > > > > > to be variable depending on what the user passed as a block backend?
> > > > > > > I don't think this is how we handle flash devices on other boards...
> > > > > > > 
> > > > > > 
> > > > > > Hi Peter,
> > > > > > 
> > > > > > x86 appears to support variable flash but arm doesn't. What is
> > > > > > the reason for not supporting variable size flash in arm?
> > > > > 
> > > > > If I recall from the last time we went around this is was the question
> > > > > of what you should pad it with.
> > > > 
> > > > Padding is a very good thing from the POV of upgrades. Firmware has shown
> > > > a tendancy to change (grow) over time, and the size has an impact of the
> > > > guest ABI/live migration state.
> > > > 
> > > > To be able to live migrate, or save/restore to/from files, then the machine
> > > > firmware size needs to be sufficient to cope with future size changes of
> > > > the firmware. The best way to deal with this is to not use the firmware
> > > > binaries' minimum compiled size, but instead to pad it upto a higher
> > > > boundary.
> > > > 
> > > > Enforcing such padding is a decent way to prevent users from inadvertantly
> > > > painting themselves into a corner with a very specific firmware binary
> > > > size at initial boot.
> > > 
> > > Padding is a good idea, but too much causes other problems. When building
> > > lightweight VMs which may pull the firmware image from a network,
> > > AArch64 VMs require 64MB of mostly zeros to be transferred first, which
> > > can become a substantial amount of the overall boot time[*]. Being able to
> > > create images smaller than the total flash device size, but still add some
> > > pad for later growth, seems like the happy-medium to shoot for.
> > 
> > QEMU configures the firmware using -blockdev, so can use any file
> > format that QEMU supports at the block layer.  IOW, you can store
> > the firmware in a qcow2 file and thus you will never fetch any
> > of the padding zeros to be transferred.  That said I'm not sure
> > that libvirt supports anything other than a raw file today.
> 
> Drew might be referring to:
> 
> https://lore.kernel.org/qemu-devel/20210810134050.396747-1-david.edmondson@oracle.com/
> 
>  > Currently ARM UEFI images are typically built as 2MB/768kB flash
>  > images for code and variables respectively. These images are both
>  > then padded out to 64MB before being loaded by QEMU.
>  >
>  > Because the images are 64MB each, QEMU allocates 128MB of memory to
>  > read them, and then proceeds to read all 128MB from disk (dirtying
>  > the memory). Of this 128MB less than 3MB is useful - the rest is
>  > zero padding.
>  >
>  > On a machine with 100 VMs this wastes over 12GB of memory.
> 
> See previous attempts:
> - Huawei
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg607292.html
> - Tencent
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg742066.html
> - Oracle
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg760065.html
> - Red Hat
> https://www.mail-archive.com/qemu-block@nongnu.org/msg81714.html

I've not looked at the patch series above in detail, but I'm thinking
that even if the file is 64 MB in size, it should never read all 64 MB
of data. The block layer APIs have ability to detect and report holes,
so it ought to be possible to only read the data which is non-zero,
and thus avoid dirtying more than 3 MB of useful

With regards,
Daniel

Markus Armbruster Nov. 8, 2022, 4:01 p.m. UTC | #13

Peter Maydell <peter.maydell@linaro.org> writes:

> On Mon, 7 Nov 2022 at 14:08, Sunil V L <sunilvl@ventanamicro.com> wrote:
>>
>> On Mon, Nov 07, 2022 at 01:06:38PM +0000, Peter Maydell wrote:
>> > On Mon, 7 Nov 2022 at 13:03, Sunil V L <sunilvl@ventanamicro.com> wrote:
>> > >
>> > > The pflash implementation currently assumes fixed size of the
>> > > backend storage. Due to this, the backend storage file needs to be
>> > > exactly of size 32M. Otherwise, there will be an error like below.
>> > >
>> > > "device requires 33554432 bytes, block backend provides 4194304 bytes"
>> > >
>> > > Fix this issue by using the actual size of the backing store.
>> > >
>> > > Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
>> > > ---
>> >
>> > Do you really want the flash device size presented to the guest
>> > to be variable depending on what the user passed as a block backend?
>> > I don't think this is how we handle flash devices on other boards...
>> >
>
>> x86 appears to support variable flash but arm doesn't. What is
>> the reason for not supporting variable size flash in arm?
>
> Mostly, that that's the straightforward thing to code.
> Partly, that it avoids weird cases (eg you can have a backing
> file that's an odd number of bytes but you can't have a
> flash that size).
>
> If x86 has a standard way of handling this then I'm all
> in favour of being consistent across platforms. What's
> the x86 board doing there?

I'm hardly the expert here, but I messed with it at some time... I guess
I can try to answer.

It's complicated.  More often than not, long development history + need
for backward compatibility = complicated.  Which makes it (in my
opinion) not a useful example to follow.

We use either ROM or flash device(s) to hold the BIOS.

The flash option exists since v1.1.

The user interface for picking one or the other, and configure contents
has a long and complicated history.  I'll spare you the details.

ROM contents comes from an image file.  Configurable with -bios.
Default depends on the machine type.

ROM size is the image file size.  It's mapped so it ends right before
address 2^32.

It's mapped a second time so it ends right before 2^20.  If the image
file is larger than 128KiB, only the last 128KiB are mapped there.

Flash contents comes from a block backend.  Configurable with machine
properties "pflash0" and "pflash1" (or legacy -drive if=pflash).  There
is no default.  If you don't configure flash contents, you get ROM
instead.

Flash device size is the block backend size.  Must be a multiple of
4KiB.

If "pflash0" is configured, we create a flash device so it ends right
before address 2^32.  If "pflash1" is also configured, we create a
second flash device so it ends right before the first one (no gap).
Combined size must not exceed a limit that depends on the machine type.

Aside: why two flash devices?  A physical machine has just one...  Well,
we need a read-only part for the code, and a read-write part for
persistent configuration ("varstore" in UEFI parlance).  Physical flash
devices let you write-protect partially, but our device models don't
implement that, it's all or nothing.  So we use two.  Kludge.

Again, there's a second mapping that ends right before 2^20, limited to
128KiB.

In my opinion, setting flash or ROM size to the size of the block
backend or image file is a bad idea.  It tangles up configuration of
frontend and backend.  We used to do this a lot (e.g. -drive).
Untangling (e.g. -device and -blockdev) was a lot of work.  With the
tangle kept around for compatibility.

Doug McIlroy's quip applies: "KISS became MICAHI: make it complicated
and hide it."

A physical machine has a fixed flash memory size.

A virtual machine models a physical machine.  It has a fixed flash
memory size, too.

If we want a machine type to model multiple variations of a physical
machine, say multiple flash sizes, the configuration knob really belongs
to the machine type.  Hiding it somewhere on the file system instead is
a bad idea.

Sunil V L Nov. 9, 2022, 10:07 a.m. UTC | #14

Thanks a lot for the feedback and history. I understand now there are good
reasons to mandate fixed size flash. Will drop this patch.

Thanks
Sunil

Markus Armbruster Nov. 9, 2022, 3:26 p.m. UTC | #15

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Nov 07, 2022 at 06:32:01PM +0100, Andrew Jones wrote:

[...]

>> Padding is a good idea, but too much causes other problems. When building
>> lightweight VMs which may pull the firmware image from a network,
>> AArch64 VMs require 64MB of mostly zeros to be transferred first, which
>> can become a substantial amount of the overall boot time[*]. Being able to
>> create images smaller than the total flash device size, but still add some
>> pad for later growth, seems like the happy-medium to shoot for.
>
> QEMU configures the firmware using -blockdev,

Yes, even though the devices in question are not block devices.

>                                               so can use any file
> format that QEMU supports at the block layer.  IOW, you can store
> the firmware in a qcow2 file and thus you will never fetch any
> of the padding zeros to be transferred.  That said I'm not sure
> that libvirt supports anything other than a raw file today. 

Here's another idea.  The "raw" format supports exposing a slice of the
underlying block node (options @offset and @size).  It could support
padding.  Writing to the padding should then grow the underlying node.

Taking a step back to look at the bigger picture...  there are three
issues, I think:

(A) Storing padding on disk is wasteful.

    Use a file system that supports sparse files, or an image format
    that can represent the padding efficiently.

(B) Reading padding into memory is wasteful.

    Matters mostly when a network is involved.  Use an image format that
    can represent the padding efficiently.

(C) Dirtying memory for padding is wasteful.

    I figure KSM could turn zero-padding into holes.

    We could play with mmap() & friends.

    Other ideas?

Any solution needs to work both for read-only and read/write padding.
Throwing away data written to the padding on cold restart is not what
I'd regard as "works".

Daniel P. Berrangé Nov. 9, 2022, 3:30 p.m. UTC | #16

On Wed, Nov 09, 2022 at 04:26:53PM +0100, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Mon, Nov 07, 2022 at 06:32:01PM +0100, Andrew Jones wrote:
> 
> [...]
> 
> >> Padding is a good idea, but too much causes other problems. When building
> >> lightweight VMs which may pull the firmware image from a network,
> >> AArch64 VMs require 64MB of mostly zeros to be transferred first, which
> >> can become a substantial amount of the overall boot time[*]. Being able to
> >> create images smaller than the total flash device size, but still add some
> >> pad for later growth, seems like the happy-medium to shoot for.
> >
> > QEMU configures the firmware using -blockdev,
> 
> Yes, even though the devices in question are not block devices.
> 
> >                                               so can use any file
> > format that QEMU supports at the block layer.  IOW, you can store
> > the firmware in a qcow2 file and thus you will never fetch any
> > of the padding zeros to be transferred.  That said I'm not sure
> > that libvirt supports anything other than a raw file today. 
> 
> Here's another idea.  The "raw" format supports exposing a slice of the
> underlying block node (options @offset and @size).  It could support
> padding.  Writing to the padding should then grow the underlying node.
> 
> Taking a step back to look at the bigger picture...  there are three
> issues, I think:
> 
> (A) Storing padding on disk is wasteful.
> 
>     Use a file system that supports sparse files, or an image format
>     that can represent the padding efficiently.
> 
> (B) Reading padding into memory is wasteful.
> 
>     Matters mostly when a network is involved.  Use an image format that
>     can represent the padding efficiently.
> 
> (C) Dirtying memory for padding is wasteful.
> 
>     I figure KSM could turn zero-padding into holes.
> 
>     We could play with mmap() & friends.
> 
>     Other ideas?

Is (C) actually a separate issue ?  I thought it was simply the
result of (B) ?  ie if we skip reading the zero padding, we won't
be dirtying the memory with lots of zeros. we'll have mmap'd the
full 64 MB, but most won't be paged in since we wouldn't write
the zeros to it. Only if the guest writes to those areas do we
need to then flush it back out.

With regards,
Daniel

Markus Armbruster Nov. 9, 2022, 3:45 p.m. UTC | #17

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Nov 09, 2022 at 04:26:53PM +0100, Markus Armbruster wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Mon, Nov 07, 2022 at 06:32:01PM +0100, Andrew Jones wrote:
>> 
>> [...]
>> 
>> >> Padding is a good idea, but too much causes other problems. When building
>> >> lightweight VMs which may pull the firmware image from a network,
>> >> AArch64 VMs require 64MB of mostly zeros to be transferred first, which
>> >> can become a substantial amount of the overall boot time[*]. Being able to
>> >> create images smaller than the total flash device size, but still add some
>> >> pad for later growth, seems like the happy-medium to shoot for.
>> >
>> > QEMU configures the firmware using -blockdev,
>> 
>> Yes, even though the devices in question are not block devices.
>> 
>> >                                               so can use any file
>> > format that QEMU supports at the block layer.  IOW, you can store
>> > the firmware in a qcow2 file and thus you will never fetch any
>> > of the padding zeros to be transferred.  That said I'm not sure
>> > that libvirt supports anything other than a raw file today. 
>> 
>> Here's another idea.  The "raw" format supports exposing a slice of the
>> underlying block node (options @offset and @size).  It could support
>> padding.  Writing to the padding should then grow the underlying node.
>> 
>> Taking a step back to look at the bigger picture...  there are three
>> issues, I think:
>> 
>> (A) Storing padding on disk is wasteful.
>> 
>>     Use a file system that supports sparse files, or an image format
>>     that can represent the padding efficiently.
>> 
>> (B) Reading padding into memory is wasteful.
>> 
>>     Matters mostly when a network is involved.  Use an image format that
>>     can represent the padding efficiently.
>> 
>> (C) Dirtying memory for padding is wasteful.
>> 
>>     I figure KSM could turn zero-padding into holes.
>> 
>>     We could play with mmap() & friends.
>> 
>>     Other ideas?
>
> Is (C) actually a separate issue ?  I thought it was simply the
> result of (B) ?  ie if we skip reading the zero padding, we won't
> be dirtying the memory with lots of zeros. we'll have mmap'd the
> full 64 MB, but most won't be paged in since we wouldn't write
> the zeros to it. Only if the guest writes to those areas do we
> need to then flush it back out.

I expressed myself poorly.  All three are related, but there's still a
distinction between each of them in my thinking.

Say we use an image format that compresses data.  Represents the padding
efficiently.  Storage on disk is efficient (A), and so is reading it
(B).  Trouble is decompressing it to memory dirties the memory unless we
take care not to write all-zero pages (C).

Clearer now?

Daniel P. Berrangé Nov. 9, 2022, 3:51 p.m. UTC | #18

On Wed, Nov 09, 2022 at 04:45:18PM +0100, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Wed, Nov 09, 2022 at 04:26:53PM +0100, Markus Armbruster wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> >> 
> >> > On Mon, Nov 07, 2022 at 06:32:01PM +0100, Andrew Jones wrote:
> >> 
> >> [...]
> >> 
> >> >> Padding is a good idea, but too much causes other problems. When building
> >> >> lightweight VMs which may pull the firmware image from a network,
> >> >> AArch64 VMs require 64MB of mostly zeros to be transferred first, which
> >> >> can become a substantial amount of the overall boot time[*]. Being able to
> >> >> create images smaller than the total flash device size, but still add some
> >> >> pad for later growth, seems like the happy-medium to shoot for.
> >> >
> >> > QEMU configures the firmware using -blockdev,
> >> 
> >> Yes, even though the devices in question are not block devices.
> >> 
> >> >                                               so can use any file
> >> > format that QEMU supports at the block layer.  IOW, you can store
> >> > the firmware in a qcow2 file and thus you will never fetch any
> >> > of the padding zeros to be transferred.  That said I'm not sure
> >> > that libvirt supports anything other than a raw file today. 
> >> 
> >> Here's another idea.  The "raw" format supports exposing a slice of the
> >> underlying block node (options @offset and @size).  It could support
> >> padding.  Writing to the padding should then grow the underlying node.
> >> 
> >> Taking a step back to look at the bigger picture...  there are three
> >> issues, I think:
> >> 
> >> (A) Storing padding on disk is wasteful.
> >> 
> >>     Use a file system that supports sparse files, or an image format
> >>     that can represent the padding efficiently.
> >> 
> >> (B) Reading padding into memory is wasteful.
> >> 
> >>     Matters mostly when a network is involved.  Use an image format that
> >>     can represent the padding efficiently.
> >> 
> >> (C) Dirtying memory for padding is wasteful.
> >> 
> >>     I figure KSM could turn zero-padding into holes.
> >> 
> >>     We could play with mmap() & friends.
> >> 
> >>     Other ideas?
> >
> > Is (C) actually a separate issue ?  I thought it was simply the
> > result of (B) ?  ie if we skip reading the zero padding, we won't
> > be dirtying the memory with lots of zeros. we'll have mmap'd the
> > full 64 MB, but most won't be paged in since we wouldn't write
> > the zeros to it. Only if the guest writes to those areas do we
> > need to then flush it back out.
> 
> I expressed myself poorly.  All three are related, but there's still a
> distinction between each of them in my thinking.
> 
> Say we use an image format that compresses data.  Represents the padding
> efficiently.  Storage on disk is efficient (A), and so is reading it
> (B).  Trouble is decompressing it to memory dirties the memory unless we
> take care not to write all-zero pages (C).
> 
> Clearer now?

Ok yeah, so reading can be efficient, but if the reader doesn't pay
attention to where the holes are, it'll dirty memory anyway.

With regards,
Daniel

[V2] hw/riscv: virt: Remove size restriction for pflash

Commit Message

Comments

Patch