diff mbox series

[RFC,1/1] block/rbd: increase dynamically the image size

Message ID 20190411105025.97397-2-sgarzare@redhat.com (mailing list archive)
State New, archived
Headers show
Series block/rbd: increase dynamically the image size | expand

Commit Message

Stefano Garzarella April 11, 2019, 10:50 a.m. UTC
RBD APIs don't allow us to write more than the size set with rbd_create()
or rbd_resize().
In order to support growing images (eg. qcow2), we resize the image
before RW operations that exceed the current size.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1171007
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
---
 block/rbd.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Comments

Jason Dillaman April 11, 2019, 12:35 p.m. UTC | #1
On Thu, Apr 11, 2019 at 7:00 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> RBD APIs don't allow us to write more than the size set with rbd_create()
> or rbd_resize().
> In order to support growing images (eg. qcow2), we resize the image
> before RW operations that exceed the current size.

What's the use-case for storing qcow2 images within a RBD image? RBD
images are already thinly provisioned, they support snapshots, they
can form a parent/child linked image hierarchy.

> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1171007
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
>  block/rbd.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 0c549c9935..228658e20a 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -102,6 +102,7 @@ typedef struct BDRVRBDState {
>      rbd_image_t image;
>      char *image_name;
>      char *snap;
> +    uint64_t image_size;
>  } BDRVRBDState;
>
>  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
> @@ -777,6 +778,14 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
>          goto failed_open;
>      }
>
> +    r = rbd_get_size(s->image, &s->image_size);
> +    if (r < 0) {
> +        error_setg_errno(errp, -r, "error reading image size from %s",
> +                         s->image_name);
> +        rbd_close(s->image);
> +        goto failed_open;
> +    }
> +
>      /* If we are using an rbd snapshot, we must be r/o, otherwise
>       * leave as-is */
>      if (s->snap != NULL) {
> @@ -921,6 +930,20 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
>          rcb->buf = acb->bounce;
>      }
>
> +    /*
> +     * RBD APIs don't allow us to write more than actual size, so in order
> +     * to support growing images, we resize the image before RW operations
> +     * that exceed the current size.
> +     */
> +    if (s->image_size < off + size) {
> +        r = rbd_resize(s->image, off + size);
> +        if (r < 0) {
> +            goto failed;
> +        }
> +
> +        s->image_size = off + size;
> +    }
> +
>      acb->ret = 0;
>      acb->error = 0;
>      acb->s = s;
> @@ -1066,6 +1089,8 @@ static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
>          return r;
>      }
>
> +    s->image_size = offset;
> +
>      return 0;
>  }
>
> --
> 2.20.1
>
>
Stefano Garzarella April 11, 2019, 1:02 p.m. UTC | #2
On Thu, Apr 11, 2019 at 08:35:44AM -0400, Jason Dillaman wrote:
> On Thu, Apr 11, 2019 at 7:00 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> >
> > RBD APIs don't allow us to write more than the size set with rbd_create()
> > or rbd_resize().
> > In order to support growing images (eg. qcow2), we resize the image
> > before RW operations that exceed the current size.
> 
> What's the use-case for storing qcow2 images within a RBD image? RBD
> images are already thinly provisioned, they support snapshots, they
> can form a parent/child linked image hierarchy.
> 

Hi Jason,
I understand your point of view, maybe one use case could be if you have
a qcow2 image and you want to put it in the rdb pool without convert it.

I'm going through this BZ [1] and I'll ask if they have other
use cases in mind.

Anyway, if we want to support only raw format on RBD driver, maybe we
should return something different from current behaviour, also avoiding
to create the image:

$ qemu-img create -f qcow2 rbd:test/qcow.img 1G
  qemu-img: rbd:test/qcow.img: Could not write qcow2 header: Invalid argument

What do you think?

Thanks,
Stefano

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1171007
Jason Dillaman April 11, 2019, 5:06 p.m. UTC | #3
On Thu, Apr 11, 2019 at 9:02 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Thu, Apr 11, 2019 at 08:35:44AM -0400, Jason Dillaman wrote:
> > On Thu, Apr 11, 2019 at 7:00 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> > >
> > > RBD APIs don't allow us to write more than the size set with rbd_create()
> > > or rbd_resize().
> > > In order to support growing images (eg. qcow2), we resize the image
> > > before RW operations that exceed the current size.
> >
> > What's the use-case for storing qcow2 images within a RBD image? RBD
> > images are already thinly provisioned, they support snapshots, they
> > can form a parent/child linked image hierarchy.
> >
>
> Hi Jason,
> I understand your point of view, maybe one use case could be if you have
> a qcow2 image and you want to put it in the rdb pool without convert it.
>
> I'm going through this BZ [1] and I'll ask if they have other
> use cases in mind.

Assuming no good use-cases, perhaps it would just be better to make
the qemu-img error messages more clear.

> Anyway, if we want to support only raw format on RBD driver, maybe we
> should return something different from current behaviour, also avoiding
> to create the image:
>
> $ qemu-img create -f qcow2 rbd:test/qcow.img 1G
>   qemu-img: rbd:test/qcow.img: Could not write qcow2 header: Invalid argument
>
> What do you think?
>
> Thanks,
> Stefano
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1171007
Stefano Garzarella April 14, 2019, 1:20 p.m. UTC | #4
On Thu, Apr 11, 2019 at 01:06:49PM -0400, Jason Dillaman wrote:
> On Thu, Apr 11, 2019 at 9:02 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> >
> > On Thu, Apr 11, 2019 at 08:35:44AM -0400, Jason Dillaman wrote:
> > > On Thu, Apr 11, 2019 at 7:00 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> > > >
> > > > RBD APIs don't allow us to write more than the size set with rbd_create()
> > > > or rbd_resize().
> > > > In order to support growing images (eg. qcow2), we resize the image
> > > > before RW operations that exceed the current size.
> > >
> > > What's the use-case for storing qcow2 images within a RBD image? RBD
> > > images are already thinly provisioned, they support snapshots, they
> > > can form a parent/child linked image hierarchy.
> > >
> >
> > Hi Jason,
> > I understand your point of view, maybe one use case could be if you have
> > a qcow2 image and you want to put it in the rdb pool without convert it.
> >
> > I'm going through this BZ [1] and I'll ask if they have other
> > use cases in mind.
> 
> Assuming no good use-cases, perhaps it would just be better to make
> the qemu-img error messages more clear.
> 

Hi Jason,
I asked about use-cases and they want to use qcow2 on rbd in order to
take advantage of these qcow2 features [1]: external snapshots,
Copy-on-write, and optional compression and encryption.

Maybe the more interesting are external snapshots and Copy-on-write,
since encryption can be achieved with LUKS and rbd should support
compression for a specified pool.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1171007#c13

Cheers,
Stefano
Jason Dillaman April 14, 2019, 3:14 p.m. UTC | #5
On Sun, Apr 14, 2019 at 9:20 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Thu, Apr 11, 2019 at 01:06:49PM -0400, Jason Dillaman wrote:
> > On Thu, Apr 11, 2019 at 9:02 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> > >
> > > On Thu, Apr 11, 2019 at 08:35:44AM -0400, Jason Dillaman wrote:
> > > > On Thu, Apr 11, 2019 at 7:00 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> > > > >
> > > > > RBD APIs don't allow us to write more than the size set with rbd_create()
> > > > > or rbd_resize().
> > > > > In order to support growing images (eg. qcow2), we resize the image
> > > > > before RW operations that exceed the current size.
> > > >
> > > > What's the use-case for storing qcow2 images within a RBD image? RBD
> > > > images are already thinly provisioned, they support snapshots, they
> > > > can form a parent/child linked image hierarchy.
> > > >
> > >
> > > Hi Jason,
> > > I understand your point of view, maybe one use case could be if you have
> > > a qcow2 image and you want to put it in the rdb pool without convert it.
> > >
> > > I'm going through this BZ [1] and I'll ask if they have other
> > > use cases in mind.
> >
> > Assuming no good use-cases, perhaps it would just be better to make
> > the qemu-img error messages more clear.
> >
>
> Hi Jason,
> I asked about use-cases and they want to use qcow2 on rbd in order to
> take advantage of these qcow2 features [1]: external snapshots,
> Copy-on-write, and optional compression and encryption.
>
> Maybe the more interesting are external snapshots and Copy-on-write,

Copy-on-write is natively supported by RBD. The concept of external
snapshots seems similar to just automating the process of creating a
new copy-on-write image. Compression is also supported by Ceph on the
cluster side by recent releases.

> since encryption can be achieved with LUKS and rbd should support
> compression for a specified pool.
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1171007#c13
>
> Cheers,
> Stefano
Kevin Wolf April 15, 2019, 8:04 a.m. UTC | #6
Am 14.04.2019 um 17:14 hat Jason Dillaman geschrieben:
> On Sun, Apr 14, 2019 at 9:20 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> >
> > On Thu, Apr 11, 2019 at 01:06:49PM -0400, Jason Dillaman wrote:
> > > On Thu, Apr 11, 2019 at 9:02 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> > > >
> > > > On Thu, Apr 11, 2019 at 08:35:44AM -0400, Jason Dillaman wrote:
> > > > > On Thu, Apr 11, 2019 at 7:00 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> > > > > >
> > > > > > RBD APIs don't allow us to write more than the size set with rbd_create()
> > > > > > or rbd_resize().
> > > > > > In order to support growing images (eg. qcow2), we resize the image
> > > > > > before RW operations that exceed the current size.
> > > > >
> > > > > What's the use-case for storing qcow2 images within a RBD image? RBD
> > > > > images are already thinly provisioned, they support snapshots, they
> > > > > can form a parent/child linked image hierarchy.
> > > > >
> > > >
> > > > Hi Jason,
> > > > I understand your point of view, maybe one use case could be if you have
> > > > a qcow2 image and you want to put it in the rdb pool without convert it.
> > > >
> > > > I'm going through this BZ [1] and I'll ask if they have other
> > > > use cases in mind.
> > >
> > > Assuming no good use-cases, perhaps it would just be better to make
> > > the qemu-img error messages more clear.
> > >
> >
> > Hi Jason,
> > I asked about use-cases and they want to use qcow2 on rbd in order to
> > take advantage of these qcow2 features [1]: external snapshots,
> > Copy-on-write, and optional compression and encryption.
> >
> > Maybe the more interesting are external snapshots and Copy-on-write,
> 
> Copy-on-write is natively supported by RBD. The concept of external
> snapshots seems similar to just automating the process of creating a
> new copy-on-write image. Compression is also supported by Ceph on the
> cluster side by recent releases.

I think a potential actual use case could be persistent dirty bitmaps
for incremental backup. Though maybe that would be better served by
using the rbd image just as a raw external data file and keeping the
qcow2 metadata on a filesystem.

How fast is rbd_resize()? Does automatically resizing for every write
request actually work reasonably well in practice? If it does, there is
probably little reason not to allow it, even if the use cases are rather
obscure.

Kevin
Stefano Garzarella April 17, 2019, 7:34 a.m. UTC | #7
On Mon, Apr 15, 2019 at 10:04:52AM +0200, Kevin Wolf wrote:
> 
> I think a potential actual use case could be persistent dirty bitmaps
> for incremental backup. Though maybe that would be better served by
> using the rbd image just as a raw external data file and keeping the
> qcow2 metadata on a filesystem.

Thanks to point it out! I'll take a look to understand how to keep
metadata separated from the data.

> 
> How fast is rbd_resize()? Does automatically resizing for every write
> request actually work reasonably well in practice? If it does, there is
> probably little reason not to allow it, even if the use cases are rather
> obscure.

I'll try to measure the percentage of the time spent in the rbd_resize.

Another solution could be to pass to the rbd driver the virtual size of
the image and resize it only one time also if the preallocation is
disabled, because RBD will not allocate blocks but IIUC it only set the max
size.

Do you think make sense? Is it feasible?

Thanks,
Stefano
Kevin Wolf April 17, 2019, 8:04 a.m. UTC | #8
Am 17.04.2019 um 09:34 hat Stefano Garzarella geschrieben:
> On Mon, Apr 15, 2019 at 10:04:52AM +0200, Kevin Wolf wrote:
> > 
> > I think a potential actual use case could be persistent dirty bitmaps
> > for incremental backup. Though maybe that would be better served by
> > using the rbd image just as a raw external data file and keeping the
> > qcow2 metadata on a filesystem.
> 
> Thanks to point it out! I'll take a look to understand how to keep
> metadata separated from the data.

I'd consider the feature still experimental, but for local files, it
works like this:

    qemu-img create -f qcow2 -o data_file=test.raw test.qcow2 4G

And then just use test.qcow2. As long as you can put everything you need
into an rbd URL, the same approach should work. Otherwise, you may need
to use QMP blockdev-create on creation and possibly the data-file option
of the qcow2 driver for opening.

> > How fast is rbd_resize()? Does automatically resizing for every write
> > request actually work reasonably well in practice? If it does, there is
> > probably little reason not to allow it, even if the use cases are rather
> > obscure.
> 
> I'll try to measure the percentage of the time spent in the rbd_resize.
> 
> Another solution could be to pass to the rbd driver the virtual size of
> the image and resize it only one time also if the preallocation is
> disabled, because RBD will not allocate blocks but IIUC it only set the max
> size.
> 
> Do you think make sense? Is it feasible?

In theory yes, though it requires modification of every driver that
should be usable together with rbd (i.e. ideally all of the drivers). If
automatic resize works good enough, I'd prefer that.

Kevin
Stefano Garzarella April 19, 2019, 12:23 p.m. UTC | #9
Hi Kevin,

On Wed, Apr 17, 2019 at 10:04:43AM +0200, Kevin Wolf wrote:
> Am 17.04.2019 um 09:34 hat Stefano Garzarella geschrieben:
> > On Mon, Apr 15, 2019 at 10:04:52AM +0200, Kevin Wolf wrote:
> > > 
> > > I think a potential actual use case could be persistent dirty bitmaps
> > > for incremental backup. Though maybe that would be better served by
> > > using the rbd image just as a raw external data file and keeping the
> > > qcow2 metadata on a filesystem.
> > 
> > Thanks to point it out! I'll take a look to understand how to keep
> > metadata separated from the data.
> 
> I'd consider the feature still experimental, but for local files, it
> works like this:
> 
>     qemu-img create -f qcow2 -o data_file=test.raw test.qcow2 4G
> 
> And then just use test.qcow2. As long as you can put everything you need
> into an rbd URL, the same approach should work. Otherwise, you may need
> to use QMP blockdev-create on creation and possibly the data-file option
> of the qcow2 driver for opening.
>

Very interesting, I'll try to add this support also in the rbd driver.

> > > How fast is rbd_resize()? Does automatically resizing for every write
> > > request actually work reasonably well in practice? If it does, there is
> > > probably little reason not to allow it, even if the use cases are rather
> > > obscure.
> > 
> > I'll try to measure the percentage of the time spent in the rbd_resize.
> > 
> > Another solution could be to pass to the rbd driver the virtual size of
> > the image and resize it only one time also if the preallocation is
> > disabled, because RBD will not allocate blocks but IIUC it only set the max
> > size.
> > 
> > Do you think make sense? Is it feasible?
> 
> In theory yes, though it requires modification of every driver that
> should be usable together with rbd (i.e. ideally all of the drivers). If
> automatic resize works good enough, I'd prefer that

I did some tests and it seems that the cost of rbd_resize() is
negligible. IIUC it only updates the metadata without allocating any
blocks (if we are growing, like that case).

Anyway the automatic resize will not affect the current use-case (raw
images on rbd), where the file size is set during the creation, so I
think there should not be side effects with this patch.

I'm also adding the support for preallocation (i.e. full) in the rbd
driver that can be useful for qcow2 images.

If you prefer I can resend this patch with the preallocation series.

Thanks,
Stefano
Kevin Wolf April 23, 2019, 7:56 a.m. UTC | #10
Am 19.04.2019 um 14:23 hat Stefano Garzarella geschrieben:
> Hi Kevin,
> 
> On Wed, Apr 17, 2019 at 10:04:43AM +0200, Kevin Wolf wrote:
> > Am 17.04.2019 um 09:34 hat Stefano Garzarella geschrieben:
> > > On Mon, Apr 15, 2019 at 10:04:52AM +0200, Kevin Wolf wrote:
> > > > 
> > > > I think a potential actual use case could be persistent dirty bitmaps
> > > > for incremental backup. Though maybe that would be better served by
> > > > using the rbd image just as a raw external data file and keeping the
> > > > qcow2 metadata on a filesystem.
> > > 
> > > Thanks to point it out! I'll take a look to understand how to keep
> > > metadata separated from the data.
> > 
> > I'd consider the feature still experimental, but for local files, it
> > works like this:
> > 
> >     qemu-img create -f qcow2 -o data_file=test.raw test.qcow2 4G
> > 
> > And then just use test.qcow2. As long as you can put everything you need
> > into an rbd URL, the same approach should work. Otherwise, you may need
> > to use QMP blockdev-create on creation and possibly the data-file option
> > of the qcow2 driver for opening.
> 
> Very interesting, I'll try to add this support also in the rbd driver.

I don't understand - what is the thing you want to add to the rbd driver?
qcow2 doesn't need special protocol driver support for doing this, and I
don't think the QEMU rbd driver has any metadata that could be split off.

> > > > How fast is rbd_resize()? Does automatically resizing for every write
> > > > request actually work reasonably well in practice? If it does, there is
> > > > probably little reason not to allow it, even if the use cases are rather
> > > > obscure.
> > > 
> > > I'll try to measure the percentage of the time spent in the rbd_resize.
> > > 
> > > Another solution could be to pass to the rbd driver the virtual size of
> > > the image and resize it only one time also if the preallocation is
> > > disabled, because RBD will not allocate blocks but IIUC it only set the max
> > > size.
> > > 
> > > Do you think make sense? Is it feasible?
> > 
> > In theory yes, though it requires modification of every driver that
> > should be usable together with rbd (i.e. ideally all of the drivers). If
> > automatic resize works good enough, I'd prefer that
> 
> I did some tests and it seems that the cost of rbd_resize() is
> negligible. IIUC it only updates the metadata without allocating any
> blocks (if we are growing, like that case).
> 
> Anyway the automatic resize will not affect the current use-case (raw
> images on rbd), where the file size is set during the creation, so I
> think there should not be side effects with this patch.

Okay, sounds good.

> I'm also adding the support for preallocation (i.e. full) in the rbd
> driver that can be useful for qcow2 images.
> 
> If you prefer I can resend this patch with the preallocation series.

Let's keep seperate things separate. Huge patch series are always harder
to handle.

Kevin
Stefano Garzarella April 23, 2019, 8:26 a.m. UTC | #11
On Tue, Apr 23, 2019 at 09:56:19AM +0200, Kevin Wolf wrote:
> Am 19.04.2019 um 14:23 hat Stefano Garzarella geschrieben:
> > Hi Kevin,
> > 
> > On Wed, Apr 17, 2019 at 10:04:43AM +0200, Kevin Wolf wrote:
> > > Am 17.04.2019 um 09:34 hat Stefano Garzarella geschrieben:
> > > > On Mon, Apr 15, 2019 at 10:04:52AM +0200, Kevin Wolf wrote:
> > > > > 
> > > > > I think a potential actual use case could be persistent dirty bitmaps
> > > > > for incremental backup. Though maybe that would be better served by
> > > > > using the rbd image just as a raw external data file and keeping the
> > > > > qcow2 metadata on a filesystem.
> > > > 
> > > > Thanks to point it out! I'll take a look to understand how to keep
> > > > metadata separated from the data.
> > > 
> > > I'd consider the feature still experimental, but for local files, it
> > > works like this:
> > > 
> > >     qemu-img create -f qcow2 -o data_file=test.raw test.qcow2 4G
> > > 
> > > And then just use test.qcow2. As long as you can put everything you need
> > > into an rbd URL, the same approach should work. Otherwise, you may need
> > > to use QMP blockdev-create on creation and possibly the data-file option
> > > of the qcow2 driver for opening.
> > 
> > Very interesting, I'll try to add this support also in the rbd driver.
> 
> I don't understand - what is the thing you want to add to the rbd driver?
> qcow2 doesn't need special protocol driver support for doing this, and I
> don't think the QEMU rbd driver has any metadata that could be split off.
> 

Oh sorry, I didn't understand that was completely independent from the
protocol.

> > > > > How fast is rbd_resize()? Does automatically resizing for every write
> > > > > request actually work reasonably well in practice? If it does, there is
> > > > > probably little reason not to allow it, even if the use cases are rather
> > > > > obscure.
> > > > 
> > > > I'll try to measure the percentage of the time spent in the rbd_resize.
> > > > 
> > > > Another solution could be to pass to the rbd driver the virtual size of
> > > > the image and resize it only one time also if the preallocation is
> > > > disabled, because RBD will not allocate blocks but IIUC it only set the max
> > > > size.
> > > > 
> > > > Do you think make sense? Is it feasible?
> > > 
> > > In theory yes, though it requires modification of every driver that
> > > should be usable together with rbd (i.e. ideally all of the drivers). If
> > > automatic resize works good enough, I'd prefer that
> > 
> > I did some tests and it seems that the cost of rbd_resize() is
> > negligible. IIUC it only updates the metadata without allocating any
> > blocks (if we are growing, like that case).
> > 
> > Anyway the automatic resize will not affect the current use-case (raw
> > images on rbd), where the file size is set during the creation, so I
> > think there should not be side effects with this patch.
> 
> Okay, sounds good.
> 
> > I'm also adding the support for preallocation (i.e. full) in the rbd
> > driver that can be useful for qcow2 images.
> > 
> > If you prefer I can resend this patch with the preallocation series.
> 
> Let's keep seperate things separate. Huge patch series are always harder
> to handle.

Okay, thanks for the suggestion!

Should this patch go through your tree?

Thanks,
Stefano
Kevin Wolf April 23, 2019, 8:38 a.m. UTC | #12
Am 23.04.2019 um 10:26 hat Stefano Garzarella geschrieben:
> On Tue, Apr 23, 2019 at 09:56:19AM +0200, Kevin Wolf wrote:
> > Am 19.04.2019 um 14:23 hat Stefano Garzarella geschrieben:
> > > Hi Kevin,
> > > 
> > > On Wed, Apr 17, 2019 at 10:04:43AM +0200, Kevin Wolf wrote:
> > > > Am 17.04.2019 um 09:34 hat Stefano Garzarella geschrieben:
> > > > > On Mon, Apr 15, 2019 at 10:04:52AM +0200, Kevin Wolf wrote:
> > > > > > 
> > > > > > I think a potential actual use case could be persistent dirty bitmaps
> > > > > > for incremental backup. Though maybe that would be better served by
> > > > > > using the rbd image just as a raw external data file and keeping the
> > > > > > qcow2 metadata on a filesystem.
> > > > > 
> > > > > Thanks to point it out! I'll take a look to understand how to keep
> > > > > metadata separated from the data.
> > > > 
> > > > I'd consider the feature still experimental, but for local files, it
> > > > works like this:
> > > > 
> > > >     qemu-img create -f qcow2 -o data_file=test.raw test.qcow2 4G
> > > > 
> > > > And then just use test.qcow2. As long as you can put everything you need
> > > > into an rbd URL, the same approach should work. Otherwise, you may need
> > > > to use QMP blockdev-create on creation and possibly the data-file option
> > > > of the qcow2 driver for opening.
> > > 
> > > Very interesting, I'll try to add this support also in the rbd driver.
> > 
> > I don't understand - what is the thing you want to add to the rbd driver?
> > qcow2 doesn't need special protocol driver support for doing this, and I
> > don't think the QEMU rbd driver has any metadata that could be split off.
> > 
> 
> Oh sorry, I didn't understand that was completely independent from the
> protocol.
> 
> > > > > > How fast is rbd_resize()? Does automatically resizing for every write
> > > > > > request actually work reasonably well in practice? If it does, there is
> > > > > > probably little reason not to allow it, even if the use cases are rather
> > > > > > obscure.
> > > > > 
> > > > > I'll try to measure the percentage of the time spent in the rbd_resize.
> > > > > 
> > > > > Another solution could be to pass to the rbd driver the virtual size of
> > > > > the image and resize it only one time also if the preallocation is
> > > > > disabled, because RBD will not allocate blocks but IIUC it only set the max
> > > > > size.
> > > > > 
> > > > > Do you think make sense? Is it feasible?
> > > > 
> > > > In theory yes, though it requires modification of every driver that
> > > > should be usable together with rbd (i.e. ideally all of the drivers). If
> > > > automatic resize works good enough, I'd prefer that
> > > 
> > > I did some tests and it seems that the cost of rbd_resize() is
> > > negligible. IIUC it only updates the metadata without allocating any
> > > blocks (if we are growing, like that case).
> > > 
> > > Anyway the automatic resize will not affect the current use-case (raw
> > > images on rbd), where the file size is set during the creation, so I
> > > think there should not be side effects with this patch.
> > 
> > Okay, sounds good.
> > 
> > > I'm also adding the support for preallocation (i.e. full) in the rbd
> > > driver that can be useful for qcow2 images.
> > > 
> > > If you prefer I can resend this patch with the preallocation series.
> > 
> > Let's keep seperate things separate. Huge patch series are always harder
> > to handle.
> 
> Okay, thanks for the suggestion!
> 
> Should this patch go through your tree?

I think so, yes.

Kevin
Kevin Wolf April 29, 2019, 9:58 a.m. UTC | #13
Am 23.04.2019 um 10:38 hat Kevin Wolf geschrieben:
> Am 23.04.2019 um 10:26 hat Stefano Garzarella geschrieben:
> > On Tue, Apr 23, 2019 at 09:56:19AM +0200, Kevin Wolf wrote:
> > > Am 19.04.2019 um 14:23 hat Stefano Garzarella geschrieben:
> > > > Hi Kevin,
> > > > 
> > > > On Wed, Apr 17, 2019 at 10:04:43AM +0200, Kevin Wolf wrote:
> > > > > Am 17.04.2019 um 09:34 hat Stefano Garzarella geschrieben:
> > > > > > On Mon, Apr 15, 2019 at 10:04:52AM +0200, Kevin Wolf wrote:
> > > > > > > 
> > > > > > > I think a potential actual use case could be persistent dirty bitmaps
> > > > > > > for incremental backup. Though maybe that would be better served by
> > > > > > > using the rbd image just as a raw external data file and keeping the
> > > > > > > qcow2 metadata on a filesystem.
> > > > > > 
> > > > > > Thanks to point it out! I'll take a look to understand how to keep
> > > > > > metadata separated from the data.
> > > > > 
> > > > > I'd consider the feature still experimental, but for local files, it
> > > > > works like this:
> > > > > 
> > > > >     qemu-img create -f qcow2 -o data_file=test.raw test.qcow2 4G
> > > > > 
> > > > > And then just use test.qcow2. As long as you can put everything you need
> > > > > into an rbd URL, the same approach should work. Otherwise, you may need
> > > > > to use QMP blockdev-create on creation and possibly the data-file option
> > > > > of the qcow2 driver for opening.
> > > > 
> > > > Very interesting, I'll try to add this support also in the rbd driver.
> > > 
> > > I don't understand - what is the thing you want to add to the rbd driver?
> > > qcow2 doesn't need special protocol driver support for doing this, and I
> > > don't think the QEMU rbd driver has any metadata that could be split off.
> > > 
> > 
> > Oh sorry, I didn't understand that was completely independent from the
> > protocol.
> > 
> > > > > > > How fast is rbd_resize()? Does automatically resizing for every write
> > > > > > > request actually work reasonably well in practice? If it does, there is
> > > > > > > probably little reason not to allow it, even if the use cases are rather
> > > > > > > obscure.
> > > > > > 
> > > > > > I'll try to measure the percentage of the time spent in the rbd_resize.
> > > > > > 
> > > > > > Another solution could be to pass to the rbd driver the virtual size of
> > > > > > the image and resize it only one time also if the preallocation is
> > > > > > disabled, because RBD will not allocate blocks but IIUC it only set the max
> > > > > > size.
> > > > > > 
> > > > > > Do you think make sense? Is it feasible?
> > > > > 
> > > > > In theory yes, though it requires modification of every driver that
> > > > > should be usable together with rbd (i.e. ideally all of the drivers). If
> > > > > automatic resize works good enough, I'd prefer that
> > > > 
> > > > I did some tests and it seems that the cost of rbd_resize() is
> > > > negligible. IIUC it only updates the metadata without allocating any
> > > > blocks (if we are growing, like that case).
> > > > 
> > > > Anyway the automatic resize will not affect the current use-case (raw
> > > > images on rbd), where the file size is set during the creation, so I
> > > > think there should not be side effects with this patch.
> > > 
> > > Okay, sounds good.
> > > 
> > > > I'm also adding the support for preallocation (i.e. full) in the rbd
> > > > driver that can be useful for qcow2 images.
> > > > 
> > > > If you prefer I can resend this patch with the preallocation series.
> > > 
> > > Let's keep seperate things separate. Huge patch series are always harder
> > > to handle.
> > 
> > Okay, thanks for the suggestion!
> > 
> > Should this patch go through your tree?
> 
> I think so, yes.

Hm, this is an RFC patch, which suggests that it wasn't originally meant
to be merged as it is. Are you going to send a new version, or did it
turn out to be exactly what we want and the "RFC" tag was a mistake?

Kevin
Stefano Garzarella April 29, 2019, 10:11 a.m. UTC | #14
On Mon, Apr 29, 2019 at 11:58:55AM +0200, Kevin Wolf wrote:
> 
> Hm, this is an RFC patch, which suggests that it wasn't originally meant
> to be merged as it is. Are you going to send a new version, or did it
> turn out to be exactly what we want and the "RFC" tag was a mistake?

I put the "RFC" tag because I was not sure about this patch, but after
your comments and some tests on the rbd_resize() impact, I think that
it turns out exactly what we want.

If you agree, you can queue it as is.

Thanks,
Stefano
Kevin Wolf April 29, 2019, 10:25 a.m. UTC | #15
Am 11.04.2019 um 12:50 hat Stefano Garzarella geschrieben:
> RBD APIs don't allow us to write more than the size set with rbd_create()
> or rbd_resize().
> In order to support growing images (eg. qcow2), we resize the image
> before RW operations that exceed the current size.
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1171007
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
>  block/rbd.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index 0c549c9935..228658e20a 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -102,6 +102,7 @@ typedef struct BDRVRBDState {
>      rbd_image_t image;
>      char *image_name;
>      char *snap;
> +    uint64_t image_size;
>  } BDRVRBDState;

Can't we use bs->total_sectors instead of adding a new image_size field?

>  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
> @@ -777,6 +778,14 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
>          goto failed_open;
>      }
>  
> +    r = rbd_get_size(s->image, &s->image_size);
> +    if (r < 0) {
> +        error_setg_errno(errp, -r, "error reading image size from %s",
> +                         s->image_name);
> +        rbd_close(s->image);
> +        goto failed_open;
> +    }
> +
>      /* If we are using an rbd snapshot, we must be r/o, otherwise
>       * leave as-is */
>      if (s->snap != NULL) {
> @@ -921,6 +930,20 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
>          rcb->buf = acb->bounce;
>      }
>  
> +    /*
> +     * RBD APIs don't allow us to write more than actual size, so in order
> +     * to support growing images, we resize the image before RW operations
> +     * that exceed the current size.
> +     */
> +    if (s->image_size < off + size) {
> +        r = rbd_resize(s->image, off + size);
> +        if (r < 0) {
> +            goto failed;
> +        }
> +
> +        s->image_size = off + size;
> +    }

This doesn't check the request type, so it's actually not limited to RW
operations, but even reads will try to resize the image. This is at
least surprising. For regular files, file-posix extends the file for
write requests, but for reads it returns a zeroed buffer without
actually changing the file size.

Kevin
Stefano Garzarella April 29, 2019, 2:04 p.m. UTC | #16
On Mon, Apr 29, 2019 at 12:25:10PM +0200, Kevin Wolf wrote:
> Am 11.04.2019 um 12:50 hat Stefano Garzarella geschrieben:
> > RBD APIs don't allow us to write more than the size set with rbd_create()
> > or rbd_resize().
> > In order to support growing images (eg. qcow2), we resize the image
> > before RW operations that exceed the current size.
> > 
> > Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1171007
> > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > ---
> >  block/rbd.c | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/block/rbd.c b/block/rbd.c
> > index 0c549c9935..228658e20a 100644
> > --- a/block/rbd.c
> > +++ b/block/rbd.c
> > @@ -102,6 +102,7 @@ typedef struct BDRVRBDState {
> >      rbd_image_t image;
> >      char *image_name;
> >      char *snap;
> > +    uint64_t image_size;
> >  } BDRVRBDState;
> 
> Can't we use bs->total_sectors instead of adding a new image_size field?

I'm not sure we can use bs->total_sectors. IIUC, for example, it doesn't
take care of bytes used by QCOW2 metadata.

> 
> >  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
> > @@ -777,6 +778,14 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
> >          goto failed_open;
> >      }
> >  
> > +    r = rbd_get_size(s->image, &s->image_size);
> > +    if (r < 0) {
> > +        error_setg_errno(errp, -r, "error reading image size from %s",
> > +                         s->image_name);
> > +        rbd_close(s->image);
> > +        goto failed_open;
> > +    }
> > +
> >      /* If we are using an rbd snapshot, we must be r/o, otherwise
> >       * leave as-is */
> >      if (s->snap != NULL) {
> > @@ -921,6 +930,20 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
> >          rcb->buf = acb->bounce;
> >      }
> >  
> > +    /*
> > +     * RBD APIs don't allow us to write more than actual size, so in order
> > +     * to support growing images, we resize the image before RW operations
> > +     * that exceed the current size.
> > +     */
> > +    if (s->image_size < off + size) {
> > +        r = rbd_resize(s->image, off + size);
> > +        if (r < 0) {
> > +            goto failed;
> > +        }
> > +
> > +        s->image_size = off + size;
> > +    }
> 
> This doesn't check the request type, so it's actually not limited to RW
> operations, but even reads will try to resize the image. This is at
> least surprising. For regular files, file-posix extends the file for
> write requests, but for reads it returns a zeroed buffer without
> actually changing the file size.

Yes, I'll change the behaviour in the v2.

I did some tries (i.e. using qemu-io and reading more than bytes used) and
the RBD driver didn't receive 'read' requests that exceed the current
size, maybe because it is managed in the QCOW2 protocol, but of course
I'll handle also in the RBD driver.

Thanks,
Stefano
Kevin Wolf April 29, 2019, 2:30 p.m. UTC | #17
Am 29.04.2019 um 16:04 hat Stefano Garzarella geschrieben:
> On Mon, Apr 29, 2019 at 12:25:10PM +0200, Kevin Wolf wrote:
> > Am 11.04.2019 um 12:50 hat Stefano Garzarella geschrieben:
> > > RBD APIs don't allow us to write more than the size set with rbd_create()
> > > or rbd_resize().
> > > In order to support growing images (eg. qcow2), we resize the image
> > > before RW operations that exceed the current size.
> > > 
> > > Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1171007
> > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > ---
> > >  block/rbd.c | 25 +++++++++++++++++++++++++
> > >  1 file changed, 25 insertions(+)
> > > 
> > > diff --git a/block/rbd.c b/block/rbd.c
> > > index 0c549c9935..228658e20a 100644
> > > --- a/block/rbd.c
> > > +++ b/block/rbd.c
> > > @@ -102,6 +102,7 @@ typedef struct BDRVRBDState {
> > >      rbd_image_t image;
> > >      char *image_name;
> > >      char *snap;
> > > +    uint64_t image_size;
> > >  } BDRVRBDState;
> > 
> > Can't we use bs->total_sectors instead of adding a new image_size field?
> 
> I'm not sure we can use bs->total_sectors. IIUC, for example, it doesn't
> take care of bytes used by QCOW2 metadata.

bs->total_sectors for the rbd BLockDriverState is the image file size,
not the virtual disk size.

The only reason not to use it would be if we need byte granularity
rather than 512 byte granularity. But I don't think it's a problem to
round up offsets to the next 512 bytes (BDRV_SECTOR_SIZE) boundary.

> > >  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
> > > @@ -777,6 +778,14 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
> > >          goto failed_open;
> > >      }
> > >  
> > > +    r = rbd_get_size(s->image, &s->image_size);
> > > +    if (r < 0) {
> > > +        error_setg_errno(errp, -r, "error reading image size from %s",
> > > +                         s->image_name);
> > > +        rbd_close(s->image);
> > > +        goto failed_open;
> > > +    }
> > > +
> > >      /* If we are using an rbd snapshot, we must be r/o, otherwise
> > >       * leave as-is */
> > >      if (s->snap != NULL) {
> > > @@ -921,6 +930,20 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
> > >          rcb->buf = acb->bounce;
> > >      }
> > >  
> > > +    /*
> > > +     * RBD APIs don't allow us to write more than actual size, so in order
> > > +     * to support growing images, we resize the image before RW operations
> > > +     * that exceed the current size.
> > > +     */
> > > +    if (s->image_size < off + size) {
> > > +        r = rbd_resize(s->image, off + size);
> > > +        if (r < 0) {
> > > +            goto failed;
> > > +        }
> > > +
> > > +        s->image_size = off + size;
> > > +    }
> > 
> > This doesn't check the request type, so it's actually not limited to RW
> > operations, but even reads will try to resize the image. This is at
> > least surprising. For regular files, file-posix extends the file for
> > write requests, but for reads it returns a zeroed buffer without
> > actually changing the file size.
> 
> Yes, I'll change the behaviour in the v2.
> 
> I did some tries (i.e. using qemu-io and reading more than bytes used) and
> the RBD driver didn't receive 'read' requests that exceed the current
> size, maybe because it is managed in the QCOW2 protocol, but of course
> I'll handle also in the RBD driver.

I don't remember the exact scenario where it happened, but I know I
implemented it for file-posix to fix a bug. Maybe it actually doesn't
happen any more because we have made other changes in the meantime, but
I'm not sure.

Kevin
Stefano Garzarella April 29, 2019, 3:55 p.m. UTC | #18
On Mon, Apr 29, 2019 at 04:30:14PM +0200, Kevin Wolf wrote:
> Am 29.04.2019 um 16:04 hat Stefano Garzarella geschrieben:
> > On Mon, Apr 29, 2019 at 12:25:10PM +0200, Kevin Wolf wrote:
> > > Am 11.04.2019 um 12:50 hat Stefano Garzarella geschrieben:
> > > > RBD APIs don't allow us to write more than the size set with rbd_create()
> > > > or rbd_resize().
> > > > In order to support growing images (eg. qcow2), we resize the image
> > > > before RW operations that exceed the current size.
> > > > 
> > > > Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1171007
> > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > > ---
> > > >  block/rbd.c | 25 +++++++++++++++++++++++++
> > > >  1 file changed, 25 insertions(+)
> > > > 
> > > > diff --git a/block/rbd.c b/block/rbd.c
> > > > index 0c549c9935..228658e20a 100644
> > > > --- a/block/rbd.c
> > > > +++ b/block/rbd.c
> > > > @@ -102,6 +102,7 @@ typedef struct BDRVRBDState {
> > > >      rbd_image_t image;
> > > >      char *image_name;
> > > >      char *snap;
> > > > +    uint64_t image_size;
> > > >  } BDRVRBDState;
> > > 
> > > Can't we use bs->total_sectors instead of adding a new image_size field?
> > 
> > I'm not sure we can use bs->total_sectors. IIUC, for example, it doesn't
> > take care of bytes used by QCOW2 metadata.
> 
> bs->total_sectors for the rbd BLockDriverState is the image file size,
> not the virtual disk size.
> 
> The only reason not to use it would be if we need byte granularity
> rather than 512 byte granularity. But I don't think it's a problem to
> round up offsets to the next 512 bytes (BDRV_SECTOR_SIZE) boundary.
> 

I tried and it works as you told me :) I'll remove the image_size in the
v2.

> > > >  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
> > > > @@ -777,6 +778,14 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
> > > >          goto failed_open;
> > > >      }
> > > >  
> > > > +    r = rbd_get_size(s->image, &s->image_size);
> > > > +    if (r < 0) {
> > > > +        error_setg_errno(errp, -r, "error reading image size from %s",
> > > > +                         s->image_name);
> > > > +        rbd_close(s->image);
> > > > +        goto failed_open;
> > > > +    }
> > > > +
> > > >      /* If we are using an rbd snapshot, we must be r/o, otherwise
> > > >       * leave as-is */
> > > >      if (s->snap != NULL) {
> > > > @@ -921,6 +930,20 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
> > > >          rcb->buf = acb->bounce;
> > > >      }
> > > >  
> > > > +    /*
> > > > +     * RBD APIs don't allow us to write more than actual size, so in order
> > > > +     * to support growing images, we resize the image before RW operations
> > > > +     * that exceed the current size.
> > > > +     */
> > > > +    if (s->image_size < off + size) {
> > > > +        r = rbd_resize(s->image, off + size);
> > > > +        if (r < 0) {
> > > > +            goto failed;
> > > > +        }
> > > > +
> > > > +        s->image_size = off + size;
> > > > +    }
> > > 
> > > This doesn't check the request type, so it's actually not limited to RW
> > > operations, but even reads will try to resize the image. This is at
> > > least surprising. For regular files, file-posix extends the file for
> > > write requests, but for reads it returns a zeroed buffer without
> > > actually changing the file size.
> > 
> > Yes, I'll change the behaviour in the v2.
> > 
> > I did some tries (i.e. using qemu-io and reading more than bytes used) and
> > the RBD driver didn't receive 'read' requests that exceed the current
> > size, maybe because it is managed in the QCOW2 protocol, but of course
> > I'll handle also in the RBD driver.
> 
> I don't remember the exact scenario where it happened, but I know I
> implemented it for file-posix to fix a bug. Maybe it actually doesn't
> happen any more because we have made other changes in the meantime, but
> I'm not sure.
> 

Thanks for the details, I'll check better if we can avoid it, otherwise
I'll take care of this in the RBD driver.

Cheers,
Stefano
diff mbox series

Patch

diff --git a/block/rbd.c b/block/rbd.c
index 0c549c9935..228658e20a 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -102,6 +102,7 @@  typedef struct BDRVRBDState {
     rbd_image_t image;
     char *image_name;
     char *snap;
+    uint64_t image_size;
 } BDRVRBDState;
 
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
@@ -777,6 +778,14 @@  static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
         goto failed_open;
     }
 
+    r = rbd_get_size(s->image, &s->image_size);
+    if (r < 0) {
+        error_setg_errno(errp, -r, "error reading image size from %s",
+                         s->image_name);
+        rbd_close(s->image);
+        goto failed_open;
+    }
+
     /* If we are using an rbd snapshot, we must be r/o, otherwise
      * leave as-is */
     if (s->snap != NULL) {
@@ -921,6 +930,20 @@  static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
         rcb->buf = acb->bounce;
     }
 
+    /*
+     * RBD APIs don't allow us to write more than actual size, so in order
+     * to support growing images, we resize the image before RW operations
+     * that exceed the current size.
+     */
+    if (s->image_size < off + size) {
+        r = rbd_resize(s->image, off + size);
+        if (r < 0) {
+            goto failed;
+        }
+
+        s->image_size = off + size;
+    }
+
     acb->ret = 0;
     acb->error = 0;
     acb->s = s;
@@ -1066,6 +1089,8 @@  static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
         return r;
     }
 
+    s->image_size = offset;
+
     return 0;
 }