diff mbox series

[v2,04/11] block: Add new bdrv_co_is_all_zeroes() function

Message ID 20250417184133.105746-17-eblake@redhat.com (mailing list archive)
State New
Headers show
Series Make blockdev-mirror dest sparse in more cases | expand

Commit Message

Eric Blake April 17, 2025, 6:39 p.m. UTC
There are some optimizations that require knowing if an image starts
out as reading all zeroes, such as making blockdev-mirror faster by
skipping the copying of source zeroes to the destination.  The
existing bdrv_co_is_zero_fast() is a good building block for answering
this question, but it tends to give an answer of 0 for a file we just
created via QMP 'blockdev-create' or similar (such as 'qemu-img create
-f raw').  Why?  Because file-posix.c insists on allocating a tiny
header to any file rather than leaving it 100% sparse, due to some
filesystems that are unable to answer alignment probes on a hole.  But
teaching file-posix.c to read the tiny header doesn't scale - the
problem of a small header is also visible when libvirt sets up an NBD
client to a just-created file on a migration destination host.

So, we need a wrapper function that handles a bit more complexity in a
common manner for all block devices - when the BDS is mostly a hole,
but has a small non-hole header, it is still worth the time to read
that header and check if it reads as all zeroes before giving up and
returning a pessimistic answer.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 include/block/block-io.h |  2 ++
 block/io.c               | 58 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

Comments

Stefan Hajnoczi April 17, 2025, 8:35 p.m. UTC | #1
On Thu, Apr 17, 2025 at 01:39:09PM -0500, Eric Blake wrote:
> There are some optimizations that require knowing if an image starts
> out as reading all zeroes, such as making blockdev-mirror faster by
> skipping the copying of source zeroes to the destination.  The
> existing bdrv_co_is_zero_fast() is a good building block for answering
> this question, but it tends to give an answer of 0 for a file we just
> created via QMP 'blockdev-create' or similar (such as 'qemu-img create
> -f raw').  Why?  Because file-posix.c insists on allocating a tiny
> header to any file rather than leaving it 100% sparse, due to some
> filesystems that are unable to answer alignment probes on a hole.  But
> teaching file-posix.c to read the tiny header doesn't scale - the
> problem of a small header is also visible when libvirt sets up an NBD
> client to a just-created file on a migration destination host.
> 
> So, we need a wrapper function that handles a bit more complexity in a
> common manner for all block devices - when the BDS is mostly a hole,
> but has a small non-hole header, it is still worth the time to read
> that header and check if it reads as all zeroes before giving up and
> returning a pessimistic answer.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  include/block/block-io.h |  2 ++
>  block/io.c               | 58 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 60 insertions(+)
> 
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index b49e0537dd4..b99cc98d265 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -161,6 +161,8 @@ bdrv_is_allocated_above(BlockDriverState *bs, BlockDriverState *base,
> 
>  int coroutine_fn GRAPH_RDLOCK
>  bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset, int64_t bytes);
> +int coroutine_fn GRAPH_RDLOCK
> +bdrv_co_is_all_zeroes(BlockDriverState *bs);
> 
>  int GRAPH_RDLOCK
>  bdrv_apply_auto_read_only(BlockDriverState *bs, const char *errmsg,
> diff --git a/block/io.c b/block/io.c
> index 6ef78070915..dc1341e4029 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2778,6 +2778,64 @@ int coroutine_fn bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset,
>      return 1;
>  }
> 
> +/*
> + * Check @bs (and its backing chain) to see if the entire image is known
> + * to read as zeroes.
> + * Return 1 if that is the case, 0 otherwise and -errno on error.
> + * This test is meant to be fast rather than accurate so returning 0
> + * does not guarantee non-zero data; however, it can report 1 in more

False negatives are possible, let's also document that false positives
are not possible:

  This test is mean to be fast rather than accurate so returning 0 does
  not guarantee non-zero data, but returning 1 does guarantee all zero
  data; ...

> + * cases than bdrv_co_is_zero_fast.
> + */
> +int coroutine_fn bdrv_co_is_all_zeroes(BlockDriverState *bs)
> +{
> +    int ret;
> +    int64_t pnum, bytes;
> +    char *buf;
> +    QEMUIOVector local_qiov;
> +    IO_CODE();
> +
> +    bytes = bdrv_co_getlength(bs);
> +    if (bytes < 0) {
> +        return bytes;
> +    }
> +
> +    /* First probe - see if the entire image reads as zero */
> +    ret = bdrv_co_common_block_status_above(bs, NULL, false, BDRV_BSTAT_ZERO,
> +                                            0, bytes, &pnum, NULL, NULL,
> +                                            NULL);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +    if (ret & BDRV_BLOCK_ZERO) {
> +        return bdrv_co_is_zero_fast(bs, pnum, bytes - pnum);
> +    }
> +
> +    /*
> +     * Because of the way 'blockdev-create' works, raw files tend to
> +     * be created with a non-sparse region at the front to make
> +     * alignment probing easier.  If the block starts with only a
> +     * small allocated region, it is still worth the effort to see if
> +     * the rest of the image is still sparse, coupled with manually
> +     * reading the first region to see if it reads zero after all.
> +     */
> +    if (pnum > qemu_real_host_page_size()) {

Probably not worth it for the corner case, but replacing
qemu_real_host_page_size() with 128 KiB would allow this to work on
images created on different CPU architectures (4 KiB vs 64 KiB page
sizes).

> +        return 0;
> +    }
> +    ret = bdrv_co_is_zero_fast(bs, pnum, bytes - pnum);
> +    if (ret <= 0) {
> +        return ret;
> +    }
> +    /* Only the head of the image is unknown, and it's small.  Read it.  */
> +    buf = qemu_blockalign(bs, pnum);
> +    qemu_iovec_init_buf(&local_qiov, buf, pnum);
> +    ret = bdrv_driver_preadv(bs, 0, pnum, &local_qiov, 0, 0);
> +    if (ret >= 0) {
> +        ret = buffer_is_zero(buf, pnum);
> +    }
> +    qemu_vfree(buf);
> +    return ret;
> +}
> +
>  int coroutine_fn bdrv_co_is_allocated(BlockDriverState *bs, int64_t offset,
>                                        int64_t bytes, int64_t *pnum)
>  {
> -- 
> 2.49.0
> 
>
Eric Blake April 18, 2025, 7:07 p.m. UTC | #2
On Thu, Apr 17, 2025 at 04:35:33PM -0400, Stefan Hajnoczi wrote:
> On Thu, Apr 17, 2025 at 01:39:09PM -0500, Eric Blake wrote:
> > There are some optimizations that require knowing if an image starts
> > out as reading all zeroes, such as making blockdev-mirror faster by
> > skipping the copying of source zeroes to the destination.  The
> > existing bdrv_co_is_zero_fast() is a good building block for answering
> > this question, but it tends to give an answer of 0 for a file we just
> > created via QMP 'blockdev-create' or similar (such as 'qemu-img create
> > -f raw').  Why?  Because file-posix.c insists on allocating a tiny
> > header to any file rather than leaving it 100% sparse, due to some
> > filesystems that are unable to answer alignment probes on a hole.  But
> > teaching file-posix.c to read the tiny header doesn't scale - the
> > problem of a small header is also visible when libvirt sets up an NBD
> > client to a just-created file on a migration destination host.
> > 
> > So, we need a wrapper function that handles a bit more complexity in a
> > common manner for all block devices - when the BDS is mostly a hole,
> > but has a small non-hole header, it is still worth the time to read
> > that header and check if it reads as all zeroes before giving up and
> > returning a pessimistic answer.
> > 
> > Signed-off-by: Eric Blake <eblake@redhat.com>
> > ---
> >  include/block/block-io.h |  2 ++
> >  block/io.c               | 58 ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 60 insertions(+)
> > 
> > diff --git a/include/block/block-io.h b/include/block/block-io.h
> > index b49e0537dd4..b99cc98d265 100644
> > --- a/include/block/block-io.h
> > +++ b/include/block/block-io.h
> > @@ -161,6 +161,8 @@ bdrv_is_allocated_above(BlockDriverState *bs, BlockDriverState *base,
> > 
> >  int coroutine_fn GRAPH_RDLOCK
> >  bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset, int64_t bytes);
> > +int coroutine_fn GRAPH_RDLOCK
> > +bdrv_co_is_all_zeroes(BlockDriverState *bs);
> > 
> >  int GRAPH_RDLOCK
> >  bdrv_apply_auto_read_only(BlockDriverState *bs, const char *errmsg,
> > diff --git a/block/io.c b/block/io.c
> > index 6ef78070915..dc1341e4029 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -2778,6 +2778,64 @@ int coroutine_fn bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset,
> >      return 1;
> >  }
> > 
> > +/*
> > + * Check @bs (and its backing chain) to see if the entire image is known
> > + * to read as zeroes.
> > + * Return 1 if that is the case, 0 otherwise and -errno on error.
> > + * This test is meant to be fast rather than accurate so returning 0
> > + * does not guarantee non-zero data; however, it can report 1 in more
> 
> False negatives are possible, let's also document that false positives
> are not possible:
> 
>   This test is mean to be fast rather than accurate so returning 0 does
>   not guarantee non-zero data, but returning 1 does guarantee all zero
>   data; ...

Copied from bdrv_co_is_zero_fast, but that wording can use a similar
treatment.

> 
> > + * cases than bdrv_co_is_zero_fast.
> > + */
> > +int coroutine_fn bdrv_co_is_all_zeroes(BlockDriverState *bs)
> > +{
> > +    int ret;
> > +    int64_t pnum, bytes;
> > +    char *buf;
> > +    QEMUIOVector local_qiov;
> > +    IO_CODE();
> > +
> > +    bytes = bdrv_co_getlength(bs);
> > +    if (bytes < 0) {
> > +        return bytes;
> > +    }
> > +
> > +    /* First probe - see if the entire image reads as zero */
> > +    ret = bdrv_co_common_block_status_above(bs, NULL, false, BDRV_BSTAT_ZERO,
> > +                                            0, bytes, &pnum, NULL, NULL,
> > +                                            NULL);
> > +    if (ret < 0) {
> > +        return ret;
> > +    }
> > +    if (ret & BDRV_BLOCK_ZERO) {
> > +        return bdrv_co_is_zero_fast(bs, pnum, bytes - pnum);
> > +    }
> > +
> > +    /*
> > +     * Because of the way 'blockdev-create' works, raw files tend to
> > +     * be created with a non-sparse region at the front to make
> > +     * alignment probing easier.  If the block starts with only a
> > +     * small allocated region, it is still worth the effort to see if
> > +     * the rest of the image is still sparse, coupled with manually
> > +     * reading the first region to see if it reads zero after all.
> > +     */
> > +    if (pnum > qemu_real_host_page_size()) {
> 
> Probably not worth it for the corner case, but replacing
> qemu_real_host_page_size() with 128 KiB would allow this to work on
> images created on different CPU architectures (4 KiB vs 64 KiB page
> sizes).

I picked the original value of qemu_real_host_page_size() based on
file-posix.c's allocate_first_block(); but agree that picking a
constant 64k or even 128k for all platforms (rather than tying it to
the host's page size) won't hurt.  The key point remains that it
should be large enough to account for whatever file-posix.c does, yet
small enough that we aren't negating any potential optimization by the
time spent probing if the image reads as zeroes.
diff mbox series

Patch

diff --git a/include/block/block-io.h b/include/block/block-io.h
index b49e0537dd4..b99cc98d265 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -161,6 +161,8 @@  bdrv_is_allocated_above(BlockDriverState *bs, BlockDriverState *base,

 int coroutine_fn GRAPH_RDLOCK
 bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset, int64_t bytes);
+int coroutine_fn GRAPH_RDLOCK
+bdrv_co_is_all_zeroes(BlockDriverState *bs);

 int GRAPH_RDLOCK
 bdrv_apply_auto_read_only(BlockDriverState *bs, const char *errmsg,
diff --git a/block/io.c b/block/io.c
index 6ef78070915..dc1341e4029 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2778,6 +2778,64 @@  int coroutine_fn bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset,
     return 1;
 }

+/*
+ * Check @bs (and its backing chain) to see if the entire image is known
+ * to read as zeroes.
+ * Return 1 if that is the case, 0 otherwise and -errno on error.
+ * This test is meant to be fast rather than accurate so returning 0
+ * does not guarantee non-zero data; however, it can report 1 in more
+ * cases than bdrv_co_is_zero_fast.
+ */
+int coroutine_fn bdrv_co_is_all_zeroes(BlockDriverState *bs)
+{
+    int ret;
+    int64_t pnum, bytes;
+    char *buf;
+    QEMUIOVector local_qiov;
+    IO_CODE();
+
+    bytes = bdrv_co_getlength(bs);
+    if (bytes < 0) {
+        return bytes;
+    }
+
+    /* First probe - see if the entire image reads as zero */
+    ret = bdrv_co_common_block_status_above(bs, NULL, false, BDRV_BSTAT_ZERO,
+                                            0, bytes, &pnum, NULL, NULL,
+                                            NULL);
+    if (ret < 0) {
+        return ret;
+    }
+    if (ret & BDRV_BLOCK_ZERO) {
+        return bdrv_co_is_zero_fast(bs, pnum, bytes - pnum);
+    }
+
+    /*
+     * Because of the way 'blockdev-create' works, raw files tend to
+     * be created with a non-sparse region at the front to make
+     * alignment probing easier.  If the block starts with only a
+     * small allocated region, it is still worth the effort to see if
+     * the rest of the image is still sparse, coupled with manually
+     * reading the first region to see if it reads zero after all.
+     */
+    if (pnum > qemu_real_host_page_size()) {
+        return 0;
+    }
+    ret = bdrv_co_is_zero_fast(bs, pnum, bytes - pnum);
+    if (ret <= 0) {
+        return ret;
+    }
+    /* Only the head of the image is unknown, and it's small.  Read it.  */
+    buf = qemu_blockalign(bs, pnum);
+    qemu_iovec_init_buf(&local_qiov, buf, pnum);
+    ret = bdrv_driver_preadv(bs, 0, pnum, &local_qiov, 0, 0);
+    if (ret >= 0) {
+        ret = buffer_is_zero(buf, pnum);
+    }
+    qemu_vfree(buf);
+    return ret;
+}
+
 int coroutine_fn bdrv_co_is_allocated(BlockDriverState *bs, int64_t offset,
                                       int64_t bytes, int64_t *pnum)
 {