Message ID | 20250417184133.105746-17-eblake@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Make blockdev-mirror dest sparse in more cases | expand |
On Thu, Apr 17, 2025 at 01:39:09PM -0500, Eric Blake wrote: > There are some optimizations that require knowing if an image starts > out as reading all zeroes, such as making blockdev-mirror faster by > skipping the copying of source zeroes to the destination. The > existing bdrv_co_is_zero_fast() is a good building block for answering > this question, but it tends to give an answer of 0 for a file we just > created via QMP 'blockdev-create' or similar (such as 'qemu-img create > -f raw'). Why? Because file-posix.c insists on allocating a tiny > header to any file rather than leaving it 100% sparse, due to some > filesystems that are unable to answer alignment probes on a hole. But > teaching file-posix.c to read the tiny header doesn't scale - the > problem of a small header is also visible when libvirt sets up an NBD > client to a just-created file on a migration destination host. > > So, we need a wrapper function that handles a bit more complexity in a > common manner for all block devices - when the BDS is mostly a hole, > but has a small non-hole header, it is still worth the time to read > that header and check if it reads as all zeroes before giving up and > returning a pessimistic answer. > > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > include/block/block-io.h | 2 ++ > block/io.c | 58 ++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 60 insertions(+) > > diff --git a/include/block/block-io.h b/include/block/block-io.h > index b49e0537dd4..b99cc98d265 100644 > --- a/include/block/block-io.h > +++ b/include/block/block-io.h > @@ -161,6 +161,8 @@ bdrv_is_allocated_above(BlockDriverState *bs, BlockDriverState *base, > > int coroutine_fn GRAPH_RDLOCK > bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset, int64_t bytes); > +int coroutine_fn GRAPH_RDLOCK > +bdrv_co_is_all_zeroes(BlockDriverState *bs); > > int GRAPH_RDLOCK > bdrv_apply_auto_read_only(BlockDriverState *bs, const char *errmsg, > diff --git a/block/io.c b/block/io.c > index 6ef78070915..dc1341e4029 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -2778,6 +2778,64 @@ int coroutine_fn bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset, > return 1; > } > > +/* > + * Check @bs (and its backing chain) to see if the entire image is known > + * to read as zeroes. > + * Return 1 if that is the case, 0 otherwise and -errno on error. > + * This test is meant to be fast rather than accurate so returning 0 > + * does not guarantee non-zero data; however, it can report 1 in more False negatives are possible, let's also document that false positives are not possible: This test is mean to be fast rather than accurate so returning 0 does not guarantee non-zero data, but returning 1 does guarantee all zero data; ... > + * cases than bdrv_co_is_zero_fast. > + */ > +int coroutine_fn bdrv_co_is_all_zeroes(BlockDriverState *bs) > +{ > + int ret; > + int64_t pnum, bytes; > + char *buf; > + QEMUIOVector local_qiov; > + IO_CODE(); > + > + bytes = bdrv_co_getlength(bs); > + if (bytes < 0) { > + return bytes; > + } > + > + /* First probe - see if the entire image reads as zero */ > + ret = bdrv_co_common_block_status_above(bs, NULL, false, BDRV_BSTAT_ZERO, > + 0, bytes, &pnum, NULL, NULL, > + NULL); > + if (ret < 0) { > + return ret; > + } > + if (ret & BDRV_BLOCK_ZERO) { > + return bdrv_co_is_zero_fast(bs, pnum, bytes - pnum); > + } > + > + /* > + * Because of the way 'blockdev-create' works, raw files tend to > + * be created with a non-sparse region at the front to make > + * alignment probing easier. If the block starts with only a > + * small allocated region, it is still worth the effort to see if > + * the rest of the image is still sparse, coupled with manually > + * reading the first region to see if it reads zero after all. > + */ > + if (pnum > qemu_real_host_page_size()) { Probably not worth it for the corner case, but replacing qemu_real_host_page_size() with 128 KiB would allow this to work on images created on different CPU architectures (4 KiB vs 64 KiB page sizes). > + return 0; > + } > + ret = bdrv_co_is_zero_fast(bs, pnum, bytes - pnum); > + if (ret <= 0) { > + return ret; > + } > + /* Only the head of the image is unknown, and it's small. Read it. */ > + buf = qemu_blockalign(bs, pnum); > + qemu_iovec_init_buf(&local_qiov, buf, pnum); > + ret = bdrv_driver_preadv(bs, 0, pnum, &local_qiov, 0, 0); > + if (ret >= 0) { > + ret = buffer_is_zero(buf, pnum); > + } > + qemu_vfree(buf); > + return ret; > +} > + > int coroutine_fn bdrv_co_is_allocated(BlockDriverState *bs, int64_t offset, > int64_t bytes, int64_t *pnum) > { > -- > 2.49.0 > >
On Thu, Apr 17, 2025 at 04:35:33PM -0400, Stefan Hajnoczi wrote: > On Thu, Apr 17, 2025 at 01:39:09PM -0500, Eric Blake wrote: > > There are some optimizations that require knowing if an image starts > > out as reading all zeroes, such as making blockdev-mirror faster by > > skipping the copying of source zeroes to the destination. The > > existing bdrv_co_is_zero_fast() is a good building block for answering > > this question, but it tends to give an answer of 0 for a file we just > > created via QMP 'blockdev-create' or similar (such as 'qemu-img create > > -f raw'). Why? Because file-posix.c insists on allocating a tiny > > header to any file rather than leaving it 100% sparse, due to some > > filesystems that are unable to answer alignment probes on a hole. But > > teaching file-posix.c to read the tiny header doesn't scale - the > > problem of a small header is also visible when libvirt sets up an NBD > > client to a just-created file on a migration destination host. > > > > So, we need a wrapper function that handles a bit more complexity in a > > common manner for all block devices - when the BDS is mostly a hole, > > but has a small non-hole header, it is still worth the time to read > > that header and check if it reads as all zeroes before giving up and > > returning a pessimistic answer. > > > > Signed-off-by: Eric Blake <eblake@redhat.com> > > --- > > include/block/block-io.h | 2 ++ > > block/io.c | 58 ++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 60 insertions(+) > > > > diff --git a/include/block/block-io.h b/include/block/block-io.h > > index b49e0537dd4..b99cc98d265 100644 > > --- a/include/block/block-io.h > > +++ b/include/block/block-io.h > > @@ -161,6 +161,8 @@ bdrv_is_allocated_above(BlockDriverState *bs, BlockDriverState *base, > > > > int coroutine_fn GRAPH_RDLOCK > > bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset, int64_t bytes); > > +int coroutine_fn GRAPH_RDLOCK > > +bdrv_co_is_all_zeroes(BlockDriverState *bs); > > > > int GRAPH_RDLOCK > > bdrv_apply_auto_read_only(BlockDriverState *bs, const char *errmsg, > > diff --git a/block/io.c b/block/io.c > > index 6ef78070915..dc1341e4029 100644 > > --- a/block/io.c > > +++ b/block/io.c > > @@ -2778,6 +2778,64 @@ int coroutine_fn bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset, > > return 1; > > } > > > > +/* > > + * Check @bs (and its backing chain) to see if the entire image is known > > + * to read as zeroes. > > + * Return 1 if that is the case, 0 otherwise and -errno on error. > > + * This test is meant to be fast rather than accurate so returning 0 > > + * does not guarantee non-zero data; however, it can report 1 in more > > False negatives are possible, let's also document that false positives > are not possible: > > This test is mean to be fast rather than accurate so returning 0 does > not guarantee non-zero data, but returning 1 does guarantee all zero > data; ... Copied from bdrv_co_is_zero_fast, but that wording can use a similar treatment. > > > + * cases than bdrv_co_is_zero_fast. > > + */ > > +int coroutine_fn bdrv_co_is_all_zeroes(BlockDriverState *bs) > > +{ > > + int ret; > > + int64_t pnum, bytes; > > + char *buf; > > + QEMUIOVector local_qiov; > > + IO_CODE(); > > + > > + bytes = bdrv_co_getlength(bs); > > + if (bytes < 0) { > > + return bytes; > > + } > > + > > + /* First probe - see if the entire image reads as zero */ > > + ret = bdrv_co_common_block_status_above(bs, NULL, false, BDRV_BSTAT_ZERO, > > + 0, bytes, &pnum, NULL, NULL, > > + NULL); > > + if (ret < 0) { > > + return ret; > > + } > > + if (ret & BDRV_BLOCK_ZERO) { > > + return bdrv_co_is_zero_fast(bs, pnum, bytes - pnum); > > + } > > + > > + /* > > + * Because of the way 'blockdev-create' works, raw files tend to > > + * be created with a non-sparse region at the front to make > > + * alignment probing easier. If the block starts with only a > > + * small allocated region, it is still worth the effort to see if > > + * the rest of the image is still sparse, coupled with manually > > + * reading the first region to see if it reads zero after all. > > + */ > > + if (pnum > qemu_real_host_page_size()) { > > Probably not worth it for the corner case, but replacing > qemu_real_host_page_size() with 128 KiB would allow this to work on > images created on different CPU architectures (4 KiB vs 64 KiB page > sizes). I picked the original value of qemu_real_host_page_size() based on file-posix.c's allocate_first_block(); but agree that picking a constant 64k or even 128k for all platforms (rather than tying it to the host's page size) won't hurt. The key point remains that it should be large enough to account for whatever file-posix.c does, yet small enough that we aren't negating any potential optimization by the time spent probing if the image reads as zeroes.
diff --git a/include/block/block-io.h b/include/block/block-io.h index b49e0537dd4..b99cc98d265 100644 --- a/include/block/block-io.h +++ b/include/block/block-io.h @@ -161,6 +161,8 @@ bdrv_is_allocated_above(BlockDriverState *bs, BlockDriverState *base, int coroutine_fn GRAPH_RDLOCK bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset, int64_t bytes); +int coroutine_fn GRAPH_RDLOCK +bdrv_co_is_all_zeroes(BlockDriverState *bs); int GRAPH_RDLOCK bdrv_apply_auto_read_only(BlockDriverState *bs, const char *errmsg, diff --git a/block/io.c b/block/io.c index 6ef78070915..dc1341e4029 100644 --- a/block/io.c +++ b/block/io.c @@ -2778,6 +2778,64 @@ int coroutine_fn bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset, return 1; } +/* + * Check @bs (and its backing chain) to see if the entire image is known + * to read as zeroes. + * Return 1 if that is the case, 0 otherwise and -errno on error. + * This test is meant to be fast rather than accurate so returning 0 + * does not guarantee non-zero data; however, it can report 1 in more + * cases than bdrv_co_is_zero_fast. + */ +int coroutine_fn bdrv_co_is_all_zeroes(BlockDriverState *bs) +{ + int ret; + int64_t pnum, bytes; + char *buf; + QEMUIOVector local_qiov; + IO_CODE(); + + bytes = bdrv_co_getlength(bs); + if (bytes < 0) { + return bytes; + } + + /* First probe - see if the entire image reads as zero */ + ret = bdrv_co_common_block_status_above(bs, NULL, false, BDRV_BSTAT_ZERO, + 0, bytes, &pnum, NULL, NULL, + NULL); + if (ret < 0) { + return ret; + } + if (ret & BDRV_BLOCK_ZERO) { + return bdrv_co_is_zero_fast(bs, pnum, bytes - pnum); + } + + /* + * Because of the way 'blockdev-create' works, raw files tend to + * be created with a non-sparse region at the front to make + * alignment probing easier. If the block starts with only a + * small allocated region, it is still worth the effort to see if + * the rest of the image is still sparse, coupled with manually + * reading the first region to see if it reads zero after all. + */ + if (pnum > qemu_real_host_page_size()) { + return 0; + } + ret = bdrv_co_is_zero_fast(bs, pnum, bytes - pnum); + if (ret <= 0) { + return ret; + } + /* Only the head of the image is unknown, and it's small. Read it. */ + buf = qemu_blockalign(bs, pnum); + qemu_iovec_init_buf(&local_qiov, buf, pnum); + ret = bdrv_driver_preadv(bs, 0, pnum, &local_qiov, 0, 0); + if (ret >= 0) { + ret = buffer_is_zero(buf, pnum); + } + qemu_vfree(buf); + return ret; +} + int coroutine_fn bdrv_co_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes, int64_t *pnum) {
There are some optimizations that require knowing if an image starts out as reading all zeroes, such as making blockdev-mirror faster by skipping the copying of source zeroes to the destination. The existing bdrv_co_is_zero_fast() is a good building block for answering this question, but it tends to give an answer of 0 for a file we just created via QMP 'blockdev-create' or similar (such as 'qemu-img create -f raw'). Why? Because file-posix.c insists on allocating a tiny header to any file rather than leaving it 100% sparse, due to some filesystems that are unable to answer alignment probes on a hole. But teaching file-posix.c to read the tiny header doesn't scale - the problem of a small header is also visible when libvirt sets up an NBD client to a just-created file on a migration destination host. So, we need a wrapper function that handles a bit more complexity in a common manner for all block devices - when the BDS is mostly a hole, but has a small non-hole header, it is still worth the time to read that header and check if it reads as all zeroes before giving up and returning a pessimistic answer. Signed-off-by: Eric Blake <eblake@redhat.com> --- include/block/block-io.h | 2 ++ block/io.c | 58 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 60 insertions(+)