[4/6] qemu-img: rebase: avoid unnecessary COW operations

Message ID	20230601192836.598602-5-andrey.drobyshev@virtuozzo.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> To: qemu-block@nongnu.org Cc: qemu-devel@nongnu.org, kwolf@redhat.com, hreitz@redhat.com, andrey.drobyshev@virtuozzo.com, den@virtuozzo.com Subject: [PATCH 4/6] qemu-img: rebase: avoid unnecessary COW operations Date: Thu, 1 Jun 2023 22:28:34 +0300 Message-Id: <20230601192836.598602-5-andrey.drobyshev@virtuozzo.com> In-Reply-To: <20230601192836.598602-1-andrey.drobyshev@virtuozzo.com> References: <20230601192836.598602-1-andrey.drobyshev@virtuozzo.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=130.117.225.111; envelope-from=andrey.drobyshev@virtuozzo.com; helo=relay.virtuozzo.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Reply-to: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> From: Andrey Drobyshev via <qemu-devel@nongnu.org> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	qemu-img: rebase: add compression support \| expand [0/6] qemu-img: rebase: add compression support [1/6] qemu-img: rebase: stop when reaching EOF of old backing file [2/6] qemu-iotests: 024: add rebasing test case for overlay_size > backing_size [3/6] qemu-img: rebase: use backing files' BlockBackend for buffer alignment [4/6] qemu-img: rebase: avoid unnecessary COW operations [5/6] qemu-img: add compression option to rebase subcommand [6/6] iotests: add test 314 for "qemu-img rebase" with compression

Message ID

20230601192836.598602-5-andrey.drobyshev@virtuozzo.com (mailing list archive)

State

New, archived

Headers

To: qemu-block@nongnu.org
Cc: qemu-devel@nongnu.org, kwolf@redhat.com, hreitz@redhat.com,
 andrey.drobyshev@virtuozzo.com, den@virtuozzo.com
Subject: [PATCH 4/6] qemu-img: rebase: avoid unnecessary COW operations
Date: Thu,  1 Jun 2023 22:28:34 +0300
Message-Id: <20230601192836.598602-5-andrey.drobyshev@virtuozzo.com>
In-Reply-To: <20230601192836.598602-1-andrey.drobyshev@virtuozzo.com>
References: <20230601192836.598602-1-andrey.drobyshev@virtuozzo.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=130.117.225.111;
 envelope-from=andrey.drobyshev@virtuozzo.com; helo=relay.virtuozzo.com
X-Spam_score_int: -18
X-Spam_score: -1.9
X-Spam_bar: -
X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Reply-to: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
From: Andrey Drobyshev via <qemu-devel@nongnu.org>
Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org

Series

qemu-img: rebase: add compression support | expand

Commit Message

Andrey Drobyshev June 1, 2023, 7:28 p.m. UTC

When rebasing an image from one backing file to another, we need to
compare data from old and new backings.  If the diff between that data
happens to be unaligned to the target cluster size, we might end up
doing partial writes, which would lead to copy-on-write and additional IO.

Consider the following simple case (virtual_size == cluster_size == 64K):

base <-- inc1 <-- inc2

qemu-io -c "write -P 0xaa 0 32K" base.qcow2
qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2

While doing rebase, we'll write a half of the cluster to inc2, and block
layer will have to read the 2nd half of the same cluster from the base image
inc1 while doing this write operation, although the whole cluster is already
read earlier to perform data comparison.

In order to avoid these unnecessary IO cycles, let's make sure every
write request is aligned to the overlay cluster size.

Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
 qemu-img.c | 72 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 52 insertions(+), 20 deletions(-)

Comments

Denis V. Lunev June 21, 2023, 6:53 p.m. UTC | #1

On 6/1/23 21:28, Andrey Drobyshev wrote:
> When rebasing an image from one backing file to another, we need to
> compare data from old and new backings.  If the diff between that data
> happens to be unaligned to the target cluster size, we might end up
> doing partial writes, which would lead to copy-on-write and additional IO.
>
> Consider the following simple case (virtual_size == cluster_size == 64K):
>
> base <-- inc1 <-- inc2
>
> qemu-io -c "write -P 0xaa 0 32K" base.qcow2
> qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
> qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
> qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
> qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2
>
> While doing rebase, we'll write a half of the cluster to inc2, and block
> layer will have to read the 2nd half of the same cluster from the base image
> inc1 while doing this write operation, although the whole cluster is already
> read earlier to perform data comparison.
>
> In order to avoid these unnecessary IO cycles, let's make sure every
> write request is aligned to the overlay cluster size.
>
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
>   qemu-img.c | 72 +++++++++++++++++++++++++++++++++++++++---------------
>   1 file changed, 52 insertions(+), 20 deletions(-)
>
> diff --git a/qemu-img.c b/qemu-img.c
> index 60f4c06487..9a469cd609 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -3513,6 +3513,7 @@ static int img_rebase(int argc, char **argv)
>       uint8_t *buf_new = NULL;
>       BlockDriverState *bs = NULL, *prefix_chain_bs = NULL;
>       BlockDriverState *unfiltered_bs;
> +    BlockDriverInfo bdi = {0};
>       char *filename;
>       const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
>       int c, flags, src_flags, ret;
> @@ -3646,6 +3647,15 @@ static int img_rebase(int argc, char **argv)
>           }
>       }
>   
> +    /* We need overlay cluster size to make sure write requests are aligned */
> +    ret = bdrv_get_info(unfiltered_bs, &bdi);
> +    if (ret < 0) {
> +        error_report("could not get block driver info");
> +        goto out;
> +    } else if (bdi.cluster_size == 0) {
> +        bdi.cluster_size = 1;
> +    }
> +
>       /* For safe rebasing we need to compare old and new backing file */
>       if (!unsafe) {
>           QDict *options = NULL;
> @@ -3744,6 +3754,7 @@ static int img_rebase(int argc, char **argv)
>           int64_t new_backing_size = 0;
>           uint64_t offset;
>           int64_t n;
> +        int64_t n_old = 0, n_new = 0;
>           float local_progress = 0;
>   
>           buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
> @@ -3784,7 +3795,7 @@ static int img_rebase(int argc, char **argv)
>           }
>   
>           for (offset = 0; offset < size; offset += n) {
> -            bool buf_old_is_zero = false;
> +            bool old_backing_eof = false;
>   
>               /* How many bytes can we handle with the next read? */
>               n = MIN(IO_BUF_SIZE, size - offset);
> @@ -3829,33 +3840,38 @@ static int img_rebase(int argc, char **argv)
>                   }
>               }
>   
> +            /* At this point n must be aligned to the target cluster size. */
Minor: except last non-aligned cluster as stated by 'if' :)
> +            if (offset + n < size) {
> +                assert(n % bdi.cluster_size == 0);
> +            }
> +
> +            /*
> +             * Much like the with the target image, we'll try to read as much
> +             * of the old and new backings as we can.
> +             */
> +            n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset));
> +            if (blk_new_backing) {
> +                n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset));
> +            }
> +
>               /*
>                * Read old and new backing file and take into consideration that
>                * backing files may be smaller than the COW image.
>                */
> -            if (offset >= old_backing_size) {
> -                memset(buf_old, 0, n);
> -                buf_old_is_zero = true;
> +            memset(buf_old + n_old, 0, n - n_old);
> +            if (!n_old) {
> +                old_backing_eof = true;
>               } else {
> -                if (offset + n > old_backing_size) {
> -                    n = old_backing_size - offset;
> -                }
> -
> -                ret = blk_pread(blk_old_backing, offset, n, buf_old, 0);
> +                ret = blk_pread(blk_old_backing, offset, n_old, buf_old, 0);
>                   if (ret < 0) {
>                       error_report("error while reading from old backing file");
>                       goto out;
>                   }
>               }
>   
> -            if (offset >= new_backing_size || !blk_new_backing) {
> -                memset(buf_new, 0, n);
> -            } else {
> -                if (offset + n > new_backing_size) {
> -                    n = new_backing_size - offset;
> -                }
> -
> -                ret = blk_pread(blk_new_backing, offset, n, buf_new, 0);
> +            memset(buf_new + n_new, 0, n - n_new);
> +            if (blk_new_backing && n_new) {
> +                ret = blk_pread(blk_new_backing, offset, n_new, buf_new, 0);
>                   if (ret < 0) {
>                       error_report("error while reading from new backing file");
>                       goto out;
> @@ -3867,15 +3883,28 @@ static int img_rebase(int argc, char **argv)
>   
>               while (written < n) {
>                   int64_t pnum;
> +                int64_t start, end;
>   
>                   if (compare_buffers(buf_old + written, buf_new + written,
>                                       n - written, &pnum))
>                   {
> -                    if (buf_old_is_zero) {
> +                    if (old_backing_eof) {
>                           ret = blk_pwrite_zeroes(blk, offset + written, pnum, 0);
>                       } else {
> -                        ret = blk_pwrite(blk, offset + written, pnum,
> -                                         buf_old + written, 0);
> +                        /*
> +                         * If we've got to this point, it means the cluster
> +                         * we're dealing with is unallocated, and any partial
> +                         * write will cause COW.  To avoid that, we make sure
> +                         * request is aligned to cluster size.
> +                         */
> +                        start = QEMU_ALIGN_DOWN(offset + written,
> +                                                bdi.cluster_size);
> +                        end = QEMU_ALIGN_UP(offset + written + pnum,
> +                                            bdi.cluster_size);
> +                        end = end > size ? size : end;
Minor (?): end = MIN(size, QEMU_ALIGN_UP(offset + written + pnum, 
bdi.cluster_size))
> +                        ret = blk_pwrite(blk, start, end - start,
> +                                         buf_old + (start - offset), 0);
> +                        pnum = end - (offset + written);
>                       }
>                       if (ret < 0) {
>                           error_report("Error while writing to COW image: %s",
> @@ -3885,6 +3914,9 @@ static int img_rebase(int argc, char **argv)
>                   }
>   
>                   written += pnum;
> +                if (offset + written >= old_backing_size) {
> +                    old_backing_eof = true;
> +                }
>               }
>               qemu_progress_print(local_progress, 100);
>           }
With or without minors:
Reviewed-by: Denis V. Lunev <den@openvz.org>

Hanna Czenczek Aug. 25, 2023, 3 p.m. UTC | #2

On 01.06.23 21:28, Andrey Drobyshev via wrote:
> When rebasing an image from one backing file to another, we need to
> compare data from old and new backings.  If the diff between that data
> happens to be unaligned to the target cluster size, we might end up
> doing partial writes, which would lead to copy-on-write and additional IO.
>
> Consider the following simple case (virtual_size == cluster_size == 64K):
>
> base <-- inc1 <-- inc2
>
> qemu-io -c "write -P 0xaa 0 32K" base.qcow2
> qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
> qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
> qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
> qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2
>
> While doing rebase, we'll write a half of the cluster to inc2, and block
> layer will have to read the 2nd half of the same cluster from the base image
> inc1 while doing this write operation, although the whole cluster is already
> read earlier to perform data comparison.
>
> In order to avoid these unnecessary IO cycles, let's make sure every
> write request is aligned to the overlay cluster size.
>
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
>   qemu-img.c | 72 +++++++++++++++++++++++++++++++++++++++---------------
>   1 file changed, 52 insertions(+), 20 deletions(-)
>
> diff --git a/qemu-img.c b/qemu-img.c
> index 60f4c06487..9a469cd609 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -3513,6 +3513,7 @@ static int img_rebase(int argc, char **argv)
>       uint8_t *buf_new = NULL;
>       BlockDriverState *bs = NULL, *prefix_chain_bs = NULL;
>       BlockDriverState *unfiltered_bs;
> +    BlockDriverInfo bdi = {0};
>       char *filename;
>       const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
>       int c, flags, src_flags, ret;
> @@ -3646,6 +3647,15 @@ static int img_rebase(int argc, char **argv)
>           }
>       }
>   
> +    /* We need overlay cluster size to make sure write requests are aligned */
> +    ret = bdrv_get_info(unfiltered_bs, &bdi);
> +    if (ret < 0) {
> +        error_report("could not get block driver info");
> +        goto out;
> +    } else if (bdi.cluster_size == 0) {
> +        bdi.cluster_size = 1;
> +    }
> +
>       /* For safe rebasing we need to compare old and new backing file */
>       if (!unsafe) {
>           QDict *options = NULL;
> @@ -3744,6 +3754,7 @@ static int img_rebase(int argc, char **argv)
>           int64_t new_backing_size = 0;
>           uint64_t offset;
>           int64_t n;
> +        int64_t n_old = 0, n_new = 0;
>           float local_progress = 0;
>   
>           buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
> @@ -3784,7 +3795,7 @@ static int img_rebase(int argc, char **argv)
>           }
>   
>           for (offset = 0; offset < size; offset += n) {
> -            bool buf_old_is_zero = false;
> +            bool old_backing_eof = false;
>   
>               /* How many bytes can we handle with the next read? */
>               n = MIN(IO_BUF_SIZE, size - offset);
> @@ -3829,33 +3840,38 @@ static int img_rebase(int argc, char **argv)
>                   }
>               }
>   
> +            /* At this point n must be aligned to the target cluster size. */
> +            if (offset + n < size) {
> +                assert(n % bdi.cluster_size == 0);

This is not correct.  First, bdrv_is_allocated_above() operates not on 
the top image, but on images in the backing chain, which may have 
different cluster sizes and so may lead to `n`s that are not aligned to 
the top image’s cluster size:

$ ./qemu-img create -f qcow2 base.qcow2 64M
$ ./qemu-img create -f qcow2 -b base.qcow2 -F qcow2 mid.qcow2 64M
$ ./qemu-img create -f qcow2 -o cluster_size=2M -b mid.qcow2 -F qcow2 
top.qcow2 64M
$ ./qemu-io -c 'write 64k 64k' mid.qcow2
$ ./qemu-img rebase -b base.qcow2 top.qcow2
qemu-img: ../qemu-img.c:3845: img_rebase: Assertion `n % 
bdi.cluster_size == 0' failed.
[1]    636690 IOT instruction (core dumped)  ./qemu-img rebase -b 
base.qcow2 top.qcow2

Second, and this is a more theoretical thing, it would also be broken 
for images with cluster sizes greater than IO_BUF_SIZE.  Now, 
IO_BUF_SIZE is 2 MB, which happens to be precisely the maximum cluster 
size we support for qcow2, and for vmdk we always create images with 64 
kB clusters (I believe), but the vmdk code seems happy to open 
pre-existing images with cluster sizes up to 512 MB. Still, even for 
qcow2, we could easily increase the limit from 2 MB at any point, and 
there is no explicit correlation why IO_BUF_SIZE happens to be exactly 
what the current maximum cluster size for qcow2 is.  One way to get 
around this would be to use MAX(IO_BUF_SIZE, bdi.cluster_size) for the 
buffer size, which would give such an explicit correlation.

> +            }
> +
> +            /*
> +             * Much like the with the target image, we'll try to read as much
> +             * of the old and new backings as we can.
> +             */
> +            n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset));
> +            if (blk_new_backing) {
> +                n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset));
> +            }
> +
>               /*
>                * Read old and new backing file and take into consideration that
>                * backing files may be smaller than the COW image.
>                */
> -            if (offset >= old_backing_size) {
> -                memset(buf_old, 0, n);
> -                buf_old_is_zero = true;
> +            memset(buf_old + n_old, 0, n - n_old);
> +            if (!n_old) {
> +                old_backing_eof = true;
>               } else {
> -                if (offset + n > old_backing_size) {
> -                    n = old_backing_size - offset;
> -                }
> -
> -                ret = blk_pread(blk_old_backing, offset, n, buf_old, 0);
> +                ret = blk_pread(blk_old_backing, offset, n_old, buf_old, 0);
>                   if (ret < 0) {
>                       error_report("error while reading from old backing file");
>                       goto out;
>                   }
>               }
>   
> -            if (offset >= new_backing_size || !blk_new_backing) {
> -                memset(buf_new, 0, n);
> -            } else {
> -                if (offset + n > new_backing_size) {
> -                    n = new_backing_size - offset;
> -                }
> -
> -                ret = blk_pread(blk_new_backing, offset, n, buf_new, 0);
> +            memset(buf_new + n_new, 0, n - n_new);
> +            if (blk_new_backing && n_new) {
> +                ret = blk_pread(blk_new_backing, offset, n_new, buf_new, 0);
>                   if (ret < 0) {
>                       error_report("error while reading from new backing file");
>                       goto out;
> @@ -3867,15 +3883,28 @@ static int img_rebase(int argc, char **argv)
>   
>               while (written < n) {
>                   int64_t pnum;
> +                int64_t start, end;
>   
>                   if (compare_buffers(buf_old + written, buf_new + written,
>                                       n - written, &pnum))
>                   {
> -                    if (buf_old_is_zero) {
> +                    if (old_backing_eof) {
>                           ret = blk_pwrite_zeroes(blk, offset + written, pnum, 0);
>                       } else {
> -                        ret = blk_pwrite(blk, offset + written, pnum,
> -                                         buf_old + written, 0);
> +                        /*
> +                         * If we've got to this point, it means the cluster
> +                         * we're dealing with is unallocated, and any partial
> +                         * write will cause COW.  To avoid that, we make sure
> +                         * request is aligned to cluster size.
> +                         */
> +                        start = QEMU_ALIGN_DOWN(offset + written,
> +                                                bdi.cluster_size);

Please add an assertion here that `start >= offset`.  I would rather 
have qemu-img crash than to write out-of-bounds memory data to disk.

I understand the idea is that this is given anyway because `offset` 
starts at 0 and we always check that `n`, by which we increment 
`offset`, is aligned, but it is absolutely critical that we don’t do an 
out-of-bounds access, so I feel an explicit assertion here is warranted.

> +                        end = QEMU_ALIGN_UP(offset + written + pnum,
> +                                            bdi.cluster_size);

Similarly here, please assert that `end - offset` this does not exceed 
the buffer’s bounds.  I know the reasoning is the same, we ensured that 
`n` is aligned, so we can always safely align up `written + pnum`, but 
still.

Hanna

> +                        end = end > size ? size : end;
> +                        ret = blk_pwrite(blk, start, end - start,
> +                                         buf_old + (start - offset), 0);
> +                        pnum = end - (offset + written);
>                       }
>                       if (ret < 0) {
>                           error_report("Error while writing to COW image: %s",
> @@ -3885,6 +3914,9 @@ static int img_rebase(int argc, char **argv)
>                   }
>   
>                   written += pnum;
> +                if (offset + written >= old_backing_size) {
> +                    old_backing_eof = true;
> +                }
>               }
>               qemu_progress_print(local_progress, 100);
>           }

Andrey Drobyshev Aug. 29, 2023, 1:27 p.m. UTC | #3

On 8/25/23 18:00, Hanna Czenczek wrote:
> On 01.06.23 21:28, Andrey Drobyshev via wrote:
>> When rebasing an image from one backing file to another, we need to
>> compare data from old and new backings.  If the diff between that data
>> happens to be unaligned to the target cluster size, we might end up
>> doing partial writes, which would lead to copy-on-write and additional
>> IO.
>>
>> Consider the following simple case (virtual_size == cluster_size == 64K):
>>
>> base <-- inc1 <-- inc2
>>
>> qemu-io -c "write -P 0xaa 0 32K" base.qcow2
>> qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
>> qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
>> qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
>> qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2
>>
>> While doing rebase, we'll write a half of the cluster to inc2, and block
>> layer will have to read the 2nd half of the same cluster from the base
>> image
>> inc1 while doing this write operation, although the whole cluster is
>> already
>> read earlier to perform data comparison.
>>
>> In order to avoid these unnecessary IO cycles, let's make sure every
>> write request is aligned to the overlay cluster size.
>>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>>   qemu-img.c | 72 +++++++++++++++++++++++++++++++++++++++---------------
>>   1 file changed, 52 insertions(+), 20 deletions(-)
>>
>> diff --git a/qemu-img.c b/qemu-img.c
>> index 60f4c06487..9a469cd609 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>>>> [...]
>>>>               }
>>   +            /* At this point n must be aligned to the target
>> cluster size. */
>> +            if (offset + n < size) {
>> +                assert(n % bdi.cluster_size == 0);
> 
> This is not correct.  First, bdrv_is_allocated_above() operates not on
> the top image, but on images in the backing chain, which may have
> different cluster sizes and so may lead to `n`s that are not aligned to
> the top image’s cluster size:
> 
> $ ./qemu-img create -f qcow2 base.qcow2 64M
> $ ./qemu-img create -f qcow2 -b base.qcow2 -F qcow2 mid.qcow2 64M
> $ ./qemu-img create -f qcow2 -o cluster_size=2M -b mid.qcow2 -F qcow2
> top.qcow2 64M
> $ ./qemu-io -c 'write 64k 64k' mid.qcow2
> $ ./qemu-img rebase -b base.qcow2 top.qcow2
> qemu-img: ../qemu-img.c:3845: img_rebase: Assertion `n %
> bdi.cluster_size == 0' failed.
> [1]    636690 IOT instruction (core dumped)  ./qemu-img rebase -b
> base.qcow2 top.qcow2
> 
> Second, and this is a more theoretical thing, it would also be broken
> for images with cluster sizes greater than IO_BUF_SIZE.  Now,
> IO_BUF_SIZE is 2 MB, which happens to be precisely the maximum cluster
> size we support for qcow2, and for vmdk we always create images with 64
> kB clusters (I believe), but the vmdk code seems happy to open
> pre-existing images with cluster sizes up to 512 MB. Still, even for
> qcow2, we could easily increase the limit from 2 MB at any point, and
> there is no explicit correlation why IO_BUF_SIZE happens to be exactly
> what the current maximum cluster size for qcow2 is.  One way to get
> around this would be to use MAX(IO_BUF_SIZE, bdi.cluster_size) for the
> buffer size, which would give such an explicit correlation.
> 

I'm not sure whether blunt allocation of buffers up to 512M is the right
thing to do.  Since we need our buffers to be equal in size, we'd have
to take MAX(old backing cluster size, new backing cluster size, target
cluster size).  As for potential increase of qcow2 cluster size, I'd
simply increase IO_BUF_SIZE accordingly once it happens.

Overall, your first point is enough to drop simply drop that assert.

However, your remark gave me another idea that there's actually the 3rd
case when it gets broken, and that is images with subclusters, since in
this case bdrv_is_allocated_above() will align n to the subcluster size.
 While looking into what exactly qcow2_co_block_status() reports I
realized that my patch breaks the following:

> qemu-img create -f qcow2 -o cluster_size=1M base.qcow2 1M
> qemu-img create -f qcow2 -b base.qcow2 -F qcow2 -o cluster_size=1M,extended_l2=on inc1.qcow2 1M
> qemu-img create -f qcow2 -b inc1.qcow2 -F qcow2 -o cluster_size=1M inc2.qcow2 1M
> qemu-io -c 'write -P 0xaa 0 32K' -c 'write -P 0xbb 64K 32K' inc1.qcow2
> qemu-img rebase -b base.qcow2 -F qcow2 inc2.qcow2
> qemu-io -c "read -P 0xaa 0 32K" -c "read -P 0xbb 64K 32K" inc2.qcow2
>> read 32768/32768 bytes at offset 0
> 32 KiB, 1 ops; 00.00 sec (78.511 MiB/sec and 2512.3671 ops/sec)
> Pattern verification failed at offset 65536, 32768 bytes
> read 32768/32768 bytes at offset 65536
> 32 KiB, 1 ops; 00.00 sec (490.381 MiB/sec and 15692.1822 ops/sec)

That happens because n_old is bounded by n and we read not too small
data chunk (1st subcluster only).  Since we end up writing whole
clusters to the target anyway, the solution would probably be rounding n
up to the cluster size right after the call to bdrv_is_allocated_above():

>             if (prefix_chain_bs) {
>                 uint64_t bytes = n;
>             ...
>             }
> 
>             n = MIN(QEMU_ALIGN_UP(n, bdi.cluster_size), size - offset);
>
Now, if the target also has subclusters, we might end up allocating more
disk space than necessary (i.e. writing whole cluster instead of several
separate subclusters).  I'm not sure whether we should consider this as
well (aligning n to subcluster size?) or leave it as is keeping in mind
the trade-off between disk space and IO ops.

In any case I'll add the above scenario to the tests.

>
> [...]
>>> +                         */
>> +                        start = QEMU_ALIGN_DOWN(offset + written,
>> +                                                bdi.cluster_size);
> 
> Please add an assertion here that `start >= offset`.  I would rather
> have qemu-img crash than to write out-of-bounds memory data to disk.
> 
> I understand the idea is that this is given anyway because `offset`
> starts at 0 and we always check that `n`, by which we increment
> `offset`, is aligned, but it is absolutely critical that we don’t do an
> out-of-bounds access, so I feel an explicit assertion here is warranted.
> 
>> +                        end = QEMU_ALIGN_UP(offset + written + pnum,
>> +                                            bdi.cluster_size);
> 
> Similarly here, please assert that `end - offset` this does not exceed
> the buffer’s bounds.  I know the reasoning is the same, we ensured that
> `n` is aligned, so we can always safely align up `written + pnum`, but
> still.
> 

Agreed. Smth like:

>                          end = QEMU_ALIGN_UP(offset + written + pnum,
>                                              bdi.cluster_size);
>                          end = end > size ? size : end;
> +                        assert(offset <= start && start < end &&
> +                               end <= offset + IO_BUF_SIZE);
>                          ret = blk_pwrite(blk, start, end - start,
>                                           buf_old + (start - offset),
>                                           write_flags);

diff --git a/qemu-img.c b/qemu-img.c
index 60f4c06487..9a469cd609 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3513,6 +3513,7 @@  static int img_rebase(int argc, char **argv)
     uint8_t *buf_new = NULL;
     BlockDriverState *bs = NULL, *prefix_chain_bs = NULL;
     BlockDriverState *unfiltered_bs;
+    BlockDriverInfo bdi = {0};
     char *filename;
     const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
     int c, flags, src_flags, ret;
@@ -3646,6 +3647,15 @@  static int img_rebase(int argc, char **argv)
         }
     }
 
+    /* We need overlay cluster size to make sure write requests are aligned */
+    ret = bdrv_get_info(unfiltered_bs, &bdi);
+    if (ret < 0) {
+        error_report("could not get block driver info");
+        goto out;
+    } else if (bdi.cluster_size == 0) {
+        bdi.cluster_size = 1;
+    }
+
     /* For safe rebasing we need to compare old and new backing file */
     if (!unsafe) {
         QDict *options = NULL;
@@ -3744,6 +3754,7 @@  static int img_rebase(int argc, char **argv)
         int64_t new_backing_size = 0;
         uint64_t offset;
         int64_t n;
+        int64_t n_old = 0, n_new = 0;
         float local_progress = 0;
 
         buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
@@ -3784,7 +3795,7 @@  static int img_rebase(int argc, char **argv)
         }
 
         for (offset = 0; offset < size; offset += n) {
-            bool buf_old_is_zero = false;
+            bool old_backing_eof = false;
 
             /* How many bytes can we handle with the next read? */
             n = MIN(IO_BUF_SIZE, size - offset);
@@ -3829,33 +3840,38 @@  static int img_rebase(int argc, char **argv)
                 }
             }
 
+            /* At this point n must be aligned to the target cluster size. */
+            if (offset + n < size) {
+                assert(n % bdi.cluster_size == 0);
+            }
+
+            /*
+             * Much like the with the target image, we'll try to read as much
+             * of the old and new backings as we can.
+             */
+            n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset));
+            if (blk_new_backing) {
+                n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset));
+            }
+
             /*
              * Read old and new backing file and take into consideration that
              * backing files may be smaller than the COW image.
              */
-            if (offset >= old_backing_size) {
-                memset(buf_old, 0, n);
-                buf_old_is_zero = true;
+            memset(buf_old + n_old, 0, n - n_old);
+            if (!n_old) {
+                old_backing_eof = true;
             } else {
-                if (offset + n > old_backing_size) {
-                    n = old_backing_size - offset;
-                }
-
-                ret = blk_pread(blk_old_backing, offset, n, buf_old, 0);
+                ret = blk_pread(blk_old_backing, offset, n_old, buf_old, 0);
                 if (ret < 0) {
                     error_report("error while reading from old backing file");
                     goto out;
                 }
             }
 
-            if (offset >= new_backing_size || !blk_new_backing) {
-                memset(buf_new, 0, n);
-            } else {
-                if (offset + n > new_backing_size) {
-                    n = new_backing_size - offset;
-                }
-
-                ret = blk_pread(blk_new_backing, offset, n, buf_new, 0);
+            memset(buf_new + n_new, 0, n - n_new);
+            if (blk_new_backing && n_new) {
+                ret = blk_pread(blk_new_backing, offset, n_new, buf_new, 0);
                 if (ret < 0) {
                     error_report("error while reading from new backing file");
                     goto out;
@@ -3867,15 +3883,28 @@  static int img_rebase(int argc, char **argv)
 
             while (written < n) {
                 int64_t pnum;
+                int64_t start, end;
 
                 if (compare_buffers(buf_old + written, buf_new + written,
                                     n - written, &pnum))
                 {
-                    if (buf_old_is_zero) {
+                    if (old_backing_eof) {
                         ret = blk_pwrite_zeroes(blk, offset + written, pnum, 0);
                     } else {
-                        ret = blk_pwrite(blk, offset + written, pnum,
-                                         buf_old + written, 0);
+                        /*
+                         * If we've got to this point, it means the cluster
+                         * we're dealing with is unallocated, and any partial
+                         * write will cause COW.  To avoid that, we make sure
+                         * request is aligned to cluster size.
+                         */
+                        start = QEMU_ALIGN_DOWN(offset + written,
+                                                bdi.cluster_size);
+                        end = QEMU_ALIGN_UP(offset + written + pnum,
+                                            bdi.cluster_size);
+                        end = end > size ? size : end;
+                        ret = blk_pwrite(blk, start, end - start,
+                                         buf_old + (start - offset), 0);
+                        pnum = end - (offset + written);
                     }
                     if (ret < 0) {
                         error_report("Error while writing to COW image: %s",
@@ -3885,6 +3914,9 @@  static int img_rebase(int argc, char **argv)
                 }
 
                 written += pnum;
+                if (offset + written >= old_backing_size) {
+                    old_backing_eof = true;
+                }
             }
             qemu_progress_print(local_progress, 100);
         }

[4/6] qemu-img: rebase: avoid unnecessary COW operations

Commit Message

Comments

Patch