[v1,01/13] qcow2: alloc space for COW in one chunk

Message ID	1495186480-114192-2-git-send-email-anton.nefedov@virtuozzo.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Anton Nefedov <anton.nefedov@virtuozzo.com> To: <qemu-devel@nongnu.org> Date: Fri, 19 May 2017 12:34:28 +0300 Message-ID: <1495186480-114192-2-git-send-email-anton.nefedov@virtuozzo.com> In-Reply-To: <1495186480-114192-1-git-send-email-anton.nefedov@virtuozzo.com> References: <1495186480-114192-1-git-send-email-anton.nefedov@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; AM5PR0801MB1988; 7:g4KWv6A2m0dzEaHGqzuJQ4BPstlYiH74xd8MtQtoEhsXtRmXKsTE7K1POymXABNtjWFTRCTrcG0tpeDUoXG3oPWd1PHFY66k9Xql4AgFEafootqB/VgcTU0U4dO1QYKqHz/9XXKd7oPfntG1W4Qvv3LFsC4Yq+OMF1/XmzqO3ztPhAzvkdJZvJWZccGBGFdH+KGZlspIyxZIAIxfz1lXhi+nFnYUHV4a48VT82Ad1GuwGvl+dypVX8xSX/OUndAObLknogQ04t1PRZFxCr8a+b5wK0xxe6t1vogWF7KzPheiIh+aZRyT1P2ITALSTpHfqBY5wvUI9oasMk/Bkgss9A==; 20:Mtl2/77gO+NRCtQTtOmQgFg5cfOJAk3nvI0aQXdWWndGbDCx7Qetnw//Ul60DN5b20Pf0r0xCxgXmILiTG9W1SGBUeKvB4+/En/SIBPCtVE4+dvZV0DUQkaUYJN1DHyvFL2PzRBEtq1OuM33Z+/Z2X9jLfQGUemW6AQfbApZgnw= Subject: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk Precedence: list Cc: kwolf@redhat.com, "Denis V. Lunev" <den@openvz.org>, Anton Nefedov <anton.nefedov@virtuozzo.com>, den@virtuozzo.com, mreitz@redhat.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Anton Nefedov May 19, 2017, 9:34 a.m. UTC

From: "Denis V. Lunev" <den@openvz.org>

Currently each single write operation can result in 3 write operations
if guest offsets are not cluster aligned. One write is performed for the
real payload and two for COW-ed areas. Thus the data possibly lays
non-contiguously on the host filesystem. This will reduce further
sequential read performance significantly.

The patch allocates the space in the file with cluster granularity,
ensuring
  1. better host offset locality
  2. less space allocation operations
     (which can be expensive on distributed storages)

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
---
 block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

Eric Blake May 22, 2017, 7 p.m. UTC | #1

On 05/19/2017 04:34 AM, Anton Nefedov wrote:
> From: "Denis V. Lunev" <den@openvz.org>
> 
> Currently each single write operation can result in 3 write operations
> if guest offsets are not cluster aligned. One write is performed for the
> real payload and two for COW-ed areas. Thus the data possibly lays
> non-contiguously on the host filesystem. This will reduce further
> sequential read performance significantly.
> 
> The patch allocates the space in the file with cluster granularity,
> ensuring
>   1. better host offset locality
>   2. less space allocation operations
>      (which can be expensive on distributed storages)

s/storages/storage/

> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
>  block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 

> diff --git a/block/qcow2.c b/block/qcow2.c
> index a8d61f0..2e6a0ec 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1575,6 +1575,32 @@ fail:
>      return ret;
>  }
>  
> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    BlockDriverState *file = bs->file->bs;
> +    QCowL2Meta *m;
> +    int ret;
> +
> +    for (m = l2meta; m != NULL; m = m->next) {
> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
> +
> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
> +            continue;
> +        }
> +
> +        /* try to alloc host space in one chunk for better locality */
> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);

Are we guaranteed that this is a fast operation?  (That is, it either
results in a hole or an error, and doesn't waste time tediously writing
actual zeroes)

> +
> +        if (ret != 0) {
> +            continue;
> +        }

Supposing we are using a file system that doesn't support holes, then
ret will not be zero, and we ended up not allocating anything after all.
 Is that a problem that we are just blindly continuing the loop as our
reaction to the error?

/reads further

I guess not - you aren't reacting to any error call, but merely using
the side effect that an allocation happened for speed when it worked,
and ignoring failure (you get the old behavior of the write() now
causing the allocation) when it didn't.

> +
> +        file->total_sectors = MAX(file->total_sectors,
> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
> +    }
> +}
> +
>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>                                           uint64_t bytes, QEMUIOVector *qiov,
>                                           int flags)
> @@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>          if (ret < 0) {
>              goto fail;
>          }
> -
>          qemu_co_mutex_unlock(&s->lock);
> +
> +        if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) {
> +            handle_alloc_space(bs, l2meta);
> +        }

Is it really a good idea to be modifying the underlying protocol image
outside of the mutex?

At any rate, it looks like your patch is doing a best-effort write
zeroes as an attempt to trigger consecutive allocation of the entire
cluster in the underlying protocol right after a cluster has been
allocated at the qcow2 format layer.  Which means there are more
syscalls now than there were previously, but now when we do three
write() calls at offsets B, A, C, those three calls are into file space
that was allocated earlier by the write zeroes, rather than fresh calls
into unallocated space that is likely to trigger up to three disjoint
allocations.

As a discussion point, wouldn't we achieve the same effect of less
fragmentation if we instead collect our data into a bounce buffer, and
only then do a single write() (or more likely, a writev() where the iov
is set up to reconstruct a single buffer on the syscall, but where the
source data is still at different offsets)?  We'd be avoiding the extra
syscalls of pre-allocating the cluster, and while our write() call is
still causing allocations, at least it is now one cluster-aligned
write() rather than three sub-cluster out-of-order write()s.

Anton Nefedov May 23, 2017, 8:28 a.m. UTC | #2

On 05/22/2017 10:00 PM, Eric Blake wrote:
> On 05/19/2017 04:34 AM, Anton Nefedov wrote:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> Currently each single write operation can result in 3 write operations
>> if guest offsets are not cluster aligned. One write is performed for the
>> real payload and two for COW-ed areas. Thus the data possibly lays
>> non-contiguously on the host filesystem. This will reduce further
>> sequential read performance significantly.
>>
>> The patch allocates the space in the file with cluster granularity,
>> ensuring
>>   1. better host offset locality
>>   2. less space allocation operations
>>      (which can be expensive on distributed storages)
>
> s/storages/storage/
>

Done.

>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>  block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>
>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index a8d61f0..2e6a0ec 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1575,6 +1575,32 @@ fail:
>>      return ret;
>>  }
>>
>> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    BlockDriverState *file = bs->file->bs;
>> +    QCowL2Meta *m;
>> +    int ret;
>> +
>> +    for (m = l2meta; m != NULL; m = m->next) {
>> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
>> +
>> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
>> +            continue;
>> +        }
>> +
>> +        /* try to alloc host space in one chunk for better locality */
>> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);
>
> Are we guaranteed that this is a fast operation?  (That is, it either
> results in a hole or an error, and doesn't waste time tediously writing
> actual zeroes)
>

well, block_int.h reads:

/*
* Efficiently zero a region of the disk image. Typically an image format
* would use a compact metadata representation to implement this. This
* function pointer may be NULL or return -ENOSUP and .bdrv_co_writev()
* will be called instead.
*/
int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs,
int64_t offset, int count, BdrvRequestFlags flags);


(and that's why the driver function is used directly, bypassing the 
'safe' bdrv interface that would try to write zeroes no matter the cost)

As far as I checked the drivers mostly follow the idea

>> +
>> +        if (ret != 0) {
>> +            continue;
>> +        }
>
> Supposing we are using a file system that doesn't support holes, then
> ret will not be zero, and we ended up not allocating anything after all.
>  Is that a problem that we are just blindly continuing the loop as our
> reaction to the error?
>
> /reads further
>
> I guess not - you aren't reacting to any error call, but merely using
> the side effect that an allocation happened for speed when it worked,
> and ignoring failure (you get the old behavior of the write() now
> causing the allocation) when it didn't.
>

yes, exactly

>> +
>> +        file->total_sectors = MAX(file->total_sectors,
>> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
>> +    }
>> +}
>> +
>>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>                                           uint64_t bytes, QEMUIOVector *qiov,
>>                                           int flags)
>> @@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>          if (ret < 0) {
>>              goto fail;
>>          }
>> -
>>          qemu_co_mutex_unlock(&s->lock);
>> +
>> +        if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) {
>> +            handle_alloc_space(bs, l2meta);
>> +        }
>
> Is it really a good idea to be modifying the underlying protocol image
> outside of the mutex?
>

as far as I understand, qcow2 usually modifies the underlying image
outside of the mutex? I guess it's qcow2 metadata that we wouldn't want
to touch unlocked

> At any rate, it looks like your patch is doing a best-effort write
> zeroes as an attempt to trigger consecutive allocation of the entire
> cluster in the underlying protocol right after a cluster has been
> allocated at the qcow2 format layer.  Which means there are more
> syscalls now than there were previously, but now when we do three
> write() calls at offsets B, A, C, those three calls are into file space
> that was allocated earlier by the write zeroes, rather than fresh calls
> into unallocated space that is likely to trigger up to three disjoint
> allocations.
>
> As a discussion point, wouldn't we achieve the same effect of less
> fragmentation if we instead collect our data into a bounce buffer, and
> only then do a single write() (or more likely, a writev() where the iov
> is set up to reconstruct a single buffer on the syscall, but where the
> source data is still at different offsets)?  We'd be avoiding the extra
> syscalls of pre-allocating the cluster, and while our write() call is
> still causing allocations, at least it is now one cluster-aligned
> write() rather than three sub-cluster out-of-order write()s.
>

I think yes we would achieve the same effect of less fragmentation;
but pre-zeroing also makes the following patch possible (to skip COW if 
there is no backing data)

I have follow-up patches which merge initial data and COW padding into a
single writev(). After those it should become reasonable to skip
cluster pre-zeroing (for cases when there is backing data).

/Anton

Denis V. Lunev May 23, 2017, 9:13 a.m. UTC | #3

On 05/22/2017 10:00 PM, Eric Blake wrote:
> On 05/19/2017 04:34 AM, Anton Nefedov wrote:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> Currently each single write operation can result in 3 write operations
>> if guest offsets are not cluster aligned. One write is performed for the
>> real payload and two for COW-ed areas. Thus the data possibly lays
>> non-contiguously on the host filesystem. This will reduce further
>> sequential read performance significantly.
>>
>> The patch allocates the space in the file with cluster granularity,
>> ensuring
>>   1. better host offset locality
>>   2. less space allocation operations
>>      (which can be expensive on distributed storages)
> s/storages/storage/
>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> ---
>>  block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index a8d61f0..2e6a0ec 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1575,6 +1575,32 @@ fail:
>>      return ret;
>>  }
>>  
>> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    BlockDriverState *file = bs->file->bs;
>> +    QCowL2Meta *m;
>> +    int ret;
>> +
>> +    for (m = l2meta; m != NULL; m = m->next) {
>> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
>> +
>> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
>> +            continue;
>> +        }
>> +
>> +        /* try to alloc host space in one chunk for better locality */
>> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);
> Are we guaranteed that this is a fast operation?  (That is, it either
> results in a hole or an error, and doesn't waste time tediously writing
> actual zeroes)
>
This is why we call driver directly. We expect that replacing with actualy
zeroes write happens only in generic function.

>> +
>> +        if (ret != 0) {
>> +            continue;
>> +        }
> Supposing we are using a file system that doesn't support holes, then
> ret will not be zero, and we ended up not allocating anything after all.
>  Is that a problem that we are just blindly continuing the loop as our
> reaction to the error?
>
> /reads further
>
> I guess not - you aren't reacting to any error call, but merely using
> the side effect that an allocation happened for speed when it worked,
> and ignoring failure (you get the old behavior of the write() now
> causing the allocation) when it didn't.
good point

>> +
>> +        file->total_sectors = MAX(file->total_sectors,
>> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
>> +    }
>> +}
>> +
>>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>                                           uint64_t bytes, QEMUIOVector *qiov,
>>                                           int flags)
>> @@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>          if (ret < 0) {
>>              goto fail;
>>          }
>> -
>>          qemu_co_mutex_unlock(&s->lock);
>> +
>> +        if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) {
>> +            handle_alloc_space(bs, l2meta);
>> +        }
> Is it really a good idea to be modifying the underlying protocol image
> outside of the mutex?
yes! This is by design, not accidental. The area is protected by the
meta and
thus writes to not allocated areas will pass.

> At any rate, it looks like your patch is doing a best-effort write
> zeroes as an attempt to trigger consecutive allocation of the entire
> cluster in the underlying protocol right after a cluster has been
> allocated at the qcow2 format layer.  Which means there are more
> syscalls now than there were previously, but now when we do three
> write() calls at offsets B, A, C, those three calls are into file space
> that was allocated earlier by the write zeroes, rather than fresh calls
> into unallocated space that is likely to trigger up to three disjoint
> allocations.
>
> As a discussion point, wouldn't we achieve the same effect of less
> fragmentation if we instead collect our data into a bounce buffer, and
> only then do a single write() (or more likely, a writev() where the iov
> is set up to reconstruct a single buffer on the syscall, but where the
> source data is still at different offsets)?  We'd be avoiding the extra
> syscalls of pre-allocating the cluster, and while our write() call is
> still causing allocations, at least it is now one cluster-aligned
> write() rather than three sub-cluster out-of-order write()s.
>
If the cluster will become bigger, the difference will be much more
significant.
Actually we have done that already in the original patchset, but those
changes are a bit controversal in terms of performance. They works much
better on SSD and worse on HDD.

May be we have done something wrong, but here only simple non-questionable
things in terms of performance are included.

Den

Kevin Wolf May 26, 2017, 8:11 a.m. UTC | #4

Am 19.05.2017 um 11:34 hat Anton Nefedov geschrieben:
> From: "Denis V. Lunev" <den@openvz.org>
> 
> Currently each single write operation can result in 3 write operations
> if guest offsets are not cluster aligned. One write is performed for the
> real payload and two for COW-ed areas. Thus the data possibly lays
> non-contiguously on the host filesystem. This will reduce further
> sequential read performance significantly.
> 
> The patch allocates the space in the file with cluster granularity,
> ensuring
>   1. better host offset locality
>   2. less space allocation operations
>      (which can be expensive on distributed storages)
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>

I don't think this is the right approach. You end up with two
write operations: One write_zeroes and then one data write. If the
backend actually supports efficient zero writes, write_zeroes won't
necessarily allocate space, but writing data will definitely split
the already existing allocation. If anything, we need a new
bdrv_allocate() or something that would call fallocate() instead of
abusing write_zeroes.

It seems much clearer to me that simply unifying the three write
requests into a single one is an improvement. And it's easy to do, I
even had a patch once to do this. The reason that I didn't send it was
that it seemed to conflict with the data cache approach

>  block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index a8d61f0..2e6a0ec 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1575,6 +1575,32 @@ fail:
>      return ret;
>  }
>  
> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    BlockDriverState *file = bs->file->bs;
> +    QCowL2Meta *m;
> +    int ret;
> +
> +    for (m = l2meta; m != NULL; m = m->next) {
> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
> +
> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
> +            continue;
> +        }
> +
> +        /* try to alloc host space in one chunk for better locality */
> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);

No. This is what you bypass:

* All sanity checks that the block layer does

* bdrv_inc/dec_in_flight(), which is required for drain to work
  correctly. Not doing this will cause crashes.

* tracked_request_begin/end(), mark_request_serialising() and
  wait_serialising_requests(), which are required for serialising
  requests to work correctly

* Ensuring correct request alignment for file. This means that e.g.
  qcow2 with cluster size 512 on a host with a 4k native disk will
  break.

* blkdebug events

* before_write_notifiers. Not calling these will cause corrupted backups
  if someone backups file.

* Dirty bitmap updates

* Updating write_gen, wr_highest_offset and total_sectors

* Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are
  respected

And these are just the obvious things. I'm sure I missed some.

> +        if (ret != 0) {
> +            continue;
> +        }
> +
> +        file->total_sectors = MAX(file->total_sectors,
> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);

You only compensate for part of a single item in the list above, by
duplicating code with block/io.c. This is not how to do things.

As I said above, I think you don't really want write_zeroes anyway, but
if you wanted a write_zeroes "but only if it's efficient" (which I'm not
sure is a good thing to want), then a better way might be introducing a
new request flag.

> +    }
> +}
> +
>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>                                           uint64_t bytes, QEMUIOVector *qiov,
>                                           int flags)

Kevin

Denis V. Lunev May 26, 2017, 8:57 a.m. UTC | #5

On 05/26/2017 11:11 AM, Kevin Wolf wrote:
> Am 19.05.2017 um 11:34 hat Anton Nefedov geschrieben:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> Currently each single write operation can result in 3 write operations
>> if guest offsets are not cluster aligned. One write is performed for the
>> real payload and two for COW-ed areas. Thus the data possibly lays
>> non-contiguously on the host filesystem. This will reduce further
>> sequential read performance significantly.
>>
>> The patch allocates the space in the file with cluster granularity,
>> ensuring
>>   1. better host offset locality
>>   2. less space allocation operations
>>      (which can be expensive on distributed storages)
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> I don't think this is the right approach. You end up with two
> write operations: One write_zeroes and then one data write. If the
> backend actually supports efficient zero writes, write_zeroes won't
> necessarily allocate space, but writing data will definitely split
> the already existing allocation. If anything, we need a new
> bdrv_allocate() or something that would call fallocate() instead of
> abusing write_zeroes.
great idea. Very nice then.

> It seems much clearer to me that simply unifying the three write
> requests into a single one is an improvement. And it's easy to do, I
> even had a patch once to do this. The reason that I didn't send it was
> that it seemed to conflict with the data cache approach
These changes help a lot on 2 patterns:
- writing 4Kb into the middle of the block. The bigger the block size,
  more we gain
- sequential write, which is not sector aligned and comes in 2 requests:
  we perform allocation, which should be significantly faster than actual
  write and after that both parts of the write can be executed in parallel.
At my opinion both cases are frequent and important.

But OK, the code should be improved, you are absolutely right here.

>>  block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index a8d61f0..2e6a0ec 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1575,6 +1575,32 @@ fail:
>>      return ret;
>>  }
>>  
>> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    BlockDriverState *file = bs->file->bs;
>> +    QCowL2Meta *m;
>> +    int ret;
>> +
>> +    for (m = l2meta; m != NULL; m = m->next) {
>> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
>> +
>> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
>> +            continue;
>> +        }
>> +
>> +        /* try to alloc host space in one chunk for better locality */
>> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);
> No. This is what you bypass:
>
> * All sanity checks that the block layer does
>
> * bdrv_inc/dec_in_flight(), which is required for drain to work
>   correctly. Not doing this will cause crashes.
>
> * tracked_request_begin/end(), mark_request_serialising() and
>   wait_serialising_requests(), which are required for serialising
>   requests to work correctly
>
> * Ensuring correct request alignment for file. This means that e.g.
>   qcow2 with cluster size 512 on a host with a 4k native disk will
>   break.
>
> * blkdebug events
>
> * before_write_notifiers. Not calling these will cause corrupted backups
>   if someone backups file.
>
> * Dirty bitmap updates
>
> * Updating write_gen, wr_highest_offset and total_sectors
>
> * Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are
>   respected
>
> And these are just the obvious things. I'm sure I missed some.
>

You seems right. I have not though about that from this angle.

>> +        if (ret != 0) {
>> +            continue;
>> +        }
>> +
>> +        file->total_sectors = MAX(file->total_sectors,
>> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
> You only compensate for part of a single item in the list above, by
> duplicating code with block/io.c. This is not how to do things.
>
> As I said above, I think you don't really want write_zeroes anyway, but
> if you wanted a write_zeroes "but only if it's efficient" (which I'm not
> sure is a good thing to want), then a better way might be introducing a
> new request flag.
>
>> +    }
>> +}
>> +
>>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>                                           uint64_t bytes, QEMUIOVector *qiov,
>>                                           int flags)
> Kevin

Thank you for review and ideas ;)

Den

Anton Nefedov May 26, 2017, 10:09 a.m. UTC | #6

On 05/26/2017 11:57 AM, Denis V. Lunev wrote:
> On 05/26/2017 11:11 AM, Kevin Wolf wrote:
>> Am 19.05.2017 um 11:34 hat Anton Nefedov geschrieben:
>>> From: "Denis V. Lunev" <den@openvz.org>
>>>
>>> Currently each single write operation can result in 3 write operations
>>> if guest offsets are not cluster aligned. One write is performed for the
>>> real payload and two for COW-ed areas. Thus the data possibly lays
>>> non-contiguously on the host filesystem. This will reduce further
>>> sequential read performance significantly.
>>>
>>> The patch allocates the space in the file with cluster granularity,
>>> ensuring
>>>    1. better host offset locality
>>>    2. less space allocation operations
>>>       (which can be expensive on distributed storages)
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>> I don't think this is the right approach. You end up with two
>> write operations: One write_zeroes and then one data write. If the
>> backend actually supports efficient zero writes, write_zeroes won't
>> necessarily allocate space, but writing data will definitely split
>> the already existing allocation. If anything, we need a new
>> bdrv_allocate() or something that would call fallocate() instead of
>> abusing write_zeroes.
> great idea. Very nice then.
> 

hi Kevin, thanks a lot for your comments.

The driver's callback is commented as:

     /*
      * Efficiently zero a region of the disk image.  Typically an image 
format
      * would use a compact metadata representation to implement this.  This
      * function pointer may be NULL or return -ENOSUP and .bdrv_co_writev()
      * will be called instead.
      */
     int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs,
         int64_t offset, int count, BdrvRequestFlags flags);


so it looks like one may expect either efficiency or ENOTSUP from that
one? but not necessarily from the safe "facade" bdrv_co_pwrite_zeroes()
which falls back to pwritev()

/*
  * Efficiently zero a region of the disk image.  Note that this is a 
regular
  * I/O request like read or write and should have a reasonable size.  This
  * function is not suitable for zeroing the entire image in a single 
request
  * because it may allocate memory for the entire region.
  */
int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
                                        int count, BdrvRequestFlags flags);


(maybe we should have even more explicit comment here)


Then let's indeed create a new bdrv_allocate(), which will result in
bdrv_co_pwrite_zeroes(flags|=ZERO_EFFICIENTLY), so we'll reuse most of
the code and employ the same driver callback, but return ENOTSUP if it
fails?


>> It seems much clearer to me that simply unifying the three write
>> requests into a single one is an improvement. And it's easy to do, I
>> even had a patch once to do this. The reason that I didn't send it was
>> that it seemed to conflict with the data cache approach
> These changes help a lot on 2 patterns:
> - writing 4Kb into the middle of the block. The bigger the block size,
>    more we gain
> - sequential write, which is not sector aligned and comes in 2 requests:
>    we perform allocation, which should be significantly faster than actual
>    write and after that both parts of the write can be executed in parallel.
> At my opinion both cases are frequent and important.
> 
> But OK, the code should be improved, you are absolutely right here.
> 
>>>   block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>>>   1 file changed, 31 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index a8d61f0..2e6a0ec 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -1575,6 +1575,32 @@ fail:
>>>       return ret;
>>>   }
>>>   
>>> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>>> +{
>>> +    BDRVQcow2State *s = bs->opaque;
>>> +    BlockDriverState *file = bs->file->bs;
>>> +    QCowL2Meta *m;
>>> +    int ret;
>>> +
>>> +    for (m = l2meta; m != NULL; m = m->next) {
>>> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
>>> +
>>> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
>>> +            continue;
>>> +        }
>>> +
>>> +        /* try to alloc host space in one chunk for better locality */
>>> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);
>> No. This is what you bypass:
>>
>> * All sanity checks that the block layer does
>>
>> * bdrv_inc/dec_in_flight(), which is required for drain to work
>>    correctly. Not doing this will cause crashes.
>>
>> * tracked_request_begin/end(), mark_request_serialising() and
>>    wait_serialising_requests(), which are required for serialising
>>    requests to work correctly
>>
>> * Ensuring correct request alignment for file. This means that e.g.
>>    qcow2 with cluster size 512 on a host with a 4k native disk will
>>    break.
>>
>> * blkdebug events
>>
>> * before_write_notifiers. Not calling these will cause corrupted backups
>>    if someone backups file.
>>
>> * Dirty bitmap updates
>>
>> * Updating write_gen, wr_highest_offset and total_sectors
>>
>> * Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are
>>    respected
>>
>> And these are just the obvious things. I'm sure I missed some.
>>
> 
> You seems right. I have not though about that from this angle.
> 
>>> +        if (ret != 0) {
>>> +            continue;
>>> +        }
>>> +
>>> +        file->total_sectors = MAX(file->total_sectors,
>>> +                                  (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
>> You only compensate for part of a single item in the list above, by
>> duplicating code with block/io.c. This is not how to do things.
>>
>> As I said above, I think you don't really want write_zeroes anyway, but
>> if you wanted a write_zeroes "but only if it's efficient" (which I'm not
>> sure is a good thing to want), then a better way might be introducing a
>> new request flag.
>>
>>> +    }
>>> +}
>>> +
>>>   static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>>                                            uint64_t bytes, QEMUIOVector *qiov,
>>>                                            int flags)
>> Kevin
> 
> Thank you for review and ideas ;)
> 
> Den
> 
/Anton

Denis V. Lunev May 26, 2017, 10:57 a.m. UTC | #7

On 05/26/2017 11:11 AM, Kevin Wolf wrote:
>
> No. This is what you bypass:
some analysis for the record.

> * All sanity checks that the block layer does
>
> * bdrv_inc/dec_in_flight(), which is required for drain to work
>   correctly. Not doing this will cause crashes.
Should not be  a problem. We are in the request, which
is accounted for the parent. Drain waits parent first and after that
switched to child.

> * tracked_request_begin/end(), mark_request_serialising() and
>   wait_serialising_requests(), which are required for serialising
>   requests to work correctly
should not cause a problem as we are protected by meta
and this area is not exposed to read operation until
fully complete. But yes, not that good.
 
> * Ensuring correct request alignment for file. This means that e.g.
>   qcow2 with cluster size 512 on a host with a 4k native disk will
>   break.
we have already checked that clusters are aligned.

> * blkdebug events
yes
> * before_write_notifiers. Not calling these will cause corrupted backups
>   if someone backups file.
no as there is no user visible data here, we are
writing zeroes where there are zeroes from guest
point of view

> * Dirty bitmap updates
same. The data from the guest point of view is not changed.
This is even good that we do not mark this areas as dirty.

> * Updating write_gen, wr_highest_offset and total_sectors
write_gen is not a problem. It will be set on actual write.
total_sectors is changed manually. Agree that this is not
really elegant.

> * Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are
>   respected
yes
> And these are just the obvious things. I'm sure I missed some.
anyway, this is a good list of things to take into account
within bdrv_allocate. But it is important, that from the
guest point of view the content of COW areas is not
changed and thus we can have some important shortcuts.

Den

Kevin Wolf May 26, 2017, 11:16 a.m. UTC | #8

Am 26.05.2017 um 10:57 hat Denis V. Lunev geschrieben:
> On 05/26/2017 11:11 AM, Kevin Wolf wrote:
> > It seems much clearer to me that simply unifying the three write
> > requests into a single one is an improvement. And it's easy to do, I
> > even had a patch once to do this. The reason that I didn't send it was
> > that it seemed to conflict with the data cache approach
> These changes help a lot on 2 patterns:
> - writing 4Kb into the middle of the block. The bigger the block size,
>   more we gain
> - sequential write, which is not sector aligned and comes in 2 requests:
>   we perform allocation, which should be significantly faster than actual
>   write and after that both parts of the write can be executed in parallel.
> At my opinion both cases are frequent and important.

Both cases are helped somewhat by reducing the number of I/O requests.
So as a first step, I really think we need to get Berto's patches ready
to be merged.

As for this series, there is one important restriction that you don't
mention: Both cases only benefit from the changes when there is no
backing file (or the backing file is sparse at the given cluster). The
main reason why people use qcow2 is for snapshots, so qcow2 without or
with an empty backing file is actually not my primary concern.

To be honest, I am concerned a bit by the additional complexity brought
in by this series (which also seemed to be more tacked on rather than
fully integrated at the first sight), but I'll have to study it in more
detail.

Maybe another thing to keep in mind is that the sequential write case
can completely avoid COW in theory. I wonder whether we should make
another attempt at this. Previously, it was the complexity of my own
prototypes that stopped me from actually merging them.

Kevin

Kevin Wolf May 26, 2017, 11:32 a.m. UTC | #9

Am 26.05.2017 um 12:57 hat Denis V. Lunev geschrieben:
> On 05/26/2017 11:11 AM, Kevin Wolf wrote:
> >
> > No. This is what you bypass:
> some analysis for the record.
> [...]

> anyway, this is a good list of things to take into account
> within bdrv_allocate. But it is important, that from the
> guest point of view the content of COW areas is not
> changed and thus we can have some important shortcuts.

Yes, not everything in this list is an actual problem for this specific
case. Some things are, though, and if you have to make an analysis like
this, that's already a sign that something is wrong.

The real point I wanted to make is that the bdrv_*() wrappers do a whole
lot of useful things (half of which everyone, including me, would
probably forget when listing them without looking at the code) - and on
top of this, when making changes to the block layer, we have an
expectation that every request consistently goes through the block/io.c
functions. So not wanting one of these useful things in a special case
isn't a good enough reason to completey bypass them.

If we really need to avoid some part of a function, we can always add
another flag or something that makes that part optional.

Kevin

[v1,01/13] qcow2: alloc space for COW in one chunk

Commit Message

Comments

Patch