Message ID | 1495186480-114192-2-git-send-email-anton.nefedov@virtuozzo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 05/19/2017 04:34 AM, Anton Nefedov wrote: > From: "Denis V. Lunev" <den@openvz.org> > > Currently each single write operation can result in 3 write operations > if guest offsets are not cluster aligned. One write is performed for the > real payload and two for COW-ed areas. Thus the data possibly lays > non-contiguously on the host filesystem. This will reduce further > sequential read performance significantly. > > The patch allocates the space in the file with cluster granularity, > ensuring > 1. better host offset locality > 2. less space allocation operations > (which can be expensive on distributed storages) s/storages/storage/ > > Signed-off-by: Denis V. Lunev <den@openvz.org> > Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> > --- > block/qcow2.c | 32 +++++++++++++++++++++++++++++++- > 1 file changed, 31 insertions(+), 1 deletion(-) > > diff --git a/block/qcow2.c b/block/qcow2.c > index a8d61f0..2e6a0ec 100644 > --- a/block/qcow2.c > +++ b/block/qcow2.c > @@ -1575,6 +1575,32 @@ fail: > return ret; > } > > +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta) > +{ > + BDRVQcow2State *s = bs->opaque; > + BlockDriverState *file = bs->file->bs; > + QCowL2Meta *m; > + int ret; > + > + for (m = l2meta; m != NULL; m = m->next) { > + uint64_t bytes = m->nb_clusters << s->cluster_bits; > + > + if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) { > + continue; > + } > + > + /* try to alloc host space in one chunk for better locality */ > + ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0); Are we guaranteed that this is a fast operation? (That is, it either results in a hole or an error, and doesn't waste time tediously writing actual zeroes) > + > + if (ret != 0) { > + continue; > + } Supposing we are using a file system that doesn't support holes, then ret will not be zero, and we ended up not allocating anything after all. Is that a problem that we are just blindly continuing the loop as our reaction to the error? /reads further I guess not - you aren't reacting to any error call, but merely using the side effect that an allocation happened for speed when it worked, and ignoring failure (you get the old behavior of the write() now causing the allocation) when it didn't. > + > + file->total_sectors = MAX(file->total_sectors, > + (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE); > + } > +} > + > static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset, > uint64_t bytes, QEMUIOVector *qiov, > int flags) > @@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset, > if (ret < 0) { > goto fail; > } > - > qemu_co_mutex_unlock(&s->lock); > + > + if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) { > + handle_alloc_space(bs, l2meta); > + } Is it really a good idea to be modifying the underlying protocol image outside of the mutex? At any rate, it looks like your patch is doing a best-effort write zeroes as an attempt to trigger consecutive allocation of the entire cluster in the underlying protocol right after a cluster has been allocated at the qcow2 format layer. Which means there are more syscalls now than there were previously, but now when we do three write() calls at offsets B, A, C, those three calls are into file space that was allocated earlier by the write zeroes, rather than fresh calls into unallocated space that is likely to trigger up to three disjoint allocations. As a discussion point, wouldn't we achieve the same effect of less fragmentation if we instead collect our data into a bounce buffer, and only then do a single write() (or more likely, a writev() where the iov is set up to reconstruct a single buffer on the syscall, but where the source data is still at different offsets)? We'd be avoiding the extra syscalls of pre-allocating the cluster, and while our write() call is still causing allocations, at least it is now one cluster-aligned write() rather than three sub-cluster out-of-order write()s.
On 05/22/2017 10:00 PM, Eric Blake wrote: > On 05/19/2017 04:34 AM, Anton Nefedov wrote: >> From: "Denis V. Lunev" <den@openvz.org> >> >> Currently each single write operation can result in 3 write operations >> if guest offsets are not cluster aligned. One write is performed for the >> real payload and two for COW-ed areas. Thus the data possibly lays >> non-contiguously on the host filesystem. This will reduce further >> sequential read performance significantly. >> >> The patch allocates the space in the file with cluster granularity, >> ensuring >> 1. better host offset locality >> 2. less space allocation operations >> (which can be expensive on distributed storages) > > s/storages/storage/ > Done. >> >> Signed-off-by: Denis V. Lunev <den@openvz.org> >> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> >> --- >> block/qcow2.c | 32 +++++++++++++++++++++++++++++++- >> 1 file changed, 31 insertions(+), 1 deletion(-) >> > >> diff --git a/block/qcow2.c b/block/qcow2.c >> index a8d61f0..2e6a0ec 100644 >> --- a/block/qcow2.c >> +++ b/block/qcow2.c >> @@ -1575,6 +1575,32 @@ fail: >> return ret; >> } >> >> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta) >> +{ >> + BDRVQcow2State *s = bs->opaque; >> + BlockDriverState *file = bs->file->bs; >> + QCowL2Meta *m; >> + int ret; >> + >> + for (m = l2meta; m != NULL; m = m->next) { >> + uint64_t bytes = m->nb_clusters << s->cluster_bits; >> + >> + if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) { >> + continue; >> + } >> + >> + /* try to alloc host space in one chunk for better locality */ >> + ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0); > > Are we guaranteed that this is a fast operation? (That is, it either > results in a hole or an error, and doesn't waste time tediously writing > actual zeroes) > well, block_int.h reads: /* * Efficiently zero a region of the disk image. Typically an image format * would use a compact metadata representation to implement this. This * function pointer may be NULL or return -ENOSUP and .bdrv_co_writev() * will be called instead. */ int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs, int64_t offset, int count, BdrvRequestFlags flags); (and that's why the driver function is used directly, bypassing the 'safe' bdrv interface that would try to write zeroes no matter the cost) As far as I checked the drivers mostly follow the idea >> + >> + if (ret != 0) { >> + continue; >> + } > > Supposing we are using a file system that doesn't support holes, then > ret will not be zero, and we ended up not allocating anything after all. > Is that a problem that we are just blindly continuing the loop as our > reaction to the error? > > /reads further > > I guess not - you aren't reacting to any error call, but merely using > the side effect that an allocation happened for speed when it worked, > and ignoring failure (you get the old behavior of the write() now > causing the allocation) when it didn't. > yes, exactly >> + >> + file->total_sectors = MAX(file->total_sectors, >> + (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE); >> + } >> +} >> + >> static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset, >> uint64_t bytes, QEMUIOVector *qiov, >> int flags) >> @@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset, >> if (ret < 0) { >> goto fail; >> } >> - >> qemu_co_mutex_unlock(&s->lock); >> + >> + if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) { >> + handle_alloc_space(bs, l2meta); >> + } > > Is it really a good idea to be modifying the underlying protocol image > outside of the mutex? > as far as I understand, qcow2 usually modifies the underlying image outside of the mutex? I guess it's qcow2 metadata that we wouldn't want to touch unlocked > At any rate, it looks like your patch is doing a best-effort write > zeroes as an attempt to trigger consecutive allocation of the entire > cluster in the underlying protocol right after a cluster has been > allocated at the qcow2 format layer. Which means there are more > syscalls now than there were previously, but now when we do three > write() calls at offsets B, A, C, those three calls are into file space > that was allocated earlier by the write zeroes, rather than fresh calls > into unallocated space that is likely to trigger up to three disjoint > allocations. > > As a discussion point, wouldn't we achieve the same effect of less > fragmentation if we instead collect our data into a bounce buffer, and > only then do a single write() (or more likely, a writev() where the iov > is set up to reconstruct a single buffer on the syscall, but where the > source data is still at different offsets)? We'd be avoiding the extra > syscalls of pre-allocating the cluster, and while our write() call is > still causing allocations, at least it is now one cluster-aligned > write() rather than three sub-cluster out-of-order write()s. > I think yes we would achieve the same effect of less fragmentation; but pre-zeroing also makes the following patch possible (to skip COW if there is no backing data) I have follow-up patches which merge initial data and COW padding into a single writev(). After those it should become reasonable to skip cluster pre-zeroing (for cases when there is backing data). /Anton
On 05/22/2017 10:00 PM, Eric Blake wrote: > On 05/19/2017 04:34 AM, Anton Nefedov wrote: >> From: "Denis V. Lunev" <den@openvz.org> >> >> Currently each single write operation can result in 3 write operations >> if guest offsets are not cluster aligned. One write is performed for the >> real payload and two for COW-ed areas. Thus the data possibly lays >> non-contiguously on the host filesystem. This will reduce further >> sequential read performance significantly. >> >> The patch allocates the space in the file with cluster granularity, >> ensuring >> 1. better host offset locality >> 2. less space allocation operations >> (which can be expensive on distributed storages) > s/storages/storage/ > >> Signed-off-by: Denis V. Lunev <den@openvz.org> >> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> >> --- >> block/qcow2.c | 32 +++++++++++++++++++++++++++++++- >> 1 file changed, 31 insertions(+), 1 deletion(-) >> >> diff --git a/block/qcow2.c b/block/qcow2.c >> index a8d61f0..2e6a0ec 100644 >> --- a/block/qcow2.c >> +++ b/block/qcow2.c >> @@ -1575,6 +1575,32 @@ fail: >> return ret; >> } >> >> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta) >> +{ >> + BDRVQcow2State *s = bs->opaque; >> + BlockDriverState *file = bs->file->bs; >> + QCowL2Meta *m; >> + int ret; >> + >> + for (m = l2meta; m != NULL; m = m->next) { >> + uint64_t bytes = m->nb_clusters << s->cluster_bits; >> + >> + if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) { >> + continue; >> + } >> + >> + /* try to alloc host space in one chunk for better locality */ >> + ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0); > Are we guaranteed that this is a fast operation? (That is, it either > results in a hole or an error, and doesn't waste time tediously writing > actual zeroes) > This is why we call driver directly. We expect that replacing with actualy zeroes write happens only in generic function. >> + >> + if (ret != 0) { >> + continue; >> + } > Supposing we are using a file system that doesn't support holes, then > ret will not be zero, and we ended up not allocating anything after all. > Is that a problem that we are just blindly continuing the loop as our > reaction to the error? > > /reads further > > I guess not - you aren't reacting to any error call, but merely using > the side effect that an allocation happened for speed when it worked, > and ignoring failure (you get the old behavior of the write() now > causing the allocation) when it didn't. good point >> + >> + file->total_sectors = MAX(file->total_sectors, >> + (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE); >> + } >> +} >> + >> static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset, >> uint64_t bytes, QEMUIOVector *qiov, >> int flags) >> @@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset, >> if (ret < 0) { >> goto fail; >> } >> - >> qemu_co_mutex_unlock(&s->lock); >> + >> + if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) { >> + handle_alloc_space(bs, l2meta); >> + } > Is it really a good idea to be modifying the underlying protocol image > outside of the mutex? yes! This is by design, not accidental. The area is protected by the meta and thus writes to not allocated areas will pass. > At any rate, it looks like your patch is doing a best-effort write > zeroes as an attempt to trigger consecutive allocation of the entire > cluster in the underlying protocol right after a cluster has been > allocated at the qcow2 format layer. Which means there are more > syscalls now than there were previously, but now when we do three > write() calls at offsets B, A, C, those three calls are into file space > that was allocated earlier by the write zeroes, rather than fresh calls > into unallocated space that is likely to trigger up to three disjoint > allocations. > > As a discussion point, wouldn't we achieve the same effect of less > fragmentation if we instead collect our data into a bounce buffer, and > only then do a single write() (or more likely, a writev() where the iov > is set up to reconstruct a single buffer on the syscall, but where the > source data is still at different offsets)? We'd be avoiding the extra > syscalls of pre-allocating the cluster, and while our write() call is > still causing allocations, at least it is now one cluster-aligned > write() rather than three sub-cluster out-of-order write()s. > If the cluster will become bigger, the difference will be much more significant. Actually we have done that already in the original patchset, but those changes are a bit controversal in terms of performance. They works much better on SSD and worse on HDD. May be we have done something wrong, but here only simple non-questionable things in terms of performance are included. Den
Am 19.05.2017 um 11:34 hat Anton Nefedov geschrieben: > From: "Denis V. Lunev" <den@openvz.org> > > Currently each single write operation can result in 3 write operations > if guest offsets are not cluster aligned. One write is performed for the > real payload and two for COW-ed areas. Thus the data possibly lays > non-contiguously on the host filesystem. This will reduce further > sequential read performance significantly. > > The patch allocates the space in the file with cluster granularity, > ensuring > 1. better host offset locality > 2. less space allocation operations > (which can be expensive on distributed storages) > > Signed-off-by: Denis V. Lunev <den@openvz.org> > Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> I don't think this is the right approach. You end up with two write operations: One write_zeroes and then one data write. If the backend actually supports efficient zero writes, write_zeroes won't necessarily allocate space, but writing data will definitely split the already existing allocation. If anything, we need a new bdrv_allocate() or something that would call fallocate() instead of abusing write_zeroes. It seems much clearer to me that simply unifying the three write requests into a single one is an improvement. And it's easy to do, I even had a patch once to do this. The reason that I didn't send it was that it seemed to conflict with the data cache approach > block/qcow2.c | 32 +++++++++++++++++++++++++++++++- > 1 file changed, 31 insertions(+), 1 deletion(-) > > diff --git a/block/qcow2.c b/block/qcow2.c > index a8d61f0..2e6a0ec 100644 > --- a/block/qcow2.c > +++ b/block/qcow2.c > @@ -1575,6 +1575,32 @@ fail: > return ret; > } > > +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta) > +{ > + BDRVQcow2State *s = bs->opaque; > + BlockDriverState *file = bs->file->bs; > + QCowL2Meta *m; > + int ret; > + > + for (m = l2meta; m != NULL; m = m->next) { > + uint64_t bytes = m->nb_clusters << s->cluster_bits; > + > + if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) { > + continue; > + } > + > + /* try to alloc host space in one chunk for better locality */ > + ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0); No. This is what you bypass: * All sanity checks that the block layer does * bdrv_inc/dec_in_flight(), which is required for drain to work correctly. Not doing this will cause crashes. * tracked_request_begin/end(), mark_request_serialising() and wait_serialising_requests(), which are required for serialising requests to work correctly * Ensuring correct request alignment for file. This means that e.g. qcow2 with cluster size 512 on a host with a 4k native disk will break. * blkdebug events * before_write_notifiers. Not calling these will cause corrupted backups if someone backups file. * Dirty bitmap updates * Updating write_gen, wr_highest_offset and total_sectors * Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are respected And these are just the obvious things. I'm sure I missed some. > + if (ret != 0) { > + continue; > + } > + > + file->total_sectors = MAX(file->total_sectors, > + (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE); You only compensate for part of a single item in the list above, by duplicating code with block/io.c. This is not how to do things. As I said above, I think you don't really want write_zeroes anyway, but if you wanted a write_zeroes "but only if it's efficient" (which I'm not sure is a good thing to want), then a better way might be introducing a new request flag. > + } > +} > + > static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset, > uint64_t bytes, QEMUIOVector *qiov, > int flags) Kevin
On 05/26/2017 11:11 AM, Kevin Wolf wrote: > Am 19.05.2017 um 11:34 hat Anton Nefedov geschrieben: >> From: "Denis V. Lunev" <den@openvz.org> >> >> Currently each single write operation can result in 3 write operations >> if guest offsets are not cluster aligned. One write is performed for the >> real payload and two for COW-ed areas. Thus the data possibly lays >> non-contiguously on the host filesystem. This will reduce further >> sequential read performance significantly. >> >> The patch allocates the space in the file with cluster granularity, >> ensuring >> 1. better host offset locality >> 2. less space allocation operations >> (which can be expensive on distributed storages) >> >> Signed-off-by: Denis V. Lunev <den@openvz.org> >> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> > I don't think this is the right approach. You end up with two > write operations: One write_zeroes and then one data write. If the > backend actually supports efficient zero writes, write_zeroes won't > necessarily allocate space, but writing data will definitely split > the already existing allocation. If anything, we need a new > bdrv_allocate() or something that would call fallocate() instead of > abusing write_zeroes. great idea. Very nice then. > It seems much clearer to me that simply unifying the three write > requests into a single one is an improvement. And it's easy to do, I > even had a patch once to do this. The reason that I didn't send it was > that it seemed to conflict with the data cache approach These changes help a lot on 2 patterns: - writing 4Kb into the middle of the block. The bigger the block size, more we gain - sequential write, which is not sector aligned and comes in 2 requests: we perform allocation, which should be significantly faster than actual write and after that both parts of the write can be executed in parallel. At my opinion both cases are frequent and important. But OK, the code should be improved, you are absolutely right here. >> block/qcow2.c | 32 +++++++++++++++++++++++++++++++- >> 1 file changed, 31 insertions(+), 1 deletion(-) >> >> diff --git a/block/qcow2.c b/block/qcow2.c >> index a8d61f0..2e6a0ec 100644 >> --- a/block/qcow2.c >> +++ b/block/qcow2.c >> @@ -1575,6 +1575,32 @@ fail: >> return ret; >> } >> >> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta) >> +{ >> + BDRVQcow2State *s = bs->opaque; >> + BlockDriverState *file = bs->file->bs; >> + QCowL2Meta *m; >> + int ret; >> + >> + for (m = l2meta; m != NULL; m = m->next) { >> + uint64_t bytes = m->nb_clusters << s->cluster_bits; >> + >> + if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) { >> + continue; >> + } >> + >> + /* try to alloc host space in one chunk for better locality */ >> + ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0); > No. This is what you bypass: > > * All sanity checks that the block layer does > > * bdrv_inc/dec_in_flight(), which is required for drain to work > correctly. Not doing this will cause crashes. > > * tracked_request_begin/end(), mark_request_serialising() and > wait_serialising_requests(), which are required for serialising > requests to work correctly > > * Ensuring correct request alignment for file. This means that e.g. > qcow2 with cluster size 512 on a host with a 4k native disk will > break. > > * blkdebug events > > * before_write_notifiers. Not calling these will cause corrupted backups > if someone backups file. > > * Dirty bitmap updates > > * Updating write_gen, wr_highest_offset and total_sectors > > * Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are > respected > > And these are just the obvious things. I'm sure I missed some. > You seems right. I have not though about that from this angle. >> + if (ret != 0) { >> + continue; >> + } >> + >> + file->total_sectors = MAX(file->total_sectors, >> + (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE); > You only compensate for part of a single item in the list above, by > duplicating code with block/io.c. This is not how to do things. > > As I said above, I think you don't really want write_zeroes anyway, but > if you wanted a write_zeroes "but only if it's efficient" (which I'm not > sure is a good thing to want), then a better way might be introducing a > new request flag. > >> + } >> +} >> + >> static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset, >> uint64_t bytes, QEMUIOVector *qiov, >> int flags) > Kevin Thank you for review and ideas ;) Den
On 05/26/2017 11:57 AM, Denis V. Lunev wrote: > On 05/26/2017 11:11 AM, Kevin Wolf wrote: >> Am 19.05.2017 um 11:34 hat Anton Nefedov geschrieben: >>> From: "Denis V. Lunev" <den@openvz.org> >>> >>> Currently each single write operation can result in 3 write operations >>> if guest offsets are not cluster aligned. One write is performed for the >>> real payload and two for COW-ed areas. Thus the data possibly lays >>> non-contiguously on the host filesystem. This will reduce further >>> sequential read performance significantly. >>> >>> The patch allocates the space in the file with cluster granularity, >>> ensuring >>> 1. better host offset locality >>> 2. less space allocation operations >>> (which can be expensive on distributed storages) >>> >>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> >> I don't think this is the right approach. You end up with two >> write operations: One write_zeroes and then one data write. If the >> backend actually supports efficient zero writes, write_zeroes won't >> necessarily allocate space, but writing data will definitely split >> the already existing allocation. If anything, we need a new >> bdrv_allocate() or something that would call fallocate() instead of >> abusing write_zeroes. > great idea. Very nice then. > hi Kevin, thanks a lot for your comments. The driver's callback is commented as: /* * Efficiently zero a region of the disk image. Typically an image format * would use a compact metadata representation to implement this. This * function pointer may be NULL or return -ENOSUP and .bdrv_co_writev() * will be called instead. */ int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs, int64_t offset, int count, BdrvRequestFlags flags); so it looks like one may expect either efficiency or ENOTSUP from that one? but not necessarily from the safe "facade" bdrv_co_pwrite_zeroes() which falls back to pwritev() /* * Efficiently zero a region of the disk image. Note that this is a regular * I/O request like read or write and should have a reasonable size. This * function is not suitable for zeroing the entire image in a single request * because it may allocate memory for the entire region. */ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset, int count, BdrvRequestFlags flags); (maybe we should have even more explicit comment here) Then let's indeed create a new bdrv_allocate(), which will result in bdrv_co_pwrite_zeroes(flags|=ZERO_EFFICIENTLY), so we'll reuse most of the code and employ the same driver callback, but return ENOTSUP if it fails? >> It seems much clearer to me that simply unifying the three write >> requests into a single one is an improvement. And it's easy to do, I >> even had a patch once to do this. The reason that I didn't send it was >> that it seemed to conflict with the data cache approach > These changes help a lot on 2 patterns: > - writing 4Kb into the middle of the block. The bigger the block size, > more we gain > - sequential write, which is not sector aligned and comes in 2 requests: > we perform allocation, which should be significantly faster than actual > write and after that both parts of the write can be executed in parallel. > At my opinion both cases are frequent and important. > > But OK, the code should be improved, you are absolutely right here. > >>> block/qcow2.c | 32 +++++++++++++++++++++++++++++++- >>> 1 file changed, 31 insertions(+), 1 deletion(-) >>> >>> diff --git a/block/qcow2.c b/block/qcow2.c >>> index a8d61f0..2e6a0ec 100644 >>> --- a/block/qcow2.c >>> +++ b/block/qcow2.c >>> @@ -1575,6 +1575,32 @@ fail: >>> return ret; >>> } >>> >>> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta) >>> +{ >>> + BDRVQcow2State *s = bs->opaque; >>> + BlockDriverState *file = bs->file->bs; >>> + QCowL2Meta *m; >>> + int ret; >>> + >>> + for (m = l2meta; m != NULL; m = m->next) { >>> + uint64_t bytes = m->nb_clusters << s->cluster_bits; >>> + >>> + if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) { >>> + continue; >>> + } >>> + >>> + /* try to alloc host space in one chunk for better locality */ >>> + ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0); >> No. This is what you bypass: >> >> * All sanity checks that the block layer does >> >> * bdrv_inc/dec_in_flight(), which is required for drain to work >> correctly. Not doing this will cause crashes. >> >> * tracked_request_begin/end(), mark_request_serialising() and >> wait_serialising_requests(), which are required for serialising >> requests to work correctly >> >> * Ensuring correct request alignment for file. This means that e.g. >> qcow2 with cluster size 512 on a host with a 4k native disk will >> break. >> >> * blkdebug events >> >> * before_write_notifiers. Not calling these will cause corrupted backups >> if someone backups file. >> >> * Dirty bitmap updates >> >> * Updating write_gen, wr_highest_offset and total_sectors >> >> * Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are >> respected >> >> And these are just the obvious things. I'm sure I missed some. >> > > You seems right. I have not though about that from this angle. > >>> + if (ret != 0) { >>> + continue; >>> + } >>> + >>> + file->total_sectors = MAX(file->total_sectors, >>> + (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE); >> You only compensate for part of a single item in the list above, by >> duplicating code with block/io.c. This is not how to do things. >> >> As I said above, I think you don't really want write_zeroes anyway, but >> if you wanted a write_zeroes "but only if it's efficient" (which I'm not >> sure is a good thing to want), then a better way might be introducing a >> new request flag. >> >>> + } >>> +} >>> + >>> static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset, >>> uint64_t bytes, QEMUIOVector *qiov, >>> int flags) >> Kevin > > Thank you for review and ideas ;) > > Den > /Anton
On 05/26/2017 11:11 AM, Kevin Wolf wrote: > > No. This is what you bypass: some analysis for the record. > * All sanity checks that the block layer does > > * bdrv_inc/dec_in_flight(), which is required for drain to work > correctly. Not doing this will cause crashes. Should not be a problem. We are in the request, which is accounted for the parent. Drain waits parent first and after that switched to child. > * tracked_request_begin/end(), mark_request_serialising() and > wait_serialising_requests(), which are required for serialising > requests to work correctly should not cause a problem as we are protected by meta and this area is not exposed to read operation until fully complete. But yes, not that good. > * Ensuring correct request alignment for file. This means that e.g. > qcow2 with cluster size 512 on a host with a 4k native disk will > break. we have already checked that clusters are aligned. > * blkdebug events yes > * before_write_notifiers. Not calling these will cause corrupted backups > if someone backups file. no as there is no user visible data here, we are writing zeroes where there are zeroes from guest point of view > * Dirty bitmap updates same. The data from the guest point of view is not changed. This is even good that we do not mark this areas as dirty. > * Updating write_gen, wr_highest_offset and total_sectors write_gen is not a problem. It will be set on actual write. total_sectors is changed manually. Agree that this is not really elegant. > * Ensuring that bl.max_pwrite_zeroes and bl.pwrite_zeroes_alignment are > respected yes > And these are just the obvious things. I'm sure I missed some. anyway, this is a good list of things to take into account within bdrv_allocate. But it is important, that from the guest point of view the content of COW areas is not changed and thus we can have some important shortcuts. Den
Am 26.05.2017 um 10:57 hat Denis V. Lunev geschrieben: > On 05/26/2017 11:11 AM, Kevin Wolf wrote: > > It seems much clearer to me that simply unifying the three write > > requests into a single one is an improvement. And it's easy to do, I > > even had a patch once to do this. The reason that I didn't send it was > > that it seemed to conflict with the data cache approach > These changes help a lot on 2 patterns: > - writing 4Kb into the middle of the block. The bigger the block size, > more we gain > - sequential write, which is not sector aligned and comes in 2 requests: > we perform allocation, which should be significantly faster than actual > write and after that both parts of the write can be executed in parallel. > At my opinion both cases are frequent and important. Both cases are helped somewhat by reducing the number of I/O requests. So as a first step, I really think we need to get Berto's patches ready to be merged. As for this series, there is one important restriction that you don't mention: Both cases only benefit from the changes when there is no backing file (or the backing file is sparse at the given cluster). The main reason why people use qcow2 is for snapshots, so qcow2 without or with an empty backing file is actually not my primary concern. To be honest, I am concerned a bit by the additional complexity brought in by this series (which also seemed to be more tacked on rather than fully integrated at the first sight), but I'll have to study it in more detail. Maybe another thing to keep in mind is that the sequential write case can completely avoid COW in theory. I wonder whether we should make another attempt at this. Previously, it was the complexity of my own prototypes that stopped me from actually merging them. Kevin
Am 26.05.2017 um 12:57 hat Denis V. Lunev geschrieben: > On 05/26/2017 11:11 AM, Kevin Wolf wrote: > > > > No. This is what you bypass: > some analysis for the record. > [...] > anyway, this is a good list of things to take into account > within bdrv_allocate. But it is important, that from the > guest point of view the content of COW areas is not > changed and thus we can have some important shortcuts. Yes, not everything in this list is an actual problem for this specific case. Some things are, though, and if you have to make an analysis like this, that's already a sign that something is wrong. The real point I wanted to make is that the bdrv_*() wrappers do a whole lot of useful things (half of which everyone, including me, would probably forget when listing them without looking at the code) - and on top of this, when making changes to the block layer, we have an expectation that every request consistently goes through the block/io.c functions. So not wanting one of these useful things in a special case isn't a good enough reason to completey bypass them. If we really need to avoid some part of a function, we can always add another flag or something that makes that part optional. Kevin
diff --git a/block/qcow2.c b/block/qcow2.c index a8d61f0..2e6a0ec 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -1575,6 +1575,32 @@ fail: return ret; } +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta) +{ + BDRVQcow2State *s = bs->opaque; + BlockDriverState *file = bs->file->bs; + QCowL2Meta *m; + int ret; + + for (m = l2meta; m != NULL; m = m->next) { + uint64_t bytes = m->nb_clusters << s->cluster_bits; + + if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) { + continue; + } + + /* try to alloc host space in one chunk for better locality */ + ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0); + + if (ret != 0) { + continue; + } + + file->total_sectors = MAX(file->total_sectors, + (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE); + } +} + static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags) @@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset, if (ret < 0) { goto fail; } - qemu_co_mutex_unlock(&s->lock); + + if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) { + handle_alloc_space(bs, l2meta); + } + BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO); trace_qcow2_writev_data(qemu_coroutine_self(), cluster_offset + offset_in_cluster);