Message ID | 1495186480-114192-4-git-send-email-anton.nefedov@virtuozzo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 05/19/2017 04:34 AM, Anton Nefedov wrote: > If COW area of the newly allocated cluster is zeroes, there is no reason > to write zero sectors in perform_cow() again now as whole clusters are > zeroed out in single chunks by handle_alloc_space(). But that's only true if you can guarantee that handle_alloc_space() succeeded at ensuring the cluster reads as zeroes. If you silently ignore errors (which is what patch 1/13 does), you risk assuming that the cluster reads as zeroes when in reality it does not, and then you have corrupted data. The idea of avoiding a COW of areas that read as zero at the source when the destination also already reads as zeroes makes sense, but I'm not convinced that this patch is safe as written. > > Introduce QCowL2Meta field "reduced", since the existing fields > (offset and nb_bytes) still has to keep other write requests from > simultaneous writing in the area > > iotest 060: > write to the discarded cluster does not trigger COW anymore. > so, break on write_aio event instead, will work for the test > (but write won't fail anymore, so update reference output) > > iotest 066: > cluster-alignment areas that were not really COWed are now detected > as zeroes, hence the initial write has to be exactly the same size for > the maps to match > > performance tests: === > > qemu-io, > results in seconds to complete (less is better) > random write 4k to empty image, no backing > HDD > 64k cluster > 128M over 128M image: 160 -> 160 ( x1 ) > 128M over 2G image: 86 -> 84 ( x1 ) > 128M over 8G image: 40 -> 29 ( x1.4 ) > 1M cluster > 32M over 8G image: 58 -> 23 ( x2.5 ) > > SSD > 64k cluster > 2G over 2G image: 71 -> 38 ( x1.9 ) > 512M over 8G image: 85 -> 8 ( x10.6 ) > 1M cluster > 128M over 32G image: 314 -> 2 ( x157 ) At any rate, the benchmark numbers show that there is merit to pursuing the idea of reducing I/O when partial cluster writes can avoid writing COW'd zeroes on either side of the data.
On 05/22/2017 10:24 PM, Eric Blake wrote: > On 05/19/2017 04:34 AM, Anton Nefedov wrote: >> If COW area of the newly allocated cluster is zeroes, there is no reason >> to write zero sectors in perform_cow() again now as whole clusters are >> zeroed out in single chunks by handle_alloc_space(). > > But that's only true if you can guarantee that handle_alloc_space() > succeeded at ensuring the cluster reads as zeroes. If you silently > ignore errors (which is what patch 1/13 does), you risk assuming that > the cluster reads as zeroes when in reality it does not, and then you > have corrupted data. > Sure; COW is only skipped if pwrite_zeroes() from patch 1/13 succeeds > The idea of avoiding a COW of areas that read as zero at the source when > the destination also already reads as zeroes makes sense, but I'm not > convinced that this patch is safe as written. > /Anton
On 05/22/2017 10:24 PM, Eric Blake wrote: > On 05/19/2017 04:34 AM, Anton Nefedov wrote: >> If COW area of the newly allocated cluster is zeroes, there is no reason >> to write zero sectors in perform_cow() again now as whole clusters are >> zeroed out in single chunks by handle_alloc_space(). > But that's only true if you can guarantee that handle_alloc_space() > succeeded at ensuring the cluster reads as zeroes. If you silently > ignore errors (which is what patch 1/13 does), you risk assuming that > the cluster reads as zeroes when in reality it does not, and then you > have corrupted data. > > The idea of avoiding a COW of areas that read as zero at the source when > the destination also already reads as zeroes makes sense, but I'm not > convinced that this patch is safe as written. we will recheck error path. OK. >> Introduce QCowL2Meta field "reduced", since the existing fields >> (offset and nb_bytes) still has to keep other write requests from >> simultaneous writing in the area >> >> iotest 060: >> write to the discarded cluster does not trigger COW anymore. >> so, break on write_aio event instead, will work for the test >> (but write won't fail anymore, so update reference output) >> >> iotest 066: >> cluster-alignment areas that were not really COWed are now detected >> as zeroes, hence the initial write has to be exactly the same size for >> the maps to match >> >> performance tests: === >> >> qemu-io, >> results in seconds to complete (less is better) >> random write 4k to empty image, no backing >> HDD >> 64k cluster >> 128M over 128M image: 160 -> 160 ( x1 ) >> 128M over 2G image: 86 -> 84 ( x1 ) >> 128M over 8G image: 40 -> 29 ( x1.4 ) >> 1M cluster >> 32M over 8G image: 58 -> 23 ( x2.5 ) >> >> SSD >> 64k cluster >> 2G over 2G image: 71 -> 38 ( x1.9 ) >> 512M over 8G image: 85 -> 8 ( x10.6 ) >> 1M cluster >> 128M over 32G image: 314 -> 2 ( x157 ) > At any rate, the benchmark numbers show that there is merit to pursuing > the idea of reducing I/O when partial cluster writes can avoid writing > COW'd zeroes on either side of the data. > yes! This is exactly the point and also with this approach we would allow sequential non-aligned to cluster writes, which is also very good. Den
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 347d94b..cf18dee 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -758,7 +758,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m, Qcow2COWRegion *r) BDRVQcow2State *s = bs->opaque; int ret; - if (r->nb_bytes == 0) { + if (r->nb_bytes == 0 || r->reduced) { return 0; } @@ -1267,10 +1267,12 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset, .cow_start = { .offset = 0, .nb_bytes = offset_into_cluster(s, guest_offset), + .reduced = false, }, .cow_end = { .offset = nb_bytes, .nb_bytes = avail_bytes - nb_bytes, + .reduced = false, }, }; qemu_co_queue_init(&(*m)->dependent_requests); diff --git a/block/qcow2.c b/block/qcow2.c index b885dfc..b438f22 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -64,6 +64,9 @@ typedef struct { #define QCOW2_EXT_MAGIC_BACKING_FORMAT 0xE2792ACA #define QCOW2_EXT_MAGIC_FEATURE_TABLE 0x6803f857 +static bool is_zero_sectors(BlockDriverState *bs, int64_t start, + uint32_t count); + static int qcow2_probe(const uint8_t *buf, int buf_size, const char *filename) { const QCowHeader *cow_header = (const void *)buf; @@ -1575,6 +1578,25 @@ fail: return ret; } +static void handle_cow_reduce(BlockDriverState *bs, QCowL2Meta *m) +{ + if (bs->encrypted) { + return; + } + if (!m->cow_start.reduced && m->cow_start.nb_bytes != 0 && + is_zero_sectors(bs, + (m->offset + m->cow_start.offset) >> BDRV_SECTOR_BITS, + m->cow_start.nb_bytes >> BDRV_SECTOR_BITS)) { + m->cow_start.reduced = true; + } + if (!m->cow_end.reduced && m->cow_end.nb_bytes != 0 && + is_zero_sectors(bs, + (m->offset + m->cow_end.offset) >> BDRV_SECTOR_BITS, + m->cow_end.nb_bytes >> BDRV_SECTOR_BITS)) { + m->cow_end.reduced = true; + } +} + static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta) { BDRVQcow2State *s = bs->opaque; @@ -1598,6 +1620,7 @@ static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta) file->total_sectors = MAX(file->total_sectors, (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE); + handle_cow_reduce(bs, m); } } diff --git a/block/qcow2.h b/block/qcow2.h index 1801dc3..ba15c08 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -305,6 +305,10 @@ typedef struct Qcow2COWRegion { /** Number of bytes to copy */ int nb_bytes; + + /** The region is filled with zeroes and does not require COW + */ + bool reduced; } Qcow2COWRegion; /** diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060 index 8e95c45..3a0f096 100755 --- a/tests/qemu-iotests/060 +++ b/tests/qemu-iotests/060 @@ -160,7 +160,7 @@ poke_file "$TEST_IMG" '131084' "\x00\x00" # 0x2000c # any unallocated cluster, leading to an attempt to overwrite the second L2 # table. Finally, resume the COW write and see it fail (but not crash). echo "open -o file.driver=blkdebug $TEST_IMG -break cow_read 0 +break write_aio 0 aio_write 0k 1k wait_break 0 write 64k 64k diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out index 9e8f5b9..ea29a32 100644 --- a/tests/qemu-iotests/060.out +++ b/tests/qemu-iotests/060.out @@ -107,7 +107,8 @@ qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps blkdebug: Suspended request '0' write failed: Input/output error blkdebug: Resuming request '0' -aio_write failed: No medium found +wrote 1024/1024 bytes at offset 0 +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) === Testing unallocated image header === diff --git a/tests/qemu-iotests/066 b/tests/qemu-iotests/066 index 8638217..3c216a1 100755 --- a/tests/qemu-iotests/066 +++ b/tests/qemu-iotests/066 @@ -71,7 +71,7 @@ echo _make_test_img $IMG_SIZE # Create data clusters (not aligned to an L2 table) -$QEMU_IO -c 'write -P 42 1M 256k' "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "write -P 42 $(((1024 + 32) * 1024)) 192k" "$TEST_IMG" | _filter_qemu_io orig_map=$($QEMU_IMG map --output=json "$TEST_IMG") # Convert the data clusters to preallocated zero clusters diff --git a/tests/qemu-iotests/066.out b/tests/qemu-iotests/066.out index 3d9da9b..093431e 100644 --- a/tests/qemu-iotests/066.out +++ b/tests/qemu-iotests/066.out @@ -19,8 +19,8 @@ Offset Length Mapped to File === Writing to preallocated zero clusters === Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67109376 -wrote 262144/262144 bytes at offset 1048576 -256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 196608/196608 bytes at offset 1081344 +192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 262144/262144 bytes at offset 1048576 256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 196608/196608 bytes at offset 1081344