diff mbox series

dm-thin: Improve performance of O_SYNC IOs to mapped data

Message ID 20231029161756.27025-1-yarden.maymon@volumez.com (mailing list archive)
State New, archived
Headers show
Series dm-thin: Improve performance of O_SYNC IOs to mapped data | expand

Commit Message

Yarden Maymon Oct. 29, 2023, 4:17 p.m. UTC
Running random write fio benchmarks on dm-thin with mapped data
there is 50% degradation when using O_SYNC.
* dm-thin without O_SYNC - 438k iops
* dm-thin with O_SYNC on mapped data - 204k iops
* directly on the underlying disk with O_SYNC - 451k iops, showing the
  problem is not the disk.

The data is mapped so the same results are expected with O_SYNC.

Currently, all O_SYNC IOs are routed to a slower path (deferred).
This action is taken early in the procedure, prior to assessing other
ongoing IOs or verifying if the IO is already mapped.

Remove the early test, and move O_SYNC to the regular data path.
O_SYNC io to a mapped space, that does not conflict with other inflight
will be remapped and routed to the faster path.
All the other O_SYNC io's behavior is maintained (deferred).

The O_SYNC IO will be deferred if :

* It is not mapped - dm_thin_find_block will return -ENODATA, the cell
  is deferred.

* There is an inflight to the same virtual key - bio_detain will
  add the io to a cell and defer it.

    build_virtual_key(tc->td, block, &key);
    if (bio_detain(tc->pool, &key, bio, &virt_cell))
        return DM_MAPIO_SUBMITTED;

* There is an inflight to the same physical key - bio_detain will
  add the io to a cell and defer it.

    build_data_key(tc->td, result.block, &key);
    if (bio_detain(tc->pool, &key, bio, &data_cell)) {
        cell_defer_no_holder(tc, virt_cell);
        return DM_MAPIO_SUBMITTED;
    }

-----------------------------------------------------

Benchmark results :

The benchmark was done on top of ubuntu's 6.2.0-1008 with commit
450e8dee51aa ("dm bufio: improve concurrent IO performance") backported.

fio params: --bs=4k --direct=1 --iodepth=32 --numjobs=8m --time_based
--runtime=5m.
dm-thin chunksize is 128k and allocation/thin_pool_zero=0 is set.
The results are in IOPs and represented as: avg_iops (max_iops).

Performance test on the underlying nvme device for baseline:
+-------------------+-----------------------+
| randwrite         | 446k (455k)           |
| randwrite sync    | 451k (455k)           |
| randrw 50/50      | 227k/227k (300k/300k) |
| randrw sync 50/50 | 227k/227k (300k/300k) |
| randread          | 773k (866k)           |
| randread sync     | 773k (861k)           |
+-------------------+-----------------------+

dm-thin blkdev with all data allocated (16GiB):
+-------------------+-----------------------+-----------------------+
|                   | Pre Patch             | Post Patch            |
+-------------------+-----------------------+-----------------------+
| randwrite         | 438k (442k)           | 450k (453k)           |
| randwrite sync    | 204k (228k)           | 450k (454k)           |
| randrw 50/50      | 224k/224k (236k/235k) | 225k/225k (234k/234k) |
| randrw sync 50/50 | 191k/191k (199k/197k) | 225k/225k (235k/235k) |
| randread          | 650k (703k)           | 661k (705k)           |
| randread sync     | 659k (705k)           | 661k (707k)           |
+-------------------+-----------------------+-----------------------+
There's a notable enhancement in random write performance with sync
compared to previous results. In the 50/50 sync test, there's also a
boost in random read due to the availability of extra resources for
reading. Furthermore, no other aspects appear to be impacted.

dm-thin blkdev without allocated data with capacity of 1.6TB (to
increase the random chance of hitting a non allocated block):
+-------------------+-------------------------+------------------------+
|                   | Pre Patch               | Post Patch             |
+-------------------+-------------------------+------------------------+
| randwrite         | 116k (253k)             | 112k (240k)            |
| randwrite sync    | 100k (121k)             | 182k (266k)            |
| randrw 50/50      | 66.7k/66.7k (109k/109k) | 67k/67k (109k/109k)    |
| randrw sync 50/50 | 76.9k/76.8k (101k/101k) | 77.6k/77.6k (122k/122k)|
| randread          | 336k (349k)             | 335k (352k)            |
| randread sync     | 334k (351k)             | 336k (348k)            |
+-------------------+-------------------------+------------------------+
In this case, there isn't a marked difference, with the exception of
random write sync, since the unmapped data path has stayed the same. The
boost in random write sync performance can be explained from random IOs
hitting the same space twice within the test (The second time they are
already mapped).

-----------------------------------------------------

Tests:
I have ran thin tests of https://github.com/jthornber/dmtest-python.
I have ran xfstests on top of thin lvm https://github.com/kdave/xfstests

I conducted a manual data integrity test :
* Constructed a layout with nvme target -> dm-thin -> nvme device.
* Using vdbench from an initiator host writing to this remote nvme
  device, using journal to a local drive.
* Initiated a reboot on the media host.
* Verified the data using vdbench once the reboot process finished.

Signed-off-by: Yarden Maymon <yarden.maymon@volumez.com>
---
 drivers/md/dm-thin.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Mike Snitzer Nov. 15, 2023, 7:18 p.m. UTC | #1
On Sun, Oct 29 2023 at 12:17P -0400,
Yarden Maymon <yarden.maymon@volumez.com> wrote:

> Running random write fio benchmarks on dm-thin with mapped data
> there is 50% degradation when using O_SYNC.
> * dm-thin without O_SYNC - 438k iops
> * dm-thin with O_SYNC on mapped data - 204k iops
> * directly on the underlying disk with O_SYNC - 451k iops, showing the
>   problem is not the disk.
> 
> The data is mapped so the same results are expected with O_SYNC.
> 
> Currently, all O_SYNC IOs are routed to a slower path (deferred).
> This action is taken early in the procedure, prior to assessing other
> ongoing IOs or verifying if the IO is already mapped.
> 
> Remove the early test, and move O_SYNC to the regular data path.
> O_SYNC io to a mapped space, that does not conflict with other inflight
> will be remapped and routed to the faster path.
> All the other O_SYNC io's behavior is maintained (deferred).
> 
> The O_SYNC IO will be deferred if :
> 
> * It is not mapped - dm_thin_find_block will return -ENODATA, the cell
>   is deferred.
> 
> * There is an inflight to the same virtual key - bio_detain will
>   add the io to a cell and defer it.
> 
>     build_virtual_key(tc->td, block, &key);
>     if (bio_detain(tc->pool, &key, bio, &virt_cell))
>         return DM_MAPIO_SUBMITTED;
> 
> * There is an inflight to the same physical key - bio_detain will
>   add the io to a cell and defer it.
> 
>     build_data_key(tc->td, result.block, &key);
>     if (bio_detain(tc->pool, &key, bio, &data_cell)) {
>         cell_defer_no_holder(tc, virt_cell);
>         return DM_MAPIO_SUBMITTED;
>     }

One of thinp's biggest reasons for deferring flushes is to coalesce
associated thin-pool metadata commits.

Your change undermines bio_triggers_commit() -- given that a flush
will only trigger a commit if routed through process_deferred_bios().

This raises doubts about the safety of your change (despite your
documented testing).

> 
> -----------------------------------------------------
> 
> Benchmark results :
> 
> The benchmark was done on top of ubuntu's 6.2.0-1008 with commit
> 450e8dee51aa ("dm bufio: improve concurrent IO performance") backported.
> 
> fio params: --bs=4k --direct=1 --iodepth=32 --numjobs=8m --time_based
> --runtime=5m.
> dm-thin chunksize is 128k and allocation/thin_pool_zero=0 is set.
> The results are in IOPs and represented as: avg_iops (max_iops).
> 
> Performance test on the underlying nvme device for baseline:
> +-------------------+-----------------------+
> | randwrite         | 446k (455k)           |
> | randwrite sync    | 451k (455k)           |
> | randrw 50/50      | 227k/227k (300k/300k) |
> | randrw sync 50/50 | 227k/227k (300k/300k) |
> | randread          | 773k (866k)           |
> | randread sync     | 773k (861k)           |
> +-------------------+-----------------------+
> 
> dm-thin blkdev with all data allocated (16GiB):
> +-------------------+-----------------------+-----------------------+
> |                   | Pre Patch             | Post Patch            |
> +-------------------+-----------------------+-----------------------+
> | randwrite         | 438k (442k)           | 450k (453k)           |
> | randwrite sync    | 204k (228k)           | 450k (454k)           |
> | randrw 50/50      | 224k/224k (236k/235k) | 225k/225k (234k/234k) |
> | randrw sync 50/50 | 191k/191k (199k/197k) | 225k/225k (235k/235k) |
> | randread          | 650k (703k)           | 661k (705k)           |
> | randread sync     | 659k (705k)           | 661k (707k)           |
> +-------------------+-----------------------+-----------------------+
> There's a notable enhancement in random write performance with sync
> compared to previous results. In the 50/50 sync test, there's also a
> boost in random read due to the availability of extra resources for
> reading. Furthermore, no other aspects appear to be impacted.

Not sure what you mean by "boost in random read due to the
availability of extra resources for reading." -- how does your change
make extra resources for reading?

> dm-thin blkdev without allocated data with capacity of 1.6TB (to
> increase the random chance of hitting a non allocated block):
> +-------------------+-------------------------+------------------------+
> |                   | Pre Patch               | Post Patch             |
> +-------------------+-------------------------+------------------------+
> | randwrite         | 116k (253k)             | 112k (240k)            |
> | randwrite sync    | 100k (121k)             | 182k (266k)            |
> | randrw 50/50      | 66.7k/66.7k (109k/109k) | 67k/67k (109k/109k)    |
> | randrw sync 50/50 | 76.9k/76.8k (101k/101k) | 77.6k/77.6k (122k/122k)|
> | randread          | 336k (349k)             | 335k (352k)            |
> | randread sync     | 334k (351k)             | 336k (348k)            |
> +-------------------+-------------------------+------------------------+
> In this case, there isn't a marked difference, with the exception of
> random write sync, since the unmapped data path has stayed the same. The
> boost in random write sync performance can be explained from random IOs
> hitting the same space twice within the test (The second time they are
> already mapped).
> 
> -----------------------------------------------------
> 
> Tests:
> I have ran thin tests of https://github.com/jthornber/dmtest-python.
> I have ran xfstests on top of thin lvm https://github.com/kdave/xfstests
> 
> I conducted a manual data integrity test :
> * Constructed a layout with nvme target -> dm-thin -> nvme device.
> * Using vdbench from an initiator host writing to this remote nvme
>   device, using journal to a local drive.
> * Initiated a reboot on the media host.
> * Verified the data using vdbench once the reboot process finished.
> 
> Signed-off-by: Yarden Maymon <yarden.maymon@volumez.com>
> ---
>  drivers/md/dm-thin.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 07c7f9795b10..ecd429260bee 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -2743,7 +2743,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
>  		return DM_MAPIO_SUBMITTED;
>  	}
>  
> -	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
> +	if (bio_op(bio) == REQ_OP_DISCARD) {
>  		thin_defer_bio_with_throttle(tc, bio);
>  		return DM_MAPIO_SUBMITTED;
>  	}
> -- 
> 2.25.1
> 

Did you explore still deferring flush bios _but_ without imposing a
throttle?

Doing so would still punt to the workqueue to handle the flushes
(which could be a source of delay) but it avoids completely missing
thinp metadata commits.

Please revert your patch and try something like this instead:

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index f88646f9f81f..704e9bb75b0f 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2748,7 +2748,10 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
 	}
 
 	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
-		thin_defer_bio_with_throttle(tc, bio);
+		if (bio->bi_opf & REQ_SYNC)
+			thin_defer_bio(tc, bio);
+		else
+			thin_defer_bio_with_throttle(tc, bio);
 		return DM_MAPIO_SUBMITTED;
 	}
diff mbox series

Patch

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 07c7f9795b10..ecd429260bee 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2743,7 +2743,7 @@  static int thin_bio_map(struct dm_target *ti, struct bio *bio)
 		return DM_MAPIO_SUBMITTED;
 	}
 
-	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
+	if (bio_op(bio) == REQ_OP_DISCARD) {
 		thin_defer_bio_with_throttle(tc, bio);
 		return DM_MAPIO_SUBMITTED;
 	}