Message ID | 20240529213921.3166462-1-bvanassche@acm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v7] block: Improve IOPS by removing the fairness code | expand |
On Wed, May 29, 2024 at 02:39:20PM -0700, Bart Van Assche wrote: > There is an algorithm in the block layer for maintaining fairness > across queues that share a tag set. The sbitmap implementation has > improved so much that we don't need the block layer fairness algorithm > anymore and that we can rely on the sbitmap implementation to guarantee > fairness. > > On my test setup (x86 VM with 72 CPU cores) this patch results in 2.9% more > IOPS. IOPS have been measured as follows: > > $ modprobe null_blk nr_devices=1 completion_nsec=0 > $ fio --bs=4096 --disable_clat=1 --disable_slat=1 --group_reporting=1 \ > --gtod_reduce=1 --invalidate=1 --ioengine=psync --ioscheduler=none \ > --norandommap --runtime=60 --rw=randread --thread --time_based=1 \ > --buffered=0 --numjobs=64 --name=/dev/nullb0 --filename=/dev/nullb0 > > Additionally, it has been verified as follows that all request queues that > share a tag set process I/O even if the completion times are different (see > also test block/035): > - Create a first request queue with completion time 1 ms and queue > depth 64. > - Create a second request queue with completion time 100 ms and that > shares the tag set of the first request queue. > - Submit I/O to both request queues with fio. > > Tests have shown that the IOPS for this test case are 29859 and 318 or a > ratio of about 94. This ratio is close to the completion time ratio. > While this is unfair, both request queues make progress at a consistent > pace. I suggested running a more lopsided workload on a high contention tag set: here's an example fio profile to exaggerate this: --- [global] rw=randread direct=1 ioengine=io_uring time_based runtime=60 ramp_time=10 [zero] bs=131072 filename=/dev/nvme0n1 iodepth=256 iodepth_batch_submit=64 iodepth_batch_complete=64 [one] bs=512 filename=/dev/nvme0n2 iodepth=1 -- My test nvme device has 2 namespaces, 1 IO queue, and only 63 tags. Without your patch: zero: (groupid=0, jobs=1): err= 0: pid=465: Thu May 30 13:29:43 2024 read: IOPS=14.0k, BW=1749MiB/s (1834MB/s)(103GiB/60002msec) lat (usec): min=2937, max=40980, avg=16990.33, stdev=1732.37 ... one: (groupid=0, jobs=1): err= 0: pid=466: Thu May 30 13:29:43 2024 read: IOPS=2726, BW=1363KiB/s (1396kB/s)(79.9MiB/60001msec) lat (usec): min=45, max=4859, avg=327.52, stdev=335.25 With your patch: zero: (groupid=0, jobs=1): err= 0: pid=341: Thu May 30 13:36:26 2024 read: IOPS=14.8k, BW=1852MiB/s (1942MB/s)(109GiB/60004msec) lat (usec): min=3103, max=26191, avg=16322.77, stdev=1138.04 ... one: (groupid=0, jobs=1): err= 0: pid=342: Thu May 30 13:36:26 2024 read: IOPS=1841, BW=921KiB/s (943kB/s)(54.0MiB/60001msec) lat (usec): min=51, max=5935, avg=503.81, stdev=608.41 So there's definitely a difference here that harms the lesser used device for a modest gain on the higher demanding device. Does it matter? I really don't know if I can answer that. It's just different is all I'm saying.
On 5/30/24 13:47, Keith Busch wrote: > I suggested running a more lopsided workload on a high contention tag > set: here's an example fio profile to exaggerate this: > > --- > [global] > rw=randread > direct=1 > ioengine=io_uring > time_based > runtime=60 > ramp_time=10 > > [zero] > bs=131072 > filename=/dev/nvme0n1 > iodepth=256 > iodepth_batch_submit=64 > iodepth_batch_complete=64 > > [one] > bs=512 > filename=/dev/nvme0n2 > iodepth=1 > -- > > My test nvme device has 2 namespaces, 1 IO queue, and only 63 tags. > > Without your patch: > > zero: (groupid=0, jobs=1): err= 0: pid=465: Thu May 30 13:29:43 2024 > read: IOPS=14.0k, BW=1749MiB/s (1834MB/s)(103GiB/60002msec) > lat (usec): min=2937, max=40980, avg=16990.33, stdev=1732.37 > ... > one: (groupid=0, jobs=1): err= 0: pid=466: Thu May 30 13:29:43 2024 > read: IOPS=2726, BW=1363KiB/s (1396kB/s)(79.9MiB/60001msec) > lat (usec): min=45, max=4859, avg=327.52, stdev=335.25 > > With your patch: > > zero: (groupid=0, jobs=1): err= 0: pid=341: Thu May 30 13:36:26 2024 > read: IOPS=14.8k, BW=1852MiB/s (1942MB/s)(109GiB/60004msec) > lat (usec): min=3103, max=26191, avg=16322.77, stdev=1138.04 > ... > one: (groupid=0, jobs=1): err= 0: pid=342: Thu May 30 13:36:26 2024 > read: IOPS=1841, BW=921KiB/s (943kB/s)(54.0MiB/60001msec) > lat (usec): min=51, max=5935, avg=503.81, stdev=608.41 > > So there's definitely a difference here that harms the lesser used > device for a modest gain on the higher demanding device. Does it matter? > I really don't know if I can answer that. It's just different is all I'm > saying. Hi Keith, Thank you for having run this test. I propose that users who want better fairness than what my patch supports use an appropriate mechanism for improving fairness (e.g. blk-iocost or blk-iolat). This leaves the choice between maximum performance and maximum fairness to the user. Does this sound good to you? Thanks, Bart.
On Thu, May 30, 2024 at 02:02:20PM -0700, Bart Van Assche wrote: > Thank you for having run this test. I propose that users who want better > fairness than what my patch supports use an appropriate mechanism for > improving fairness (e.g. blk-iocost or blk-iolat). This leaves the choice > between maximum performance and maximum fairness to the user. Does this > sound good to you? I really don't know, I generally test with low latency devices and disable those blk services because their overhead is too high. I'm probably not the target demographic for those mechanisms. :) I just wanted to push the edge cases to see where things diverge. Perhaps Jens can weigh in on the impact and suggested remedies?
On 5/30/24 3:17 PM, Keith Busch wrote: > On Thu, May 30, 2024 at 02:02:20PM -0700, Bart Van Assche wrote: >> Thank you for having run this test. I propose that users who want better >> fairness than what my patch supports use an appropriate mechanism for >> improving fairness (e.g. blk-iocost or blk-iolat). This leaves the choice >> between maximum performance and maximum fairness to the user. Does this >> sound good to you? > > I really don't know, I generally test with low latency devices and > disable those blk services because their overhead is too high. I'm > probably not the target demographic for those mechanisms. :) Yeah same. But outside of that, needing to configure something else is also a bit of a cop out. From the initial posting, it's quoting 2.9% gain. For lots of cases, adding blk-iocost or blk-iolat would be MORE than a 2.9% hit. That said, I'd love to kill the code, but I still don't think we have good numbers on it. Are yours fully stable? What does the qd=1 test do _without_ having anyone compete with it? Is the bandwidth nicely balanced if each does qd=32? I'm again kindly asking for some testing :-) > I just wanted to push the edge cases to see where things diverge. > Perhaps Jens can weigh in on the impact and suggested remedies? Don't think we have enough data yet to make the call...
On 5/30/24 16:38, Jens Axboe wrote: > On 5/30/24 3:17 PM, Keith Busch wrote: >> On Thu, May 30, 2024 at 02:02:20PM -0700, Bart Van Assche wrote: >>> Thank you for having run this test. I propose that users who want better >>> fairness than what my patch supports use an appropriate mechanism for >>> improving fairness (e.g. blk-iocost or blk-iolat). This leaves the choice >>> between maximum performance and maximum fairness to the user. Does this >>> sound good to you? >> >> I really don't know, I generally test with low latency devices and >> disable those blk services because their overhead is too high. I'm >> probably not the target demographic for those mechanisms. :) > > Yeah same. But outside of that, needing to configure something else is > also a bit of a cop out. From the initial posting, it's quoting 2.9% > gain. For lots of cases, adding blk-iocost or blk-iolat would be MORE > than a 2.9% hit. > > That said, I'd love to kill the code, but I still don't think we have > good numbers on it. Are yours fully stable? What does the qd=1 test do > _without_ having anyone compete with it? Is the bandwidth nicely > balanced if each does qd=32? I'm again kindly asking for some testing > :-) > >> I just wanted to push the edge cases to see where things diverge. >> Perhaps Jens can weigh in on the impact and suggested remedies? > > Don't think we have enough data yet to make the call... Hi Jens, I'm interested in this patch because it solves a UFS performance issue. Because this patch significantly improves performance for UFS devices, we are carrying some form of this patch since more than two years in the Android kernel tree. Thanks, Bart.
diff --git a/block/blk-core.c b/block/blk-core.c index 82c3ae22d76d..9a11b786ecd4 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -425,8 +425,6 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id) q->node = node_id; - atomic_set(&q->nr_active_requests_shared_tags, 0); - timer_setup(&q->timeout, blk_rq_timed_out_timer, 0); INIT_WORK(&q->timeout_work, blk_timeout_work); INIT_LIST_HEAD(&q->icq_list); diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 770c0c2b72fa..3231e4eee098 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -478,11 +478,31 @@ static int hctx_sched_tags_bitmap_show(void *data, struct seq_file *m) return res; } +struct count_active_params { + struct blk_mq_hw_ctx *hctx; + int *active; +}; + +static bool hctx_count_active(struct request *rq, void *data) +{ + const struct count_active_params *params = data; + + if (rq->mq_hctx == params->hctx) + (*params->active)++; + + return true; +} + static int hctx_active_show(void *data, struct seq_file *m) { struct blk_mq_hw_ctx *hctx = data; + int active = 0; + struct count_active_params params = { .hctx = hctx, .active = &active }; + + blk_mq_all_tag_iter(hctx->sched_tags ?: hctx->tags, hctx_count_active, + ¶ms); - seq_printf(m, "%d\n", __blk_mq_active_requests(hctx)); + seq_printf(m, "%d\n", active); return 0; } diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index cc57e2dd9a0b..25334bfcabf8 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -105,10 +105,6 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) static int __blk_mq_get_tag(struct blk_mq_alloc_data *data, struct sbitmap_queue *bt) { - if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) && - !hctx_may_queue(data->hctx, bt)) - return BLK_MQ_NO_TAG; - if (data->shallow_depth) return sbitmap_queue_get_shallow(bt, data->shallow_depth); else diff --git a/block/blk-mq.c b/block/blk-mq.c index 3b4df8e5ac9e..7ae2cf08320b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -425,8 +425,6 @@ __blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data) rq_list_add(data->cached_rq, rq); nr++; } - if (!(data->rq_flags & RQF_SCHED_TAGS)) - blk_mq_add_active_requests(data->hctx, nr); /* caller already holds a reference, add for remainder */ percpu_ref_get_many(&data->q->q_usage_counter, nr - 1); data->nr_tags -= nr; @@ -511,8 +509,6 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data) goto retry; } - if (!(data->rq_flags & RQF_SCHED_TAGS)) - blk_mq_inc_active_requests(data->hctx); rq = blk_mq_rq_ctx_init(data, blk_mq_tags_from_data(data), tag); blk_mq_rq_time_init(rq, alloc_time_ns); return rq; @@ -672,8 +668,6 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, tag = blk_mq_get_tag(&data); if (tag == BLK_MQ_NO_TAG) goto out_queue_exit; - if (!(data.rq_flags & RQF_SCHED_TAGS)) - blk_mq_inc_active_requests(data.hctx); rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag); blk_mq_rq_time_init(rq, alloc_time_ns); rq->__data_len = 0; @@ -715,10 +709,8 @@ static void __blk_mq_free_request(struct request *rq) blk_pm_mark_last_busy(rq); rq->mq_hctx = NULL; - if (rq->tag != BLK_MQ_NO_TAG) { - blk_mq_dec_active_requests(hctx); + if (rq->tag != BLK_MQ_NO_TAG) blk_mq_put_tag(hctx->tags, ctx, rq->tag); - } if (sched_tag != BLK_MQ_NO_TAG) blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag); blk_mq_sched_restart(hctx); @@ -1067,8 +1059,6 @@ static inline void blk_mq_flush_tag_batch(struct blk_mq_hw_ctx *hctx, { struct request_queue *q = hctx->queue; - blk_mq_sub_active_requests(hctx, nr_tags); - blk_mq_put_tags(hctx->tags, tag_array, nr_tags); percpu_ref_put_many(&q->q_usage_counter, nr_tags); } @@ -1758,9 +1748,6 @@ bool __blk_mq_alloc_driver_tag(struct request *rq) if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) { bt = &rq->mq_hctx->tags->breserved_tags; tag_offset = 0; - } else { - if (!hctx_may_queue(rq->mq_hctx, bt)) - return false; } tag = __sbitmap_queue_get(bt); @@ -1768,7 +1755,6 @@ bool __blk_mq_alloc_driver_tag(struct request *rq) return false; rq->tag = tag + tag_offset; - blk_mq_inc_active_requests(rq->mq_hctx); return true; } @@ -3738,7 +3724,6 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set, if (!zalloc_cpumask_var_node(&hctx->cpumask, gfp, node)) goto free_hctx; - atomic_set(&hctx->nr_active, 0); if (node == NUMA_NO_NODE) node = set->numa_node; hctx->numa_node = node; diff --git a/block/blk-mq.h b/block/blk-mq.h index 260beea8e332..db3b999e433a 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -271,70 +271,9 @@ static inline int blk_mq_get_rq_budget_token(struct request *rq) return -1; } -static inline void __blk_mq_add_active_requests(struct blk_mq_hw_ctx *hctx, - int val) -{ - if (blk_mq_is_shared_tags(hctx->flags)) - atomic_add(val, &hctx->queue->nr_active_requests_shared_tags); - else - atomic_add(val, &hctx->nr_active); -} - -static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx) -{ - __blk_mq_add_active_requests(hctx, 1); -} - -static inline void __blk_mq_sub_active_requests(struct blk_mq_hw_ctx *hctx, - int val) -{ - if (blk_mq_is_shared_tags(hctx->flags)) - atomic_sub(val, &hctx->queue->nr_active_requests_shared_tags); - else - atomic_sub(val, &hctx->nr_active); -} - -static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx) -{ - __blk_mq_sub_active_requests(hctx, 1); -} - -static inline void blk_mq_add_active_requests(struct blk_mq_hw_ctx *hctx, - int val) -{ - if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) - __blk_mq_add_active_requests(hctx, val); -} - -static inline void blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx) -{ - if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) - __blk_mq_inc_active_requests(hctx); -} - -static inline void blk_mq_sub_active_requests(struct blk_mq_hw_ctx *hctx, - int val) -{ - if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) - __blk_mq_sub_active_requests(hctx, val); -} - -static inline void blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx) -{ - if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) - __blk_mq_dec_active_requests(hctx); -} - -static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx) -{ - if (blk_mq_is_shared_tags(hctx->flags)) - return atomic_read(&hctx->queue->nr_active_requests_shared_tags); - return atomic_read(&hctx->nr_active); -} static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq) { - blk_mq_dec_active_requests(hctx); blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag); rq->tag = BLK_MQ_NO_TAG; } @@ -376,45 +315,6 @@ static inline void blk_mq_free_requests(struct list_head *list) } } -/* - * For shared tag users, we track the number of currently active users - * and attempt to provide a fair share of the tag depth for each of them. - */ -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx, - struct sbitmap_queue *bt) -{ - unsigned int depth, users; - - if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) - return true; - - /* - * Don't try dividing an ant - */ - if (bt->sb.depth == 1) - return true; - - if (blk_mq_is_shared_tags(hctx->flags)) { - struct request_queue *q = hctx->queue; - - if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) - return true; - } else { - if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) - return true; - } - - users = READ_ONCE(hctx->tags->active_queues); - if (!users) - return true; - - /* - * Allow at least some tags - */ - depth = max((bt->sb.depth + users - 1) / users, 4U); - return __blk_mq_active_requests(hctx) < depth; -} - /* run the code block in @dispatch_ops with rcu/srcu read lock held */ #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ do { \ diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 89ba6b16fe8b..daec5471197b 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -398,12 +398,6 @@ struct blk_mq_hw_ctx { /** @queue_num: Index of this hardware queue. */ unsigned int queue_num; - /** - * @nr_active: Number of active requests. Only used when a tag set is - * shared across request queues. - */ - atomic_t nr_active; - /** @cpuhp_online: List to store request if CPU is going to die */ struct hlist_node cpuhp_online; /** @cpuhp_dead: List to store request if some CPU die. */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index aefdda9f4ec7..b9faa05a9139 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -453,8 +453,6 @@ struct request_queue { struct timer_list timeout; struct work_struct timeout_work; - atomic_t nr_active_requests_shared_tags; - struct blk_mq_tags *sched_shared_tags; struct list_head icq_list;
There is an algorithm in the block layer for maintaining fairness across queues that share a tag set. The sbitmap implementation has improved so much that we don't need the block layer fairness algorithm anymore and that we can rely on the sbitmap implementation to guarantee fairness. On my test setup (x86 VM with 72 CPU cores) this patch results in 2.9% more IOPS. IOPS have been measured as follows: $ modprobe null_blk nr_devices=1 completion_nsec=0 $ fio --bs=4096 --disable_clat=1 --disable_slat=1 --group_reporting=1 \ --gtod_reduce=1 --invalidate=1 --ioengine=psync --ioscheduler=none \ --norandommap --runtime=60 --rw=randread --thread --time_based=1 \ --buffered=0 --numjobs=64 --name=/dev/nullb0 --filename=/dev/nullb0 Additionally, it has been verified as follows that all request queues that share a tag set process I/O even if the completion times are different (see also test block/035): - Create a first request queue with completion time 1 ms and queue depth 64. - Create a second request queue with completion time 100 ms and that shares the tag set of the first request queue. - Submit I/O to both request queues with fio. Tests have shown that the IOPS for this test case are 29859 and 318 or a ratio of about 94. This ratio is close to the completion time ratio. While this is unfair, both request queues make progress at a consistent pace. This patch removes the following code and structure members: - The function hctx_may_queue(). - blk_mq_hw_ctx.nr_active and request_queue.nr_active_requests_shared_tags and also all the code that modifies these two member variables. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- Changes compared to v6: rebased on top of Jens' for-next branch and improved patch description. Changes compared to v5: rebased on top of Jens' for-next branch. Changes compared to v4: rebased on top of Jens' for-next branch. Changes compared to v3: removed a "Fixes:" tag that got added accidentally. Changes compared to v2: improved patch description. Changes compared to v1: improved the debugfs code. block/blk-core.c | 2 - block/blk-mq-debugfs.c | 22 ++++++++- block/blk-mq-tag.c | 4 -- block/blk-mq.c | 17 +------ block/blk-mq.h | 100 ----------------------------------------- include/linux/blk-mq.h | 6 --- include/linux/blkdev.h | 2 - 7 files changed, 22 insertions(+), 131 deletions(-)