Message ID | 20240118180541.930783-3-axboe@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mq-deadline scalability improvements | expand |
On Thu, Jan 18, 2024 at 11:04:57AM -0700, Jens Axboe wrote: > +#define DD_CPU_BUCKETS 32 > +#define DD_CPU_BUCKETS_MASK (DD_CPU_BUCKETS - 1) A bit of wasted space on machines with < 32 CPUs, no? Sure, it's not much and the fixed size makes the implementation simpler, but these add up.
On 1/18/24 11:25 AM, Keith Busch wrote: > On Thu, Jan 18, 2024 at 11:04:57AM -0700, Jens Axboe wrote: >> +#define DD_CPU_BUCKETS 32 >> +#define DD_CPU_BUCKETS_MASK (DD_CPU_BUCKETS - 1) > > A bit of wasted space on machines with < 32 CPUs, no? Sure, it's not > much and the fixed size makes the implementation simpler, but these add > up. True, could make it allocated instead, 32 was just pulled out of thin air. Might make sense to make the magic 8 or something instead, that's probably Good Enough in general. Or we can make it dependent on the number of online CPUs when initialized. Honestly just didn't want to overcomplicate this initial RFC.
On 1/18/24 10:04, Jens Axboe wrote: > If we attempt to insert a list of requests but someone else is already > running an insertion, then fallback to queueing it internally and let > the existing inserter finish the operation. Because this patch adds significant complexity: what are the use cases that benefit from this kind of optimization? Are these perhaps workloads on systems with many CPU cores and fast storage? If the storage is fast, why to use mq-deadline instead of "none" as I/O-scheduler? Thanks, Bart.
On 1/18/24 11:31 AM, Bart Van Assche wrote: > On 1/18/24 10:04, Jens Axboe wrote: >> If we attempt to insert a list of requests but someone else is already >> running an insertion, then fallback to queueing it internally and let >> the existing inserter finish the operation. > > Because this patch adds significant complexity: what are the use cases > that benefit from this kind of optimization? Are these perhaps workloads > on systems with many CPU cores and fast storage? If the storage is fast, > why to use mq-deadline instead of "none" as I/O-scheduler? You and others complain that mq-deadline is slow and doesn't scale, these two patches help improve that situation. Not sure why this is even a question?
On 1/18/24 10:33, Jens Axboe wrote: > On 1/18/24 11:31 AM, Bart Van Assche wrote: >> On 1/18/24 10:04, Jens Axboe wrote: >>> If we attempt to insert a list of requests but someone else is already >>> running an insertion, then fallback to queueing it internally and let >>> the existing inserter finish the operation. >> >> Because this patch adds significant complexity: what are the use cases >> that benefit from this kind of optimization? Are these perhaps workloads >> on systems with many CPU cores and fast storage? If the storage is fast, >> why to use mq-deadline instead of "none" as I/O-scheduler? > > You and others complain that mq-deadline is slow and doesn't scale, > these two patches help improve that situation. Not sure why this is even > a question? How much does this patch improve performance? Thanks, Bart.
On 1/18/24 11:53 AM, Bart Van Assche wrote: > On 1/18/24 10:33, Jens Axboe wrote: >> On 1/18/24 11:31 AM, Bart Van Assche wrote: >>> On 1/18/24 10:04, Jens Axboe wrote: >>>> If we attempt to insert a list of requests but someone else is already >>>> running an insertion, then fallback to queueing it internally and let >>>> the existing inserter finish the operation. >>> >>> Because this patch adds significant complexity: what are the use cases >>> that benefit from this kind of optimization? Are these perhaps workloads >>> on systems with many CPU cores and fast storage? If the storage is fast, >>> why to use mq-deadline instead of "none" as I/O-scheduler? >> >> You and others complain that mq-deadline is slow and doesn't scale, >> these two patches help improve that situation. Not sure why this is even >> a question? > > How much does this patch improve performance? Do you need me to link the cover letter that you were CC'ed on?
On 1/18/24 10:56, Jens Axboe wrote:
> Do you need me to link the cover letter that you were CC'ed on?
There is no reason to use such an aggressive tone in your emails.
In the cover letter I see performance numbers for the patch series in
its entirety but not for the individual patches. I'd like to know by
how much this patch by itself improves performance because whether or
not I will review this patch will depend on that data.
Thanks,
Bart.
On 1/18/24 1:46 PM, Bart Van Assche wrote: > On 1/18/24 10:56, Jens Axboe wrote: >> Do you need me to link the cover letter that you were CC'ed on? > > There is no reason to use such an aggressive tone in your emails. I'm getting frustrated with you because I need to say the same things multiple times, and it doesn't seem like it's getting received at the other end. And you had clearly seen the results, which means that rather than the passive aggressive question, you could have said "It'd probably be useful to include some performance numbers in the commmit messages themselves as well." which would be received way differently than asking a question that you already have the answer to. > In the cover letter I see performance numbers for the patch series in > its entirety but not for the individual patches. I'd like to know by > how much this patch by itself improves performance because whether or > not I will review this patch will depend on that data. You can't really split them up, as you need both to see the real benefit. The only reason they are split is because it makes sense to do so in terms of the code, as they are two different paths and points of contention. This obviously makes patch 2 look better than it is, and that would be the case even if I swapped the order of them as well. As mentioned in a previous reply, I'll be editing the commit messages and probably just include the various performance results in patch 2 and reference patch 1 from that too.
On 1/18/24 12:52, Jens Axboe wrote: > And you had clearly seen the results, which means that rather > than the passive aggressive question, you could have said [ ... ] I'm never passive aggressive. If that is how my message was received I want to apologize. Bart.
diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 9e0ab3ea728a..eeeaaff189e1 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -81,8 +81,17 @@ struct dd_per_prio { enum { DD_DISPATCHING = 0, + DD_INSERTING = 1, }; +#define DD_CPU_BUCKETS 32 +#define DD_CPU_BUCKETS_MASK (DD_CPU_BUCKETS - 1) + +struct dd_bucket_list { + struct list_head list; + spinlock_t lock; +} ____cacheline_aligned_in_smp; + struct deadline_data { /* * run time data @@ -94,6 +103,9 @@ struct deadline_data { unsigned long run_state; + atomic_t insert_seq; + struct dd_bucket_list bucket_lists[DD_CPU_BUCKETS]; + struct dd_per_prio per_prio[DD_PRIO_COUNT]; /* Data direction of latest dispatched request. */ @@ -711,7 +723,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) struct deadline_data *dd; struct elevator_queue *eq; enum dd_prio prio; - int ret = -ENOMEM; + int i, ret = -ENOMEM; eq = elevator_alloc(q, e); if (!eq) @@ -725,6 +737,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) spin_lock_init(&dd->lock); spin_lock_init(&dd->zone_lock); + atomic_set(&dd->insert_seq, 0); + + for (i = 0; i < DD_CPU_BUCKETS; i++) { + INIT_LIST_HEAD(&dd->bucket_lists[i].list); + spin_lock_init(&dd->bucket_lists[i].lock); + } for (prio = 0; prio <= DD_PRIO_MAX; prio++) { struct dd_per_prio *per_prio = &dd->per_prio[prio]; @@ -876,6 +894,67 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, } } +static void dd_dispatch_from_buckets(struct deadline_data *dd, + struct list_head *list) +{ + int i; + + for (i = 0; i < DD_CPU_BUCKETS; i++) { + struct dd_bucket_list *bucket = &dd->bucket_lists[i]; + + if (list_empty_careful(&bucket->list)) + continue; + spin_lock(&bucket->lock); + list_splice_init(&bucket->list, list); + spin_unlock(&bucket->lock); + } +} + +/* + * If we can grab the dd->lock, then just return and do the insertion as per + * usual. If not, add to one of our internal buckets, and afterwards recheck + * if if we should retry. + */ +static bool dd_insert_to_bucket(struct deadline_data *dd, + struct list_head *list, int *seq) + __acquires(&dd->lock) +{ + struct dd_bucket_list *bucket; + int next_seq; + + *seq = atomic_read(&dd->insert_seq); + + if (spin_trylock(&dd->lock)) + return false; + if (!test_bit(DD_INSERTING, &dd->run_state)) { + spin_lock(&dd->lock); + return false; + } + + *seq = atomic_inc_return(&dd->insert_seq); + + bucket = &dd->bucket_lists[get_cpu() & DD_CPU_BUCKETS_MASK]; + spin_lock(&bucket->lock); + list_splice_init(list, &bucket->list); + spin_unlock(&bucket->lock); + put_cpu(); + + /* + * If seq still matches, we should be safe to just exit with the + * pending requests queued in a bucket. + */ + next_seq = atomic_inc_return(&dd->insert_seq); + if (next_seq == *seq + 1) + return true; + + /* + * Seq changed, be safe and grab the lock and insert. Don't update + * sequence, so that we flusht the buckets too. + */ + spin_lock(&dd->lock); + return false; +} + /* * Called from blk_mq_insert_request() or blk_mq_dispatch_plug_list(). */ @@ -886,16 +965,39 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx, struct request_queue *q = hctx->queue; struct deadline_data *dd = q->elevator->elevator_data; LIST_HEAD(free); + int seq; - spin_lock(&dd->lock); - while (!list_empty(list)) { - struct request *rq; + /* + * If dispatch is busy and we ended up adding to our internal bucket, + * then we're done for now. + */ + if (dd_insert_to_bucket(dd, list, &seq)) + return; + + set_bit(DD_INSERTING, &dd->run_state); + do { + int next_seq; + + while (!list_empty(list)) { + struct request *rq; + + rq = list_first_entry(list, struct request, queuelist); + list_del_init(&rq->queuelist); + dd_insert_request(hctx, rq, flags, &free); + } + + /* + * If sequence changed, flush internal buckets + */ + next_seq = atomic_inc_return(&dd->insert_seq); + if (next_seq == seq + 1) + break; + seq = next_seq; + dd_dispatch_from_buckets(dd, list); + } while (1); - rq = list_first_entry(list, struct request, queuelist); - list_del_init(&rq->queuelist); - dd_insert_request(hctx, rq, flags, &free); - } spin_unlock(&dd->lock); + clear_bit(DD_INSERTING, &dd->run_state); blk_mq_free_requests(&free); }
If we attempt to insert a list of requests but someone else is already running an insertion, then fallback to queueing it internally and let the existing inserter finish the operation. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- block/mq-deadline.c | 118 +++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 110 insertions(+), 8 deletions(-)