diff mbox

queue stall with blk-mq-sched

Message ID 1663de5d-cdf7-a6ed-7539-c7d1f5e98f6c@fb.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe Jan. 24, 2017, 7:55 p.m. UTC
On 01/24/2017 11:49 AM, Hannes Reinecke wrote:
> On 01/24/2017 05:09 PM, Jens Axboe wrote:
>> On 01/24/2017 08:54 AM, Hannes Reinecke wrote:
>>> Hi Jens,
>>>
>>> I'm trying to debug a queue stall with your blk-mq-sched branch; with my
>>> latest mpt3sas patches fio stops basically directly after starting a
>>> sequential read :-(
>>>
>>> I've debugged things and came up with the attached patch; we need to
>>> restart waiters with blk_mq_tag_idle() after completing a tag.
>>> We're already calling blk_mq_tag_busy() when fetching a tag, so I think
>>> calling blk_mq_tag_idle() is required when retiring a tag.
>>
>> The patch isn't correct, the whole point of the un-idling is that it
>> ISN'T happening for every request completion. Otherwise you throw
>> away scalability. So a queue will go into active mode on the first
>> request, and idle when it's been idle for a bit. The active count
>> is used to divide up the tags.
>>
>> So I'm assuming we're missing a queue run somewhere when we fail
>> getting a driver tag. The latter should only happen if the target
>> has IO in flight already, and the restart marking should take care
>> of it. Obviously there's a case where that is not true, since you
>> are seeing stalls.
>>
> But what is the point in the 'blk_mq_tag_busy()' thingie then?
> When will it be reset?
> The only instances I've seen is that it'll be getting reset during 
> resize and teardown ... hence my patch.

The point is to have some count of how many queues are busy "lately",
which helps in dividing up the tags fairly. Hence we bump it as soon as
the queue goes active, and drop it after some delay. That's working as
expected.

>>> However, even with the attached patch I'm seeing some queue stalls;
>>> looks like they're related to the 'stonewall' statement in fio.
>>
>> I think you are heading down the wrong path. Your patch will cause
>> the symptoms to be a bit different, but you'll still run into cases
>> where we fail giving out the tag and then stall.
>>
> Hehe.
> How did you know that?

My crystal ball :-)

> That's indeed what I'm seeing.
> 
> Oh well, back to the drawing board...

Try this patch. We only want to bump it for the driver tags, not the
scheduler side.
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ee69e5e..c905aa1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -230,15 +230,15 @@  struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
 
 		rq = tags->static_rqs[tag];
 
-		if (blk_mq_tag_busy(data->hctx)) {
-			rq->rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
-		}
-
 		if (data->flags & BLK_MQ_REQ_INTERNAL) {
 			rq->tag = -1;
 			rq->internal_tag = tag;
 		} else {
+			if (blk_mq_tag_busy(data->hctx)) {
+				rq->rq_flags = RQF_MQ_INFLIGHT;
+				atomic_inc(&data->hctx->nr_active);
+			}
+
 			rq->tag = tag;
 			rq->internal_tag = -1;
 		}
@@ -870,6 +870,10 @@  static bool blk_mq_get_driver_tag(struct request *rq,
 	rq->tag = blk_mq_get_tag(&data);
 	if (rq->tag >= 0) {
 		data.hctx->tags->rqs[rq->tag] = rq;
+		if (blk_mq_tag_busy(data.hctx)) {
+			rq->rq_flags |= RQF_MQ_INFLIGHT;
+			atomic_inc(&data.hctx->nr_active);
+		}
 		goto done;
 	}