diff mbox

[v4,6/6] dm rq: Avoid that request processing stalls sporadically

Message ID 20170407181654.27836-7-bart.vanassche@sandisk.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bart Van Assche April 7, 2017, 6:16 p.m. UTC
While running the srp-test software I noticed that request
processing stalls sporadically at the beginning of a test, namely
when mkfs is run against a dm-mpath device. Every time when that
happened the following command was sufficient to resume request
processing:

    echo run >/sys/kernel/debug/block/dm-0/state

This patch avoids that such request processing stalls occur. The
test I ran is as follows:

    while srp-test/run_tests -d -r 30 -t 02-mq; do :; done

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
---
 drivers/md/dm-rq.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Mike Snitzer April 11, 2017, 4:09 p.m. UTC | #1
On Fri, Apr 07 2017 at  2:16pm -0400,
Bart Van Assche <bart.vanassche@sandisk.com> wrote:

> While running the srp-test software I noticed that request
> processing stalls sporadically at the beginning of a test, namely
> when mkfs is run against a dm-mpath device. Every time when that
> happened the following command was sufficient to resume request
> processing:
> 
>     echo run >/sys/kernel/debug/block/dm-0/state
> 
> This patch avoids that such request processing stalls occur. The
> test I ran is as follows:
> 
>     while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: dm-devel@redhat.com
> ---
>  drivers/md/dm-rq.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 6886bf160fb2..d19af1d21f4c 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -755,6 +755,7 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		/* Undo dm_start_request() before requeuing */
>  		rq_end_stats(md, rq);
>  		rq_completed(md, rq_data_dir(rq), false);
> +		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
>  		return BLK_MQ_RQ_QUEUE_BUSY;
>  	}
>  
> -- 
> 2.12.0
> 

I really appreciate your hard work Bart but this looks like a cheap
hack.

I'm clearly too late to stop this from going in (given Jens got it
merged for -rc6) but: this has no place in dm-mq (or any blk-mq
driver).  If it is needed it should be elevated to blk-mq core to
trigger blk_mq_delay_run_hw_queue() when BLK_MQ_RQ_QUEUE_BUSY is
returned from blk_mq_ops' .queue_rq.

If this dm-mq specific commit is justified the case certainly is spelled
out in the commit header.
Bart Van Assche April 11, 2017, 4:26 p.m. UTC | #2
On Tue, 2017-04-11 at 12:09 -0400, Mike Snitzer wrote:
> This has no place in dm-mq (or any blk-mq
> driver).  If it is needed it should be elevated to blk-mq core to
> trigger blk_mq_delay_run_hw_queue() when BLK_MQ_RQ_QUEUE_BUSY is
> returned from blk_mq_ops' .queue_rq.

Hello Mike,

If the blk-mq core would have to figure out whether or not a queue is no
longer busy without any cooperation from the blk-mq driver all the blk-mq
core could do is to attempt to rerun that queue from time to time. But at
which intervals should the blk-mq core check whether or not a queue is still
busy? Would it be possible to choose intervals at which to check the queue
state that work well for all block drivers? Consider e.g. at the dm-mpath
driver. multipath_busy() returns true as long as path initialization is in
progress. Path initialization can take a long time. The (indirect) call to
blk_mq_run_queue() from pg_init_done() resumes request processing immediately
after path initialization has finished. Sorry but I don't think it is possible
to invent an algorithm for the blk-mq core that guarantees not only that a
queue is rerun as soon as it is no longer busy but also that avoids that
plenty of CPU cycles are wasted by the blk-mq core for checking whether a
queue is no longer busy.

Bart.
Mike Snitzer April 11, 2017, 5:47 p.m. UTC | #3
On Tue, Apr 11 2017 at 12:26pm -0400,
Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:

> On Tue, 2017-04-11 at 12:09 -0400, Mike Snitzer wrote:
> > This has no place in dm-mq (or any blk-mq
> > driver).  If it is needed it should be elevated to blk-mq core to
> > trigger blk_mq_delay_run_hw_queue() when BLK_MQ_RQ_QUEUE_BUSY is
> > returned from blk_mq_ops' .queue_rq.
> 
> Hello Mike,
> 
> If the blk-mq core would have to figure out whether or not a queue is no
> longer busy without any cooperation from the blk-mq driver all the blk-mq
> core could do is to attempt to rerun that queue from time to time. But at
> which intervals should the blk-mq core check whether or not a queue is still
> busy? Would it be possible to choose intervals at which to check the queue
> state that work well for all block drivers? Consider e.g. at the dm-mpath
> driver. multipath_busy() returns true as long as path initialization is in
> progress. Path initialization can take a long time. The (indirect) call to
> blk_mq_run_queue() from pg_init_done() resumes request processing immediately
> after path initialization has finished. Sorry but I don't think it is possible
> to invent an algorithm for the blk-mq core that guarantees not only that a
> queue is rerun as soon as it is no longer busy but also that avoids that
> plenty of CPU cycles are wasted by the blk-mq core for checking whether a
> queue is no longer busy.

Sorry but that isn't a very strong argument for what you've done.

I mean I do appreciate your point that the 2 BLK_MQ_RQ_QUEUE_BUSY
returns in dm_mq_queue_rq() are not equal but that could easily be
conveyed using a new return value.

Anyway, point is, no blk-mq driver should need to have concern about
whether their request will get resubmitted (and the associated hw queue
re-ran) if they return BLK_MQ_RQ_QUEUE_BUSY.

Your change is a means to an end but it just solves the problem in a
very hackish way.  Other drivers will very likely be caught about by
this blk-mq quirk in the future.
Bart Van Assche April 11, 2017, 5:51 p.m. UTC | #4
On Tue, 2017-04-11 at 13:47 -0400, Mike Snitzer wrote:
> Other drivers will very likely be caught about by
> this blk-mq quirk in the future.

Hello Mike,

Are you aware that the requirement that blk-mq drivers rerun the queue after
having returned BLK_MQ_RQ_QUEUE_BUSY is a requirement that is shared with
traditional block drivers? From dm_old_request_fn():

	if (... || (ti->type->busy && ti->type->busy(ti))) {
		blk_delay_queue(q, 10);
		return;
	}

Bart.
Mike Snitzer April 11, 2017, 6:03 p.m. UTC | #5
On Tue, Apr 11 2017 at  1:51pm -0400,
Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:

> On Tue, 2017-04-11 at 13:47 -0400, Mike Snitzer wrote:
> > Other drivers will very likely be caught about by
> > this blk-mq quirk in the future.
> 
> Hello Mike,
> 
> Are you aware that the requirement that blk-mq drivers rerun the queue after
> having returned BLK_MQ_RQ_QUEUE_BUSY is a requirement that is shared with
> traditional block drivers? From dm_old_request_fn():
> 
> 	if (... || (ti->type->busy && ti->type->busy(ti))) {
> 		blk_delay_queue(q, 10);
> 		return;
> 	}

No, and pointing to DM code that does something with the old .request_fn
case to justify why blk-mq requires the same is pretty specious.

Rather than working so hard to use DM code against me, your argument
should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
established pattern"

I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
tree-wide.

Could be there are some others, but hardly a well-established pattern.
Bart Van Assche April 11, 2017, 6:18 p.m. UTC | #6
On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> Rather than working so hard to use DM code against me, your argument
> should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> established pattern"
> 
> I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> tree-wide.
> 
> Could be there are some others, but hardly a well-established pattern.

Hello Mike,

Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
.queue_rq() implementation stop the request queue (blk_mq_stop_hw_queue())
before returning "busy" and restart the queue after the busy condition has
been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
xen-blkfront. However, this approach is not appropriate for the dm-mq core
nor for the scsi core since both drivers already use the "stopped" state for
another purpose than tracking whether or not a hardware queue is busy. Hence
the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these last
two drivers to rerun a hardware queue after the busy state has been cleared.

Bart.
Ming Lei April 12, 2017, 3:42 a.m. UTC | #7
On Tue, Apr 11, 2017 at 06:18:36PM +0000, Bart Van Assche wrote:
> On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > Rather than working so hard to use DM code against me, your argument
> > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> > established pattern"
> > 
> > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > tree-wide.
> > 
> > Could be there are some others, but hardly a well-established pattern.
> 
> Hello Mike,
> 
> Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> .queue_rq() implementation stop the request queue (blk_mq_stop_hw_queue())
> before returning "busy" and restart the queue after the busy condition has
> been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
> xen-blkfront. However, this approach is not appropriate for the dm-mq core
> nor for the scsi core since both drivers already use the "stopped" state for
> another purpose than tracking whether or not a hardware queue is busy. Hence
> the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these last
> two drivers to rerun a hardware queue after the busy state has been cleared.

But looks this patch just reruns the hw queue after 100ms, which isn't
that after the busy state has been cleared, right?

Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
will buffer this request into hctx->dispatch and run the hw queue again,
so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have been
needed at my 1st impression. Or maybe Bart has more stories about this usage,
better to comments it?

Thanks,
Ming
Bart Van Assche April 12, 2017, 6:38 p.m. UTC | #8
On Wed, 2017-04-12 at 11:42 +0800, Ming Lei wrote:
> On Tue, Apr 11, 2017 at 06:18:36PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > > Rather than working so hard to use DM code against me, your argument
> > > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> > > established pattern"
> > > 
> > > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> > > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > > tree-wide.
> > > 
> > > Could be there are some others, but hardly a well-established pattern.
> > 
> > Hello Mike,
> > 
> > Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> > .queue_rq() implementation stop the request queue (blk_mq_stop_hw_queue())
> > before returning "busy" and restart the queue after the busy condition has
> > been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
> > xen-blkfront. However, this approach is not appropriate for the dm-mq core
> > nor for the scsi core since both drivers already use the "stopped" state for
> > another purpose than tracking whether or not a hardware queue is busy. Hence
> > the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these last
> > two drivers to rerun a hardware queue after the busy state has been cleared.
> 
> But looks this patch just reruns the hw queue after 100ms, which isn't
> that after the busy state has been cleared, right?

Hello Ming,

That patch can be considered as a first step that can be refined further, namely
by modifying the dm-rq code further such that dm-rq queues are only rerun after
the busy condition has been cleared. The patch at the start of this thread is
easier to review and easier to test than any patch that would only rerun dm-rq
queues after the busy condition has been cleared.

> Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
> will buffer this request into hctx->dispatch and run the hw queue again,
> so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have been
> needed at my 1st impression.

If the blk-mq core would always rerun a hardware queue if a block driver
returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
to be busy with polling a hardware queue until the "busy" condition has been
cleared. One can see easily that that's not what the blk-mq core does. From
blk_mq_sched_dispatch_requests():

	if (!list_empty(&rq_list)) {
		blk_mq_sched_mark_restart_hctx(hctx);
		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
	}

From the end of blk_mq_dispatch_rq_list():

	if (!list_empty(list)) {
		[ ... ]
		if (!blk_mq_sched_needs_restart(hctx) &&
		    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
			blk_mq_run_hw_queue(hctx, true);
	}

In other words, the BLK_MQ_S_SCHED_RESTART flag is set before the dispatch list
is examined and only if that flag gets cleared while blk_mq_dispatch_rq_list()
is in progress by a concurrent blk_mq_sched_restart_hctx() call then the
dispatch list will be rerun after a block driver returned BLK_MQ_RQ_QUEUE_BUSY.

Bart.
Ming Lei April 13, 2017, 2:20 a.m. UTC | #9
On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
> On Wed, 2017-04-12 at 11:42 +0800, Ming Lei wrote:
> > On Tue, Apr 11, 2017 at 06:18:36PM +0000, Bart Van Assche wrote:
> > > On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > > > Rather than working so hard to use DM code against me, your argument
> > > > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> > > > established pattern"
> > > > 
> > > > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> > > > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > > > tree-wide.
> > > > 
> > > > Could be there are some others, but hardly a well-established pattern.
> > > 
> > > Hello Mike,
> > > 
> > > Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> > > .queue_rq() implementation stop the request queue (blk_mq_stop_hw_queue())
> > > before returning "busy" and restart the queue after the busy condition has
> > > been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
> > > xen-blkfront. However, this approach is not appropriate for the dm-mq core
> > > nor for the scsi core since both drivers already use the "stopped" state for
> > > another purpose than tracking whether or not a hardware queue is busy. Hence
> > > the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these last
> > > two drivers to rerun a hardware queue after the busy state has been cleared.
> > 
> > But looks this patch just reruns the hw queue after 100ms, which isn't
> > that after the busy state has been cleared, right?
> 
> Hello Ming,
> 
> That patch can be considered as a first step that can be refined further, namely
> by modifying the dm-rq code further such that dm-rq queues are only rerun after
> the busy condition has been cleared. The patch at the start of this thread is
> easier to review and easier to test than any patch that would only rerun dm-rq
> queues after the busy condition has been cleared.

OK, got it, it should have been better to add comments about this change
since reruning the queue after 100ms is actually a workaround, instead
of final solution.

> 
> > Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
> > will buffer this request into hctx->dispatch and run the hw queue again,
> > so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have been
> > needed at my 1st impression.
> 
> If the blk-mq core would always rerun a hardware queue if a block driver
> returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core

It won't casue 100% CPU utilization since we restart queue in completion
path and at that time at least one tag is available, then progress can be
made.

> to be busy with polling a hardware queue until the "busy" condition has been
> cleared. One can see easily that that's not what the blk-mq core does. From
> blk_mq_sched_dispatch_requests():
> 
> 	if (!list_empty(&rq_list)) {
> 		blk_mq_sched_mark_restart_hctx(hctx);
> 		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
> 	}
> 
> From the end of blk_mq_dispatch_rq_list():
> 
> 	if (!list_empty(list)) {
> 		[ ... ]
> 		if (!blk_mq_sched_needs_restart(hctx) &&
> 		    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
> 			blk_mq_run_hw_queue(hctx, true);
> 	}

That is exactly what I meant, blk-mq already provides this mechanism
to rerun the queue automatically in case of BLK_MQ_RQ_QUEUE_BUSY. If the
mechanism doesn't work well, we need to fix that, then why bother
drivers to workaround it?

> 
> In other words, the BLK_MQ_S_SCHED_RESTART flag is set before the dispatch list
> is examined and only if that flag gets cleared while blk_mq_dispatch_rq_list()
> is in progress by a concurrent blk_mq_sched_restart_hctx() call then the
> dispatch list will be rerun after a block driver returned BLK_MQ_RQ_QUEUE_BUSY.

Yes, the queue is rerun either in completion path when
BLK_MQ_S_SCHED_RESTART is set, or just .queue_rq() returning _BUSY
and the flag is cleared at the same time from completion path.

So in theroy we can make sure the queue will be run again if _BUSY
happened, then what is the root cause why we have to add
blk_mq_delay_run_hw_queue(hctx, 100) in dm's .queue_rq()?

Thanks,
Ming
Bart Van Assche April 13, 2017, 4:59 p.m. UTC | #10
On 04/12/17 19:20, Ming Lei wrote:
> On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
>> If the blk-mq core would always rerun a hardware queue if a block driver
>> returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
> 
> It won't casue 100% CPU utilization since we restart queue in completion
> path and at that time at least one tag is available, then progress can be
> made.

Hello Ming,

Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
then it's likely that calling .queue_rq() again after only a few
microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
!test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) blk_mq_run_hw_queue(hctx,
true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy
condition for either a SCSI LLD or a dm-rq driver, run top and you will
see that the CPU usage of a kworker thread jumps up to 100%.

Bart.
Ming Lei April 14, 2017, 1:13 a.m. UTC | #11
On Thu, Apr 13, 2017 at 09:59:57AM -0700, Bart Van Assche wrote:
> On 04/12/17 19:20, Ming Lei wrote:
> > On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
> >> If the blk-mq core would always rerun a hardware queue if a block driver
> >> returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
> > 
> > It won't casue 100% CPU utilization since we restart queue in completion
> > path and at that time at least one tag is available, then progress can be
> > made.
> 
> Hello Ming,
> 
> Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
> then it's likely that calling .queue_rq() again after only a few
> microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
> don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
> !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) blk_mq_run_hw_queue(hctx,
> true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy

Yes, that can be true, but I mean it is still OK to run the queue again
with

	if (!blk_mq_sched_needs_restart(hctx) &&
	    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
			blk_mq_run_hw_queue(hctx, true);

and restarting queue in __blk_mq_finish_request() when
BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(). And both are in current
blk-mq implementation.

Then why do we need blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) in dm?

Thanks,
Ming
Bart Van Assche April 14, 2017, 5:12 p.m. UTC | #12
On Fri, 2017-04-14 at 09:13 +0800, Ming Lei wrote:
> On Thu, Apr 13, 2017 at 09:59:57AM -0700, Bart Van Assche wrote:
> > On 04/12/17 19:20, Ming Lei wrote:
> > > On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
> > > > If the blk-mq core would always rerun a hardware queue if a block driver
> > > > returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
> > > 
> > > It won't casue 100% CPU utilization since we restart queue in completion
> > > path and at that time at least one tag is available, then progress can be
> > > made.
> > 
> > Hello Ming,
> > 
> > Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
> > then it's likely that calling .queue_rq() again after only a few
> > microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
> > don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
> > !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) blk_mq_run_hw_queue(hctx,
> > true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy
> 
> Yes, that can be true, but I mean it is still OK to run the queue again
> with
> 
> 	if (!blk_mq_sched_needs_restart(hctx) &&
> 	    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
> 			blk_mq_run_hw_queue(hctx, true);
> 
> and restarting queue in __blk_mq_finish_request() when
> BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(). And both are in current
> blk-mq implementation.
> 
> Then why do we need blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) in dm?

Because if dm_mq_queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY that there is no
guarantee that __blk_mq_finish_request() will be called later on for the
same queue. dm_mq_queue_rq() can e.g. return BLK_MQ_RQ_QUEUE_BUSY while no
dm requests are in progress because the SCSI error handler is active for
all underlying paths. See also scsi_lld_busy() and scsi_host_in_recovery().

Bart.
Ming Lei April 16, 2017, 10:21 a.m. UTC | #13
On Fri, Apr 14, 2017 at 05:12:50PM +0000, Bart Van Assche wrote:
> On Fri, 2017-04-14 at 09:13 +0800, Ming Lei wrote:
> > On Thu, Apr 13, 2017 at 09:59:57AM -0700, Bart Van Assche wrote:
> > > On 04/12/17 19:20, Ming Lei wrote:
> > > > On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
> > > > > If the blk-mq core would always rerun a hardware queue if a block driver
> > > > > returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
> > > > 
> > > > It won't casue 100% CPU utilization since we restart queue in completion
> > > > path and at that time at least one tag is available, then progress can be
> > > > made.
> > > 
> > > Hello Ming,
> > > 
> > > Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
> > > then it's likely that calling .queue_rq() again after only a few
> > > microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
> > > don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
> > > !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) blk_mq_run_hw_queue(hctx,
> > > true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy
> > 
> > Yes, that can be true, but I mean it is still OK to run the queue again
> > with
> > 
> > 	if (!blk_mq_sched_needs_restart(hctx) &&
> > 	    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
> > 			blk_mq_run_hw_queue(hctx, true);
> > 
> > and restarting queue in __blk_mq_finish_request() when
> > BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(). And both are in current
> > blk-mq implementation.
> > 
> > Then why do we need blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) in dm?
> 
> Because if dm_mq_queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY that there is no
> guarantee that __blk_mq_finish_request() will be called later on for the
> same queue. dm_mq_queue_rq() can e.g. return BLK_MQ_RQ_QUEUE_BUSY while no
> dm requests are in progress because the SCSI error handler is active for
> all underlying paths. See also scsi_lld_busy() and scsi_host_in_recovery().

OK, thanks Bart for the explanation.

Looks a very interesting BLK_MQ_RQ_QUEUE_BUSY case which isn't casued by
too many pending I/O, and will study more about this case.


Thanks,
Ming
diff mbox

Patch

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6886bf160fb2..d19af1d21f4c 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -755,6 +755,7 @@  static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
+		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
 		return BLK_MQ_RQ_QUEUE_BUSY;
 	}