diff mbox

SCSI: run queue if SCSI device queue isn't ready and queue is idle

Message ID 1512490099.2660.6.camel@sandisk.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Bart Van Assche Dec. 5, 2017, 4:08 p.m. UTC
On Tue, 2017-12-05 at 15:52 +0800, Ming Lei wrote:
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c

> index db9556662e27..1816dd8259b3 100644

> --- a/drivers/scsi/scsi_lib.c

> +++ b/drivers/scsi/scsi_lib.c

> @@ -1967,6 +1967,8 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)

>  out_put_device:

>  	put_device(&sdev->sdev_gendev);

>  out:

> +	if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev))

> +		blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);

>  	return false;

>  }


This cannot work since multiple threads can call scsi_mq_get_budget()
concurrently and hence it can happen that none of them sees
atomic_read(&sdev->device_busy) == 0. BTW, the above patch is something I
consider as a workaround. Have you considered to fix the root cause and to
add blk_mq_sched_mark_restart_hctx() where these are missing, e.g. in 
blk_mq_sched_dispatch_requests()? The patch below is not a full solution
but resulted in a significant improvement in my tests:


Bart.

Comments

Ming Lei Dec. 5, 2017, 4:28 p.m. UTC | #1
On Tue, Dec 05, 2017 at 04:08:20PM +0000, Bart Van Assche wrote:
> On Tue, 2017-12-05 at 15:52 +0800, Ming Lei wrote:
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index db9556662e27..1816dd8259b3 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1967,6 +1967,8 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> >  out_put_device:
> >  	put_device(&sdev->sdev_gendev);
> >  out:
> > +	if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev))
> > +		blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> >  	return false;
> >  }
> 
> This cannot work since multiple threads can call scsi_mq_get_budget()

That is exactly the way we are handling these cases before 0df21c86bdbf(scsi:
implement .get_budget and .put_budget for blk-mq), so if it can't work,
that is not fault of commit 0df21c86bdbf.

> concurrently and hence it can happen that none of them sees
> atomic_read(&sdev->device_busy) == 0. BTW, the above patch is something I

If there is concurrent .get_budget(), one of them will see the counter
becoming zero finally because each sdev->device_busy is inc/dec
atomically. Or scsi_dev_queue_ready() return true.

Anyway, we need this patch to avoid possible regression. If you think
there are bugs in blk-mq RESTART, just root cause and and fix it.

> consider as a workaround. Have you considered to fix the root cause and to
> add blk_mq_sched_mark_restart_hctx() where these are missing, e.g. in 
> blk_mq_sched_dispatch_requests()? The patch below is not a full solution
> but resulted in a significant improvement in my tests:
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 69e3226e66ca..9d86876ec503 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -226,6 +226,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  		 * TODO: get more budgets, and dequeue more requests in
>  		 * one time.
>  		 */
> +		blk_mq_sched_mark_restart_hctx(hctx);
>  		blk_mq_do_dispatch_ctx(hctx);
>  	} else {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);

This is still a workaround for RESTART, see my comment before:

	https://marc.info/?l=linux-block&m=151217500929341&w=2
Bart Van Assche Dec. 5, 2017, 4:41 p.m. UTC | #2
On Wed, 2017-12-06 at 00:28 +0800, Ming Lei wrote:
> This is still a workaround for RESTART, see my comment before:

> 

> 	https://marc.info/?l=linux-block&m=151217500929341&w=2


A quote from that e-mail: "The theory about using BLK_MQ_S_SCHED_RESTART in
current way is that we mark it after requests are added to hctx->dispatch".
Reading that makes me wonder whether you understand the purpose of the
BLK_MQ_S_SCHED_RESTART flag? That flag is not set after requests are added
to the dispatch list but after requests have been *removed*. The purpose of
that flag is to detect whether another thread has run the queue after
requests were removed from the dispatch list and before these were readded.
If so, the queue needs to be rerun.

Bart.
Ming Lei Dec. 5, 2017, 4:45 p.m. UTC | #3
On Tue, Dec 05, 2017 at 04:41:46PM +0000, Bart Van Assche wrote:
> On Wed, 2017-12-06 at 00:28 +0800, Ming Lei wrote:
> > This is still a workaround for RESTART, see my comment before:
> > 
> > 	https://marc.info/?l=linux-block&m=151217500929341&w=2
> 
> A quote from that e-mail: "The theory about using BLK_MQ_S_SCHED_RESTART in
> current way is that we mark it after requests are added to hctx->dispatch".
> Reading that makes me wonder whether you understand the purpose of the
> BLK_MQ_S_SCHED_RESTART flag? That flag is not set after requests are added
> to the dispatch list but after requests have been *removed*. The purpose of
> that flag is to detect whether another thread has run the queue after
> requests were removed from the dispatch list and before these were readded.
> If so, the queue needs to be rerun.

If you want to discuss that, please reply on that thread of '[PATCH 4/7] blk-mq:
Avoid that request processing sta', I will reply on you there too.
Ming Lei Dec. 6, 2017, 1:52 a.m. UTC | #4
On Wed, Dec 06, 2017 at 12:28:25AM +0800, Ming Lei wrote:
> On Tue, Dec 05, 2017 at 04:08:20PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-12-05 at 15:52 +0800, Ming Lei wrote:
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > index db9556662e27..1816dd8259b3 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -1967,6 +1967,8 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> > >  out_put_device:
> > >  	put_device(&sdev->sdev_gendev);
> > >  out:
> > > +	if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev))
> > > +		blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> > >  	return false;
> > >  }
> > 
> > This cannot work since multiple threads can call scsi_mq_get_budget()
> 
> That is exactly the way we are handling these cases before 0df21c86bdbf(scsi:
> implement .get_budget and .put_budget for blk-mq), so if it can't work,
> that is not fault of commit 0df21c86bdbf.
> 
> > concurrently and hence it can happen that none of them sees
> > atomic_read(&sdev->device_busy) == 0. BTW, the above patch is something I
> 
> If there is concurrent .get_budget(), one of them will see the counter
> becoming zero finally because each sdev->device_busy is inc/dec
> atomically. Or scsi_dev_queue_ready() return true.
> 
> Anyway, we need this patch to avoid possible regression. If you think
> there are bugs in blk-mq RESTART, just root cause and and fix it.
> 
> > consider as a workaround. Have you considered to fix the root cause and to
> > add blk_mq_sched_mark_restart_hctx() where these are missing, e.g. in 
> > blk_mq_sched_dispatch_requests()? The patch below is not a full solution
> > but resulted in a significant improvement in my tests:
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 69e3226e66ca..9d86876ec503 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -226,6 +226,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >  		 * TODO: get more budgets, and dequeue more requests in
> >  		 * one time.
> >  		 */
> > +		blk_mq_sched_mark_restart_hctx(hctx);
> >  		blk_mq_do_dispatch_ctx(hctx);
> >  	} else {
> >  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> 

BTW, this kind of change can't cover scsi_set_blocked() which is
triggered by timeout, scsi dispatch failure. You will see that
easily if you run the SCSI test script I provided in the commit log.
Bart Van Assche Dec. 6, 2017, 4:07 p.m. UTC | #5
On Wed, 2017-12-06 at 09:52 +0800, Ming Lei wrote:
> On Wed, Dec 06, 2017 at 12:28:25AM +0800, Ming Lei wrote:

> > On Tue, Dec 05, 2017 at 04:08:20PM +0000, Bart Van Assche wrote:

> > > The patch below is not a full solution but resulted in a significant

> > > improvement in my tests:

> > > 

> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c

> > > index 69e3226e66ca..9d86876ec503 100644

> > > --- a/block/blk-mq-sched.c

> > > +++ b/block/blk-mq-sched.c

> > > @@ -226,6 +226,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)

> > >  		 * TODO: get more budgets, and dequeue more requests in

> > >  		 * one time.

> > >  		 */

> > > +		blk_mq_sched_mark_restart_hctx(hctx);

> > >  		blk_mq_do_dispatch_ctx(hctx);

> > >  	} else {

> > >  		blk_mq_flush_busy_ctxs(hctx, &rq_list);

> 

> BTW, this kind of change can't cover scsi_set_blocked() which is

> triggered by timeout, scsi dispatch failure. You will see that

> easily if you run the SCSI test script I provided in the commit log.


Hello Ming,

I am aware that the above change does not cover all cases. That's why I wrote
in my previous e-mail that that patch is not a full solution. The reason I
posted that change anyway is because I prefer a solution that is not based on
delayed queue runs over a solution that is based on delayed queue runs
(blk_mq_delay_run_hw_queue()). My concern is that performance of a solution
based on delayed queue runs will be suboptimal.

Bart.
Ming Lei Dec. 7, 2017, 1:31 a.m. UTC | #6
On Wed, Dec 06, 2017 at 04:07:17PM +0000, Bart Van Assche wrote:
> On Wed, 2017-12-06 at 09:52 +0800, Ming Lei wrote:
> > On Wed, Dec 06, 2017 at 12:28:25AM +0800, Ming Lei wrote:
> > > On Tue, Dec 05, 2017 at 04:08:20PM +0000, Bart Van Assche wrote:
> > > > The patch below is not a full solution but resulted in a significant
> > > > improvement in my tests:
> > > > 
> > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > index 69e3226e66ca..9d86876ec503 100644
> > > > --- a/block/blk-mq-sched.c
> > > > +++ b/block/blk-mq-sched.c
> > > > @@ -226,6 +226,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > > >  		 * TODO: get more budgets, and dequeue more requests in
> > > >  		 * one time.
> > > >  		 */
> > > > +		blk_mq_sched_mark_restart_hctx(hctx);
> > > >  		blk_mq_do_dispatch_ctx(hctx);
> > > >  	} else {
> > > >  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
> > 
> > BTW, this kind of change can't cover scsi_set_blocked() which is
> > triggered by timeout, scsi dispatch failure. You will see that
> > easily if you run the SCSI test script I provided in the commit log.
> 
> Hello Ming,
> 
> I am aware that the above change does not cover all cases. That's why I wrote
> in my previous e-mail that that patch is not a full solution. The reason I
> posted that change anyway is because I prefer a solution that is not based on
> delayed queue runs over a solution that is based on delayed queue runs
> (blk_mq_delay_run_hw_queue()). My concern is that performance of a solution
> based on delayed queue runs will be suboptimal.

Hi,

The patch I posted won't cause any performance regression because it is
only triggered when queue is becoming idle, also that is exact the way for us
to deal with these cases before.

But if you always call blk_mq_sched_mark_restart_hctx() before a new
dispatch, that may affect performance on NVMe which may never trigger
BLK_STS_RESOURCE.
Bart Van Assche Dec. 7, 2017, 9:06 p.m. UTC | #7
On Wed, 2017-12-06 at 00:28 +0800, Ming Lei wrote:
> On Tue, Dec 05, 2017 at 04:08:20PM +0000, Bart Van Assche wrote:

> > On Tue, 2017-12-05 at 15:52 +0800, Ming Lei wrote:

> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c

> > > index db9556662e27..1816dd8259b3 100644

> > > --- a/drivers/scsi/scsi_lib.c

> > > +++ b/drivers/scsi/scsi_lib.c

> > > @@ -1967,6 +1967,8 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)

> > >  out_put_device:

> > >  	put_device(&sdev->sdev_gendev);

> > >  out:

> > > +	if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev))

> > > +		blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);

> > >  	return false;

> > >  }

> > 

> > This cannot work since multiple threads can call scsi_mq_get_budget()

> 

> That is exactly the way we are handling these cases before 0df21c86bdbf(scsi:

> implement .get_budget and .put_budget for blk-mq), so if it can't work,

> that is not fault of commit 0df21c86bdbf.

> 

> > concurrently and hence it can happen that none of them sees

> > atomic_read(&sdev->device_busy) == 0. BTW, the above patch is something I

> 

> If there is concurrent .get_budget(), one of them will see the counter

> becoming zero finally because each sdev->device_busy is inc/dec

> atomically. Or scsi_dev_queue_ready() return true.

> 

> Anyway, we need this patch to avoid possible regression. If you think

> there are bugs in blk-mq RESTART, just root cause and and fix it.


Hello Ming,

When I looked at the patch at the start of this thread for the first time I
got frustrated because I didn't see how this patch could fix the queue stall
I ran into myself. Today I started realizing that what Holger reported is
probably another issue than what I ran into myself. Since this patch by
itself looks now useful to me:

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>


BTW, do you think the patch at the start of this thread also fixes the issue
that resulted in commit 826a70a08b12 ("SCSI: don't get target/host busy_count
in scsi_mq_get_budget()")? Do you think we still need that patch?

Bart.
Bart Van Assche Dec. 7, 2017, 9:11 p.m. UTC | #8
On Thu, 2017-12-07 at 09:31 +0800, Ming Lei wrote:
> But if you always call blk_mq_sched_mark_restart_hctx() before a new

> dispatch, that may affect performance on NVMe which may never trigger

> BLK_STS_RESOURCE.


Hmm ... only the SCSI core implements .get_budget() and .put_budget() and
I proposed to insert a blk_mq_sched_mark_restart_hctx() call under "if
(q->mq_ops->get_budget)". In other words, I proposed to insert a
blk_mq_sched_mark_restart_hctx() call in a code path that is never triggered
by the NVMe driver. So I don't see how the change I proposed could affect
the performance of the NVMe driver?

Bart.
Ming Lei Dec. 8, 2017, 12:36 a.m. UTC | #9
On Thu, Dec 07, 2017 at 09:11:54PM +0000, Bart Van Assche wrote:
> On Thu, 2017-12-07 at 09:31 +0800, Ming Lei wrote:
> > But if you always call blk_mq_sched_mark_restart_hctx() before a new
> > dispatch, that may affect performance on NVMe which may never trigger
> > BLK_STS_RESOURCE.
> 
> Hmm ... only the SCSI core implements .get_budget() and .put_budget() and
> I proposed to insert a blk_mq_sched_mark_restart_hctx() call under "if
> (q->mq_ops->get_budget)". In other words, I proposed to insert a
> blk_mq_sched_mark_restart_hctx() call in a code path that is never triggered
> by the NVMe driver. So I don't see how the change I proposed could affect
> the performance of the NVMe driver?

You only add the check on none scheduler, right?

But this race isn't related with scheduler, that means it can't fix the
race with other schedulers.

I have test case to trigger this issue on both none and mq-deadline, and
my patch fixes them all.

Thanks,
Ming
Ming Lei Dec. 8, 2017, 12:50 a.m. UTC | #10
On Thu, Dec 07, 2017 at 09:06:58PM +0000, Bart Van Assche wrote:
> On Wed, 2017-12-06 at 00:28 +0800, Ming Lei wrote:
> > On Tue, Dec 05, 2017 at 04:08:20PM +0000, Bart Van Assche wrote:
> > > On Tue, 2017-12-05 at 15:52 +0800, Ming Lei wrote:
> > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > > index db9556662e27..1816dd8259b3 100644
> > > > --- a/drivers/scsi/scsi_lib.c
> > > > +++ b/drivers/scsi/scsi_lib.c
> > > > @@ -1967,6 +1967,8 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> > > >  out_put_device:
> > > >  	put_device(&sdev->sdev_gendev);
> > > >  out:
> > > > +	if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev))
> > > > +		blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> > > >  	return false;
> > > >  }
> > > 
> > > This cannot work since multiple threads can call scsi_mq_get_budget()
> > 
> > That is exactly the way we are handling these cases before 0df21c86bdbf(scsi:
> > implement .get_budget and .put_budget for blk-mq), so if it can't work,
> > that is not fault of commit 0df21c86bdbf.
> > 
> > > concurrently and hence it can happen that none of them sees
> > > atomic_read(&sdev->device_busy) == 0. BTW, the above patch is something I
> > 
> > If there is concurrent .get_budget(), one of them will see the counter
> > becoming zero finally because each sdev->device_busy is inc/dec
> > atomically. Or scsi_dev_queue_ready() return true.
> > 
> > Anyway, we need this patch to avoid possible regression. If you think
> > there are bugs in blk-mq RESTART, just root cause and and fix it.
> 
> Hello Ming,
> 
> When I looked at the patch at the start of this thread for the first time I
> got frustrated because I didn't see how this patch could fix the queue stall
> I ran into myself. Today I started realizing that what Holger reported is
> probably another issue than what I ran into myself. Since this patch by
> itself looks now useful to me:
> 
> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>

Hi Bart,

Thanks for your Review!

> 
> BTW, do you think the patch at the start of this thread also fixes the issue
> that resulted in commit 826a70a08b12 ("SCSI: don't get target/host busy_count
> in scsi_mq_get_budget()")? Do you think we still need that patch?

Yeah, once the patch in this thread is in, IO hang shouldn't happen any more
even without 826a70a08b12.

But that commit is still the correct thing to do, since we let blk-mq's sbitmap
queue respect per-host queue depth, which way is much efficient than the
simple atomic operations used in scsi_host_queue_ready(). So I think that commit
is still useful.

When I was figuring out patch of 826a70a08b12, I just ignore the .get_budget() for
requests from hctx->dispatch_list, because we don't need to deal with queue idle
when .get_budget() is called from both blk_mq_do_dispatch_sched() and blk_mq_do_dispatch_ctx().

Thanks,
Ming
diff mbox

Patch

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 69e3226e66ca..9d86876ec503 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -226,6 +226,7 @@  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		 * TODO: get more budgets, and dequeue more requests in
 		 * one time.
 		 */
+		blk_mq_sched_mark_restart_hctx(hctx);
 		blk_mq_do_dispatch_ctx(hctx);
 	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);