blk-mq: introduce BLK_STS_DEV_RESOURCE

Message ID	20180120134813.4446-1-ming.lei@redhat.com (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org, Christoph Hellwig <hch@infradead.org> Cc: linux-scsi@vger.kernel.org, Ming Lei <ming.lei@redhat.com>, Mike Snitzer <snitzer@redhat.com>, Laurence Oberman <loberman@redhat.com>, Bart Van Assche <bart.vanassche@sandisk.com> Subject: [PATCH] blk-mq: introduce BLK_STS_DEV_RESOURCE Date: Sat, 20 Jan 2018 21:48:13 +0800 Message-Id: <20180120134813.4446-1-ming.lei@redhat.com> Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk

Ming Lei Jan. 20, 2018, 1:48 p.m. UTC

This status is returned from driver to block layer if device related
resource is run out of, but driver can guarantee that IO dispatch is
triggered in future when the resource is available.

This patch converts some drivers to use this return value. Meantime
if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
queue after 10ms for avoiding IO hang.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c             |  1 +
 block/blk-mq.c               | 20 ++++++++++++++++----
 drivers/block/null_blk.c     |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/scsi/scsi_lib.c      |  6 +++---
 include/linux/blk_types.h    |  7 +++++++
 7 files changed, 30 insertions(+), 10 deletions(-)

Mike Snitzer Jan. 20, 2018, 5:30 p.m. UTC | #1

On Sat, Jan 20 2018 at  8:48am -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> This status is returned from driver to block layer if device related
> resource is run out of, but driver can guarantee that IO dispatch is
> triggered in future when the resource is available.
> 
> This patch converts some drivers to use this return value. Meantime
> if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
> queue after 10ms for avoiding IO hang.
> 
> Suggested-by: Jens Axboe <axboe@kernel.dk>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c             |  1 +
>  block/blk-mq.c               | 20 ++++++++++++++++----
>  drivers/block/null_blk.c     |  2 +-
>  drivers/block/virtio_blk.c   |  2 +-
>  drivers/block/xen-blkfront.c |  2 +-
>  drivers/scsi/scsi_lib.c      |  6 +++---
>  include/linux/blk_types.h    |  7 +++++++
>  7 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 01f271d40825..6e97e0bf8178 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>  		}
>  
>  		ret = q->mq_ops->queue_rq(hctx, &bd);
> -		if (ret == BLK_STS_RESOURCE) {
> +		if ((ret == BLK_STS_RESOURCE) ||
> +				(ret == BLK_STS_DEV_RESOURCE)) {
>  			/*
>  			 * If an I/O scheduler has been configured and we got a
>  			 * driver tag for the next request already, free it

Just a nit, but this should be on one line.

> @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>  		*cookie = new_cookie;
>  		break;
>  	case BLK_STS_RESOURCE:
> +	case BLK_STS_DEV_RESOURCE:
>  		__blk_mq_requeue_request(rq);
>  		break;
>  	default:

It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is
too muddled: calling __blk_mq_requeue_request() for both will cause
underlying blk-mq driver to retain the request, won't it?

> @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>  	hctx_lock(hctx, &srcu_idx);
>  
>  	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> -	if (ret == BLK_STS_RESOURCE)
> +	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))
>  		blk_mq_sched_insert_request(rq, false, true, false);
>  	else if (ret != BLK_STS_OK)
>  		blk_mq_end_request(rq, ret);

For this normal (non dm-mpath) case the request gets re-inserted;
dm-mpath must avoid that.

But with dm-mpath, which instead uses blk_mq_request_issue_directly(),
we're driving IO with stacked blk-mq drivers.  If the underlying blk-mq 
driver (e.g. scsi-mq or nvme) is made to retain the request, using
__blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above,
then dm-mpath will not have the ability to requeue the request without
conflicting with the underlying blk-mq driver, will it?

Or am I'm misunderstanding what __blk_mq_requeue_request() is doing?

dm_mq_queue_rq
-> multipath_clone_and_map
   -> blk_get_request (scsi_mq)
      -> if error, dm-mpath conditionally requeues (w/ or w/o delay)
      -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called
-> dm_dispatch_clone_request
   -> blk_mq_request_issue_directly
      -> __blk_mq_try_issue_directly
         -> __blk_mq_issue_directly
            -> q->mq_ops->queue_rq (this is the underlying scsi_mq)
               -> a BLK_STS_RESOURCE return here is how Bart was able to cause stalls
            -> __blk_mq_requeue_request, if BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE **1
   -> (return from blk_mq_request_issue_directly)
   -> if BLK_STS_RESOURCE, the dm-mpath request is released using blk_put_request();
                           and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq **2
-> if DM_MAPIO_REQUEUE return from map_request()'s call to dm_dispatch_clone_request:
   BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq

The redundant queueing (both to underlying blk-mq at **1 above, and
upper layer blk-mq at **2 above) is what I'm concerned about.

Hope this is clear.

I'd love to be missing something, please advise.

Thanks,
Mike

Ming Lei Jan. 21, 2018, 12:57 a.m. UTC | #2

On Sat, Jan 20, 2018 at 12:30:02PM -0500, Mike Snitzer wrote:
> On Sat, Jan 20 2018 at  8:48am -0500,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > This status is returned from driver to block layer if device related
> > resource is run out of, but driver can guarantee that IO dispatch is
> > triggered in future when the resource is available.
> > 
> > This patch converts some drivers to use this return value. Meantime
> > if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
> > queue after 10ms for avoiding IO hang.
> > 
> > Suggested-by: Jens Axboe <axboe@kernel.dk>
> > Cc: Mike Snitzer <snitzer@redhat.com>
> > Cc: Laurence Oberman <loberman@redhat.com>
> > Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-core.c             |  1 +
> >  block/blk-mq.c               | 20 ++++++++++++++++----
> >  drivers/block/null_blk.c     |  2 +-
> >  drivers/block/virtio_blk.c   |  2 +-
> >  drivers/block/xen-blkfront.c |  2 +-
> >  drivers/scsi/scsi_lib.c      |  6 +++---
> >  include/linux/blk_types.h    |  7 +++++++
> >  7 files changed, 30 insertions(+), 10 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 01f271d40825..6e97e0bf8178 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> >  		}
> >  
> >  		ret = q->mq_ops->queue_rq(hctx, &bd);
> > -		if (ret == BLK_STS_RESOURCE) {
> > +		if ((ret == BLK_STS_RESOURCE) ||
> > +				(ret == BLK_STS_DEV_RESOURCE)) {
> >  			/*
> >  			 * If an I/O scheduler has been configured and we got a
> >  			 * driver tag for the next request already, free it
> 
> Just a nit, but this should be on one line.

It is too long, and my editor starts to highlight/complain it, :-)

> 
> > @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> >  		*cookie = new_cookie;
> >  		break;
> >  	case BLK_STS_RESOURCE:
> > +	case BLK_STS_DEV_RESOURCE:
> >  		__blk_mq_requeue_request(rq);
> >  		break;
> >  	default:
> 
> It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is
> too muddled: calling __blk_mq_requeue_request() for both will cause
> underlying blk-mq driver to retain the request, won't it?

blk_mq_request_issue_directly() is used by driver(dm-rq) on underlying
queue, and driver need to deal with underlying queue busy, now we simply
free the (underlying)request and feedback the busy status to blk-mq via
dm-rq.

Except for blk_mq_request_issue_directly(), this request need to be
requeued, and is retained by blk-mq in hctx->dispatch_list.

The difference is that if driver returns BLK_STS_DEV_RESOURCE, the queue
will be rerun when resource is available, so don't need to run queue after
a delay for avoiding IO hang explicitly.

> 
> > @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >  	hctx_lock(hctx, &srcu_idx);
> >  
> >  	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> > -	if (ret == BLK_STS_RESOURCE)
> > +	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))
> >  		blk_mq_sched_insert_request(rq, false, true, false);
> >  	else if (ret != BLK_STS_OK)
> >  		blk_mq_end_request(rq, ret);
> 
> For this normal (non dm-mpath) case the request gets re-inserted;
> dm-mpath must avoid that.
> 
> But with dm-mpath, which instead uses blk_mq_request_issue_directly(),
> we're driving IO with stacked blk-mq drivers.  If the underlying blk-mq 
> driver (e.g. scsi-mq or nvme) is made to retain the request, using
> __blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above,
> then dm-mpath will not have the ability to requeue the request without
> conflicting with the underlying blk-mq driver, will it?

No, as I explained, the exception is blk_mq_request_issue_directly(),
and now dm-rq simply frees it(and in my original version, this request
is cached for underlying queue, and reused in next dispatch), for others,
the request is retained in hctx->dispatch_list, and owned by blk-mq.

> 
> Or am I'm misunderstanding what __blk_mq_requeue_request() is doing?
> 
> dm_mq_queue_rq
> -> multipath_clone_and_map
>    -> blk_get_request (scsi_mq)
>       -> if error, dm-mpath conditionally requeues (w/ or w/o delay)

Yes, with this patch, most of times blk-mq will run queue w/ delay
because SCHED_RESTART is set after the 1st STS_RESOURCE from dm-rq
.queue_rq()

>       -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called
> -> dm_dispatch_clone_request
>    -> blk_mq_request_issue_directly
>       -> __blk_mq_try_issue_directly
>          -> __blk_mq_issue_directly
>             -> q->mq_ops->queue_rq (this is the underlying scsi_mq)
>                -> a BLK_STS_RESOURCE return here is how Bart was able to cause stalls

The stall only happens when SCHED_RESTART is set and the dm-rq queue is
idle(no any in-flight requests), that is exactly what this patch tries to
address as suggested by Jens.

>             -> __blk_mq_requeue_request, if BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE **1
>    -> (return from blk_mq_request_issue_directly)
>    -> if BLK_STS_RESOURCE, the dm-mpath request is released using blk_put_request();
>                            and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq **2

Right.

> -> if DM_MAPIO_REQUEUE return from map_request()'s call to dm_dispatch_clone_request:
>    BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq

Right.

> 
> The redundant queueing (both to underlying blk-mq at **1 above, and
> upper layer blk-mq at **2 above) is what I'm concerned about.
> 
> Hope this is clear.

Yeah, it is quite clear.

I also have other dm-mpath specific questions:

1) when any STS_RESOURCE is returned from underlying queue either
because of blk_get_request() or underlying .queue_rq() for a while,
will dm-mpath try to switch to other path?

2) what is the basic path switch policy of dm-mpath?

3) is it possible to move the check of 'ti->type->busy' into
.clone_and_map_rq()? if it is possible, this way might be more
effective to detect underlying queue busy.

Actually this patch may has another issue: if the default run queue
delay(in this patch, it is 10ms) is too short, the timer may expire
before any in-flight underlying request completes, then we may
dequeue too quick, and sequential IO performance can be hurt too.

But my previous patch in github doesn't have this issue.

	https://github.com/ming1/linux/commit/dfd672c998283a110247152237a9916b8264f3ec

Jens, what do you think of this issue? Or do we need to worry about
it?

Ming Lei Jan. 21, 2018, 1:32 a.m. UTC | #3

On Sun, Jan 21, 2018 at 08:57:41AM +0800, Ming Lei wrote:
> On Sat, Jan 20, 2018 at 12:30:02PM -0500, Mike Snitzer wrote:
> > On Sat, Jan 20 2018 at  8:48am -0500,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > This status is returned from driver to block layer if device related
> > > resource is run out of, but driver can guarantee that IO dispatch is
> > > triggered in future when the resource is available.
> > > 
> > > This patch converts some drivers to use this return value. Meantime
> > > if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
> > > queue after 10ms for avoiding IO hang.
> > > 
> > > Suggested-by: Jens Axboe <axboe@kernel.dk>
> > > Cc: Mike Snitzer <snitzer@redhat.com>
> > > Cc: Laurence Oberman <loberman@redhat.com>
> > > Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  block/blk-core.c             |  1 +
> > >  block/blk-mq.c               | 20 ++++++++++++++++----
> > >  drivers/block/null_blk.c     |  2 +-
> > >  drivers/block/virtio_blk.c   |  2 +-
> > >  drivers/block/xen-blkfront.c |  2 +-
> > >  drivers/scsi/scsi_lib.c      |  6 +++---
> > >  include/linux/blk_types.h    |  7 +++++++
> > >  7 files changed, 30 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 01f271d40825..6e97e0bf8178 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > >  		}
> > >  
> > >  		ret = q->mq_ops->queue_rq(hctx, &bd);
> > > -		if (ret == BLK_STS_RESOURCE) {
> > > +		if ((ret == BLK_STS_RESOURCE) ||
> > > +				(ret == BLK_STS_DEV_RESOURCE)) {
> > >  			/*
> > >  			 * If an I/O scheduler has been configured and we got a
> > >  			 * driver tag for the next request already, free it
> > 
> > Just a nit, but this should be on one line.
> 
> It is too long, and my editor starts to highlight/complain it, :-)
> 
> > 
> > > @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> > >  		*cookie = new_cookie;
> > >  		break;
> > >  	case BLK_STS_RESOURCE:
> > > +	case BLK_STS_DEV_RESOURCE:
> > >  		__blk_mq_requeue_request(rq);
> > >  		break;
> > >  	default:
> > 
> > It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is
> > too muddled: calling __blk_mq_requeue_request() for both will cause
> > underlying blk-mq driver to retain the request, won't it?
> 
> blk_mq_request_issue_directly() is used by driver(dm-rq) on underlying
> queue, and driver need to deal with underlying queue busy, now we simply
> free the (underlying)request and feedback the busy status to blk-mq via
> dm-rq.
> 
> Except for blk_mq_request_issue_directly(), this request need to be
> requeued, and is retained by blk-mq in hctx->dispatch_list.
> 
> The difference is that if driver returns BLK_STS_DEV_RESOURCE, the queue
> will be rerun when resource is available, so don't need to run queue after
> a delay for avoiding IO hang explicitly.
> 
> > 
> > > @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > >  	hctx_lock(hctx, &srcu_idx);
> > >  
> > >  	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> > > -	if (ret == BLK_STS_RESOURCE)
> > > +	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))
> > >  		blk_mq_sched_insert_request(rq, false, true, false);
> > >  	else if (ret != BLK_STS_OK)
> > >  		blk_mq_end_request(rq, ret);
> > 
> > For this normal (non dm-mpath) case the request gets re-inserted;
> > dm-mpath must avoid that.
> > 
> > But with dm-mpath, which instead uses blk_mq_request_issue_directly(),
> > we're driving IO with stacked blk-mq drivers.  If the underlying blk-mq 
> > driver (e.g. scsi-mq or nvme) is made to retain the request, using
> > __blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above,
> > then dm-mpath will not have the ability to requeue the request without
> > conflicting with the underlying blk-mq driver, will it?
> 
> No, as I explained, the exception is blk_mq_request_issue_directly(),
> and now dm-rq simply frees it(and in my original version, this request
> is cached for underlying queue, and reused in next dispatch), for others,
> the request is retained in hctx->dispatch_list, and owned by blk-mq.
> 
> > 
> > Or am I'm misunderstanding what __blk_mq_requeue_request() is doing?
> > 
> > dm_mq_queue_rq
> > -> multipath_clone_and_map
> >    -> blk_get_request (scsi_mq)
> >       -> if error, dm-mpath conditionally requeues (w/ or w/o delay)
> 
> Yes, with this patch, most of times blk-mq will run queue w/ delay
> because SCHED_RESTART is set after the 1st STS_RESOURCE from dm-rq
> .queue_rq()
> 
> >       -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called
> > -> dm_dispatch_clone_request
> >    -> blk_mq_request_issue_directly
> >       -> __blk_mq_try_issue_directly
> >          -> __blk_mq_issue_directly
> >             -> q->mq_ops->queue_rq (this is the underlying scsi_mq)
> >                -> a BLK_STS_RESOURCE return here is how Bart was able to cause stalls
> 
> The stall only happens when SCHED_RESTART is set and the dm-rq queue is
> idle(no any in-flight requests), that is exactly what this patch tries to
> address as suggested by Jens.
> 
> >             -> __blk_mq_requeue_request, if BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE **1
> >    -> (return from blk_mq_request_issue_directly)
> >    -> if BLK_STS_RESOURCE, the dm-mpath request is released using blk_put_request();
> >                            and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq **2
> 
> Right.
> 
> > -> if DM_MAPIO_REQUEUE return from map_request()'s call to dm_dispatch_clone_request:
> >    BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq
> 
> Right.
> 
> > 
> > The redundant queueing (both to underlying blk-mq at **1 above, and
> > upper layer blk-mq at **2 above) is what I'm concerned about.
> > 
> > Hope this is clear.
> 
> Yeah, it is quite clear.
> 
> I also have other dm-mpath specific questions:
> 
> 1) when any STS_RESOURCE is returned from underlying queue either
> because of blk_get_request() or underlying .queue_rq() for a while,
> will dm-mpath try to switch to other path?
> 
> 2) what is the basic path switch policy of dm-mpath?
> 
> 3) is it possible to move the check of 'ti->type->busy' into
> .clone_and_map_rq()? if it is possible, this way might be more
> effective to detect underlying queue busy.
> 
> Actually this patch may has another issue: if the default run queue
> delay(in this patch, it is 10ms) is too short, the timer may expire
> before any in-flight underlying request completes, then we may
> dequeue too quick, and sequential IO performance can be hurt too.
> 
> But my previous patch in github doesn't have this issue.
> 
> 	https://github.com/ming1/linux/commit/dfd672c998283a110247152237a9916b8264f3ec
> 
> Jens, what do you think of this issue? Or do we need to worry about
> it?

Just forget we have discussed to introduce blk_get_request_notify() which
can address this issue too by returning BLK_STS_DEV_RESOURCE in case
blk_get_request_notify() gets NULL.

Mike Snitzer Jan. 21, 2018, 3:52 a.m. UTC | #4

On Sat, Jan 20 2018 at  7:57pm -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> On Sat, Jan 20, 2018 at 12:30:02PM -0500, Mike Snitzer wrote:
> > On Sat, Jan 20 2018 at  8:48am -0500,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > This status is returned from driver to block layer if device related
> > > resource is run out of, but driver can guarantee that IO dispatch is
> > > triggered in future when the resource is available.
> > > 
> > > This patch converts some drivers to use this return value. Meantime
> > > if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
> > > queue after 10ms for avoiding IO hang.
> > > 
> > > Suggested-by: Jens Axboe <axboe@kernel.dk>
> > > Cc: Mike Snitzer <snitzer@redhat.com>
> > > Cc: Laurence Oberman <loberman@redhat.com>
> > > Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  block/blk-core.c             |  1 +
> > >  block/blk-mq.c               | 20 ++++++++++++++++----
> > >  drivers/block/null_blk.c     |  2 +-
> > >  drivers/block/virtio_blk.c   |  2 +-
> > >  drivers/block/xen-blkfront.c |  2 +-
> > >  drivers/scsi/scsi_lib.c      |  6 +++---
> > >  include/linux/blk_types.h    |  7 +++++++
> > >  7 files changed, 30 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 01f271d40825..6e97e0bf8178 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > >  		}
> > >  
> > >  		ret = q->mq_ops->queue_rq(hctx, &bd);
> > > -		if (ret == BLK_STS_RESOURCE) {
> > > +		if ((ret == BLK_STS_RESOURCE) ||
> > > +				(ret == BLK_STS_DEV_RESOURCE)) {
> > >  			/*
> > >  			 * If an I/O scheduler has been configured and we got a
> > >  			 * driver tag for the next request already, free it
> > 
> > Just a nit, but this should be on one line.
> 
> It is too long, and my editor starts to highlight/complain it, :-)

Look at the lines immediately following it, your isn't longer than
them..

> > > @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> > >  		*cookie = new_cookie;
> > >  		break;
> > >  	case BLK_STS_RESOURCE:
> > > +	case BLK_STS_DEV_RESOURCE:
> > >  		__blk_mq_requeue_request(rq);
> > >  		break;
> > >  	default:
> > 
> > It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is
> > too muddled: calling __blk_mq_requeue_request() for both will cause
> > underlying blk-mq driver to retain the request, won't it?
> 
> blk_mq_request_issue_directly() is used by driver(dm-rq) on underlying
> queue, and driver need to deal with underlying queue busy, now we simply
> free the (underlying)request and feedback the busy status to blk-mq via
> dm-rq.
> 
> Except for blk_mq_request_issue_directly(), this request need to be
> requeued, and is retained by blk-mq in hctx->dispatch_list.
> 
> The difference is that if driver returns BLK_STS_DEV_RESOURCE, the queue
> will be rerun when resource is available, so don't need to run queue after
> a delay for avoiding IO hang explicitly.

Yes, I understand the intent.

> > > @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > >  	hctx_lock(hctx, &srcu_idx);
> > >  
> > >  	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> > > -	if (ret == BLK_STS_RESOURCE)
> > > +	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))
> > >  		blk_mq_sched_insert_request(rq, false, true, false);
> > >  	else if (ret != BLK_STS_OK)
> > >  		blk_mq_end_request(rq, ret);
> > 
> > For this normal (non dm-mpath) case the request gets re-inserted;
> > dm-mpath must avoid that.
> > 
> > But with dm-mpath, which instead uses blk_mq_request_issue_directly(),
> > we're driving IO with stacked blk-mq drivers.  If the underlying blk-mq 
> > driver (e.g. scsi-mq or nvme) is made to retain the request, using
> > __blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above,
> > then dm-mpath will not have the ability to requeue the request without
> > conflicting with the underlying blk-mq driver, will it?
> 
> No, as I explained, the exception is blk_mq_request_issue_directly(),
> and now dm-rq simply frees it(and in my original version, this request
> is cached for underlying queue, and reused in next dispatch), for others,
> the request is retained in hctx->dispatch_list, and owned by blk-mq.
>
> > Or am I'm misunderstanding what __blk_mq_requeue_request() is doing?

OK I was misunderstanding __blk_mq_requeue_request().  Seems
__blk_mq_requeue_request() is effectively resetting a request for
reuse.  Not retaining the request for reissue.

> > dm_mq_queue_rq
> > -> multipath_clone_and_map
> >    -> blk_get_request (scsi_mq)
> >       -> if error, dm-mpath conditionally requeues (w/ or w/o delay)
> 
> Yes, with this patch, most of times blk-mq will run queue w/ delay
> because SCHED_RESTART is set after the 1st STS_RESOURCE from dm-rq
> .queue_rq()
> 
> >       -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called
> > -> dm_dispatch_clone_request
> >    -> blk_mq_request_issue_directly
> >       -> __blk_mq_try_issue_directly
> >          -> __blk_mq_issue_directly
> >             -> q->mq_ops->queue_rq (this is the underlying scsi_mq)
> >                -> a BLK_STS_RESOURCE return here is how Bart was able to cause stalls
> 
> The stall only happens when SCHED_RESTART is set and the dm-rq queue is
> idle(no any in-flight requests), that is exactly what this patch tries to
> address as suggested by Jens.
> 
> >             -> __blk_mq_requeue_request, if BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE **1
> >    -> (return from blk_mq_request_issue_directly)
> >    -> if BLK_STS_RESOURCE, the dm-mpath request is released using blk_put_request();
> >                            and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq **2
> 
> Right.
> 
> > -> if DM_MAPIO_REQUEUE return from map_request()'s call to dm_dispatch_clone_request:
> >    BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq
> 
> Right.
> 
> > 
> > The redundant queueing (both to underlying blk-mq at **1 above, and
> > upper layer blk-mq at **2 above) is what I'm concerned about.
> > 
> > Hope this is clear.
> 
> Yeah, it is quite clear.

Well I thought there was a problem with __blk_mq_issue_directly calling
__blk_mq_requeue_request for dm-rq on scsi-mq.. guess not.

> I also have other dm-mpath specific questions:
> 
> 1) when any STS_RESOURCE is returned from underlying queue either
> because of blk_get_request() or underlying .queue_rq() for a while,
> will dm-mpath try to switch to other path?

If dm_requeue_original_request() is used, and then we reenter
multipath_clone_and_map() then a new path may be selected.  Depending on
whether there are multiple paths and which path selector is being used.

> 2) what is the basic path switch policy of dm-mpath?

It is selectable, there are 3:
round-robin, queue-length and service-time

A path will be failed in multipath_end_io() if IO returns with a
retryable error (as determined by blk_path_error()).  This ensures the
path that experienced the failure will not be selected again until
reinstated.

> 3) is it possible to move the check of 'ti->type->busy' into
> .clone_and_map_rq()? if it is possible, this way might be more
> effective to detect underlying queue busy.

Can you elaborate further why it'd make a difference?

It'd be somewhat odd to push that busy check down into
.clone_and_map_rq().

Are you looking to check the selected path's blk-mq queue before calling
blk_get_request()?  Why is that any different than blk_get_request()
returning busy?

> Actually this patch may has another issue: if the default run queue
> delay(in this patch, it is 10ms) is too short, the timer may expire
> before any in-flight underlying request completes, then we may
> dequeue too quick, and sequential IO performance can be hurt too.

That's the problem with any arbitrary timer based solution.  No value
that you pick is going to be correct for all hardware.
 
> But my previous patch in github doesn't have this issue.
> 
> 	https://github.com/ming1/linux/commit/dfd672c998283a110247152237a9916b8264f3ec
> 
> Jens, what do you think of this issue? Or do we need to worry about
> it?

As your follow-up reply pointed out, blk_get_request_notify() is likely
the better way forward to address this race.

Mike

Ming Lei Jan. 22, 2018, 3:18 a.m. UTC | #5

On Sat, Jan 20, 2018 at 10:52:01PM -0500, Mike Snitzer wrote:
> On Sat, Jan 20 2018 at  7:57pm -0500,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Sat, Jan 20, 2018 at 12:30:02PM -0500, Mike Snitzer wrote:
> > > On Sat, Jan 20 2018 at  8:48am -0500,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > This status is returned from driver to block layer if device related
> > > > resource is run out of, but driver can guarantee that IO dispatch is
> > > > triggered in future when the resource is available.
> > > > 
> > > > This patch converts some drivers to use this return value. Meantime
> > > > if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
> > > > queue after 10ms for avoiding IO hang.
> > > > 
> > > > Suggested-by: Jens Axboe <axboe@kernel.dk>
> > > > Cc: Mike Snitzer <snitzer@redhat.com>
> > > > Cc: Laurence Oberman <loberman@redhat.com>
> > > > Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  block/blk-core.c             |  1 +
> > > >  block/blk-mq.c               | 20 ++++++++++++++++----
> > > >  drivers/block/null_blk.c     |  2 +-
> > > >  drivers/block/virtio_blk.c   |  2 +-
> > > >  drivers/block/xen-blkfront.c |  2 +-
> > > >  drivers/scsi/scsi_lib.c      |  6 +++---
> > > >  include/linux/blk_types.h    |  7 +++++++
> > > >  7 files changed, 30 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index 01f271d40825..6e97e0bf8178 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > >  		}
> > > >  
> > > >  		ret = q->mq_ops->queue_rq(hctx, &bd);
> > > > -		if (ret == BLK_STS_RESOURCE) {
> > > > +		if ((ret == BLK_STS_RESOURCE) ||
> > > > +				(ret == BLK_STS_DEV_RESOURCE)) {
> > > >  			/*
> > > >  			 * If an I/O scheduler has been configured and we got a
> > > >  			 * driver tag for the next request already, free it
> > > 
> > > Just a nit, but this should be on one line.
> > 
> > It is too long, and my editor starts to highlight/complain it, :-)
> 
> Look at the lines immediately following it, your isn't longer than
> them..

OK.

> 
> > > > @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> > > >  		*cookie = new_cookie;
> > > >  		break;
> > > >  	case BLK_STS_RESOURCE:
> > > > +	case BLK_STS_DEV_RESOURCE:
> > > >  		__blk_mq_requeue_request(rq);
> > > >  		break;
> > > >  	default:
> > > 
> > > It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is
> > > too muddled: calling __blk_mq_requeue_request() for both will cause
> > > underlying blk-mq driver to retain the request, won't it?
> > 
> > blk_mq_request_issue_directly() is used by driver(dm-rq) on underlying
> > queue, and driver need to deal with underlying queue busy, now we simply
> > free the (underlying)request and feedback the busy status to blk-mq via
> > dm-rq.
> > 
> > Except for blk_mq_request_issue_directly(), this request need to be
> > requeued, and is retained by blk-mq in hctx->dispatch_list.
> > 
> > The difference is that if driver returns BLK_STS_DEV_RESOURCE, the queue
> > will be rerun when resource is available, so don't need to run queue after
> > a delay for avoiding IO hang explicitly.
> 
> Yes, I understand the intent.
> 
> > > > @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > > >  	hctx_lock(hctx, &srcu_idx);
> > > >  
> > > >  	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> > > > -	if (ret == BLK_STS_RESOURCE)
> > > > +	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))
> > > >  		blk_mq_sched_insert_request(rq, false, true, false);
> > > >  	else if (ret != BLK_STS_OK)
> > > >  		blk_mq_end_request(rq, ret);
> > > 
> > > For this normal (non dm-mpath) case the request gets re-inserted;
> > > dm-mpath must avoid that.
> > > 
> > > But with dm-mpath, which instead uses blk_mq_request_issue_directly(),
> > > we're driving IO with stacked blk-mq drivers.  If the underlying blk-mq 
> > > driver (e.g. scsi-mq or nvme) is made to retain the request, using
> > > __blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above,
> > > then dm-mpath will not have the ability to requeue the request without
> > > conflicting with the underlying blk-mq driver, will it?
> > 
> > No, as I explained, the exception is blk_mq_request_issue_directly(),
> > and now dm-rq simply frees it(and in my original version, this request
> > is cached for underlying queue, and reused in next dispatch), for others,
> > the request is retained in hctx->dispatch_list, and owned by blk-mq.
> >
> > > Or am I'm misunderstanding what __blk_mq_requeue_request() is doing?
> 
> OK I was misunderstanding __blk_mq_requeue_request().  Seems
> __blk_mq_requeue_request() is effectively resetting a request for
> reuse.  Not retaining the request for reissue.
> 
> > > dm_mq_queue_rq
> > > -> multipath_clone_and_map
> > >    -> blk_get_request (scsi_mq)
> > >       -> if error, dm-mpath conditionally requeues (w/ or w/o delay)
> > 
> > Yes, with this patch, most of times blk-mq will run queue w/ delay
> > because SCHED_RESTART is set after the 1st STS_RESOURCE from dm-rq
> > .queue_rq()
> > 
> > >       -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called
> > > -> dm_dispatch_clone_request
> > >    -> blk_mq_request_issue_directly
> > >       -> __blk_mq_try_issue_directly
> > >          -> __blk_mq_issue_directly
> > >             -> q->mq_ops->queue_rq (this is the underlying scsi_mq)
> > >                -> a BLK_STS_RESOURCE return here is how Bart was able to cause stalls
> > 
> > The stall only happens when SCHED_RESTART is set and the dm-rq queue is
> > idle(no any in-flight requests), that is exactly what this patch tries to
> > address as suggested by Jens.
> > 
> > >             -> __blk_mq_requeue_request, if BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE **1
> > >    -> (return from blk_mq_request_issue_directly)
> > >    -> if BLK_STS_RESOURCE, the dm-mpath request is released using blk_put_request();
> > >                            and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq **2
> > 
> > Right.
> > 
> > > -> if DM_MAPIO_REQUEUE return from map_request()'s call to dm_dispatch_clone_request:
> > >    BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq
> > 
> > Right.
> > 
> > > 
> > > The redundant queueing (both to underlying blk-mq at **1 above, and
> > > upper layer blk-mq at **2 above) is what I'm concerned about.
> > > 
> > > Hope this is clear.
> > 
> > Yeah, it is quite clear.
> 
> Well I thought there was a problem with __blk_mq_issue_directly calling
> __blk_mq_requeue_request for dm-rq on scsi-mq.. guess not.
> 
> > I also have other dm-mpath specific questions:
> > 
> > 1) when any STS_RESOURCE is returned from underlying queue either
> > because of blk_get_request() or underlying .queue_rq() for a while,
> > will dm-mpath try to switch to other path?
> 
> If dm_requeue_original_request() is used, and then we reenter
> multipath_clone_and_map() then a new path may be selected.  Depending on
> whether there are multiple paths and which path selector is being used.

OK.

> 
> > 2) what is the basic path switch policy of dm-mpath?
> 
> It is selectable, there are 3:
> round-robin, queue-length and service-time
> 
> A path will be failed in multipath_end_io() if IO returns with a
> retryable error (as determined by blk_path_error()).  This ensures the
> path that experienced the failure will not be selected again until
> reinstated.
> 
> > 3) is it possible to move the check of 'ti->type->busy' into
> > .clone_and_map_rq()? if it is possible, this way might be more
> > effective to detect underlying queue busy.
> 
> Can you elaborate further why it'd make a difference?
> 
> It'd be somewhat odd to push that busy check down into
> .clone_and_map_rq().

From source code of multipath_busy() and multipath_clone_and_map():

1) they both check if there is path available, so this part in
multipath_busy() can be done in multipath_clone_and_map() too, I guess.

2) multipath_busy() also check if there is any non-busy path(underlying
queue), but never records the non-busy path for further uses:

- pgpath_busy() calls blk_lld_busy() to check if underlying queue is
  busy, this way is easy to cause race, because queue can become not
  busy at the same time. Also blk_lld_busy() always return false in
  case of blk-mq.

- that means multipath_busy() isn't needed for blk-mq, but for non-mq,
  the check on busy may avoid to waste CPU for doing unnecessary allocation
  which may be a bit expensive, but this still depends on that the selected
  path in multipath_clone_and_map() isn't busy. 

That is why I raise the question in previous email.

> 
> Are you looking to check the selected path's blk-mq queue before calling
> blk_get_request()?  Why is that any different than blk_get_request()
> returning busy?

Maybe the check of 'ti->type->busy' can be killed at least in dm_mq_queue_rq().

> 
> > Actually this patch may has another issue: if the default run queue
> > delay(in this patch, it is 10ms) is too short, the timer may expire
> > before any in-flight underlying request completes, then we may
> > dequeue too quick, and sequential IO performance can be hurt too.
> 
> That's the problem with any arbitrary timer based solution.  No value
> that you pick is going to be correct for all hardware.
>  
> > But my previous patch in github doesn't have this issue.
> > 
> > 	https://github.com/ming1/linux/commit/dfd672c998283a110247152237a9916b8264f3ec
> > 
> > Jens, what do you think of this issue? Or do we need to worry about
> > it?
> 
> As your follow-up reply pointed out, blk_get_request_notify() is likely
> the better way forward to address this race.

Yeah, please ignore this patch, and I will post them out soon for
review, but the new patches have to be against both dm and block tree.

blk-mq: introduce BLK_STS_DEV_RESOURCE

Commit Message

Comments

Patch