Message ID | 20180120134813.4446-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Sat, Jan 20 2018 at 8:48am -0500, Ming Lei <ming.lei@redhat.com> wrote: > This status is returned from driver to block layer if device related > resource is run out of, but driver can guarantee that IO dispatch is > triggered in future when the resource is available. > > This patch converts some drivers to use this return value. Meantime > if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run > queue after 10ms for avoiding IO hang. > > Suggested-by: Jens Axboe <axboe@kernel.dk> > Cc: Mike Snitzer <snitzer@redhat.com> > Cc: Laurence Oberman <loberman@redhat.com> > Cc: Bart Van Assche <bart.vanassche@sandisk.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > block/blk-core.c | 1 + > block/blk-mq.c | 20 ++++++++++++++++---- > drivers/block/null_blk.c | 2 +- > drivers/block/virtio_blk.c | 2 +- > drivers/block/xen-blkfront.c | 2 +- > drivers/scsi/scsi_lib.c | 6 +++--- > include/linux/blk_types.h | 7 +++++++ > 7 files changed, 30 insertions(+), 10 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 01f271d40825..6e97e0bf8178 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > } > > ret = q->mq_ops->queue_rq(hctx, &bd); > - if (ret == BLK_STS_RESOURCE) { > + if ((ret == BLK_STS_RESOURCE) || > + (ret == BLK_STS_DEV_RESOURCE)) { > /* > * If an I/O scheduler has been configured and we got a > * driver tag for the next request already, free it Just a nit, but this should be on one line. > @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > *cookie = new_cookie; > break; > case BLK_STS_RESOURCE: > + case BLK_STS_DEV_RESOURCE: > __blk_mq_requeue_request(rq); > break; > default: It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is too muddled: calling __blk_mq_requeue_request() for both will cause underlying blk-mq driver to retain the request, won't it? > @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > hctx_lock(hctx, &srcu_idx); > > ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); > - if (ret == BLK_STS_RESOURCE) > + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) > blk_mq_sched_insert_request(rq, false, true, false); > else if (ret != BLK_STS_OK) > blk_mq_end_request(rq, ret); For this normal (non dm-mpath) case the request gets re-inserted; dm-mpath must avoid that. But with dm-mpath, which instead uses blk_mq_request_issue_directly(), we're driving IO with stacked blk-mq drivers. If the underlying blk-mq driver (e.g. scsi-mq or nvme) is made to retain the request, using __blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above, then dm-mpath will not have the ability to requeue the request without conflicting with the underlying blk-mq driver, will it? Or am I'm misunderstanding what __blk_mq_requeue_request() is doing? dm_mq_queue_rq -> multipath_clone_and_map -> blk_get_request (scsi_mq) -> if error, dm-mpath conditionally requeues (w/ or w/o delay) -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called -> dm_dispatch_clone_request -> blk_mq_request_issue_directly -> __blk_mq_try_issue_directly -> __blk_mq_issue_directly -> q->mq_ops->queue_rq (this is the underlying scsi_mq) -> a BLK_STS_RESOURCE return here is how Bart was able to cause stalls -> __blk_mq_requeue_request, if BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE **1 -> (return from blk_mq_request_issue_directly) -> if BLK_STS_RESOURCE, the dm-mpath request is released using blk_put_request(); and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq **2 -> if DM_MAPIO_REQUEUE return from map_request()'s call to dm_dispatch_clone_request: BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq The redundant queueing (both to underlying blk-mq at **1 above, and upper layer blk-mq at **2 above) is what I'm concerned about. Hope this is clear. I'd love to be missing something, please advise. Thanks, Mike
On Sat, Jan 20, 2018 at 12:30:02PM -0500, Mike Snitzer wrote: > On Sat, Jan 20 2018 at 8:48am -0500, > Ming Lei <ming.lei@redhat.com> wrote: > > > This status is returned from driver to block layer if device related > > resource is run out of, but driver can guarantee that IO dispatch is > > triggered in future when the resource is available. > > > > This patch converts some drivers to use this return value. Meantime > > if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run > > queue after 10ms for avoiding IO hang. > > > > Suggested-by: Jens Axboe <axboe@kernel.dk> > > Cc: Mike Snitzer <snitzer@redhat.com> > > Cc: Laurence Oberman <loberman@redhat.com> > > Cc: Bart Van Assche <bart.vanassche@sandisk.com> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > block/blk-core.c | 1 + > > block/blk-mq.c | 20 ++++++++++++++++---- > > drivers/block/null_blk.c | 2 +- > > drivers/block/virtio_blk.c | 2 +- > > drivers/block/xen-blkfront.c | 2 +- > > drivers/scsi/scsi_lib.c | 6 +++--- > > include/linux/blk_types.h | 7 +++++++ > > 7 files changed, 30 insertions(+), 10 deletions(-) > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 01f271d40825..6e97e0bf8178 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > > } > > > > ret = q->mq_ops->queue_rq(hctx, &bd); > > - if (ret == BLK_STS_RESOURCE) { > > + if ((ret == BLK_STS_RESOURCE) || > > + (ret == BLK_STS_DEV_RESOURCE)) { > > /* > > * If an I/O scheduler has been configured and we got a > > * driver tag for the next request already, free it > > Just a nit, but this should be on one line. It is too long, and my editor starts to highlight/complain it, :-) > > > @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > > *cookie = new_cookie; > > break; > > case BLK_STS_RESOURCE: > > + case BLK_STS_DEV_RESOURCE: > > __blk_mq_requeue_request(rq); > > break; > > default: > > It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is > too muddled: calling __blk_mq_requeue_request() for both will cause > underlying blk-mq driver to retain the request, won't it? blk_mq_request_issue_directly() is used by driver(dm-rq) on underlying queue, and driver need to deal with underlying queue busy, now we simply free the (underlying)request and feedback the busy status to blk-mq via dm-rq. Except for blk_mq_request_issue_directly(), this request need to be requeued, and is retained by blk-mq in hctx->dispatch_list. The difference is that if driver returns BLK_STS_DEV_RESOURCE, the queue will be rerun when resource is available, so don't need to run queue after a delay for avoiding IO hang explicitly. > > > @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > > hctx_lock(hctx, &srcu_idx); > > > > ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); > > - if (ret == BLK_STS_RESOURCE) > > + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) > > blk_mq_sched_insert_request(rq, false, true, false); > > else if (ret != BLK_STS_OK) > > blk_mq_end_request(rq, ret); > > For this normal (non dm-mpath) case the request gets re-inserted; > dm-mpath must avoid that. > > But with dm-mpath, which instead uses blk_mq_request_issue_directly(), > we're driving IO with stacked blk-mq drivers. If the underlying blk-mq > driver (e.g. scsi-mq or nvme) is made to retain the request, using > __blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above, > then dm-mpath will not have the ability to requeue the request without > conflicting with the underlying blk-mq driver, will it? No, as I explained, the exception is blk_mq_request_issue_directly(), and now dm-rq simply frees it(and in my original version, this request is cached for underlying queue, and reused in next dispatch), for others, the request is retained in hctx->dispatch_list, and owned by blk-mq. > > Or am I'm misunderstanding what __blk_mq_requeue_request() is doing? > > dm_mq_queue_rq > -> multipath_clone_and_map > -> blk_get_request (scsi_mq) > -> if error, dm-mpath conditionally requeues (w/ or w/o delay) Yes, with this patch, most of times blk-mq will run queue w/ delay because SCHED_RESTART is set after the 1st STS_RESOURCE from dm-rq .queue_rq() > -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called > -> dm_dispatch_clone_request > -> blk_mq_request_issue_directly > -> __blk_mq_try_issue_directly > -> __blk_mq_issue_directly > -> q->mq_ops->queue_rq (this is the underlying scsi_mq) > -> a BLK_STS_RESOURCE return here is how Bart was able to cause stalls The stall only happens when SCHED_RESTART is set and the dm-rq queue is idle(no any in-flight requests), that is exactly what this patch tries to address as suggested by Jens. > -> __blk_mq_requeue_request, if BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE **1 > -> (return from blk_mq_request_issue_directly) > -> if BLK_STS_RESOURCE, the dm-mpath request is released using blk_put_request(); > and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq **2 Right. > -> if DM_MAPIO_REQUEUE return from map_request()'s call to dm_dispatch_clone_request: > BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq Right. > > The redundant queueing (both to underlying blk-mq at **1 above, and > upper layer blk-mq at **2 above) is what I'm concerned about. > > Hope this is clear. Yeah, it is quite clear. I also have other dm-mpath specific questions: 1) when any STS_RESOURCE is returned from underlying queue either because of blk_get_request() or underlying .queue_rq() for a while, will dm-mpath try to switch to other path? 2) what is the basic path switch policy of dm-mpath? 3) is it possible to move the check of 'ti->type->busy' into .clone_and_map_rq()? if it is possible, this way might be more effective to detect underlying queue busy. Actually this patch may has another issue: if the default run queue delay(in this patch, it is 10ms) is too short, the timer may expire before any in-flight underlying request completes, then we may dequeue too quick, and sequential IO performance can be hurt too. But my previous patch in github doesn't have this issue. https://github.com/ming1/linux/commit/dfd672c998283a110247152237a9916b8264f3ec Jens, what do you think of this issue? Or do we need to worry about it?
On Sun, Jan 21, 2018 at 08:57:41AM +0800, Ming Lei wrote: > On Sat, Jan 20, 2018 at 12:30:02PM -0500, Mike Snitzer wrote: > > On Sat, Jan 20 2018 at 8:48am -0500, > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > This status is returned from driver to block layer if device related > > > resource is run out of, but driver can guarantee that IO dispatch is > > > triggered in future when the resource is available. > > > > > > This patch converts some drivers to use this return value. Meantime > > > if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run > > > queue after 10ms for avoiding IO hang. > > > > > > Suggested-by: Jens Axboe <axboe@kernel.dk> > > > Cc: Mike Snitzer <snitzer@redhat.com> > > > Cc: Laurence Oberman <loberman@redhat.com> > > > Cc: Bart Van Assche <bart.vanassche@sandisk.com> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > --- > > > block/blk-core.c | 1 + > > > block/blk-mq.c | 20 ++++++++++++++++---- > > > drivers/block/null_blk.c | 2 +- > > > drivers/block/virtio_blk.c | 2 +- > > > drivers/block/xen-blkfront.c | 2 +- > > > drivers/scsi/scsi_lib.c | 6 +++--- > > > include/linux/blk_types.h | 7 +++++++ > > > 7 files changed, 30 insertions(+), 10 deletions(-) > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > > index 01f271d40825..6e97e0bf8178 100644 > > > --- a/block/blk-mq.c > > > +++ b/block/blk-mq.c > > > @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > > > } > > > > > > ret = q->mq_ops->queue_rq(hctx, &bd); > > > - if (ret == BLK_STS_RESOURCE) { > > > + if ((ret == BLK_STS_RESOURCE) || > > > + (ret == BLK_STS_DEV_RESOURCE)) { > > > /* > > > * If an I/O scheduler has been configured and we got a > > > * driver tag for the next request already, free it > > > > Just a nit, but this should be on one line. > > It is too long, and my editor starts to highlight/complain it, :-) > > > > > > @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > > > *cookie = new_cookie; > > > break; > > > case BLK_STS_RESOURCE: > > > + case BLK_STS_DEV_RESOURCE: > > > __blk_mq_requeue_request(rq); > > > break; > > > default: > > > > It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is > > too muddled: calling __blk_mq_requeue_request() for both will cause > > underlying blk-mq driver to retain the request, won't it? > > blk_mq_request_issue_directly() is used by driver(dm-rq) on underlying > queue, and driver need to deal with underlying queue busy, now we simply > free the (underlying)request and feedback the busy status to blk-mq via > dm-rq. > > Except for blk_mq_request_issue_directly(), this request need to be > requeued, and is retained by blk-mq in hctx->dispatch_list. > > The difference is that if driver returns BLK_STS_DEV_RESOURCE, the queue > will be rerun when resource is available, so don't need to run queue after > a delay for avoiding IO hang explicitly. > > > > > > @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > > > hctx_lock(hctx, &srcu_idx); > > > > > > ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); > > > - if (ret == BLK_STS_RESOURCE) > > > + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) > > > blk_mq_sched_insert_request(rq, false, true, false); > > > else if (ret != BLK_STS_OK) > > > blk_mq_end_request(rq, ret); > > > > For this normal (non dm-mpath) case the request gets re-inserted; > > dm-mpath must avoid that. > > > > But with dm-mpath, which instead uses blk_mq_request_issue_directly(), > > we're driving IO with stacked blk-mq drivers. If the underlying blk-mq > > driver (e.g. scsi-mq or nvme) is made to retain the request, using > > __blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above, > > then dm-mpath will not have the ability to requeue the request without > > conflicting with the underlying blk-mq driver, will it? > > No, as I explained, the exception is blk_mq_request_issue_directly(), > and now dm-rq simply frees it(and in my original version, this request > is cached for underlying queue, and reused in next dispatch), for others, > the request is retained in hctx->dispatch_list, and owned by blk-mq. > > > > > Or am I'm misunderstanding what __blk_mq_requeue_request() is doing? > > > > dm_mq_queue_rq > > -> multipath_clone_and_map > > -> blk_get_request (scsi_mq) > > -> if error, dm-mpath conditionally requeues (w/ or w/o delay) > > Yes, with this patch, most of times blk-mq will run queue w/ delay > because SCHED_RESTART is set after the 1st STS_RESOURCE from dm-rq > .queue_rq() > > > -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called > > -> dm_dispatch_clone_request > > -> blk_mq_request_issue_directly > > -> __blk_mq_try_issue_directly > > -> __blk_mq_issue_directly > > -> q->mq_ops->queue_rq (this is the underlying scsi_mq) > > -> a BLK_STS_RESOURCE return here is how Bart was able to cause stalls > > The stall only happens when SCHED_RESTART is set and the dm-rq queue is > idle(no any in-flight requests), that is exactly what this patch tries to > address as suggested by Jens. > > > -> __blk_mq_requeue_request, if BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE **1 > > -> (return from blk_mq_request_issue_directly) > > -> if BLK_STS_RESOURCE, the dm-mpath request is released using blk_put_request(); > > and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq **2 > > Right. > > > -> if DM_MAPIO_REQUEUE return from map_request()'s call to dm_dispatch_clone_request: > > BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq > > Right. > > > > > The redundant queueing (both to underlying blk-mq at **1 above, and > > upper layer blk-mq at **2 above) is what I'm concerned about. > > > > Hope this is clear. > > Yeah, it is quite clear. > > I also have other dm-mpath specific questions: > > 1) when any STS_RESOURCE is returned from underlying queue either > because of blk_get_request() or underlying .queue_rq() for a while, > will dm-mpath try to switch to other path? > > 2) what is the basic path switch policy of dm-mpath? > > 3) is it possible to move the check of 'ti->type->busy' into > .clone_and_map_rq()? if it is possible, this way might be more > effective to detect underlying queue busy. > > Actually this patch may has another issue: if the default run queue > delay(in this patch, it is 10ms) is too short, the timer may expire > before any in-flight underlying request completes, then we may > dequeue too quick, and sequential IO performance can be hurt too. > > But my previous patch in github doesn't have this issue. > > https://github.com/ming1/linux/commit/dfd672c998283a110247152237a9916b8264f3ec > > Jens, what do you think of this issue? Or do we need to worry about > it? Just forget we have discussed to introduce blk_get_request_notify() which can address this issue too by returning BLK_STS_DEV_RESOURCE in case blk_get_request_notify() gets NULL.
On Sat, Jan 20 2018 at 7:57pm -0500, Ming Lei <ming.lei@redhat.com> wrote: > On Sat, Jan 20, 2018 at 12:30:02PM -0500, Mike Snitzer wrote: > > On Sat, Jan 20 2018 at 8:48am -0500, > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > This status is returned from driver to block layer if device related > > > resource is run out of, but driver can guarantee that IO dispatch is > > > triggered in future when the resource is available. > > > > > > This patch converts some drivers to use this return value. Meantime > > > if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run > > > queue after 10ms for avoiding IO hang. > > > > > > Suggested-by: Jens Axboe <axboe@kernel.dk> > > > Cc: Mike Snitzer <snitzer@redhat.com> > > > Cc: Laurence Oberman <loberman@redhat.com> > > > Cc: Bart Van Assche <bart.vanassche@sandisk.com> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > --- > > > block/blk-core.c | 1 + > > > block/blk-mq.c | 20 ++++++++++++++++---- > > > drivers/block/null_blk.c | 2 +- > > > drivers/block/virtio_blk.c | 2 +- > > > drivers/block/xen-blkfront.c | 2 +- > > > drivers/scsi/scsi_lib.c | 6 +++--- > > > include/linux/blk_types.h | 7 +++++++ > > > 7 files changed, 30 insertions(+), 10 deletions(-) > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > > index 01f271d40825..6e97e0bf8178 100644 > > > --- a/block/blk-mq.c > > > +++ b/block/blk-mq.c > > > @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > > > } > > > > > > ret = q->mq_ops->queue_rq(hctx, &bd); > > > - if (ret == BLK_STS_RESOURCE) { > > > + if ((ret == BLK_STS_RESOURCE) || > > > + (ret == BLK_STS_DEV_RESOURCE)) { > > > /* > > > * If an I/O scheduler has been configured and we got a > > > * driver tag for the next request already, free it > > > > Just a nit, but this should be on one line. > > It is too long, and my editor starts to highlight/complain it, :-) Look at the lines immediately following it, your isn't longer than them.. > > > @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > > > *cookie = new_cookie; > > > break; > > > case BLK_STS_RESOURCE: > > > + case BLK_STS_DEV_RESOURCE: > > > __blk_mq_requeue_request(rq); > > > break; > > > default: > > > > It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is > > too muddled: calling __blk_mq_requeue_request() for both will cause > > underlying blk-mq driver to retain the request, won't it? > > blk_mq_request_issue_directly() is used by driver(dm-rq) on underlying > queue, and driver need to deal with underlying queue busy, now we simply > free the (underlying)request and feedback the busy status to blk-mq via > dm-rq. > > Except for blk_mq_request_issue_directly(), this request need to be > requeued, and is retained by blk-mq in hctx->dispatch_list. > > The difference is that if driver returns BLK_STS_DEV_RESOURCE, the queue > will be rerun when resource is available, so don't need to run queue after > a delay for avoiding IO hang explicitly. Yes, I understand the intent. > > > @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > > > hctx_lock(hctx, &srcu_idx); > > > > > > ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); > > > - if (ret == BLK_STS_RESOURCE) > > > + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) > > > blk_mq_sched_insert_request(rq, false, true, false); > > > else if (ret != BLK_STS_OK) > > > blk_mq_end_request(rq, ret); > > > > For this normal (non dm-mpath) case the request gets re-inserted; > > dm-mpath must avoid that. > > > > But with dm-mpath, which instead uses blk_mq_request_issue_directly(), > > we're driving IO with stacked blk-mq drivers. If the underlying blk-mq > > driver (e.g. scsi-mq or nvme) is made to retain the request, using > > __blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above, > > then dm-mpath will not have the ability to requeue the request without > > conflicting with the underlying blk-mq driver, will it? > > No, as I explained, the exception is blk_mq_request_issue_directly(), > and now dm-rq simply frees it(and in my original version, this request > is cached for underlying queue, and reused in next dispatch), for others, > the request is retained in hctx->dispatch_list, and owned by blk-mq. > > > Or am I'm misunderstanding what __blk_mq_requeue_request() is doing? OK I was misunderstanding __blk_mq_requeue_request(). Seems __blk_mq_requeue_request() is effectively resetting a request for reuse. Not retaining the request for reissue. > > dm_mq_queue_rq > > -> multipath_clone_and_map > > -> blk_get_request (scsi_mq) > > -> if error, dm-mpath conditionally requeues (w/ or w/o delay) > > Yes, with this patch, most of times blk-mq will run queue w/ delay > because SCHED_RESTART is set after the 1st STS_RESOURCE from dm-rq > .queue_rq() > > > -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called > > -> dm_dispatch_clone_request > > -> blk_mq_request_issue_directly > > -> __blk_mq_try_issue_directly > > -> __blk_mq_issue_directly > > -> q->mq_ops->queue_rq (this is the underlying scsi_mq) > > -> a BLK_STS_RESOURCE return here is how Bart was able to cause stalls > > The stall only happens when SCHED_RESTART is set and the dm-rq queue is > idle(no any in-flight requests), that is exactly what this patch tries to > address as suggested by Jens. > > > -> __blk_mq_requeue_request, if BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE **1 > > -> (return from blk_mq_request_issue_directly) > > -> if BLK_STS_RESOURCE, the dm-mpath request is released using blk_put_request(); > > and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq **2 > > Right. > > > -> if DM_MAPIO_REQUEUE return from map_request()'s call to dm_dispatch_clone_request: > > BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq > > Right. > > > > > The redundant queueing (both to underlying blk-mq at **1 above, and > > upper layer blk-mq at **2 above) is what I'm concerned about. > > > > Hope this is clear. > > Yeah, it is quite clear. Well I thought there was a problem with __blk_mq_issue_directly calling __blk_mq_requeue_request for dm-rq on scsi-mq.. guess not. > I also have other dm-mpath specific questions: > > 1) when any STS_RESOURCE is returned from underlying queue either > because of blk_get_request() or underlying .queue_rq() for a while, > will dm-mpath try to switch to other path? If dm_requeue_original_request() is used, and then we reenter multipath_clone_and_map() then a new path may be selected. Depending on whether there are multiple paths and which path selector is being used. > 2) what is the basic path switch policy of dm-mpath? It is selectable, there are 3: round-robin, queue-length and service-time A path will be failed in multipath_end_io() if IO returns with a retryable error (as determined by blk_path_error()). This ensures the path that experienced the failure will not be selected again until reinstated. > 3) is it possible to move the check of 'ti->type->busy' into > .clone_and_map_rq()? if it is possible, this way might be more > effective to detect underlying queue busy. Can you elaborate further why it'd make a difference? It'd be somewhat odd to push that busy check down into .clone_and_map_rq(). Are you looking to check the selected path's blk-mq queue before calling blk_get_request()? Why is that any different than blk_get_request() returning busy? > Actually this patch may has another issue: if the default run queue > delay(in this patch, it is 10ms) is too short, the timer may expire > before any in-flight underlying request completes, then we may > dequeue too quick, and sequential IO performance can be hurt too. That's the problem with any arbitrary timer based solution. No value that you pick is going to be correct for all hardware. > But my previous patch in github doesn't have this issue. > > https://github.com/ming1/linux/commit/dfd672c998283a110247152237a9916b8264f3ec > > Jens, what do you think of this issue? Or do we need to worry about > it? As your follow-up reply pointed out, blk_get_request_notify() is likely the better way forward to address this race. Mike
On Sat, Jan 20, 2018 at 10:52:01PM -0500, Mike Snitzer wrote: > On Sat, Jan 20 2018 at 7:57pm -0500, > Ming Lei <ming.lei@redhat.com> wrote: > > > On Sat, Jan 20, 2018 at 12:30:02PM -0500, Mike Snitzer wrote: > > > On Sat, Jan 20 2018 at 8:48am -0500, > > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > > > This status is returned from driver to block layer if device related > > > > resource is run out of, but driver can guarantee that IO dispatch is > > > > triggered in future when the resource is available. > > > > > > > > This patch converts some drivers to use this return value. Meantime > > > > if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run > > > > queue after 10ms for avoiding IO hang. > > > > > > > > Suggested-by: Jens Axboe <axboe@kernel.dk> > > > > Cc: Mike Snitzer <snitzer@redhat.com> > > > > Cc: Laurence Oberman <loberman@redhat.com> > > > > Cc: Bart Van Assche <bart.vanassche@sandisk.com> > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > --- > > > > block/blk-core.c | 1 + > > > > block/blk-mq.c | 20 ++++++++++++++++---- > > > > drivers/block/null_blk.c | 2 +- > > > > drivers/block/virtio_blk.c | 2 +- > > > > drivers/block/xen-blkfront.c | 2 +- > > > > drivers/scsi/scsi_lib.c | 6 +++--- > > > > include/linux/blk_types.h | 7 +++++++ > > > > 7 files changed, 30 insertions(+), 10 deletions(-) > > > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > > > index 01f271d40825..6e97e0bf8178 100644 > > > > --- a/block/blk-mq.c > > > > +++ b/block/blk-mq.c > > > > @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > > > > } > > > > > > > > ret = q->mq_ops->queue_rq(hctx, &bd); > > > > - if (ret == BLK_STS_RESOURCE) { > > > > + if ((ret == BLK_STS_RESOURCE) || > > > > + (ret == BLK_STS_DEV_RESOURCE)) { > > > > /* > > > > * If an I/O scheduler has been configured and we got a > > > > * driver tag for the next request already, free it > > > > > > Just a nit, but this should be on one line. > > > > It is too long, and my editor starts to highlight/complain it, :-) > > Look at the lines immediately following it, your isn't longer than > them.. OK. > > > > > @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > > > > *cookie = new_cookie; > > > > break; > > > > case BLK_STS_RESOURCE: > > > > + case BLK_STS_DEV_RESOURCE: > > > > __blk_mq_requeue_request(rq); > > > > break; > > > > default: > > > > > > It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is > > > too muddled: calling __blk_mq_requeue_request() for both will cause > > > underlying blk-mq driver to retain the request, won't it? > > > > blk_mq_request_issue_directly() is used by driver(dm-rq) on underlying > > queue, and driver need to deal with underlying queue busy, now we simply > > free the (underlying)request and feedback the busy status to blk-mq via > > dm-rq. > > > > Except for blk_mq_request_issue_directly(), this request need to be > > requeued, and is retained by blk-mq in hctx->dispatch_list. > > > > The difference is that if driver returns BLK_STS_DEV_RESOURCE, the queue > > will be rerun when resource is available, so don't need to run queue after > > a delay for avoiding IO hang explicitly. > > Yes, I understand the intent. > > > > > @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > > > > hctx_lock(hctx, &srcu_idx); > > > > > > > > ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); > > > > - if (ret == BLK_STS_RESOURCE) > > > > + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) > > > > blk_mq_sched_insert_request(rq, false, true, false); > > > > else if (ret != BLK_STS_OK) > > > > blk_mq_end_request(rq, ret); > > > > > > For this normal (non dm-mpath) case the request gets re-inserted; > > > dm-mpath must avoid that. > > > > > > But with dm-mpath, which instead uses blk_mq_request_issue_directly(), > > > we're driving IO with stacked blk-mq drivers. If the underlying blk-mq > > > driver (e.g. scsi-mq or nvme) is made to retain the request, using > > > __blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above, > > > then dm-mpath will not have the ability to requeue the request without > > > conflicting with the underlying blk-mq driver, will it? > > > > No, as I explained, the exception is blk_mq_request_issue_directly(), > > and now dm-rq simply frees it(and in my original version, this request > > is cached for underlying queue, and reused in next dispatch), for others, > > the request is retained in hctx->dispatch_list, and owned by blk-mq. > > > > > Or am I'm misunderstanding what __blk_mq_requeue_request() is doing? > > OK I was misunderstanding __blk_mq_requeue_request(). Seems > __blk_mq_requeue_request() is effectively resetting a request for > reuse. Not retaining the request for reissue. > > > > dm_mq_queue_rq > > > -> multipath_clone_and_map > > > -> blk_get_request (scsi_mq) > > > -> if error, dm-mpath conditionally requeues (w/ or w/o delay) > > > > Yes, with this patch, most of times blk-mq will run queue w/ delay > > because SCHED_RESTART is set after the 1st STS_RESOURCE from dm-rq > > .queue_rq() > > > > > -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called > > > -> dm_dispatch_clone_request > > > -> blk_mq_request_issue_directly > > > -> __blk_mq_try_issue_directly > > > -> __blk_mq_issue_directly > > > -> q->mq_ops->queue_rq (this is the underlying scsi_mq) > > > -> a BLK_STS_RESOURCE return here is how Bart was able to cause stalls > > > > The stall only happens when SCHED_RESTART is set and the dm-rq queue is > > idle(no any in-flight requests), that is exactly what this patch tries to > > address as suggested by Jens. > > > > > -> __blk_mq_requeue_request, if BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE **1 > > > -> (return from blk_mq_request_issue_directly) > > > -> if BLK_STS_RESOURCE, the dm-mpath request is released using blk_put_request(); > > > and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq **2 > > > > Right. > > > > > -> if DM_MAPIO_REQUEUE return from map_request()'s call to dm_dispatch_clone_request: > > > BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq > > > > Right. > > > > > > > > The redundant queueing (both to underlying blk-mq at **1 above, and > > > upper layer blk-mq at **2 above) is what I'm concerned about. > > > > > > Hope this is clear. > > > > Yeah, it is quite clear. > > Well I thought there was a problem with __blk_mq_issue_directly calling > __blk_mq_requeue_request for dm-rq on scsi-mq.. guess not. > > > I also have other dm-mpath specific questions: > > > > 1) when any STS_RESOURCE is returned from underlying queue either > > because of blk_get_request() or underlying .queue_rq() for a while, > > will dm-mpath try to switch to other path? > > If dm_requeue_original_request() is used, and then we reenter > multipath_clone_and_map() then a new path may be selected. Depending on > whether there are multiple paths and which path selector is being used. OK. > > > 2) what is the basic path switch policy of dm-mpath? > > It is selectable, there are 3: > round-robin, queue-length and service-time > > A path will be failed in multipath_end_io() if IO returns with a > retryable error (as determined by blk_path_error()). This ensures the > path that experienced the failure will not be selected again until > reinstated. > > > 3) is it possible to move the check of 'ti->type->busy' into > > .clone_and_map_rq()? if it is possible, this way might be more > > effective to detect underlying queue busy. > > Can you elaborate further why it'd make a difference? > > It'd be somewhat odd to push that busy check down into > .clone_and_map_rq(). From source code of multipath_busy() and multipath_clone_and_map(): 1) they both check if there is path available, so this part in multipath_busy() can be done in multipath_clone_and_map() too, I guess. 2) multipath_busy() also check if there is any non-busy path(underlying queue), but never records the non-busy path for further uses: - pgpath_busy() calls blk_lld_busy() to check if underlying queue is busy, this way is easy to cause race, because queue can become not busy at the same time. Also blk_lld_busy() always return false in case of blk-mq. - that means multipath_busy() isn't needed for blk-mq, but for non-mq, the check on busy may avoid to waste CPU for doing unnecessary allocation which may be a bit expensive, but this still depends on that the selected path in multipath_clone_and_map() isn't busy. That is why I raise the question in previous email. > > Are you looking to check the selected path's blk-mq queue before calling > blk_get_request()? Why is that any different than blk_get_request() > returning busy? Maybe the check of 'ti->type->busy' can be killed at least in dm_mq_queue_rq(). > > > Actually this patch may has another issue: if the default run queue > > delay(in this patch, it is 10ms) is too short, the timer may expire > > before any in-flight underlying request completes, then we may > > dequeue too quick, and sequential IO performance can be hurt too. > > That's the problem with any arbitrary timer based solution. No value > that you pick is going to be correct for all hardware. > > > But my previous patch in github doesn't have this issue. > > > > https://github.com/ming1/linux/commit/dfd672c998283a110247152237a9916b8264f3ec > > > > Jens, what do you think of this issue? Or do we need to worry about > > it? > > As your follow-up reply pointed out, blk_get_request_notify() is likely > the better way forward to address this race. Yeah, please ignore this patch, and I will post them out soon for review, but the new patches have to be against both dm and block tree.
diff --git a/block/blk-core.c b/block/blk-core.c index a2005a485335..ac4789c5e329 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -145,6 +145,7 @@ static const struct { [BLK_STS_MEDIUM] = { -ENODATA, "critical medium" }, [BLK_STS_PROTECTION] = { -EILSEQ, "protection" }, [BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" }, + [BLK_STS_DEV_RESOURCE] = { -ENOMEM, "device resource" }, [BLK_STS_AGAIN] = { -EAGAIN, "nonblocking retry" }, /* device mapper special case, should not leak out: */ diff --git a/block/blk-mq.c b/block/blk-mq.c index 01f271d40825..6e97e0bf8178 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1169,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, struct request *rq, *nxt; bool no_tag = false; int errors, queued; + blk_status_t ret = BLK_STS_OK; if (list_empty(list)) return false; @@ -1181,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, errors = queued = 0; do { struct blk_mq_queue_data bd; - blk_status_t ret; rq = list_first_entry(list, struct request, queuelist); if (!blk_mq_get_driver_tag(rq, &hctx, false)) { @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, } ret = q->mq_ops->queue_rq(hctx, &bd); - if (ret == BLK_STS_RESOURCE) { + if ((ret == BLK_STS_RESOURCE) || + (ret == BLK_STS_DEV_RESOURCE)) { /* * If an I/O scheduler has been configured and we got a * driver tag for the next request already, free it @@ -1257,6 +1258,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * that is where we will continue on next queue run. */ if (!list_empty(list)) { + bool needs_restart; + spin_lock(&hctx->lock); list_splice_init(list, &hctx->dispatch); spin_unlock(&hctx->lock); @@ -1280,10 +1283,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * - Some but not all block drivers stop a queue before * returning BLK_STS_RESOURCE. Two exceptions are scsi-mq * and dm-rq. + * + * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART + * bit is set, run queue after 10ms for avoiding IO hang + * because the queue may be idle and the RESTART mechanism + * can't work any more. */ - if (!blk_mq_sched_needs_restart(hctx) || + needs_restart = blk_mq_sched_needs_restart(hctx); + if (!needs_restart || (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) blk_mq_run_hw_queue(hctx, true); + else if (needs_restart && (ret == BLK_STS_RESOURCE)) + blk_mq_delay_run_hw_queue(hctx, 10); } return (queued + errors) != 0; @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, *cookie = new_cookie; break; case BLK_STS_RESOURCE: + case BLK_STS_DEV_RESOURCE: __blk_mq_requeue_request(rq); break; default: @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, hctx_lock(hctx, &srcu_idx); ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); - if (ret == BLK_STS_RESOURCE) + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) blk_mq_sched_insert_request(rq, false, true, false); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret); diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index 6655893a3a7a..287a09611c0f 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -1230,7 +1230,7 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd) return BLK_STS_OK; } else /* requeue request */ - return BLK_STS_RESOURCE; + return BLK_STS_DEV_RESOURCE; } } diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 68846897d213..79908e6ddbf2 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -276,7 +276,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx, /* Out of mem doesn't actually happen, since we fall back * to direct descriptors */ if (err == -ENOMEM || err == -ENOSPC) - return BLK_STS_RESOURCE; + return BLK_STS_DEV_RESOURCE; return BLK_STS_IOERR; } diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 891265acb10e..e126e4cac2ca 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -911,7 +911,7 @@ static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx, out_busy: blk_mq_stop_hw_queue(hctx); spin_unlock_irqrestore(&rinfo->ring_lock, flags); - return BLK_STS_RESOURCE; + return BLK_STS_DEV_RESOURCE; } static void blkif_complete_rq(struct request *rq) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index d9ca1dfab154..55be2550c555 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, case BLK_STS_OK: break; case BLK_STS_RESOURCE: - if (atomic_read(&sdev->device_busy) == 0 && - !scsi_device_blocked(sdev)) - blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); + if (atomic_read(&sdev->device_busy) || + scsi_device_blocked(sdev)) + ret = BLK_STS_DEV_RESOURCE; break; default: /* diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index c5d3db0d83f8..d76f1744a7ca 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -39,6 +39,13 @@ typedef u8 __bitwise blk_status_t; #define BLK_STS_AGAIN ((__force blk_status_t)12) +/* + * This status is returned from driver to block layer if device related + * resource is run out of, but driver can guarantee that IO dispatch is + * triggered in future when the resource is available. + */ +#define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) + /** * blk_path_error - returns true if error may be path related * @error: status the request was completed with
This status is returned from driver to block layer if device related resource is run out of, but driver can guarantee that IO dispatch is triggered in future when the resource is available. This patch converts some drivers to use this return value. Meantime if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run queue after 10ms for avoiding IO hang. Suggested-by: Jens Axboe <axboe@kernel.dk> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-core.c | 1 + block/blk-mq.c | 20 ++++++++++++++++---- drivers/block/null_blk.c | 2 +- drivers/block/virtio_blk.c | 2 +- drivers/block/xen-blkfront.c | 2 +- drivers/scsi/scsi_lib.c | 6 +++--- include/linux/blk_types.h | 7 +++++++ 7 files changed, 30 insertions(+), 10 deletions(-)