Message ID | 20180118163707.11825-1-bart.vanassche@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jan 18 2018 at 11:37am -0500, Bart Van Assche <bart.vanassche@wdc.com> wrote: > If the .queue_rq() implementation of a block driver returns > BLK_STS_RESOURCE then that block driver is responsible for > rerunning the queue once the condition that caused it to return > BLK_STS_RESOURCE has been cleared. The dm-mpath driver tells the > dm core to requeue a request if e.g. not enough memory is > available for cloning a request or if the underlying path is > busy. Since the dm-mpath driver does not receive any kind of > notification if the condition that caused it to return "requeue" > is cleared, the only solution to avoid that dm-mpath request > processing stalls is to call blk_mq_delay_run_hw_queue(). Hence > this patch. > > Fixes: ec3eaf9a6731 ("dm mpath: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE") > Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> > Cc: Ming Lei <ming.lei@redhat.com> > --- > drivers/md/dm-rq.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > index f16096af879a..c59c59cfd2a5 100644 > --- a/drivers/md/dm-rq.c > +++ b/drivers/md/dm-rq.c > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, > /* Undo dm_start_request() before requeuing */ > rq_end_stats(md, rq); > rq_completed(md, rq_data_dir(rq), false); > + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > return BLK_STS_RESOURCE; > } > > -- > 2.15.1 > Sorry but we need to understand why you still need this. The issue you say it was originally intended to fix _should_ be addressed with this change: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=4dd6edd23e7ea971efddc303f9e67eb79e95808e So if you still feel you need to blindly kick the queue then it is very likely a bug in blk-mq (either its RESTART hueristics or whatever internal implementation is lacking) Did you try Ming's RFC patch to "fixup RESTART" before resorting to the above again?, see: https://patchwork.kernel.org/patch/10172315/ Thanks, Mike
On Thu, 2018-01-18 at 11:50 -0500, Mike Snitzer wrote: > The issue you say it was originally intended to fix _should_ be > addressed with this change: > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=4dd6edd23e7ea971efddc303f9e67eb79e95808e Hello Mike, Sorry but I'm not convinced that that patch is sufficient. That patch helps if .end_io() is called with status BLK_STS_RESOURCE and also if blk_insert_cloned_request() returns the .queue_rq() return value. It does not help if .queue_rq() returns BLK_STS_RESOURCE and that return value gets ignored. I think that can happen as follows: - Request cloning in multipath_clone_and_map() succeeds and that function returns DM_MAPIO_REMAPPED. - dm_dispatch_clone_request() calls blk_insert_cloned_request(). - blk_insert_cloned_request() calls blk_mq_request_direct_issue(), which results in a call of __blk_mq_try_issue_directly(). - __blk_mq_try_issue_directly() calls blk_mq_sched_insert_request(). In this case the BLK_STS_RESOURCE returned by the .queue_rq() implementation of the underlying path will be ignored. Please note that I have not tried to find all paths that ignore the .queue_rq() return value. Thanks, Bart.
On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > If the .queue_rq() implementation of a block driver returns > BLK_STS_RESOURCE then that block driver is responsible for > rerunning the queue once the condition that caused it to return > BLK_STS_RESOURCE has been cleared. The dm-mpath driver tells the > dm core to requeue a request if e.g. not enough memory is > available for cloning a request or if the underlying path is > busy. Since the dm-mpath driver does not receive any kind of > notification if the condition that caused it to return "requeue" > is cleared, the only solution to avoid that dm-mpath request > processing stalls is to call blk_mq_delay_run_hw_queue(). Hence > this patch. > > Fixes: ec3eaf9a6731 ("dm mpath: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE") > Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> > Cc: Ming Lei <ming.lei@redhat.com> > --- > drivers/md/dm-rq.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > index f16096af879a..c59c59cfd2a5 100644 > --- a/drivers/md/dm-rq.c > +++ b/drivers/md/dm-rq.c > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, > /* Undo dm_start_request() before requeuing */ > rq_end_stats(md, rq); > rq_completed(md, rq_data_dir(rq), false); > + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > return BLK_STS_RESOURCE; > } > Nak. It still takes a bit time to add this request to hctx->dispatch_list from here, so suppose the time is longer than 100ms because of interrupt , preemption or whatever, this request can't be observed in the scheduled run queue(__blk_mq_run_hw_queue). Not mention it is just a ugly workaround, which degrades performance a lot.
On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: > On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > > index f16096af879a..c59c59cfd2a5 100644 > > --- a/drivers/md/dm-rq.c > > +++ b/drivers/md/dm-rq.c > > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, > > /* Undo dm_start_request() before requeuing */ > > rq_end_stats(md, rq); > > rq_completed(md, rq_data_dir(rq), false); > > + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > > return BLK_STS_RESOURCE; > > } > > > > Nak. This patch fixes a regression that was introduced by you. You should know that regressions are not acceptable. If you don't agree with this patch, please fix the root cause. Bart.
On Fri, Jan 19, 2018 at 12:14:24AM +0000, Bart Van Assche wrote: > On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > > > index f16096af879a..c59c59cfd2a5 100644 > > > --- a/drivers/md/dm-rq.c > > > +++ b/drivers/md/dm-rq.c > > > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, > > > /* Undo dm_start_request() before requeuing */ > > > rq_end_stats(md, rq); > > > rq_completed(md, rq_data_dir(rq), false); > > > + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > > > return BLK_STS_RESOURCE; > > > } > > > > > > > Nak. > > This patch fixes a regression that was introduced by you. You should know > that regressions are not acceptable. If you don't agree with this patch, > please fix the root cause. Yesterday I sent a patch, did you test that?
On Fri, 2018-01-19 at 08:18 +0800, Ming Lei wrote: > On Fri, Jan 19, 2018 at 12:14:24AM +0000, Bart Van Assche wrote: > > On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: > > > On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > > > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > > > > index f16096af879a..c59c59cfd2a5 100644 > > > > --- a/drivers/md/dm-rq.c > > > > +++ b/drivers/md/dm-rq.c > > > > @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, > > > > /* Undo dm_start_request() before requeuing */ > > > > rq_end_stats(md, rq); > > > > rq_completed(md, rq_data_dir(rq), false); > > > > + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > > > > return BLK_STS_RESOURCE; > > > > } > > > > > > > > > > Nak. > > > > This patch fixes a regression that was introduced by you. You should know > > that regressions are not acceptable. If you don't agree with this patch, > > please fix the root cause. > > Yesterday I sent a patch, did you test that? Yes, I did. It caused queue stalls that were so bad that sending "kick" to the debugfs "state" attribute was not sufficient to resolve the queue stall. Bart.
On 1/18/18 5:18 PM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 12:14:24AM +0000, Bart Van Assche wrote: >> On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: >>> On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: >>>> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c >>>> index f16096af879a..c59c59cfd2a5 100644 >>>> --- a/drivers/md/dm-rq.c >>>> +++ b/drivers/md/dm-rq.c >>>> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, >>>> /* Undo dm_start_request() before requeuing */ >>>> rq_end_stats(md, rq); >>>> rq_completed(md, rq_data_dir(rq), false); >>>> + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); >>>> return BLK_STS_RESOURCE; >>>> } >>>> >>> >>> Nak. >> >> This patch fixes a regression that was introduced by you. You should know >> that regressions are not acceptable. If you don't agree with this patch, >> please fix the root cause. > > Yesterday I sent a patch, did you test that? That patch isn't going to be of much help. It might prevent you from completely stalling, but it might take you tens of seconds to get there. On top of that, it's a rolling timer that gets added when IO is queued, and re-added if IO is pending when it fires. If the latter case is not true, then it won't get armed again. So if IO fails to queue without any other IO pending, you're pretty much in the same situation as if you marked the queue as restart. Nobody is going to be noticing either of those conditions.
On Thu, Jan 18, 2018 at 05:13:53PM +0000, Bart Van Assche wrote: > On Thu, 2018-01-18 at 11:50 -0500, Mike Snitzer wrote: > > The issue you say it was originally intended to fix _should_ be > > addressed with this change: > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=4dd6edd23e7ea971efddc303f9e67eb79e95808e > > Hello Mike, > > Sorry but I'm not convinced that that patch is sufficient. That patch helps > if .end_io() is called with status BLK_STS_RESOURCE and also if > blk_insert_cloned_request() returns the .queue_rq() return value. It does not > help if .queue_rq() returns BLK_STS_RESOURCE and that return value gets > ignored. The return value from .queue_rq() is handled by blk-mq, why do you think it can be ignored? Please see blk_mq_dispatch_rq_list(). > I think that can happen as follows: > - Request cloning in multipath_clone_and_map() succeeds and that function > returns DM_MAPIO_REMAPPED. > - dm_dispatch_clone_request() calls blk_insert_cloned_request(). > - blk_insert_cloned_request() calls blk_mq_request_direct_issue(), which > results in a call of __blk_mq_try_issue_directly(). > - __blk_mq_try_issue_directly() calls blk_mq_sched_insert_request(). In this This only happens iff queue is stopped or quiesced, then we return BLK_STS_OK to blk-mq via .queue_rq(), please see __blk_mq_try_issue_directly(), how does this cause IO hang? > case the BLK_STS_RESOURCE returned by the .queue_rq() implementation of the > underlying path will be ignored. No, this case won't return BLK_STS_RESOURCE.
On Fri, 2018-01-19 at 08:26 +0800, Ming Lei wrote:
> No, this case won't return BLK_STS_RESOURCE.
It's possible that my attempt to reverse engineer the latest blk-mq changes
was wrong. But the queue stall is real and needs to be fixed.
Bart.
On Thu, Jan 18, 2018 at 05:24:29PM -0700, Jens Axboe wrote: > On 1/18/18 5:18 PM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 12:14:24AM +0000, Bart Van Assche wrote: > >> On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: > >>> On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > >>>> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > >>>> index f16096af879a..c59c59cfd2a5 100644 > >>>> --- a/drivers/md/dm-rq.c > >>>> +++ b/drivers/md/dm-rq.c > >>>> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, > >>>> /* Undo dm_start_request() before requeuing */ > >>>> rq_end_stats(md, rq); > >>>> rq_completed(md, rq_data_dir(rq), false); > >>>> + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > >>>> return BLK_STS_RESOURCE; > >>>> } > >>>> > >>> > >>> Nak. > >> > >> This patch fixes a regression that was introduced by you. You should know > >> that regressions are not acceptable. If you don't agree with this patch, > >> please fix the root cause. > > > > Yesterday I sent a patch, did you test that? > > That patch isn't going to be of much help. It might prevent you from > completely stalling, but it might take you tens of seconds to get there. Yeah, that is true, and why I marked it as RFC, but the case is so rare to happen. > > On top of that, it's a rolling timer that gets added when IO is queued, > and re-added if IO is pending when it fires. If the latter case is not > true, then it won't get armed again. So if IO fails to queue without > any other IO pending, you're pretty much in the same situation as if No IO pending detection may take a bit time, we can do it after BLK_STS_RESOURCE is returned and SCHED_RESTART is set. I have done this way in the following patch, what do you think of it? https://github.com/ming1/linux/commit/f866ce0b97b0ae22d033881da7eb07706fd458b4
On 1/18/18 5:35 PM, Ming Lei wrote: > On Thu, Jan 18, 2018 at 05:24:29PM -0700, Jens Axboe wrote: >> On 1/18/18 5:18 PM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 12:14:24AM +0000, Bart Van Assche wrote: >>>> On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: >>>>> On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: >>>>>> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c >>>>>> index f16096af879a..c59c59cfd2a5 100644 >>>>>> --- a/drivers/md/dm-rq.c >>>>>> +++ b/drivers/md/dm-rq.c >>>>>> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, >>>>>> /* Undo dm_start_request() before requeuing */ >>>>>> rq_end_stats(md, rq); >>>>>> rq_completed(md, rq_data_dir(rq), false); >>>>>> + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); >>>>>> return BLK_STS_RESOURCE; >>>>>> } >>>>>> >>>>> >>>>> Nak. >>>> >>>> This patch fixes a regression that was introduced by you. You should know >>>> that regressions are not acceptable. If you don't agree with this patch, >>>> please fix the root cause. >>> >>> Yesterday I sent a patch, did you test that? >> >> That patch isn't going to be of much help. It might prevent you from >> completely stalling, but it might take you tens of seconds to get there. > > Yeah, that is true, and why I marked it as RFC, but the case is so rare to > happen. You don't know that. If the resource is very limited, or someone else is gobbling up all of it, then it may trigger quite often. Granted, in that case, you need some way of signaling the freeing of those resources to make it efficient. >> On top of that, it's a rolling timer that gets added when IO is queued, >> and re-added if IO is pending when it fires. If the latter case is not >> true, then it won't get armed again. So if IO fails to queue without >> any other IO pending, you're pretty much in the same situation as if > > No IO pending detection may take a bit time, we can do it after > BLK_STS_RESOURCE is returned and SCHED_RESTART is set. I have done > this way in the following patch, what do you think of it? > > https://github.com/ming1/linux/commit/f866ce0b97b0ae22d033881da7eb07706fd458b4 I think it's overly complicated. As I wrote in a lengthier email earlier today, just ensure that we have a unique return code from the driver for when it aborts queuing an IO due to a resource shortage that isn't tied to it's own consumption of it. When we get that return, do a delayed queue run after X amount of time to ensure we always try again. If you want to get fancy (later on), you could track the frequency of such returns and complain if it's too high. There's no point in adding a lot of code to check whether we need to run the queue or not. Just always do it. We won't be doing it more than 100 times per second for the worst case of the condition taking a while to clear, if we stick to the 10ms re-run time.
On Thu, Jan 18, 2018 at 05:49:10PM -0700, Jens Axboe wrote: > On 1/18/18 5:35 PM, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 05:24:29PM -0700, Jens Axboe wrote: > >> On 1/18/18 5:18 PM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 12:14:24AM +0000, Bart Van Assche wrote: > >>>> On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: > >>>>> On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > >>>>>> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > >>>>>> index f16096af879a..c59c59cfd2a5 100644 > >>>>>> --- a/drivers/md/dm-rq.c > >>>>>> +++ b/drivers/md/dm-rq.c > >>>>>> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, > >>>>>> /* Undo dm_start_request() before requeuing */ > >>>>>> rq_end_stats(md, rq); > >>>>>> rq_completed(md, rq_data_dir(rq), false); > >>>>>> + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > >>>>>> return BLK_STS_RESOURCE; > >>>>>> } > >>>>>> > >>>>> > >>>>> Nak. > >>>> > >>>> This patch fixes a regression that was introduced by you. You should know > >>>> that regressions are not acceptable. If you don't agree with this patch, > >>>> please fix the root cause. > >>> > >>> Yesterday I sent a patch, did you test that? > >> > >> That patch isn't going to be of much help. It might prevent you from > >> completely stalling, but it might take you tens of seconds to get there. > > > > Yeah, that is true, and why I marked it as RFC, but the case is so rare to > > happen. > > You don't know that. If the resource is very limited, or someone else > is gobbling up all of it, then it may trigger quite often. Granted, > in that case, you need some way of signaling the freeing of those > resources to make it efficient. > > >> On top of that, it's a rolling timer that gets added when IO is queued, > >> and re-added if IO is pending when it fires. If the latter case is not > >> true, then it won't get armed again. So if IO fails to queue without > >> any other IO pending, you're pretty much in the same situation as if > > > > No IO pending detection may take a bit time, we can do it after > > BLK_STS_RESOURCE is returned and SCHED_RESTART is set. I have done > > this way in the following patch, what do you think of it? > > > > https://github.com/ming1/linux/commit/f866ce0b97b0ae22d033881da7eb07706fd458b4 > > I think it's overly complicated. As I wrote in a lengthier email earlier > today, just ensure that we have a unique return code from the driver for > when it aborts queuing an IO due to a resource shortage that isn't tied > to it's own consumption of it. When we get that return, do a delayed > queue run after X amount of time to ensure we always try again. If you > want to get fancy (later on), you could track the frequency of such > returns and complain if it's too high. Suppose the new introduced return code is BLK_STS_NO_DEV_RESOURCE. This way may degrade performance, for example, DM-MPATH returns BLK_STS_NO_DEV_RESOURCE when blk_get_request() returns NULL, blk-mq handles it by blk_mq_delay_run_hw_queue(10ms). Then blk_mq_sched_restart() is just triggered when one DM-MPATH request is completed, that means one request of underlying queue is completed too, but blk_mq_sched_restart() still can't make progress during the 10ms. That means we should only run blk_mq_delay_run_hw_queue(10ms/100ms) when the queue is idle. I will post out the patch in github for discussion. Thanks, Ming
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index f16096af879a..c59c59cfd2a5 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, /* Undo dm_start_request() before requeuing */ rq_end_stats(md, rq); rq_completed(md, rq_data_dir(rq), false); + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); return BLK_STS_RESOURCE; }
If the .queue_rq() implementation of a block driver returns BLK_STS_RESOURCE then that block driver is responsible for rerunning the queue once the condition that caused it to return BLK_STS_RESOURCE has been cleared. The dm-mpath driver tells the dm core to requeue a request if e.g. not enough memory is available for cloning a request or if the underlying path is busy. Since the dm-mpath driver does not receive any kind of notification if the condition that caused it to return "requeue" is cleared, the only solution to avoid that dm-mpath request processing stalls is to call blk_mq_delay_run_hw_queue(). Hence this patch. Fixes: ec3eaf9a6731 ("dm mpath: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Ming Lei <ming.lei@redhat.com> --- drivers/md/dm-rq.c | 1 + 1 file changed, 1 insertion(+)