Message ID | 20180130142459.52668-1-snitzer@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On 01/30/18 06:24, Mike Snitzer wrote: > + * > + * If driver returns BLK_STS_RESOURCE and SCHED_RESTART > + * bit is set, run queue after a delay to avoid IO stalls > + * that could otherwise occur if the queue is idle. > */ > - if (!blk_mq_sched_needs_restart(hctx) || > + needs_restart = blk_mq_sched_needs_restart(hctx); > + if (!needs_restart || > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) > blk_mq_run_hw_queue(hctx, true); > + else if (needs_restart && (ret == BLK_STS_RESOURCE)) > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY); > } If a request completes concurrently with execution of the above code then the request completion will trigger a call of blk_mq_sched_restart_hctx() and that call will clear the BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above code tests it then the above code will schedule an asynchronous call of __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the new queue run returns again BLK_STS_RESOURCE then the above code will be executed again. In other words, a loop occurs. That loop will repeat as long as the described race occurs. The current (kernel v4.15) block layer behavior is simpler: only block drivers call blk_mq_delay_run_hw_queue() and the block layer core never calls that function. Hence that loop cannot occur with the v4.15 block layer core and block drivers. A motivation of why that loop is preferred compared to the current behavior (no loop) is missing. Does this mean that that loop is a needless complication of the block layer core? Sorry but I still prefer the v4.15 block layer approach because this patch has in my view the following disadvantages: - It involves a blk-mq API change. API changes are always risky and need some time to stabilize. - The delay after which to rerun the queue is moved from block layer drivers into the block layer core. I think that's wrong because only the block driver authors can make a good choice for this constant. - This patch makes block drivers harder to understand. Anyone who sees return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first time will have to look up the meaning of these constants. An explicit blk_mq_delay_run_hw_queue() call is easier to understand. - This patch makes the blk-mq core harder to understand because of the loop mentioned above. - This patch does not fix any bugs nor makes block drivers easier to read or to implement. So why is this patch considered useful? Thanks, Bart.
On Tue, 2018-01-30 at 09:52 -0800, Bart Van Assche wrote: > On 01/30/18 06:24, Mike Snitzer wrote: > > + * > > + * If driver returns BLK_STS_RESOURCE and > > SCHED_RESTART > > + * bit is set, run queue after a delay to avoid IO > > stalls > > + * that could otherwise occur if the queue is > > idle. > > */ > > - if (!blk_mq_sched_needs_restart(hctx) || > > + needs_restart = blk_mq_sched_needs_restart(hctx); > > + if (!needs_restart || > > (no_tag && list_empty_careful(&hctx- > > >dispatch_wait.entry))) > > blk_mq_run_hw_queue(hctx, true); > > + else if (needs_restart && (ret == > > BLK_STS_RESOURCE)) > > + blk_mq_delay_run_hw_queue(hctx, > > BLK_MQ_QUEUE_DELAY); > > } > > If a request completes concurrently with execution of the above code > then the request completion will trigger a call of > blk_mq_sched_restart_hctx() and that call will clear the > BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above > code > tests it then the above code will schedule an asynchronous call of > __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the > new > queue run returns again BLK_STS_RESOURCE then the above code will be > executed again. In other words, a loop occurs. That loop will repeat > as > long as the described race occurs. The current (kernel v4.15) block > layer behavior is simpler: only block drivers call > blk_mq_delay_run_hw_queue() and the block layer core never calls > that > function. Hence that loop cannot occur with the v4.15 block layer > core > and block drivers. A motivation of why that loop is preferred > compared > to the current behavior (no loop) is missing. Does this mean that > that > loop is a needless complication of the block layer core? > > Sorry but I still prefer the v4.15 block layer approach because this > patch has in my view the following disadvantages: > - It involves a blk-mq API change. API changes are always risky and > need > some time to stabilize. > - The delay after which to rerun the queue is moved from block layer > drivers into the block layer core. I think that's wrong because > only > the block driver authors can make a good choice for this constant. > - This patch makes block drivers harder to understand. Anyone who > sees > return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the > first > time will have to look up the meaning of these constants. An > explicit > blk_mq_delay_run_hw_queue() call is easier to understand. > - This patch makes the blk-mq core harder to understand because of > the > loop mentioned above. > - This patch does not fix any bugs nor makes block drivers easier to > read or to implement. So why is this patch considered useful? > > Thanks, > > Bart. Hello Bart What is the performance implication of your method versus this patch above. Is there more of a delay in your approach or in Ming's approach from your own testing. I guess it seems we will never get consensus on this so is it time to take a vote. I respect and trust your inputs, you know that, but are you perhaps prepared to accept the approach above and agree to it and if it turns out to expose more issues it can be addressed later. Is that not a better way to progress this because to me it looks like you and Ming will continue to disagree on which is the better approach. With much respect Laurence
On Tue, Jan 30 2018 at 12:52pm -0500, Bart Van Assche <bart.vanassche@wdc.com> wrote: > On 01/30/18 06:24, Mike Snitzer wrote: > >+ * > >+ * If driver returns BLK_STS_RESOURCE and SCHED_RESTART > >+ * bit is set, run queue after a delay to avoid IO stalls > >+ * that could otherwise occur if the queue is idle. > > */ > >- if (!blk_mq_sched_needs_restart(hctx) || > >+ needs_restart = blk_mq_sched_needs_restart(hctx); > >+ if (!needs_restart || > > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) > > blk_mq_run_hw_queue(hctx, true); > >+ else if (needs_restart && (ret == BLK_STS_RESOURCE)) > >+ blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY); > > } > > If a request completes concurrently with execution of the above code > then the request completion will trigger a call of > blk_mq_sched_restart_hctx() and that call will clear the > BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above > code tests it then the above code will schedule an asynchronous call > of __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the > new queue run returns again BLK_STS_RESOURCE then the above code > will be executed again. In other words, a loop occurs. That loop > will repeat as long as the described race occurs. The current > (kernel v4.15) block layer behavior is simpler: only block drivers > call blk_mq_delay_run_hw_queue() and the block layer core never > calls that function. Hence that loop cannot occur with the v4.15 > block layer core and block drivers. A motivation of why that loop is > preferred compared to the current behavior (no loop) is missing. > Does this mean that that loop is a needless complication of the > block layer core? No it means the loop is an internal blk-mq concern. And that drivers needn't worry about kicking the queue. And it affords blk-mq latitude to change how it responds to BLK_STS_RESOURCE in the future (without needing to change every driver). But even v4.15 has a similar loop. It just happens to extend into the .queue_rq() where the driver is completely blind to SCHED_RESTART. And the driver may just repeatedly kick the queue after a delay (via blk_mq_delay_run_hw_queue). This cycle should be a very rare occurrence regardless of which approach is taken (V5 vs 4.15). The fact that you engineered your SRP initiator and target code to pathologically trigger this worst case (via target_can_queue) doesn't mean it is the fast path for a properly configured and functioning storage network. > Sorry but I still prefer the v4.15 block layer approach because this > patch has in my view the following disadvantages: > - It involves a blk-mq API change. API changes are always risky and need > some time to stabilize. The number of blk-mq API changes that have occurred since blk-mq was introduced is a very long list. Seems contrived to make this the one that is breaking the camel's back. > - The delay after which to rerun the queue is moved from block layer > drivers into the block layer core. I think that's wrong because only > the block driver authors can make a good choice for this constant. Unsubstantiated. 3ms (scsi-mq, nvmefc) vs 100ms (dm-mq mpath): which is better? Pretty sure if the underlying storage network is 1) performant 2) properly configured -- then a shorter delay is preferable. > - This patch makes block drivers harder to understand. Anyone who sees > return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first > time will have to look up the meaning of these constants. An explicit > blk_mq_delay_run_hw_queue() call is easier to understand. No it doesn't make blk-mq harder to understand. But even if it did it actually acknowledges that there is existing blk-mq SCHED_RESTART heuristic for how to deal with the need for blk-mq to back-off in the face of BLK_STS_RESOURCE returns. By just having each blk-mq driver blindly kick the queue you're actively ignoring, and defeating, that entire design element of blk-mq (SCHED_RESTART). It is an instance where blk-mq can and does know better. Acknowledging that fact is moving blk-mq in a better direction. > - This patch makes the blk-mq core harder to understand because of the > loop mentioned above. You've said your peace. But you've taken on this campaign to undermine this line of development with such passion that we're now in a place where Jens is shell-shocked by all the repeat campaigning and noise. Bart you keep saying the same things over and over. Yet you cannot show the patch to actively be a problem with testing-based detail. Seems you'd rather refuse to even test it than open yourself up to the possibility that this concern of yours has been making a mountain out of a mole hill. > - This patch does not fix any bugs nor makes block drivers easier to > read or to implement. So why is this patch considered useful? It enables blk-mq core to own the problem that individual drivers should have no business needing to worry about. Period.
On Tue, 2018-01-30 at 14:33 -0500, Mike Snitzer wrote: > On Tue, Jan 30 2018 at 12:52pm -0500, > Bart Van Assche <bart.vanassche@wdc.com> wrote: > > > - This patch does not fix any bugs nor makes block drivers easier to > > read or to implement. So why is this patch considered useful? > > It enables blk-mq core to own the problem that individual drivers should > have no business needing to worry about. Period. Thanks for having confirmed that this patch is an API-change only and does not fix any bugs. Bart.
On Tue, Jan 30 2018 at 2:42pm -0500, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Tue, 2018-01-30 at 14:33 -0500, Mike Snitzer wrote: > > On Tue, Jan 30 2018 at 12:52pm -0500, > > Bart Van Assche <bart.vanassche@wdc.com> wrote: > > > > > - This patch does not fix any bugs nor makes block drivers easier to > > > read or to implement. So why is this patch considered useful? > > > > It enables blk-mq core to own the problem that individual drivers should > > have no business needing to worry about. Period. > > Thanks for having confirmed that this patch is an API-change only and does > not fix any bugs. No, it is an API change that enables blk-mq drivers to make forward progress without compromising the potential benefits associated with blk-mq's SCHED_RESTART functionality. (If we're going to beat this horse to death it might as well be with precision)
On Tue, Jan 30, 2018 at 09:52:31AM -0800, Bart Van Assche wrote: > On 01/30/18 06:24, Mike Snitzer wrote: > > + * > > + * If driver returns BLK_STS_RESOURCE and SCHED_RESTART > > + * bit is set, run queue after a delay to avoid IO stalls > > + * that could otherwise occur if the queue is idle. > > */ > > - if (!blk_mq_sched_needs_restart(hctx) || > > + needs_restart = blk_mq_sched_needs_restart(hctx); > > + if (!needs_restart || > > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) > > blk_mq_run_hw_queue(hctx, true); > > + else if (needs_restart && (ret == BLK_STS_RESOURCE)) > > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY); > > } > > If a request completes concurrently with execution of the above code then > the request completion will trigger a call of blk_mq_sched_restart_hctx() > and that call will clear the BLK_MQ_S_SCHED_RESTART bit. If that bit is > cleared before the above code tests it then the above code will schedule an > asynchronous call of __blk_mq_run_hw_queue(). If the .queue_rq() call Right. > triggered by the new queue run returns again BLK_STS_RESOURCE then the above > code will be executed again. In other words, a loop occurs. That loop will This patch doesn't change anything about that. When BLK_STS_RESOURCE is returned, this request is added to hctx->dispatch, next time, before dispatching this request, SCHED_RESTART is set and the loop is cut. > repeat as long as the described race occurs. The current (kernel v4.15) > block layer behavior is simpler: only block drivers call > blk_mq_delay_run_hw_queue() and the block layer core never calls that > function. Hence that loop cannot occur with the v4.15 block layer core and That way isn't safe, I have explained to you in the following link: https://marc.info/?l=dm-devel&m=151727454815018&w=2 > block drivers. A motivation of why that loop is preferred compared to the > current behavior (no loop) is missing. Does this mean that that loop is a > needless complication of the block layer core? No such loop as I explained above. > > Sorry but I still prefer the v4.15 block layer approach because this patch > has in my view the following disadvantages: > - It involves a blk-mq API change. API changes are always risky and need > some time to stabilize. > - The delay after which to rerun the queue is moved from block layer > drivers into the block layer core. I think that's wrong because only > the block driver authors can make a good choice for this constant. > - This patch makes block drivers harder to understand. Anyone who sees > return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first > time will have to look up the meaning of these constants. An explicit > blk_mq_delay_run_hw_queue() call is easier to understand. > - This patch makes the blk-mq core harder to understand because of the > loop mentioned above. > - This patch does not fix any bugs nor makes block drivers easier to > read or to implement. So why is this patch considered useful? It does avoid the race I mentioned in the following link: https://marc.info/?l=dm-devel&m=151727454815018&w=2 More importantly, every driver need this change, if you have better idea to fix them all, please post a patch, then we can compare both. Thanks, Ming
On 1/30/18 7:24 AM, Mike Snitzer wrote: > From: Ming Lei <ming.lei@redhat.com> > > This status is returned from driver to block layer if device related > resource is unavailable, but driver can guarantee that IO dispatch > will be triggered in future when the resource is available. > > Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver > returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after > a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is > 3 ms because both scsi-mq and nvmefc are using that magic value. > > If a driver can make sure there is in-flight IO, it is safe to return > BLK_STS_DEV_RESOURCE because: > > 1) If all in-flight IOs complete before examining SCHED_RESTART in > blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue > is run immediately in this case by blk_mq_dispatch_rq_list(); > > 2) if there is any in-flight IO after/when examining SCHED_RESTART > in blk_mq_dispatch_rq_list(): > - if SCHED_RESTART isn't set, queue is run immediately as handled in 1) > - otherwise, this request will be dispatched after any in-flight IO is > completed via blk_mq_sched_restart() > > 3) if SCHED_RESTART is set concurently in context because of > BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two > cases and make sure IO hang can be avoided. > > One invariant is that queue will be rerun if SCHED_RESTART is set. This looks pretty good to me. I'm waffling a bit on whether to retain the current BLK_STS_RESOURCE behavior and name the new one something else, but I do like using the DEV name in there to signify the difference between a global and device resource. Just a few small nits below - can you roll a v6 with the changes? > diff --git a/block/blk-core.c b/block/blk-core.c > index cdae69be68e9..38279d4ae08b 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -145,6 +145,7 @@ static const struct { > [BLK_STS_MEDIUM] = { -ENODATA, "critical medium" }, > [BLK_STS_PROTECTION] = { -EILSEQ, "protection" }, > [BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" }, > + [BLK_STS_DEV_RESOURCE] = { -ENOMEM, "device resource" }, > [BLK_STS_AGAIN] = { -EAGAIN, "nonblocking retry" }, I think we should make BLK_STS_DEV_RESOURCE be -EBUSY, that more closely matches the result and makes it distinctly different than the global shortage that is STS_RESOURCE/ENOMEM. > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 43e7449723e0..e39b4e2a63db 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx, > return true; > } > > +#define BLK_MQ_QUEUE_DELAY 3 /* ms units */ Make that BLK_MQ_RESOURCE_DELAY or something like that. The name is too generic right now. > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > index 2d973ac54b09..f41d2057215f 100644 > --- a/include/linux/blk_types.h > +++ b/include/linux/blk_types.h > @@ -39,6 +39,23 @@ typedef u8 __bitwise blk_status_t; > > #define BLK_STS_AGAIN ((__force blk_status_t)12) > > +/* > + * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device > + * related resource is unavailable, but driver can guarantee that queue > + * will be rerun in future once the resource is available (whereby > + * dispatching requests). > + * > + * To safely return BLK_STS_DEV_RESOURCE, and allow forward progress, a > + * driver just needs to make sure there is in-flight IO. > + * > + * Difference with BLK_STS_RESOURCE: > + * If driver isn't sure if the queue will be rerun once device resource > + * is made available, please return BLK_STS_RESOURCE. For example: when > + * memory allocation, DMA Mapping or other system resource allocation > + * fails and IO can't be submitted to device. > + */ > +#define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) I'd rephrase that as: BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if device related resource are unavailable, but the driver can guarantee that the queue will be rerun in the future once resources become available again. This is typically the case for device specific resources that are consumed for IO. If the driver fails allocating these resources, we know that inflight (or pending) IO will free these resource upon completion. This is different from BLK_STS_RESOURCE in that it explicitly references device specific resource. For resources of wider scope, allocation failure can happen without having pending IO. This means that we can't rely on request completions freeing these resources, as IO may not be in flight. Examples of that are kernel memory allocations, DMA mappings, or any other system wide resources.
On Tue, Jan 30 2018 at 9:44P -0500, Jens Axboe <axboe@kernel.dk> wrote: > On 1/30/18 7:24 AM, Mike Snitzer wrote: > > From: Ming Lei <ming.lei@redhat.com> > > > > This status is returned from driver to block layer if device related > > resource is unavailable, but driver can guarantee that IO dispatch > > will be triggered in future when the resource is available. > > > > Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver > > returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after > > a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is > > 3 ms because both scsi-mq and nvmefc are using that magic value. > > > > If a driver can make sure there is in-flight IO, it is safe to return > > BLK_STS_DEV_RESOURCE because: > > > > 1) If all in-flight IOs complete before examining SCHED_RESTART in > > blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue > > is run immediately in this case by blk_mq_dispatch_rq_list(); > > > > 2) if there is any in-flight IO after/when examining SCHED_RESTART > > in blk_mq_dispatch_rq_list(): > > - if SCHED_RESTART isn't set, queue is run immediately as handled in 1) > > - otherwise, this request will be dispatched after any in-flight IO is > > completed via blk_mq_sched_restart() > > > > 3) if SCHED_RESTART is set concurently in context because of > > BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two > > cases and make sure IO hang can be avoided. > > > > One invariant is that queue will be rerun if SCHED_RESTART is set. > > This looks pretty good to me. I'm waffling a bit on whether to retain > the current BLK_STS_RESOURCE behavior and name the new one something > else, but I do like using the DEV name in there to signify the > difference between a global and device resource. > > Just a few small nits below - can you roll a v6 with the changes? Folded in all your feedback and just replied with v6. > > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > > index 2d973ac54b09..f41d2057215f 100644 > > --- a/include/linux/blk_types.h > > +++ b/include/linux/blk_types.h > > @@ -39,6 +39,23 @@ typedef u8 __bitwise blk_status_t; > > > > #define BLK_STS_AGAIN ((__force blk_status_t)12) > > > > +/* > > + * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device > > + * related resource is unavailable, but driver can guarantee that queue > > + * will be rerun in future once the resource is available (whereby > > + * dispatching requests). > > + * > > + * To safely return BLK_STS_DEV_RESOURCE, and allow forward progress, a > > + * driver just needs to make sure there is in-flight IO. > > + * > > + * Difference with BLK_STS_RESOURCE: > > + * If driver isn't sure if the queue will be rerun once device resource > > + * is made available, please return BLK_STS_RESOURCE. For example: when > > + * memory allocation, DMA Mapping or other system resource allocation > > + * fails and IO can't be submitted to device. > > + */ > > +#define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) > > I'd rephrase that as: > > BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if > device related resource are unavailable, but the driver can guarantee > that the queue will be rerun in the future once resources become > available again. This is typically the case for device specific > resources that are consumed for IO. If the driver fails allocating these > resources, we know that inflight (or pending) IO will free these > resource upon completion. > > This is different from BLK_STS_RESOURCE in that it explicitly references > device specific resource. For resources of wider scope, allocation > failure can happen without having pending IO. This means that we can't > rely on request completions freeing these resources, as IO may not be in > flight. Examples of that are kernel memory allocations, DMA mappings, or > any other system wide resources. Thanks for that, definitely clearer, nice job. Mike
On 1/30/18 10:52 AM, Bart Van Assche wrote: > On 01/30/18 06:24, Mike Snitzer wrote: >> + * >> + * If driver returns BLK_STS_RESOURCE and SCHED_RESTART >> + * bit is set, run queue after a delay to avoid IO stalls >> + * that could otherwise occur if the queue is idle. >> */ >> - if (!blk_mq_sched_needs_restart(hctx) || >> + needs_restart = blk_mq_sched_needs_restart(hctx); >> + if (!needs_restart || >> (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) >> blk_mq_run_hw_queue(hctx, true); >> + else if (needs_restart && (ret == BLK_STS_RESOURCE)) >> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY); >> } > > If a request completes concurrently with execution of the above code > then the request completion will trigger a call of > blk_mq_sched_restart_hctx() and that call will clear the > BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above code > tests it then the above code will schedule an asynchronous call of > __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the new > queue run returns again BLK_STS_RESOURCE then the above code will be > executed again. In other words, a loop occurs. That loop will repeat as > long as the described race occurs. The current (kernel v4.15) block > layer behavior is simpler: only block drivers call > blk_mq_delay_run_hw_queue() and the block layer core never calls that > function. Hence that loop cannot occur with the v4.15 block layer core > and block drivers. A motivation of why that loop is preferred compared > to the current behavior (no loop) is missing. Does this mean that that > loop is a needless complication of the block layer core? The dispatch, and later restart check, is within the hctx lock. The completions should be as well. > Sorry but I still prefer the v4.15 block layer approach because this > patch has in my view the following disadvantages: > - It involves a blk-mq API change. API changes are always risky and need > some time to stabilize. > - The delay after which to rerun the queue is moved from block layer > drivers into the block layer core. I think that's wrong because only > the block driver authors can make a good choice for this constant. It's exactly the right place to put it, as drivers cannot make a good decision for when to run the queue again. You get NULL on allocating some piece of memory, when do you run it again? That's the last thing I want driver writers to make a decision on, because a novice device driver writer will just think that he should run the queue again asap. In reality we are screwed. Decisions like that SHOULD be in shared and generic code, not in driver private code. > - This patch makes block drivers harder to understand. Anyone who sees > return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first > time will have to look up the meaning of these constants. An explicit > blk_mq_delay_run_hw_queue() call is easier to understand. BLK_STS_RESOURCE should always be safe to return, and it should work the same as STS_DEV_RESOURCE, except it may cause an extra queue run. Well written drivers should use STS_DEV_RESOURCE where it makes sense. > - This patch does not fix any bugs nor makes block drivers easier to > read or to implement. So why is this patch considered useful? It does fix the run bug on global resources, I'm assuming you mean it doesn't fix any EXTRA bugs compared to just use the delayed run?
On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote: > BLK_STS_RESOURCE should always be safe to return, and it should work > the same as STS_DEV_RESOURCE, except it may cause an extra queue > run. > > Well written drivers should use STS_DEV_RESOURCE where it makes > sense. Hello Jens, I would appreciate it if other names would be chosen than BLK_STS_RESOURCE and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that one of the two status codes causes the queue to be rerun and the other not. I'm afraid that the currently chosen names will cause confusion. Thanks, Bart.
On 1/30/18 8:21 PM, Bart Van Assche wrote: > On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote: >> BLK_STS_RESOURCE should always be safe to return, and it should work >> the same as STS_DEV_RESOURCE, except it may cause an extra queue >> run. >> >> Well written drivers should use STS_DEV_RESOURCE where it makes >> sense. > > Hello Jens, > > I would appreciate it if other names would be chosen than BLK_STS_RESOURCE > and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that > one of the two status codes causes the queue to be rerun and the other not. > I'm afraid that the currently chosen names will cause confusion. DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction a bit clearer.
On Tue, 2018-01-30 at 20:22 -0700, Jens Axboe wrote: > On 1/30/18 8:21 PM, Bart Van Assche wrote: > > On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote: > > > BLK_STS_RESOURCE should always be safe to return, and it should work > > > the same as STS_DEV_RESOURCE, except it may cause an extra queue > > > run. > > > > > > Well written drivers should use STS_DEV_RESOURCE where it makes > > > sense. > > > > Hello Jens, > > > > I would appreciate it if other names would be chosen than BLK_STS_RESOURCE > > and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that > > one of the two status codes causes the queue to be rerun and the other not. > > I'm afraid that the currently chosen names will cause confusion. > > DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE > could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction > a bit clearer. I'm concerned about both. A block driver can namely run out of device resources without receiving a notification if that resource becomes available again. In that case the appropriate return code is BLK_STS_RESOURCE. Hence my concern ... Bart.
On 1/30/18 8:27 PM, Bart Van Assche wrote: > On Tue, 2018-01-30 at 20:22 -0700, Jens Axboe wrote: >> On 1/30/18 8:21 PM, Bart Van Assche wrote: >>> On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote: >>>> BLK_STS_RESOURCE should always be safe to return, and it should work >>>> the same as STS_DEV_RESOURCE, except it may cause an extra queue >>>> run. >>>> >>>> Well written drivers should use STS_DEV_RESOURCE where it makes >>>> sense. >>> >>> Hello Jens, >>> >>> I would appreciate it if other names would be chosen than BLK_STS_RESOURCE >>> and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that >>> one of the two status codes causes the queue to be rerun and the other not. >>> I'm afraid that the currently chosen names will cause confusion. >> >> DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE >> could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction >> a bit clearer. > > I'm concerned about both. A block driver can namely run out of device > resources without receiving a notification if that resource becomes > available again. That is true, and that is why I don't want to driver to make guesses on when to re-run. The saving grace here is that it should happen extremely rarely. I went over this in a previous email. If it's not a rare occurence, then there's no other way around it then wiring up a notification mechanism to kick the queue when the shortage clears. One way to deal with that is making the assumption that other IO will clear the issue. That means we can confine it to the blk-mq layer. Similarly to how we currently sometimes have to track starvation across hardware queues and restart queue X when Y completes IO, we may have to have a broader scope of shortage. That would probably fix all bug pathological cases. > In that case the appropriate return code is BLK_STS_RESOURCE. Hence my > concern ... But this isn't a new thing, the only real change here is making it so that drivers don't have to think about that case.
On Tue, Jan 30, 2018 at 08:22:27PM -0700, Jens Axboe wrote: > On 1/30/18 8:21 PM, Bart Van Assche wrote: > > On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote: > >> BLK_STS_RESOURCE should always be safe to return, and it should work > >> the same as STS_DEV_RESOURCE, except it may cause an extra queue > >> run. > >> > >> Well written drivers should use STS_DEV_RESOURCE where it makes > >> sense. > > > > Hello Jens, > > > > I would appreciate it if other names would be chosen than BLK_STS_RESOURCE > > and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that > > one of the two status codes causes the queue to be rerun and the other not. > > I'm afraid that the currently chosen names will cause confusion. > > DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE > could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction I guess it still can't cover all, for example, .queue_rq() may not submit rq to hardware successfully because of tansport busy, such FC,..
diff --git a/block/blk-core.c b/block/blk-core.c index cdae69be68e9..38279d4ae08b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -145,6 +145,7 @@ static const struct { [BLK_STS_MEDIUM] = { -ENODATA, "critical medium" }, [BLK_STS_PROTECTION] = { -EILSEQ, "protection" }, [BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" }, + [BLK_STS_DEV_RESOURCE] = { -ENOMEM, "device resource" }, [BLK_STS_AGAIN] = { -EAGAIN, "nonblocking retry" }, /* device mapper special case, should not leak out: */ diff --git a/block/blk-mq.c b/block/blk-mq.c index 43e7449723e0..e39b4e2a63db 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx, return true; } +#define BLK_MQ_QUEUE_DELAY 3 /* ms units */ + bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool got_budget) { @@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, struct request *rq, *nxt; bool no_tag = false; int errors, queued; + blk_status_t ret = BLK_STS_OK; if (list_empty(list)) return false; @@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, errors = queued = 0; do { struct blk_mq_queue_data bd; - blk_status_t ret; rq = list_first_entry(list, struct request, queuelist); if (!blk_mq_get_driver_tag(rq, &hctx, false)) { @@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, } ret = q->mq_ops->queue_rq(hctx, &bd); - if (ret == BLK_STS_RESOURCE) { + if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { /* * If an I/O scheduler has been configured and we got a * driver tag for the next request already, free it @@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * that is where we will continue on next queue run. */ if (!list_empty(list)) { + bool needs_restart; + spin_lock(&hctx->lock); list_splice_init(list, &hctx->dispatch); spin_unlock(&hctx->lock); @@ -1278,10 +1282,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * - Some but not all block drivers stop a queue before * returning BLK_STS_RESOURCE. Two exceptions are scsi-mq * and dm-rq. + * + * If driver returns BLK_STS_RESOURCE and SCHED_RESTART + * bit is set, run queue after a delay to avoid IO stalls + * that could otherwise occur if the queue is idle. */ - if (!blk_mq_sched_needs_restart(hctx) || + needs_restart = blk_mq_sched_needs_restart(hctx); + if (!needs_restart || (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) blk_mq_run_hw_queue(hctx, true); + else if (needs_restart && (ret == BLK_STS_RESOURCE)) + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY); } return (queued + errors) != 0; @@ -1762,6 +1773,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, *cookie = new_cookie; break; case BLK_STS_RESOURCE: + case BLK_STS_DEV_RESOURCE: __blk_mq_requeue_request(rq); break; default: @@ -1824,7 +1836,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, hctx_lock(hctx, &srcu_idx); ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); - if (ret == BLK_STS_RESOURCE) + if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) blk_mq_sched_insert_request(rq, false, true, false); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret); diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index 5b94e530570c..4bc25fc4e73c 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -1230,7 +1230,7 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd) return BLK_STS_OK; } else /* requeue request */ - return BLK_STS_RESOURCE; + return BLK_STS_DEV_RESOURCE; } } diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 68846897d213..79908e6ddbf2 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -276,7 +276,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx, /* Out of mem doesn't actually happen, since we fall back * to direct descriptors */ if (err == -ENOMEM || err == -ENOSPC) - return BLK_STS_RESOURCE; + return BLK_STS_DEV_RESOURCE; return BLK_STS_IOERR; } diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 891265acb10e..e126e4cac2ca 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -911,7 +911,7 @@ static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx, out_busy: blk_mq_stop_hw_queue(hctx); spin_unlock_irqrestore(&rinfo->ring_lock, flags); - return BLK_STS_RESOURCE; + return BLK_STS_DEV_RESOURCE; } static void blkif_complete_rq(struct request *rq) diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index b7d175e94a02..348a0cb6963a 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -404,7 +404,7 @@ static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ clone->start_time = jiffies; r = blk_insert_cloned_request(clone->q, clone); - if (r != BLK_STS_OK && r != BLK_STS_RESOURCE) + if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DEV_RESOURCE) /* must complete clone in terms of original request */ dm_complete_request(rq, r); return r; @@ -496,7 +496,7 @@ static int map_request(struct dm_rq_target_io *tio) trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)), blk_rq_pos(rq)); ret = dm_dispatch_clone_request(clone, rq); - if (ret == BLK_STS_RESOURCE) { + if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { blk_rq_unprep_clone(clone); tio->ti->type->release_clone_rq(clone); tio->clone = NULL; @@ -769,7 +769,6 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, /* Undo dm_start_request() before requeuing */ rq_end_stats(md, rq); rq_completed(md, rq_data_dir(rq), false); - blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); return BLK_STS_RESOURCE; } diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index b76ba4629e02..54e679541ad6 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -35,8 +35,6 @@ enum nvme_fc_queue_flags { NVME_FC_Q_LIVE, }; -#define NVMEFC_QUEUE_DELAY 3 /* ms units */ - #define NVME_FC_DEFAULT_DEV_LOSS_TMO 60 /* seconds */ struct nvme_fc_queue { @@ -2231,7 +2229,7 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue, * the target device is present */ if (ctrl->rport->remoteport.port_state != FC_OBJSTATE_ONLINE) - goto busy; + return BLK_STS_RESOURCE; if (!nvme_fc_ctrl_get(ctrl)) return BLK_STS_IOERR; @@ -2311,16 +2309,10 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue, ret != -EBUSY) return BLK_STS_IOERR; - goto busy; + return BLK_STS_RESOURCE; } return BLK_STS_OK; - -busy: - if (!(op->flags & FCOP_FLAGS_AEN) && queue->hctx) - blk_mq_delay_run_hw_queue(queue->hctx, NVMEFC_QUEUE_DELAY); - - return BLK_STS_RESOURCE; } static inline blk_status_t nvme_fc_is_ready(struct nvme_fc_queue *queue, diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index d9ca1dfab154..55be2550c555 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, case BLK_STS_OK: break; case BLK_STS_RESOURCE: - if (atomic_read(&sdev->device_busy) == 0 && - !scsi_device_blocked(sdev)) - blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); + if (atomic_read(&sdev->device_busy) || + scsi_device_blocked(sdev)) + ret = BLK_STS_DEV_RESOURCE; break; default: /* diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 2d973ac54b09..f41d2057215f 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -39,6 +39,23 @@ typedef u8 __bitwise blk_status_t; #define BLK_STS_AGAIN ((__force blk_status_t)12) +/* + * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device + * related resource is unavailable, but driver can guarantee that queue + * will be rerun in future once the resource is available (whereby + * dispatching requests). + * + * To safely return BLK_STS_DEV_RESOURCE, and allow forward progress, a + * driver just needs to make sure there is in-flight IO. + * + * Difference with BLK_STS_RESOURCE: + * If driver isn't sure if the queue will be rerun once device resource + * is made available, please return BLK_STS_RESOURCE. For example: when + * memory allocation, DMA Mapping or other system resource allocation + * fails and IO can't be submitted to device. + */ +#define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) + /** * blk_path_error - returns true if error may be path related * @error: status the request was completed with