Message ID | 1505772186-4601-1-git-send-email-keith.busch@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 2017-09-18 at 18:03 -0400, Keith Busch wrote: > I think we've always known it's possible to lose a request during timeout > handling, but just accepted that possibility. It seems to be causing > problems, though, leading to unnecessary error escalation and IO failures. > > The possiblity arises when the block layer marks the request complete > prior to running the timeout handler. If that request happens to complete > while the handler is running, the request will be lost, inevitably > triggering a second timeout. > > This patch attempts to shorten the window for this race condition by > clearing the started flag when the driver completes a request. The block > layer's timeout handler will then complete the command if it observes > the started flag is no longer set. > > Note it's possible to lose the command even with this patch. It's just > less likely to happen. Hello Keith, Are you sure the root cause of this race condition is in the blk-mq core? I've never observed such behavior in any of my numerous scsi-mq tests (which trigger timeouts). Are you sure the race you observed is not caused by a blk_mq_reinit_tagset() call, a function that is only used by the NVMe driver and not by any other block driver? Bart.
On Mon, Sep 18, 2017 at 10:07:58PM +0000, Bart Van Assche wrote: > On Mon, 2017-09-18 at 18:03 -0400, Keith Busch wrote: > > I think we've always known it's possible to lose a request during timeout > > handling, but just accepted that possibility. It seems to be causing > > problems, though, leading to unnecessary error escalation and IO failures. > > > > The possiblity arises when the block layer marks the request complete > > prior to running the timeout handler. If that request happens to complete > > while the handler is running, the request will be lost, inevitably > > triggering a second timeout. > > > > This patch attempts to shorten the window for this race condition by > > clearing the started flag when the driver completes a request. The block > > layer's timeout handler will then complete the command if it observes > > the started flag is no longer set. > > > > Note it's possible to lose the command even with this patch. It's just > > less likely to happen. > > Hello Keith, > > Are you sure the root cause of this race condition is in the blk-mq core? > I've never observed such behavior in any of my numerous scsi-mq tests (which > trigger timeouts). Are you sure the race you observed is not caused by a > blk_mq_reinit_tagset() call, a function that is only used by the NVMe driver > and not by any other block driver? Hi Bart, The nvme driver's use of blk_mq_reinit_tagset only happens during controller initialisation, but I'm seeing lost commands well after that during normal and stable running. The timing is pretty narrow to hit, but I'm pretty sure this is what's happening. For nvme, this occurs when nvme_timeout() runs concurrently with nvme_handle_cqe() for the same struct request. For scsi-mq, the same situation may arise if scsi_mq_done() runs concurrently with scsi_times_out(). I don't really like the proposed "fix" though since it only makes it less likely, but I didn't see a way to close that without introducing locks. If someone knows of a better way, that would be awesome. Thanks, Keith
On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote: > The nvme driver's use of blk_mq_reinit_tagset only happens during > controller initialisation, but I'm seeing lost commands well after that > during normal and stable running. > > The timing is pretty narrow to hit, but I'm pretty sure this is what's > happening. For nvme, this occurs when nvme_timeout() runs concurrently > with nvme_handle_cqe() for the same struct request. For scsi-mq, > the same situation may arise if scsi_mq_done() runs concurrently with > scsi_times_out(). Hello Keith, Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit() for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is called. See also blk_mark_rq_complete(). This avoids that the .complete() and .timeout() functions run concurrently. In case this wouldn't be clear, a block driver must not call blk_mq_end_request() after the timeout handler has finished because that would trigger a use-after-free of a request structure. I noticed that your patch includes changes for blk_mq_start_request(). No timeout or completion handler should be running while blk_mq_start_request() is being executed. If these changes make a difference in your tests then I think that means that there is something wrong with the NVMe driver. Bart.
On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote: > On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote: > > The nvme driver's use of blk_mq_reinit_tagset only happens during > > controller initialisation, but I'm seeing lost commands well after that > > during normal and stable running. > > > > The timing is pretty narrow to hit, but I'm pretty sure this is what's > > happening. For nvme, this occurs when nvme_timeout() runs concurrently > > with nvme_handle_cqe() for the same struct request. For scsi-mq, > > the same situation may arise if scsi_mq_done() runs concurrently with > > scsi_times_out(). > > Hello Keith, > > Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit() > for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is > called. See also blk_mark_rq_complete(). This avoids that the .complete() and > .timeout() functions run concurrently. Indeed that prevents .complete from running concurrently with the timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete callbacks. These are the LLD functions that call blk_mq_complete_request well before .complete. If the driver calls blk_mq_complete_request on a request that blk-mq is timing out, the request is lost because blk-mq already called blk_mark_rq_complete. Nothing prevents these LLD functions from running at the same time as the timeout handler. > In case this wouldn't be clear, a block driver must not call > blk_mq_end_request() after the timeout handler has finished because that would > trigger a use-after-free of a request structure. > > I noticed that your patch includes changes for blk_mq_start_request(). No > timeout or completion handler should be running while blk_mq_start_request() is > being executed. If these changes make a difference in your tests then I think > that means that there is something wrong with the NVMe driver. The reason for changing blk_mq_start_request was because of how the requeue was clearing STARTED. I had to remove that since it would have otherwise been impossible for the blk_mq_rq_timed_out to know if the request was requeued vs. completed.
On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote: > On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote: > > Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit() > > for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is > > called. See also blk_mark_rq_complete(). This avoids that the .complete() and > > .timeout() functions run concurrently. > > Indeed that prevents .complete from running concurrently with the > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete > callbacks. These are the LLD functions that call blk_mq_complete_request > well before .complete. If the driver calls blk_mq_complete_request on > a request that blk-mq is timing out, the request is lost because blk-mq > already called blk_mark_rq_complete. Nothing prevents these LLD functions > from running at the same time as the timeout handler. Can you explain how you define "request is lost"? If a timeout occurs for a SCSI request then scsi_times_out() calls scsi_abort_command() (if no .eh_timed_out() callback has been defined by the LLD). It is the responsibility of the SCSI LLD to call .scsi_done() before its .eh_abort_handler() returns SUCCESS. If .eh_abort_handler() returns a value other than SUCCESS then the SCSI core will escalate the error further until .scsi_done() has been called for the command that timed out. See also scsi_abort_eh_cmnd(). So I think what you wrote is not correct for the SCSI core and a properly implemented SCSI LLD. Bart.
On Mon, Sep 18, 2017 at 11:14:38PM +0000, Bart Van Assche wrote: > On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote: > > On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote: > > > Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit() > > > for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is > > > called. See also blk_mark_rq_complete(). This avoids that the .complete() and > > > .timeout() functions run concurrently. > > > > Indeed that prevents .complete from running concurrently with the > > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete > > callbacks. These are the LLD functions that call blk_mq_complete_request > > well before .complete. If the driver calls blk_mq_complete_request on > > a request that blk-mq is timing out, the request is lost because blk-mq > > already called blk_mark_rq_complete. Nothing prevents these LLD functions > > from running at the same time as the timeout handler. > > Can you explain how you define "request is lost"? Sure, what I mean by "lost" is when nothing will call __blk_mq_complete_request, which is required to make forward progress on that request. If a driver calls blk_mq_complete_request on a request being checked by by the timeout handler, blk-mq will return immediately instead of making progress toward completion since blk-mq already set REQ_ATOM_COMPLETE while running the timeout hanlder. > If a timeout occurs for a > SCSI request then scsi_times_out() calls scsi_abort_command() (if no > .eh_timed_out() callback has been defined by the LLD). It is the responsibility > of the SCSI LLD to call .scsi_done() before its .eh_abort_handler() returns > SUCCESS. If .eh_abort_handler() returns a value other than SUCCESS then the > SCSI core will escalate the error further until .scsi_done() has been called for > the command that timed out. See also scsi_abort_eh_cmnd(). So I think what you > wrote is not correct for the SCSI core and a properly implemented SCSI LLD. Once blk-mq's blk_mq_check_expired calls blk_mark_rq_complete, an LLD calling blk_mq_complete_request does absolutely nothing because REQ_ATOM_COMPLETE is already set. There's nothing stopping scsi_mq_done from running at the same time as blk-mq's timeout handler. It doesn't matter what path .scsi_done is called now. That will just call scsi_mq_done -> blk_mq_complete_request, and since REQ_ATOM_COMPLETE is already set, the command won't be completed. The only way to complete that request now is if the timeout handler returns BLK_EH_HANDLED, but the scsi-mq abort path returns BLK_EH_NOT_HANDLED on success (very few drivers actually return BLK_EH_HANDLED). After the timeout handler runs, it's once again possible for the driver to complete the request if it returned BLK_EH_RESET_TIMER, but if the driver already completed the command during the timeout handler, how is the driver supposed to know it needs to complete the request a second time?
On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote: > On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote: >> On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote: >> > The nvme driver's use of blk_mq_reinit_tagset only happens during >> > controller initialisation, but I'm seeing lost commands well after that >> > during normal and stable running. >> > >> > The timing is pretty narrow to hit, but I'm pretty sure this is what's >> > happening. For nvme, this occurs when nvme_timeout() runs concurrently >> > with nvme_handle_cqe() for the same struct request. For scsi-mq, >> > the same situation may arise if scsi_mq_done() runs concurrently with >> > scsi_times_out(). >> >> Hello Keith, >> >> Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit() >> for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is >> called. See also blk_mark_rq_complete(). This avoids that the .complete() and >> .timeout() functions run concurrently. > > Indeed that prevents .complete from running concurrently with the > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete > callbacks. These are the LLD functions that call blk_mq_complete_request > well before .complete. If the driver calls blk_mq_complete_request on > a request that blk-mq is timing out, the request is lost because blk-mq > already called blk_mark_rq_complete. Nothing prevents these LLD functions That shouldn't happen because only one blk_mark_rq_complete() can win and it is the winner's responsibility to complete the request, so there shouldn't be request lost. Especially in your case, it is the responsibility of timeout to complete the rq really. Also your patch removes test_and_clear_bit(REQ_ATOM_STARTED) from __blk_mq_requeue_request(), which will make timeout to easy to happen since the period between starting requeue and starting req may be long enough, and timeout can be triggered because STARTED isn't cleared.
On Mon, 2017-09-18 at 21:55 -0400, Keith Busch wrote: > The only way to complete that request now is if the timeout > handler returns BLK_EH_HANDLED, but the scsi-mq abort path returns > BLK_EH_NOT_HANDLED on success (very few drivers actually return > BLK_EH_HANDLED). > > After the timeout handler runs, it's once again possible for the driver to > complete the request if it returned BLK_EH_RESET_TIMER, but if the driver > already completed the command during the timeout handler, how is the > driver supposed to know it needs to complete the request a second time? Hello Keith, If you have a look at scmd_eh_abort_handler() then you will see that it either calls scsi_queue_insert(), scsi_finish_command() or scsi_eh_scmd_add(). scsi_queue_insert() reinserts the request. scsi_eh_scmd_add() wakes up the error handler thread to perform further error handling escalation. scsi_finish_command() calls scsi_end_request(). That last function in turn calls __blk_mq_end_request(). __blk_mq_end_request() does not check REQ_ATOM_COMPLETE. Sorry but the summary is that I do not agree with you that the SCSI core can lose requests. Bart.
On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote: > On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote: > > > > Indeed that prevents .complete from running concurrently with the > > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete > > callbacks. These are the LLD functions that call blk_mq_complete_request > > well before .complete. If the driver calls blk_mq_complete_request on > > a request that blk-mq is timing out, the request is lost because blk-mq > > already called blk_mark_rq_complete. Nothing prevents these LLD functions > > That shouldn't happen because only one blk_mark_rq_complete() can win > and it is the winner's responsibility to complete the request, so > there shouldn't > be request lost. Especially in your case, it is the responsibility of timeout > to complete the rq really. Hm, either I'm explaining this poorly, or I'm missing something that's obvious to everyone else. The driver's IRQ handler has no idea it's racing with the blk-mq timeout handler, and there's nothing indicating it lost the race. The IRQ handler just calls blk_mq_complete_request. As far as the driver is concerned, it has done its part to complete the request at that point. The problem is when blk-mq's timeout handler prevents the request from completing, and doesn't leave any indication the driver requested to complete it. Who is responsible for completing that request now?
On Tue, 2017-09-19 at 11:07 -0400, Keith Busch wrote: > The problem is when blk-mq's timeout handler prevents the request from > completing, and doesn't leave any indication the driver requested to > complete it. Who is responsible for completing that request now? Hello Keith, My understanding is that block drivers are responsible for completing timed out requests by using one of the following approaches: * By returning BLK_EH_HANDLED or BLK_EH_RESET_TIMER from inside the timeout handler. * By returning BLK_EH_NOT_HANDLED and by calling blk_mq_end_request() or __blk_mq_end_request() for the request that timed out. Bart.
On Tue, Sep 19, 2017 at 11:07 PM, Keith Busch <keith.busch@intel.com> wrote: > On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote: >> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote: >> > >> > Indeed that prevents .complete from running concurrently with the >> > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete >> > callbacks. These are the LLD functions that call blk_mq_complete_request >> > well before .complete. If the driver calls blk_mq_complete_request on >> > a request that blk-mq is timing out, the request is lost because blk-mq >> > already called blk_mark_rq_complete. Nothing prevents these LLD functions >> >> That shouldn't happen because only one blk_mark_rq_complete() can win >> and it is the winner's responsibility to complete the request, so >> there shouldn't >> be request lost. Especially in your case, it is the responsibility of timeout >> to complete the rq really. > > Hm, either I'm explaining this poorly, or I'm missing something that's > obvious to everyone else. > > The driver's IRQ handler has no idea it's racing with the blk-mq timeout > handler, and there's nothing indicating it lost the race. The IRQ handler > just calls blk_mq_complete_request. As far as the driver is concerned, > it has done its part to complete the request at that point. Both blk_mark_rq_complete() and blk_mq_check_expired() calls blk_mark_rq_complete() to try to complete the req, but only one of them can do that actually, right? > > The problem is when blk-mq's timeout handler prevents the request from > completing, and doesn't leave any indication the driver requested to > complete it. Who is responsible for completing that request now? Who sets ATOM_COMPLETE successfully is responsible for completing the request. In this case it should be timeout handler, and irq handler shouldn't touch the request any more, otherwise use-after-free may happen.
On Tue, Sep 19, 2017 at 11:22:20PM +0800, Ming Lei wrote: > On Tue, Sep 19, 2017 at 11:07 PM, Keith Busch <keith.busch@intel.com> wrote: > > On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote: > >> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote: > >> > > >> > Indeed that prevents .complete from running concurrently with the > >> > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete > >> > callbacks. These are the LLD functions that call blk_mq_complete_request > >> > well before .complete. If the driver calls blk_mq_complete_request on > >> > a request that blk-mq is timing out, the request is lost because blk-mq > >> > already called blk_mark_rq_complete. Nothing prevents these LLD functions > >> > >> That shouldn't happen because only one blk_mark_rq_complete() can win > >> and it is the winner's responsibility to complete the request, so > >> there shouldn't > >> be request lost. Especially in your case, it is the responsibility of timeout > >> to complete the rq really. > > > > Hm, either I'm explaining this poorly, or I'm missing something that's > > obvious to everyone else. > > > > The driver's IRQ handler has no idea it's racing with the blk-mq timeout > > handler, and there's nothing indicating it lost the race. The IRQ handler > > just calls blk_mq_complete_request. As far as the driver is concerned, > > it has done its part to complete the request at that point. > > Both blk_mark_rq_complete() and blk_mq_check_expired() calls > blk_mark_rq_complete() to try to complete the req, but only > one of them can do that actually, right? Yes, only one can successfully call that. The problem I'm asking about is the driver's IRQ handler doesn't know it lost the race. Only blk-mq knows that. At the point the driver's timeout handler runs, the driver believes it already completed the request. > > The problem is when blk-mq's timeout handler prevents the request from > > completing, and doesn't leave any indication the driver requested to > > complete it. Who is responsible for completing that request now? > > Who sets ATOM_COMPLETE successfully is responsible for completing > the request. In this case it should be timeout handler, and irq handler > shouldn't touch the request any more, otherwise use-after-free may > happen. Blk-mq sets ATOM_COMPLETE well before the driver's timeout handler is executed. The IRQ handler doesn't know blk-mq did that, so suggesting the IRQ handler can't touch the request doesn't make sense.
On Tue, Sep 19, 2017 at 03:18:45PM +0000, Bart Van Assche wrote: > On Tue, 2017-09-19 at 11:07 -0400, Keith Busch wrote: > > The problem is when blk-mq's timeout handler prevents the request from > > completing, and doesn't leave any indication the driver requested to > > complete it. Who is responsible for completing that request now? > > Hello Keith, > > My understanding is that block drivers are responsible for completing timed > out requests by using one of the following approaches: > * By returning BLK_EH_HANDLED or BLK_EH_RESET_TIMER from inside the timeout > handler. No problem with BLK_EH_HANDLED when timeout handler completes the request. That usage at least makes sense. In NVMe, we use BLK_EH_RESET_TIMER if the driver does an asynchronous action to reclaim the request. If the request is returned very soon though (before blk-mq clears ATOM_COMPLETE), blk_mq_complete_request will still to do nothing. > * By returning BLK_EH_NOT_HANDLED and by calling blk_mq_end_request() or > __blk_mq_end_request() for the request that timed out. You want to bypass __blk_mq_complete_request? Doesn't that actually do important things with queue and scheduler stats? If it's not important, then this sounds like the piece I'm looking for, but this also puts a burden on the driver to track the state of their requests that blk-mq could do for all drivers.
diff --git a/block/blk-mq.c b/block/blk-mq.c index 98a1860..37144ef 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -566,6 +566,7 @@ void blk_mq_complete_request(struct request *rq) if (unlikely(blk_should_fake_timeout(q))) return; + clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags); if (!blk_mark_rq_complete(rq)) __blk_mq_complete_request(rq); } @@ -605,19 +606,19 @@ void blk_mq_start_request(struct request *rq) * complete. So be sure to clear complete again when we start * the request, otherwise we'll ignore the completion event. */ - if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) + if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) { set_bit(REQ_ATOM_STARTED, &rq->atomic_flags); + if (q->dma_drain_size && blk_rq_bytes(rq)) { + /* + * Make sure space for the drain appears. We know we can do + * this because max_hw_segments has been adjusted to be one + * fewer than the device can handle. + */ + rq->nr_phys_segments++; + } + } if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags); - - if (q->dma_drain_size && blk_rq_bytes(rq)) { - /* - * Make sure space for the drain appears. We know we can do - * this because max_hw_segments has been adjusted to be one - * fewer than the device can handle. - */ - rq->nr_phys_segments++; - } } EXPORT_SYMBOL(blk_mq_start_request); @@ -637,11 +638,6 @@ static void __blk_mq_requeue_request(struct request *rq) trace_block_rq_requeue(q, rq); wbt_requeue(q->rq_wb, &rq->issue_stat); blk_mq_sched_requeue_request(rq); - - if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) { - if (q->dma_drain_size && blk_rq_bytes(rq)) - rq->nr_phys_segments--; - } } void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list) @@ -763,10 +759,15 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved) __blk_mq_complete_request(req); break; case BLK_EH_RESET_TIMER: - blk_add_timer(req); - blk_clear_rq_complete(req); - break; + if (test_bit(REQ_ATOM_STARTED, &req->atomic_flags)) { + blk_add_timer(req); + blk_clear_rq_complete(req); + break; + } + /* Fall through */ case BLK_EH_NOT_HANDLED: + if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags)) + __blk_mq_complete_request(req); break; default: printk(KERN_ERR "block: bad eh return: %d\n", ret);
I think we've always known it's possible to lose a request during timeout handling, but just accepted that possibility. It seems to be causing problems, though, leading to unnecessary error escalation and IO failures. The possiblity arises when the block layer marks the request complete prior to running the timeout handler. If that request happens to complete while the handler is running, the request will be lost, inevitably triggering a second timeout. This patch attempts to shorten the window for this race condition by clearing the started flag when the driver completes a request. The block layer's timeout handler will then complete the command if it observes the started flag is no longer set. Note it's possible to lose the command even with this patch. It's just less likely to happen. Signed-off-by: Keith Busch <keith.busch@intel.com> --- block/blk-mq.c | 37 +++++++++++++++++++------------------ 1 file changed, 19 insertions(+), 18 deletions(-)