[RFC] blk-mq: Fix lost request during timeout

Message ID	1505772186-4601-1-git-send-email-keith.busch@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Keith Busch <keith.busch@intel.com> To: linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de> Cc: Keith Busch <keith.busch@intel.com> Subject: [RFC PATCH] blk-mq: Fix lost request during timeout Date: Mon, 18 Sep 2017 18:03:06 -0400 Message-Id: <1505772186-4601-1-git-send-email-keith.busch@intel.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Keith Busch Sept. 18, 2017, 10:03 p.m. UTC

I think we've always known it's possible to lose a request during timeout
handling, but just accepted that possibility. It seems to be causing
problems, though, leading to unnecessary error escalation and IO failures.

The possiblity arises when the block layer marks the request complete
prior to running the timeout handler. If that request happens to complete
while the handler is running, the request will be lost, inevitably
triggering a second timeout.

This patch attempts to shorten the window for this race condition by
clearing the started flag when the driver completes a request. The block
layer's timeout handler will then complete the command if it observes
the started flag is no longer set.

Note it's possible to lose the command even with this patch. It's just
less likely to happen.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-mq.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

Bart Van Assche Sept. 18, 2017, 10:07 p.m. UTC | #1

On Mon, 2017-09-18 at 18:03 -0400, Keith Busch wrote:
> I think we've always known it's possible to lose a request during timeout

> handling, but just accepted that possibility. It seems to be causing

> problems, though, leading to unnecessary error escalation and IO failures.

> 

> The possiblity arises when the block layer marks the request complete

> prior to running the timeout handler. If that request happens to complete

> while the handler is running, the request will be lost, inevitably

> triggering a second timeout.

> 

> This patch attempts to shorten the window for this race condition by

> clearing the started flag when the driver completes a request. The block

> layer's timeout handler will then complete the command if it observes

> the started flag is no longer set.

> 

> Note it's possible to lose the command even with this patch. It's just

> less likely to happen.


Hello Keith,

Are you sure the root cause of this race condition is in the blk-mq core?
I've never observed such behavior in any of my numerous scsi-mq tests (which
trigger timeouts). Are you sure the race you observed is not caused by a
blk_mq_reinit_tagset() call, a function that is only used by the NVMe driver
and not by any other block driver?

Bart.

Keith Busch Sept. 18, 2017, 10:39 p.m. UTC | #2

On Mon, Sep 18, 2017 at 10:07:58PM +0000, Bart Van Assche wrote:
> On Mon, 2017-09-18 at 18:03 -0400, Keith Busch wrote:
> > I think we've always known it's possible to lose a request during timeout
> > handling, but just accepted that possibility. It seems to be causing
> > problems, though, leading to unnecessary error escalation and IO failures.
> > 
> > The possiblity arises when the block layer marks the request complete
> > prior to running the timeout handler. If that request happens to complete
> > while the handler is running, the request will be lost, inevitably
> > triggering a second timeout.
> > 
> > This patch attempts to shorten the window for this race condition by
> > clearing the started flag when the driver completes a request. The block
> > layer's timeout handler will then complete the command if it observes
> > the started flag is no longer set.
> > 
> > Note it's possible to lose the command even with this patch. It's just
> > less likely to happen.
> 
> Hello Keith,
> 
> Are you sure the root cause of this race condition is in the blk-mq core?
> I've never observed such behavior in any of my numerous scsi-mq tests (which
> trigger timeouts). Are you sure the race you observed is not caused by a
> blk_mq_reinit_tagset() call, a function that is only used by the NVMe driver
> and not by any other block driver?

Hi Bart,

The nvme driver's use of blk_mq_reinit_tagset only happens during
controller initialisation, but I'm seeing lost commands well after that
during normal and stable running.

The timing is pretty narrow to hit, but I'm pretty sure this is what's
happening. For nvme, this occurs when nvme_timeout() runs concurrently
with nvme_handle_cqe() for the same struct request. For scsi-mq,
the same situation may arise if scsi_mq_done() runs concurrently with
scsi_times_out().

I don't really like the proposed "fix" though since it only makes it
less likely, but I didn't see a way to close that without introducing
locks. If someone knows of a better way, that would be awesome.

Thanks,
Keith

Bart Van Assche Sept. 18, 2017, 10:53 p.m. UTC | #3

On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote:
> The nvme driver's use of blk_mq_reinit_tagset only happens during

> controller initialisation, but I'm seeing lost commands well after that

> during normal and stable running.

> 

> The timing is pretty narrow to hit, but I'm pretty sure this is what's

> happening. For nvme, this occurs when nvme_timeout() runs concurrently

> with nvme_handle_cqe() for the same struct request. For scsi-mq,

> the same situation may arise if scsi_mq_done() runs concurrently with

> scsi_times_out().

Hello Keith,

Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit()
for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
called. See also blk_mark_rq_complete(). This avoids that the .complete() and
.timeout() functions run concurrently.

In case this wouldn't be clear, a block driver must not call
blk_mq_end_request() after the timeout handler has finished because that would
trigger a use-after-free of a request structure.

I noticed that your patch includes changes for blk_mq_start_request(). No
timeout or completion handler should be running while blk_mq_start_request() is
being executed. If these changes make a difference in your tests then I think
that means that there is something wrong with the NVMe driver.

Bart.

Keith Busch Sept. 18, 2017, 11:08 p.m. UTC | #4

On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote:
> On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote:
> > The nvme driver's use of blk_mq_reinit_tagset only happens during
> > controller initialisation, but I'm seeing lost commands well after that
> > during normal and stable running.
> > 
> > The timing is pretty narrow to hit, but I'm pretty sure this is what's
> > happening. For nvme, this occurs when nvme_timeout() runs concurrently
> > with nvme_handle_cqe() for the same struct request. For scsi-mq,
> > the same situation may arise if scsi_mq_done() runs concurrently with
> > scsi_times_out().
> 
> Hello Keith,
> 
> Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit()
> for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
> called. See also blk_mark_rq_complete(). This avoids that the .complete() and
> .timeout() functions run concurrently.

Indeed that prevents .complete from running concurrently with the
timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
callbacks. These are the LLD functions that call blk_mq_complete_request
well before .complete. If the driver calls blk_mq_complete_request on
a request that blk-mq is timing out, the request is lost because blk-mq
already called blk_mark_rq_complete. Nothing prevents these LLD functions
from running at the same time as the timeout handler.

> In case this wouldn't be clear, a block driver must not call
> blk_mq_end_request() after the timeout handler has finished because that would
> trigger a use-after-free of a request structure.
>
> I noticed that your patch includes changes for blk_mq_start_request(). No
> timeout or completion handler should be running while blk_mq_start_request() is
> being executed. If these changes make a difference in your tests then I think
> that means that there is something wrong with the NVMe driver.

The reason for changing blk_mq_start_request was because of how the
requeue was clearing STARTED. I had to remove that since it would have
otherwise been impossible for the blk_mq_rq_timed_out to know if the
request was requeued vs. completed.

Bart Van Assche Sept. 18, 2017, 11:14 p.m. UTC | #5

On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote:
> On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote:

> > Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit()

> > for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is

> > called. See also blk_mark_rq_complete(). This avoids that the .complete() and

> > .timeout() functions run concurrently.

> 

> Indeed that prevents .complete from running concurrently with the

> timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete

> callbacks. These are the LLD functions that call blk_mq_complete_request

> well before .complete. If the driver calls blk_mq_complete_request on

> a request that blk-mq is timing out, the request is lost because blk-mq

> already called blk_mark_rq_complete. Nothing prevents these LLD functions

> from running at the same time as the timeout handler.

Can you explain how you define "request is lost"? If a timeout occurs for a
SCSI request then scsi_times_out() calls scsi_abort_command() (if no
.eh_timed_out() callback has been defined by the LLD). It is the responsibility
of the SCSI LLD to call .scsi_done() before its .eh_abort_handler() returns
SUCCESS. If .eh_abort_handler() returns a value other than SUCCESS then the
SCSI core will escalate the error further until .scsi_done() has been called for
the command that timed out. See also scsi_abort_eh_cmnd(). So I think what you
wrote is not correct for the SCSI core and a properly implemented SCSI LLD. 

Bart.

Keith Busch Sept. 19, 2017, 1:55 a.m. UTC | #6

On Mon, Sep 18, 2017 at 11:14:38PM +0000, Bart Van Assche wrote:
> On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote:
> > On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote:
> > > Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit()
> > > for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
> > > called. See also blk_mark_rq_complete(). This avoids that the .complete() and
> > > .timeout() functions run concurrently.
> > 
> > Indeed that prevents .complete from running concurrently with the
> > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> > callbacks. These are the LLD functions that call blk_mq_complete_request
> > well before .complete. If the driver calls blk_mq_complete_request on
> > a request that blk-mq is timing out, the request is lost because blk-mq
> > already called blk_mark_rq_complete. Nothing prevents these LLD functions
> > from running at the same time as the timeout handler.
> 
> Can you explain how you define "request is lost"? 

Sure, what I mean by "lost" is when nothing will call
__blk_mq_complete_request, which is required to make forward progress on
that request.

If a driver calls blk_mq_complete_request on a request being checked by
by the timeout handler, blk-mq will return immediately instead of
making progress toward completion since blk-mq already set
REQ_ATOM_COMPLETE while running the timeout hanlder.

> If a timeout occurs for a
> SCSI request then scsi_times_out() calls scsi_abort_command() (if no
> .eh_timed_out() callback has been defined by the LLD). It is the responsibility
> of the SCSI LLD to call .scsi_done() before its .eh_abort_handler() returns
> SUCCESS. If .eh_abort_handler() returns a value other than SUCCESS then the
> SCSI core will escalate the error further until .scsi_done() has been called for
> the command that timed out. See also scsi_abort_eh_cmnd(). So I think what you
> wrote is not correct for the SCSI core and a properly implemented SCSI LLD. 

Once blk-mq's blk_mq_check_expired calls blk_mark_rq_complete, an
LLD calling blk_mq_complete_request does absolutely nothing because
REQ_ATOM_COMPLETE is already set.

There's nothing stopping scsi_mq_done from running at the same time as
blk-mq's timeout handler.

It doesn't matter what path .scsi_done is called now. That will just
call scsi_mq_done -> blk_mq_complete_request, and since
REQ_ATOM_COMPLETE is already set, the command won't be completed.

The only way to complete that request now is if the timeout
handler returns BLK_EH_HANDLED, but the scsi-mq abort path returns
BLK_EH_NOT_HANDLED on success (very few drivers actually return
BLK_EH_HANDLED).

After the timeout handler runs, it's once again possible for the driver to
complete the request if it returned BLK_EH_RESET_TIMER, but if the driver
already completed the command during the timeout handler, how is the
driver supposed to know it needs to complete the request a second time?

Ming Lei Sept. 19, 2017, 4:16 a.m. UTC | #7

On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote:
>> On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote:
>> > The nvme driver's use of blk_mq_reinit_tagset only happens during
>> > controller initialisation, but I'm seeing lost commands well after that
>> > during normal and stable running.
>> >
>> > The timing is pretty narrow to hit, but I'm pretty sure this is what's
>> > happening. For nvme, this occurs when nvme_timeout() runs concurrently
>> > with nvme_handle_cqe() for the same struct request. For scsi-mq,
>> > the same situation may arise if scsi_mq_done() runs concurrently with
>> > scsi_times_out().
>>
>> Hello Keith,
>>
>> Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit()
>> for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
>> called. See also blk_mark_rq_complete(). This avoids that the .complete() and
>> .timeout() functions run concurrently.
>
> Indeed that prevents .complete from running concurrently with the
> timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> callbacks. These are the LLD functions that call blk_mq_complete_request
> well before .complete. If the driver calls blk_mq_complete_request on
> a request that blk-mq is timing out, the request is lost because blk-mq
> already called blk_mark_rq_complete. Nothing prevents these LLD functions

That shouldn't happen because only one blk_mark_rq_complete() can win
and it is the winner's responsibility to complete the request, so
there shouldn't
be request lost. Especially in your case, it is the responsibility of timeout
to complete the rq really.

Also your patch removes test_and_clear_bit(REQ_ATOM_STARTED) from
__blk_mq_requeue_request(), which will make timeout to easy to happen
since the period between starting requeue and starting req may be long
enough, and timeout can be triggered because STARTED isn't cleared.

Bart Van Assche Sept. 19, 2017, 3:05 p.m. UTC | #8

On Mon, 2017-09-18 at 21:55 -0400, Keith Busch wrote:
> The only way to complete that request now is if the timeout

> handler returns BLK_EH_HANDLED, but the scsi-mq abort path returns

> BLK_EH_NOT_HANDLED on success (very few drivers actually return

> BLK_EH_HANDLED).

> 

> After the timeout handler runs, it's once again possible for the driver to

> complete the request if it returned BLK_EH_RESET_TIMER, but if the driver

> already completed the command during the timeout handler, how is the

> driver supposed to know it needs to complete the request a second time?

Hello Keith,

If you have a look at scmd_eh_abort_handler() then you will see that it either
calls scsi_queue_insert(), scsi_finish_command() or scsi_eh_scmd_add().
scsi_queue_insert() reinserts the request. scsi_eh_scmd_add() wakes up the
error handler thread to perform further error handling escalation.
scsi_finish_command() calls scsi_end_request(). That last function in turn
calls __blk_mq_end_request(). __blk_mq_end_request() does not check
REQ_ATOM_COMPLETE. Sorry but the summary is that I do not agree with you
that the SCSI core can lose requests.

Bart.

Keith Busch Sept. 19, 2017, 3:07 p.m. UTC | #9

On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote:
> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote:
> >
> > Indeed that prevents .complete from running concurrently with the
> > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> > callbacks. These are the LLD functions that call blk_mq_complete_request
> > well before .complete. If the driver calls blk_mq_complete_request on
> > a request that blk-mq is timing out, the request is lost because blk-mq
> > already called blk_mark_rq_complete. Nothing prevents these LLD functions
> 
> That shouldn't happen because only one blk_mark_rq_complete() can win
> and it is the winner's responsibility to complete the request, so
> there shouldn't
> be request lost. Especially in your case, it is the responsibility of timeout
> to complete the rq really.

Hm, either I'm explaining this poorly, or I'm missing something that's
obvious to everyone else.

The driver's IRQ handler has no idea it's racing with the blk-mq timeout
handler, and there's nothing indicating it lost the race. The IRQ handler
just calls blk_mq_complete_request. As far as the driver is concerned,
it has done its part to complete the request at that point.

The problem is when blk-mq's timeout handler prevents the request from
completing, and doesn't leave any indication the driver requested to
complete it. Who is responsible for completing that request now?

Bart Van Assche Sept. 19, 2017, 3:18 p.m. UTC | #10

On Tue, 2017-09-19 at 11:07 -0400, Keith Busch wrote:
> The problem is when blk-mq's timeout handler prevents the request from

> completing, and doesn't leave any indication the driver requested to

> complete it. Who is responsible for completing that request now?

Hello Keith,

My understanding is that block drivers are responsible for completing timed
out requests by using one of the following approaches:
* By returning BLK_EH_HANDLED or BLK_EH_RESET_TIMER from inside the timeout
  handler.
* By returning BLK_EH_NOT_HANDLED and by calling blk_mq_end_request() or
  __blk_mq_end_request() for the request that timed out.

Bart.

Ming Lei Sept. 19, 2017, 3:22 p.m. UTC | #11

On Tue, Sep 19, 2017 at 11:07 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote:
>> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote:
>> >
>> > Indeed that prevents .complete from running concurrently with the
>> > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
>> > callbacks. These are the LLD functions that call blk_mq_complete_request
>> > well before .complete. If the driver calls blk_mq_complete_request on
>> > a request that blk-mq is timing out, the request is lost because blk-mq
>> > already called blk_mark_rq_complete. Nothing prevents these LLD functions
>>
>> That shouldn't happen because only one blk_mark_rq_complete() can win
>> and it is the winner's responsibility to complete the request, so
>> there shouldn't
>> be request lost. Especially in your case, it is the responsibility of timeout
>> to complete the rq really.
>
> Hm, either I'm explaining this poorly, or I'm missing something that's
> obvious to everyone else.
>
> The driver's IRQ handler has no idea it's racing with the blk-mq timeout
> handler, and there's nothing indicating it lost the race. The IRQ handler
> just calls blk_mq_complete_request. As far as the driver is concerned,
> it has done its part to complete the request at that point.

Both blk_mark_rq_complete() and blk_mq_check_expired() calls
blk_mark_rq_complete() to try to complete the req, but only
one of them can do that actually, right?

>
> The problem is when blk-mq's timeout handler prevents the request from
> completing, and doesn't leave any indication the driver requested to
> complete it. Who is responsible for completing that request now?

Who sets ATOM_COMPLETE successfully is responsible for completing
the request. In this case it should be timeout handler, and irq handler
shouldn't touch the request any more, otherwise use-after-free may
happen.

Keith Busch Sept. 19, 2017, 4 p.m. UTC | #12

On Tue, Sep 19, 2017 at 11:22:20PM +0800, Ming Lei wrote:
> On Tue, Sep 19, 2017 at 11:07 PM, Keith Busch <keith.busch@intel.com> wrote:
> > On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote:
> >> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote:
> >> >
> >> > Indeed that prevents .complete from running concurrently with the
> >> > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> >> > callbacks. These are the LLD functions that call blk_mq_complete_request
> >> > well before .complete. If the driver calls blk_mq_complete_request on
> >> > a request that blk-mq is timing out, the request is lost because blk-mq
> >> > already called blk_mark_rq_complete. Nothing prevents these LLD functions
> >>
> >> That shouldn't happen because only one blk_mark_rq_complete() can win
> >> and it is the winner's responsibility to complete the request, so
> >> there shouldn't
> >> be request lost. Especially in your case, it is the responsibility of timeout
> >> to complete the rq really.
> >
> > Hm, either I'm explaining this poorly, or I'm missing something that's
> > obvious to everyone else.
> >
> > The driver's IRQ handler has no idea it's racing with the blk-mq timeout
> > handler, and there's nothing indicating it lost the race. The IRQ handler
> > just calls blk_mq_complete_request. As far as the driver is concerned,
> > it has done its part to complete the request at that point.
> 
> Both blk_mark_rq_complete() and blk_mq_check_expired() calls
> blk_mark_rq_complete() to try to complete the req, but only
> one of them can do that actually, right?

Yes, only one can successfully call that.

The problem I'm asking about is the driver's IRQ handler doesn't know it
lost the race. Only blk-mq knows that. At the point the driver's timeout
handler runs, the driver believes it already completed the request.
 
> > The problem is when blk-mq's timeout handler prevents the request from
> > completing, and doesn't leave any indication the driver requested to
> > complete it. Who is responsible for completing that request now?
> 
> Who sets ATOM_COMPLETE successfully is responsible for completing
> the request. In this case it should be timeout handler, and irq handler
> shouldn't touch the request any more, otherwise use-after-free may
> happen.

Blk-mq sets ATOM_COMPLETE well before the driver's timeout handler is
executed. The IRQ handler doesn't know blk-mq did that, so suggesting
the IRQ handler can't touch the request doesn't make sense.

Keith Busch Sept. 19, 2017, 4:39 p.m. UTC | #13

On Tue, Sep 19, 2017 at 03:18:45PM +0000, Bart Van Assche wrote:
> On Tue, 2017-09-19 at 11:07 -0400, Keith Busch wrote:
> > The problem is when blk-mq's timeout handler prevents the request from
> > completing, and doesn't leave any indication the driver requested to
> > complete it. Who is responsible for completing that request now?
> 
> Hello Keith,
> 
> My understanding is that block drivers are responsible for completing timed
> out requests by using one of the following approaches:
> * By returning BLK_EH_HANDLED or BLK_EH_RESET_TIMER from inside the timeout
>   handler.

No problem with BLK_EH_HANDLED when timeout handler completes the
request. That usage at least makes sense.

In NVMe, we use BLK_EH_RESET_TIMER if the driver does an asynchronous
action to reclaim the request. If the request is returned very soon though
(before blk-mq clears ATOM_COMPLETE), blk_mq_complete_request will still
to do nothing.

> * By returning BLK_EH_NOT_HANDLED and by calling blk_mq_end_request() or
>   __blk_mq_end_request() for the request that timed out.

You want to bypass __blk_mq_complete_request? Doesn't that actually do
important things with queue and scheduler stats? If it's not important,
then this sounds like the piece I'm looking for, but this also puts a
burden on the driver to track the state of their requests that blk-mq
could do for all drivers.

[RFC] blk-mq: Fix lost request during timeout

Commit Message

Comments

Patch