diff mbox

[RFC,3/3] blk-mq: Remove generation seqeunce

Message ID 20180718205321.GA32160@localhost.localdomain (mailing list archive)
State New, archived
Headers show

Commit Message

Keith Busch July 18, 2018, 8:53 p.m. UTC
On Wed, Jul 18, 2018 at 09:56:50PM +0200, hch@lst.de wrote:
> On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > that so colourfully calls everyone else "dangerous" while advocating
> > for silently losing requests on purpose.
> > 
> > But where's the option that fixes scsi to handle hardware completions
> > concurrently with arbitrary timeout software? Propping up that house of
> > cards can't be the only recourse.
> 
> The important bit is that we need to fix this issue quickly.  We are
> past -rc5 so I'm rather concerned about anything too complicated.
> 
> I'm not even sure SCSI has a problem with multiple completions happening
> at the same time, but it certainly has a problem with bypassing
> blk_mq_complete_request from the EH path.
> 
> I think we can solve this properly, but I also think we are way to late
> in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> to revert the changes and try again for 4.19 or even 4.20 if we don't
> act quickly enough.

If scsi needs this behavior, why not just put that behavior in scsi? It
can set the state to complete and then everything can play out as
before.

---
--

Comments

Bart Van Assche July 18, 2018, 8:58 p.m. UTC | #1
On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote:
> If scsi needs this behavior, why not just put that behavior in scsi? It
> can set the state to complete and then everything can play out as
> before.
> [ ... ]

There may be other drivers that need the same protection the SCSI core needs
so I think the patch at the end of your previous e-mail is a step in the wrong
direction.

Bart.
Keith Busch July 18, 2018, 9:17 p.m. UTC | #2
On Wed, Jul 18, 2018 at 08:58:48PM +0000, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote:
> > If scsi needs this behavior, why not just put that behavior in scsi? It
> > can set the state to complete and then everything can play out as
> > before.
> > [ ... ]
> 
> There may be other drivers that need the same protection the SCSI core needs
> so I think the patch at the end of your previous e-mail is a step in the wrong
> direction.
> 
> Bart.

And there may be other drivers that don't want their completions
ignored, so breaking them again is also a step in the wrong direction.

There are not that many blk-mq drivers, so we can go through them all.

Most don't even implement .timeout, so they never know that condition
ever happened. Others always return BLK_EH_RESET_TIMER without doing
anythign else, so the 'new' behavior would have to be better for those,
too.

The following don't implement .timeout:

  loop, rdb, virtio, xen, dm, ubi, scm

The following always return RESET_TIMER:

  null, skd

The following is safe to the new way:

  mtip

And now ones I am not sure about:

  ndb, mmc, dasd

I don't know, reverting looks worse than just fixing the drivers.
Bart Van Assche July 18, 2018, 9:30 p.m. UTC | #3
On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote:
> There are not that many blk-mq drivers, so we can go through them all.

I'm not sure that's the right approach. I think it is the responsibility of
the block layer to handle races between completion handling and timeout
handling and that this is not the responsibility of e.g. a block driver. If
you look at e.g. the legacy block layer then you will see that it takes care
of this race and that legacy block drivers do not have to worry about this
race.

Bart.
Keith Busch July 18, 2018, 9:33 p.m. UTC | #4
On Wed, Jul 18, 2018 at 09:30:11PM +0000, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote:
> > There are not that many blk-mq drivers, so we can go through them all.
> 
> I'm not sure that's the right approach. I think it is the responsibility of
> the block layer to handle races between completion handling and timeout
> handling and that this is not the responsibility of e.g. a block driver. If
> you look at e.g. the legacy block layer then you will see that it takes care
> of this race and that legacy block drivers do not have to worry about this
> race.

Reverting doesn't handle the race at all. It just ignores completions
and puts the responsibility on the drivers to handle the race because
that's what scsi wants to happen.
Christoph Hellwig July 19, 2018, 1:19 p.m. UTC | #5
On Wed, Jul 18, 2018 at 03:17:11PM -0600, Keith Busch wrote:
> And there may be other drivers that don't want their completions
> ignored, so breaking them again is also a step in the wrong direction.
> 
> There are not that many blk-mq drivers, so we can go through them all.

I think the point is that SCSI is the biggest user by both the number
of low-level drivers sitting under the midlayer, and also by usage.

We need to be very careful not to break it.  Note that this doesn't
mean that I don't want to eventually move away from just ignoring
completions in timeout state for SCSI.  I'd just rather rever 4.18
to a clean known state instead of doctoring around late in the rc
phase.

> Most don't even implement .timeout, so they never know that condition
> ever happened. Others always return BLK_EH_RESET_TIMER without doing
> anythign else, so the 'new' behavior would have to be better for those,
> too.

And we should never even hit the timeout handler for those as it
is rather pointless (although it looks we currently do..).
Christoph Hellwig July 19, 2018, 1:22 p.m. UTC | #6
On Wed, Jul 18, 2018 at 02:53:21PM -0600, Keith Busch wrote:
> If scsi needs this behavior, why not just put that behavior in scsi? It
> can set the state to complete and then everything can play out as
> before.

I think even with this we are missing handling for the somewhat
degenerate blk_abort_request case.

But most importantly we'll need some good test coverage.  Please
do some basic testing (e.g. with a version of the hack from
Jianchao (who seems to keep getting dropped from this thread for
some reason) and send it out to the block and scsi lists.
Keith Busch July 19, 2018, 2:59 p.m. UTC | #7
On Thu, Jul 19, 2018 at 03:19:04PM +0200, hch@lst.de wrote:
> On Wed, Jul 18, 2018 at 03:17:11PM -0600, Keith Busch wrote:
> > And there may be other drivers that don't want their completions
> > ignored, so breaking them again is also a step in the wrong direction.
> > 
> > There are not that many blk-mq drivers, so we can go through them all.
> 
> I think the point is that SCSI is the biggest user by both the number
> of low-level drivers sitting under the midlayer, and also by usage.
> 
> We need to be very careful not to break it.  Note that this doesn't
> mean that I don't want to eventually move away from just ignoring
> completions in timeout state for SCSI.  I'd just rather rever 4.18
> to a clean known state instead of doctoring around late in the rc
> phase.

I definitely do not want to break scsi. I just don't want to break every
one else either, and I think scsi can get the behavior it wants without
forcing others to subscribe to it.
 
> > Most don't even implement .timeout, so they never know that condition
> > ever happened. Others always return BLK_EH_RESET_TIMER without doing
> > anythign else, so the 'new' behavior would have to be better for those,
> > too.
> 
> And we should never even hit the timeout handler for those as it
> is rather pointless (although it looks we currently do..).

I don't see why we'd expect to never hit timeout for at least some of
these. It's not a stretch to see, for example, that virtio-blk or loop
could have their requests lost with no way to recover if we revert. I've
wasted too much time debugging hardware for such lost commands when it
was in fact functioning perfectly fine. So reintroducing that behavior
is a bit distressing.
Keith Busch July 19, 2018, 3:56 p.m. UTC | #8
On Thu, Jul 19, 2018 at 08:59:31AM -0600, Keith Busch wrote:
> > And we should never even hit the timeout handler for those as it
> > is rather pointless (although it looks we currently do..).
> 
> I don't see why we'd expect to never hit timeout for at least some of
> these. It's not a stretch to see, for example, that virtio-blk or loop
> could have their requests lost with no way to recover if we revert. I've
> wasted too much time debugging hardware for such lost commands when it
> was in fact functioning perfectly fine. So reintroducing that behavior
> is a bit distressing.

Even some scsi drivers are susceptible to losing their requests with the
reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
from it's timeout handler. With the behavior everyone seems to want,
a natural completion at or around the same time is lost forever because
it was blocked from completion with no way to recover.

While the timing for when requests may be lost is quite narrow, I've
seen it enough with very difficult to reproduce scenarios that hardware
devs no longer trust IO timeouts are their problem because Linux loses
their completions.
Bart Van Assche July 19, 2018, 4:04 p.m. UTC | #9
On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote:
> Even some scsi drivers are susceptible to losing their requests with the
> reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
> from it's timeout handler. With the behavior everyone seems to want,
> a natural completion at or around the same time is lost forever because
> it was blocked from completion with no way to recover.

The patch I had posted handles a completion that occurs while a timeout is
being handled properly. From https://www.mail-archive.com/linux-block@vger.kernel.org/msg22196.html:

 void blk_mq_complete_request(struct request *rq)
[ ... ]
+               if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT,
+                                          MQ_RQ_COMPLETE)) {
+                       __blk_mq_complete_request(rq);
+                       break;
+               }
+               if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
+                       break;
[ ... ]
@@ -838,25 +838,42 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
[ ... ]
        case BLK_EH_RESET_TIMER:
[ ... ]
+                       if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
+                               __blk_mq_complete_request(req);
+                               break;
+                       }

Bart.
Keith Busch July 19, 2018, 4:22 p.m. UTC | #10
On Thu, Jul 19, 2018 at 04:04:46PM +0000, Bart Van Assche wrote:
> On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote:
> > Even some scsi drivers are susceptible to losing their requests with the
> > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
> > from it's timeout handler. With the behavior everyone seems to want,
> > a natural completion at or around the same time is lost forever because
> > it was blocked from completion with no way to recover.
> 
> The patch I had posted handles a completion that occurs while a timeout is
> being handled properly. From https://www.mail-archive.com/linux-block@vger.kernel.org/msg22196.html:

Sounds like a win-win to me.
Christoph Hellwig July 19, 2018, 4:29 p.m. UTC | #11
On Thu, Jul 19, 2018 at 10:22:40AM -0600, Keith Busch wrote:
> On Thu, Jul 19, 2018 at 04:04:46PM +0000, Bart Van Assche wrote:
> > On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote:
> > > Even some scsi drivers are susceptible to losing their requests with the
> > > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
> > > from it's timeout handler. With the behavior everyone seems to want,
> > > a natural completion at or around the same time is lost forever because
> > > it was blocked from completion with no way to recover.
> > 
> > The patch I had posted handles a completion that occurs while a timeout is
> > being handled properly. From https://www.mail-archive.com/linux-block@vger.kernel.org/msg22196.html:
> 
> Sounds like a win-win to me.

How do we get a fix into 4.18 at this part of the cycle?  I think that
is the most important prirority right now.
Keith Busch July 19, 2018, 8:18 p.m. UTC | #12
On Thu, Jul 19, 2018 at 06:29:24PM +0200, hch@lst.de wrote:
> How do we get a fix into 4.18 at this part of the cycle?  I think that
> is the most important prirority right now.

Even if you were okay at this point to incorporate the concepts from
Bart's patch, it still looks like trouble for scsi (will elobrate
separately).

But reverting breaks other things we finally got working, so I'd
still vote for isolating the old behavior to scsi if that isn't too
unpalatable. I'll send a small patch shortly and see what happens.
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 22326612a5d3..f50559718b71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -558,10 +558,8 @@  static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	if (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
-			MQ_RQ_IN_FLIGHT)
+	if (blk_mq_mark_complete(rq))
 		return;
-
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
 
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 8932ae81a15a..a5d05fab24a7 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -286,6 +286,14 @@  enum blk_eh_timer_return scsi_times_out(struct request *req)
 	enum blk_eh_timer_return rtn = BLK_EH_DONE;
 	struct Scsi_Host *host = scmd->device->host;
 
+	/*
+	 * Mark complete now so lld can't post a completion during error
+	 * handling. Return immediately if it was already marked complete, as
+	 * that means the lld posted a completion already.
+	 */
+	if (req->q->mq_ops && blk_mq_mark_complete(req))
+		return rtn;
+
 	trace_scsi_dispatch_cmd_timeout(scmd);
 	scsi_log_completion(scmd, TIMEOUT_ERROR);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 9b0fd11ce89a..0ce587c9c27b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -289,6 +289,15 @@  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
+/*
+ * Returns true if request is not in flight.
+ */
+static inline bool blk_mq_mark_complete(struct request *rq)
+{
+	return (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
+			MQ_RQ_IN_FLIGHT);
+}
+
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.