Message ID | 20180718205321.GA32160@localhost.localdomain (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote: > If scsi needs this behavior, why not just put that behavior in scsi? It > can set the state to complete and then everything can play out as > before. > [ ... ] There may be other drivers that need the same protection the SCSI core needs so I think the patch at the end of your previous e-mail is a step in the wrong direction. Bart.
On Wed, Jul 18, 2018 at 08:58:48PM +0000, Bart Van Assche wrote: > On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote: > > If scsi needs this behavior, why not just put that behavior in scsi? It > > can set the state to complete and then everything can play out as > > before. > > [ ... ] > > There may be other drivers that need the same protection the SCSI core needs > so I think the patch at the end of your previous e-mail is a step in the wrong > direction. > > Bart. And there may be other drivers that don't want their completions ignored, so breaking them again is also a step in the wrong direction. There are not that many blk-mq drivers, so we can go through them all. Most don't even implement .timeout, so they never know that condition ever happened. Others always return BLK_EH_RESET_TIMER without doing anythign else, so the 'new' behavior would have to be better for those, too. The following don't implement .timeout: loop, rdb, virtio, xen, dm, ubi, scm The following always return RESET_TIMER: null, skd The following is safe to the new way: mtip And now ones I am not sure about: ndb, mmc, dasd I don't know, reverting looks worse than just fixing the drivers.
On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote:
> There are not that many blk-mq drivers, so we can go through them all.
I'm not sure that's the right approach. I think it is the responsibility of
the block layer to handle races between completion handling and timeout
handling and that this is not the responsibility of e.g. a block driver. If
you look at e.g. the legacy block layer then you will see that it takes care
of this race and that legacy block drivers do not have to worry about this
race.
Bart.
On Wed, Jul 18, 2018 at 09:30:11PM +0000, Bart Van Assche wrote: > On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote: > > There are not that many blk-mq drivers, so we can go through them all. > > I'm not sure that's the right approach. I think it is the responsibility of > the block layer to handle races between completion handling and timeout > handling and that this is not the responsibility of e.g. a block driver. If > you look at e.g. the legacy block layer then you will see that it takes care > of this race and that legacy block drivers do not have to worry about this > race. Reverting doesn't handle the race at all. It just ignores completions and puts the responsibility on the drivers to handle the race because that's what scsi wants to happen.
On Wed, Jul 18, 2018 at 03:17:11PM -0600, Keith Busch wrote: > And there may be other drivers that don't want their completions > ignored, so breaking them again is also a step in the wrong direction. > > There are not that many blk-mq drivers, so we can go through them all. I think the point is that SCSI is the biggest user by both the number of low-level drivers sitting under the midlayer, and also by usage. We need to be very careful not to break it. Note that this doesn't mean that I don't want to eventually move away from just ignoring completions in timeout state for SCSI. I'd just rather rever 4.18 to a clean known state instead of doctoring around late in the rc phase. > Most don't even implement .timeout, so they never know that condition > ever happened. Others always return BLK_EH_RESET_TIMER without doing > anythign else, so the 'new' behavior would have to be better for those, > too. And we should never even hit the timeout handler for those as it is rather pointless (although it looks we currently do..).
On Wed, Jul 18, 2018 at 02:53:21PM -0600, Keith Busch wrote: > If scsi needs this behavior, why not just put that behavior in scsi? It > can set the state to complete and then everything can play out as > before. I think even with this we are missing handling for the somewhat degenerate blk_abort_request case. But most importantly we'll need some good test coverage. Please do some basic testing (e.g. with a version of the hack from Jianchao (who seems to keep getting dropped from this thread for some reason) and send it out to the block and scsi lists.
On Thu, Jul 19, 2018 at 03:19:04PM +0200, hch@lst.de wrote: > On Wed, Jul 18, 2018 at 03:17:11PM -0600, Keith Busch wrote: > > And there may be other drivers that don't want their completions > > ignored, so breaking them again is also a step in the wrong direction. > > > > There are not that many blk-mq drivers, so we can go through them all. > > I think the point is that SCSI is the biggest user by both the number > of low-level drivers sitting under the midlayer, and also by usage. > > We need to be very careful not to break it. Note that this doesn't > mean that I don't want to eventually move away from just ignoring > completions in timeout state for SCSI. I'd just rather rever 4.18 > to a clean known state instead of doctoring around late in the rc > phase. I definitely do not want to break scsi. I just don't want to break every one else either, and I think scsi can get the behavior it wants without forcing others to subscribe to it. > > Most don't even implement .timeout, so they never know that condition > > ever happened. Others always return BLK_EH_RESET_TIMER without doing > > anythign else, so the 'new' behavior would have to be better for those, > > too. > > And we should never even hit the timeout handler for those as it > is rather pointless (although it looks we currently do..). I don't see why we'd expect to never hit timeout for at least some of these. It's not a stretch to see, for example, that virtio-blk or loop could have their requests lost with no way to recover if we revert. I've wasted too much time debugging hardware for such lost commands when it was in fact functioning perfectly fine. So reintroducing that behavior is a bit distressing.
On Thu, Jul 19, 2018 at 08:59:31AM -0600, Keith Busch wrote: > > And we should never even hit the timeout handler for those as it > > is rather pointless (although it looks we currently do..). > > I don't see why we'd expect to never hit timeout for at least some of > these. It's not a stretch to see, for example, that virtio-blk or loop > could have their requests lost with no way to recover if we revert. I've > wasted too much time debugging hardware for such lost commands when it > was in fact functioning perfectly fine. So reintroducing that behavior > is a bit distressing. Even some scsi drivers are susceptible to losing their requests with the reverted behavior: take virtio-scsi for example, which returns RESET_TIMER from it's timeout handler. With the behavior everyone seems to want, a natural completion at or around the same time is lost forever because it was blocked from completion with no way to recover. While the timing for when requests may be lost is quite narrow, I've seen it enough with very difficult to reproduce scenarios that hardware devs no longer trust IO timeouts are their problem because Linux loses their completions.
On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote: > Even some scsi drivers are susceptible to losing their requests with the > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER > from it's timeout handler. With the behavior everyone seems to want, > a natural completion at or around the same time is lost forever because > it was blocked from completion with no way to recover. The patch I had posted handles a completion that occurs while a timeout is being handled properly. From https://www.mail-archive.com/linux-block@vger.kernel.org/msg22196.html: void blk_mq_complete_request(struct request *rq) [ ... ] + if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, + MQ_RQ_COMPLETE)) { + __blk_mq_complete_request(rq); + break; + } + if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE)) + break; [ ... ] @@ -838,25 +838,42 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) [ ... ] case BLK_EH_RESET_TIMER: [ ... ] + if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) { + __blk_mq_complete_request(req); + break; + } Bart.
On Thu, Jul 19, 2018 at 04:04:46PM +0000, Bart Van Assche wrote: > On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote: > > Even some scsi drivers are susceptible to losing their requests with the > > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER > > from it's timeout handler. With the behavior everyone seems to want, > > a natural completion at or around the same time is lost forever because > > it was blocked from completion with no way to recover. > > The patch I had posted handles a completion that occurs while a timeout is > being handled properly. From https://www.mail-archive.com/linux-block@vger.kernel.org/msg22196.html: Sounds like a win-win to me.
On Thu, Jul 19, 2018 at 10:22:40AM -0600, Keith Busch wrote: > On Thu, Jul 19, 2018 at 04:04:46PM +0000, Bart Van Assche wrote: > > On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote: > > > Even some scsi drivers are susceptible to losing their requests with the > > > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER > > > from it's timeout handler. With the behavior everyone seems to want, > > > a natural completion at or around the same time is lost forever because > > > it was blocked from completion with no way to recover. > > > > The patch I had posted handles a completion that occurs while a timeout is > > being handled properly. From https://www.mail-archive.com/linux-block@vger.kernel.org/msg22196.html: > > Sounds like a win-win to me. How do we get a fix into 4.18 at this part of the cycle? I think that is the most important prirority right now.
On Thu, Jul 19, 2018 at 06:29:24PM +0200, hch@lst.de wrote: > How do we get a fix into 4.18 at this part of the cycle? I think that > is the most important prirority right now. Even if you were okay at this point to incorporate the concepts from Bart's patch, it still looks like trouble for scsi (will elobrate separately). But reverting breaks other things we finally got working, so I'd still vote for isolating the old behavior to scsi if that isn't too unpalatable. I'll send a small patch shortly and see what happens.
diff --git a/block/blk-mq.c b/block/blk-mq.c index 22326612a5d3..f50559718b71 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -558,10 +558,8 @@ static void __blk_mq_complete_request(struct request *rq) bool shared = false; int cpu; - if (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) != - MQ_RQ_IN_FLIGHT) + if (blk_mq_mark_complete(rq)) return; - if (rq->internal_tag != -1) blk_mq_sched_completed_request(rq); diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 8932ae81a15a..a5d05fab24a7 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -286,6 +286,14 @@ enum blk_eh_timer_return scsi_times_out(struct request *req) enum blk_eh_timer_return rtn = BLK_EH_DONE; struct Scsi_Host *host = scmd->device->host; + /* + * Mark complete now so lld can't post a completion during error + * handling. Return immediately if it was already marked complete, as + * that means the lld posted a completion already. + */ + if (req->q->mq_ops && blk_mq_mark_complete(req)) + return rtn; + trace_scsi_dispatch_cmd_timeout(scmd); scsi_log_completion(scmd, TIMEOUT_ERROR); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 9b0fd11ce89a..0ce587c9c27b 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -289,6 +289,15 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues); void blk_mq_quiesce_queue_nowait(struct request_queue *q); +/* + * Returns true if request is not in flight. + */ +static inline bool blk_mq_mark_complete(struct request *rq) +{ + return (cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) != + MQ_RQ_IN_FLIGHT); +} + /* * Driver command data is immediately after the request. So subtract request * size to get back to the original request, add request size to get the PDU.