Message ID | 1448037342-18384-5-git-send-email-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Christoph, Christoph Hellwig <hch@lst.de> writes: > This marks the request as one that's not actually completed yet, but > should be reaped next time blk_mq_complete_request comes in. This is > useful it the abort handler kicked of a reset that will complete all > pending requests. What's the purpose, though? Is this an optimization? We've had "fun" problems with races between completion and timeout before. I can't say I'm too keen on adding more complexity to this code path. Have you considered what happens in your new code when this race occurs? I don't expect it to cause any issues in the mq case, since the timeout handler should run on the same cpu as the completion code for a given request (right?). However, for the old code path, they could run in parallel. blk_complete_request: A if (!blk_mark_rq_complete(rq) || B test_and_cleart_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) { C __blk_mq_complete_request(rq); could run alongside of: blk_rq_check_expired: 1 if (!blk_mark_rq_complete(rq)) 2 blk_rq_timed_out(rq); So, if 1 comes before A, we have two cases to consider: i. the expiration path does not yet set REQ_ATOM_QUIESCED before the completion code runs, and so the completion code does nothing. ii. the expiration path *does* SET REQ_ATOM_QUIESCED. In this instance, will we get yet another completion for the request when the command is ultimately retired by the adapter reset? Cheers, Jeff > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > block/blk-mq.c | 6 +++++- > block/blk-softirq.c | 3 ++- > block/blk-timeout.c | 3 +++ > block/blk.h | 1 + > include/linux/blkdev.h | 1 + > 5 files changed, 12 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 8354601..76773dc 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -383,7 +383,8 @@ void blk_mq_complete_request(struct request *rq, int error) > > if (unlikely(blk_should_fake_timeout(q))) > return; > - if (!blk_mark_rq_complete(rq)) { > + if (!blk_mark_rq_complete(rq) || > + test_and_clear_bit(REQ_ATOM_QUIESCED, &rq->atomic_flags)) { > rq->errors = error; > __blk_mq_complete_request(rq); > } > @@ -586,6 +587,9 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved) > break; > case BLK_EH_NOT_HANDLED: > break; > + case BLK_EH_QUIESCED: > + set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags); > + break; > default: > printk(KERN_ERR "block: bad eh return: %d\n", ret); > break; > diff --git a/block/blk-softirq.c b/block/blk-softirq.c > index 53b1737..9d47fbc 100644 > --- a/block/blk-softirq.c > +++ b/block/blk-softirq.c > @@ -167,7 +167,8 @@ void blk_complete_request(struct request *req) > { > if (unlikely(blk_should_fake_timeout(req->q))) > return; > - if (!blk_mark_rq_complete(req)) > + if (!blk_mark_rq_complete(req) || > + test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) > __blk_complete_request(req); > } > EXPORT_SYMBOL(blk_complete_request); > diff --git a/block/blk-timeout.c b/block/blk-timeout.c > index aedd128..b3a7f20 100644 > --- a/block/blk-timeout.c > +++ b/block/blk-timeout.c > @@ -96,6 +96,9 @@ static void blk_rq_timed_out(struct request *req) > blk_add_timer(req); > blk_clear_rq_complete(req); > break; > + case BLK_EH_QUIESCED: > + set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags); > + break; > case BLK_EH_NOT_HANDLED: > /* > * LLD handles this for now but in the future > diff --git a/block/blk.h b/block/blk.h > index 37b9165..f4c98f8 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -120,6 +120,7 @@ void blk_account_io_done(struct request *req); > enum rq_atomic_flags { > REQ_ATOM_COMPLETE = 0, > REQ_ATOM_STARTED, > + REQ_ATOM_QUIESCED, > }; > > /* > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 9a8424a..5df5fb13 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -223,6 +223,7 @@ enum blk_eh_timer_return { > BLK_EH_NOT_HANDLED, > BLK_EH_HANDLED, > BLK_EH_RESET_TIMER, > + BLK_EH_QUIESCED, > }; > > typedef enum blk_eh_timer_return (rq_timed_out_fn)(struct request *); -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 24, 2015 at 10:16:51AM -0500, Jeff Moyer wrote: > Hi Christoph, > > Christoph Hellwig <hch@lst.de> writes: > > > This marks the request as one that's not actually completed yet, but > > should be reaped next time blk_mq_complete_request comes in. This is > > useful it the abort handler kicked of a reset that will complete all > > pending requests. > > What's the purpose, though? Is this an optimization? It allows us to to correctly implement controller reset (like SCSI target resets) from the timeout handler. The current HANDLED/NOT_HANDLED returns are not very useful if you want to eventually kick of a reset that will abort all requests, but needs to ensure the the requests don't get reused before that. Only SCSI handles that for now, and needs it's own per-LUN command list and a lot of complex code for that - something we'd like to avoid for NVMe or other new drivers. > We've had "fun" problems with races between completion and timeout > before. I can't say I'm too keen on adding more complexity to this code > path. Have you considered what happens in your new code when this race > occurs? I don't expect it to cause any issues in the mq case, since the > timeout handler should run on the same cpu as the completion code for a > given request (right?). However, for the old code path, they could run > in parallel. > > blk_complete_request: > A if (!blk_mark_rq_complete(rq) || > B test_and_cleart_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) { > C __blk_mq_complete_request(rq); > > could run alongside of: > > blk_rq_check_expired: > 1 if (!blk_mark_rq_complete(rq)) > 2 blk_rq_timed_out(rq); > > So, if 1 comes before A, we have two cases to consider: > > i. the expiration path does not yet set REQ_ATOM_QUIESCED before the > completion code runs, and so the completion code does nothing. The command has timed out and sets REQ_ATOM_COMPLETED first, the the actual completion comes in and does indeed nothing. We now set REQ_ATOM_QUIESCED and kick off a controller reset, which will ultimatively complete all commands using blk_mq_complete_request. Now REQ_ATOM_QUIESCED is set on the command that caused the timeout, so it will be completed as well. > ii. the expiration path *does* SET REQ_ATOM_QUIESCED. In this instance, > will we get yet another completion for the request when the command > is ultimately retired by the adapter reset? The command has timed out and sets REQ_ATOM_COMPLETED first, then REQ_ATOM_QUIESCED as well. Now the actual completion comes in and does nothing because REQ_ATOM_COMPLETED was set. We will then kick off the controller reset, which will ultimatively complete all commands using blk_mq_complete_request. Now REQ_ATOM_QUIESCED is set on the command that caused the timeout, so it will be completed as well. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig <hch@lst.de> writes: > On Tue, Nov 24, 2015 at 10:16:51AM -0500, Jeff Moyer wrote: >> Hi Christoph, >> >> Christoph Hellwig <hch@lst.de> writes: >> >> > This marks the request as one that's not actually completed yet, but >> > should be reaped next time blk_mq_complete_request comes in. This is >> > useful it the abort handler kicked of a reset that will complete all >> > pending requests. >> >> What's the purpose, though? Is this an optimization? > > It allows us to to correctly implement controller reset (like SCSI target > resets) from the timeout handler. The current HANDLED/NOT_HANDLED returns > are not very useful if you want to eventually kick of a reset that will > abort all requests, but needs to ensure the the requests don't get reused > before that. Only SCSI handles that for now, and needs it's own per-LUN > command list and a lot of complex code for that - something we'd like > to avoid for NVMe or other new drivers. Thanks for the explanation. One more question below. >> We've had "fun" problems with races between completion and timeout >> before. I can't say I'm too keen on adding more complexity to this code >> path. Have you considered what happens in your new code when this race >> occurs? I don't expect it to cause any issues in the mq case, since the >> timeout handler should run on the same cpu as the completion code for a >> given request (right?). However, for the old code path, they could run >> in parallel. >> >> blk_complete_request: >> A if (!blk_mark_rq_complete(rq) || >> B test_and_cleart_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) { >> C __blk_mq_complete_request(rq); >> >> could run alongside of: >> >> blk_rq_check_expired: >> 1 if (!blk_mark_rq_complete(rq)) >> 2 blk_rq_timed_out(rq); >> >> So, if 1 comes before A, we have two cases to consider: >> >> i. the expiration path does not yet set REQ_ATOM_QUIESCED before the >> completion code runs, and so the completion code does nothing. > > The command has timed out and sets REQ_ATOM_COMPLETED first, > the the actual completion comes in and does indeed nothing. We now > set REQ_ATOM_QUIESCED and kick off a controller reset, which will > ultimatively complete all commands using blk_mq_complete_request. > Now REQ_ATOM_QUIESCED is set on the command that caused the timeout, > so it will be completed as well. > >> ii. the expiration path *does* SET REQ_ATOM_QUIESCED. In this instance, >> will we get yet another completion for the request when the command >> is ultimately retired by the adapter reset? > > The command has timed out and sets REQ_ATOM_COMPLETED first, then > REQ_ATOM_QUIESCED as well. Now the actual completion comes in and does > nothing because REQ_ATOM_COMPLETED was set. We will then kick off the See B above. REQ_ATOM_COMPLETE is set, so the first half of that statement is false, but then test_and_clear_bit(REQ_ATOM_QUIESCED...) returns true, so we call __blk_complete_request. So the question is, will we get a double completion for that request after the reset is performed? -Jeff p.s. That should be __blk_complete_request up there in 'C'. > controller reset, which will ultimatively complete all commands using > blk_mq_complete_request. Now REQ_ATOM_QUIESCED is set on the command that > caused the timeout, so it will be completed as well. > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 24, 2015 at 10:51:04AM -0500, Jeff Moyer wrote: > >> > >> blk_complete_request: > >> A if (!blk_mark_rq_complete(rq) || > >> B test_and_cleart_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) { > >> C __blk_mq_complete_request(rq); > >> > >> could run alongside of: > >> > >> blk_rq_check_expired: > >> 1 if (!blk_mark_rq_complete(rq)) > >> 2 blk_rq_timed_out(rq); > >> > >> So, if 1 comes before A, we have two cases to consider: > >> > >> i. the expiration path does not yet set REQ_ATOM_QUIESCED before the > >> completion code runs, and so the completion code does nothing. > > > > The command has timed out and sets REQ_ATOM_COMPLETED first, > > the the actual completion comes in and does indeed nothing. We now > > set REQ_ATOM_QUIESCED and kick off a controller reset, which will > > ultimatively complete all commands using blk_mq_complete_request. > > Now REQ_ATOM_QUIESCED is set on the command that caused the timeout, > > so it will be completed as well. > > > >> ii. the expiration path *does* SET REQ_ATOM_QUIESCED. In this instance, > >> will we get yet another completion for the request when the command > >> is ultimately retired by the adapter reset? > > > > The command has timed out and sets REQ_ATOM_COMPLETED first, then > > REQ_ATOM_QUIESCED as well. Now the actual completion comes in and does > > nothing because REQ_ATOM_COMPLETED was set. We will then kick off the > > See B above. REQ_ATOM_COMPLETE is set, so the first half of that > statement is false, but then test_and_clear_bit(REQ_ATOM_QUIESCED...) > returns true, so we call __blk_complete_request. So the question is, > will we get a double completion for that request after the reset is > performed? We always set REQ_ATOM_COMPLETE before calling into ->timeout and thus even having a chance to REQ_ATOM_QUIESCED. Maybe we're talking past each other, so if it feels like I'm off track try to explain the race in a bit more detail. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig <hch@lst.de> writes: > We always set REQ_ATOM_COMPLETE before calling into ->timeout and thus > even having a chance to REQ_ATOM_QUIESCED. Maybe we're talking past > each other, so if it feels like I'm off track try to explain the > race in a bit more detail. Sure. CPU 0 executes the following: blk_rq_check_expired(): if (!blk_mark_rq_complete(rq)) // this succeeds, so we call blk_rq_timed_out blk_rq_timed_out(rq); blk_rq_timed_out(): if (q->rq_timed_out_fn) ret = q->rq_timed_out_fn(req); switch (ret) { // QUIESCED is returned ... case BLK_EH_QUIESCED: set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags); break; CPU 1 takes an interrupt for the completion of the same request: blk_complete_request(): if (!blk_mark_rq_complete(req) || // this fails, as it's already marked complete test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) // this succeeds __blk_complete_request(req); // so we complete the request Later, after the adapter reset is finished, you mentioned that all requests will be completed. My question is: will this result in a double completion for this particular request? I hope that's more clear. -Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 24, 2015 at 11:34:22AM -0500, Jeff Moyer wrote: > CPU 0 executes the following: > > blk_rq_check_expired(): > if (!blk_mark_rq_complete(rq)) // this succeeds, so we call blk_rq_timed_out > blk_rq_timed_out(rq); > blk_rq_timed_out(): > if (q->rq_timed_out_fn) > ret = q->rq_timed_out_fn(req); > switch (ret) { // QUIESCED is returned > ... > case BLK_EH_QUIESCED: > set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags); > break; > > CPU 1 takes an interrupt for the completion of the same request: > > blk_complete_request(): > if (!blk_mark_rq_complete(req) || // this fails, as it's already marked complete > test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) // this succeeds > __blk_complete_request(req); // so we complete the request > > Later, after the adapter reset is finished, you mentioned that all > requests will be completed. My question is: will this result in a > double completion for this particular request? The reset action clears the queue's busy tags only after the adapter is disabled. At that point, no interrupts can happen and h/w will never complete more active requests. If an interrupt completes the request while the reset is in progress, the request tag won't be "busy" when the final clean-up occurs. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 24, 2015 at 11:34:22AM -0500, Jeff Moyer wrote: > CPU 1 takes an interrupt for the completion of the same request: > > blk_complete_request(): > if (!blk_mark_rq_complete(req) || // this fails, as it's already marked complete > test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) // this succeeds and clears the flag, so we'd need a race betweem this call to blk_mq_complete request and the later completion of all outstanding commands from reset. For NVMe we ensure this by not taking completions onc we start reset, but we probably need to document this better. I will ensure all this is properly documented in the next version! -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig <hch@infradead.org> writes: > On Tue, Nov 24, 2015 at 11:34:22AM -0500, Jeff Moyer wrote: >> CPU 1 takes an interrupt for the completion of the same request: >> >> blk_complete_request(): >> if (!blk_mark_rq_complete(req) || // this fails, as it's already marked complete >> test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) // this succeeds > > and clears the flag, so we'd need a race betweem this call to > blk_mq_complete request and the later completion of all outstanding > commands from reset. For NVMe we ensure this by not taking completions But we're completing the request, so the request will actually be freed. Any references to this request are bogus at this point, no? > onc we start reset, but we probably need to document this better. I > will ensure all this is properly documented in the next version! Thanks, I think that will help a lot. Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 24, 2015 at 01:12:52PM -0500, Jeff Moyer wrote: > >> blk_complete_request(): > >> if (!blk_mark_rq_complete(req) || // this fails, as it's already marked complete > >> test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) // this succeeds > > > > and clears the flag, so we'd need a race betweem this call to > > blk_mq_complete request and the later completion of all outstanding > > commands from reset. For NVMe we ensure this by not taking completions > > But we're completing the request, so the request will actually be freed. > Any references to this request are bogus at this point, no? For blk-mq it won't be freed. For the non blk-mq case the tag to request lookup will have to handle that case, but that path isn't exercised by nvme. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/block/blk-mq.c b/block/blk-mq.c index 8354601..76773dc 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -383,7 +383,8 @@ void blk_mq_complete_request(struct request *rq, int error) if (unlikely(blk_should_fake_timeout(q))) return; - if (!blk_mark_rq_complete(rq)) { + if (!blk_mark_rq_complete(rq) || + test_and_clear_bit(REQ_ATOM_QUIESCED, &rq->atomic_flags)) { rq->errors = error; __blk_mq_complete_request(rq); } @@ -586,6 +587,9 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved) break; case BLK_EH_NOT_HANDLED: break; + case BLK_EH_QUIESCED: + set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags); + break; default: printk(KERN_ERR "block: bad eh return: %d\n", ret); break; diff --git a/block/blk-softirq.c b/block/blk-softirq.c index 53b1737..9d47fbc 100644 --- a/block/blk-softirq.c +++ b/block/blk-softirq.c @@ -167,7 +167,8 @@ void blk_complete_request(struct request *req) { if (unlikely(blk_should_fake_timeout(req->q))) return; - if (!blk_mark_rq_complete(req)) + if (!blk_mark_rq_complete(req) || + test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) __blk_complete_request(req); } EXPORT_SYMBOL(blk_complete_request); diff --git a/block/blk-timeout.c b/block/blk-timeout.c index aedd128..b3a7f20 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -96,6 +96,9 @@ static void blk_rq_timed_out(struct request *req) blk_add_timer(req); blk_clear_rq_complete(req); break; + case BLK_EH_QUIESCED: + set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags); + break; case BLK_EH_NOT_HANDLED: /* * LLD handles this for now but in the future diff --git a/block/blk.h b/block/blk.h index 37b9165..f4c98f8 100644 --- a/block/blk.h +++ b/block/blk.h @@ -120,6 +120,7 @@ void blk_account_io_done(struct request *req); enum rq_atomic_flags { REQ_ATOM_COMPLETE = 0, REQ_ATOM_STARTED, + REQ_ATOM_QUIESCED, }; /* diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 9a8424a..5df5fb13 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -223,6 +223,7 @@ enum blk_eh_timer_return { BLK_EH_NOT_HANDLED, BLK_EH_HANDLED, BLK_EH_RESET_TIMER, + BLK_EH_QUIESCED, }; typedef enum blk_eh_timer_return (rq_timed_out_fn)(struct request *);
This marks the request as one that's not actually completed yet, but should be reaped next time blk_mq_complete_request comes in. This is useful it the abort handler kicked of a reset that will complete all pending requests. Signed-off-by: Christoph Hellwig <hch@lst.de> --- block/blk-mq.c | 6 +++++- block/blk-softirq.c | 3 ++- block/blk-timeout.c | 3 +++ block/blk.h | 1 + include/linux/blkdev.h | 1 + 5 files changed, 12 insertions(+), 2 deletions(-)