Message ID | 65187802-413a-7425-234e-68cf0b281a91@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: pop cached rq before potentially blocking rq_qos_throttle() | expand |
On Tue, 2022-06-21 at 10:07 -0600, Jens Axboe wrote: > If rq_qos_throttle() ends up blocking, then we will have invalidated > and > flushed our current plug. Since blk_mq_get_cached_request() hasn't > popped the cached request off the plug list just yet, we end holding > a > pointer to a request that is no longer valid. This insta-crashes with > rq->mq_hctx being NULL in the validity checks just after. > > Pop the request off the cached list before doing rq_qos_throttle() to > avoid using a potentially stale request. > > Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and > rq_qos_throttle protection") > Reported-by: Dylan Yudaken <dylany@fb.com> > Signed-off-by: Jens Axboe <axboe@kernel.dk> > > --- > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 33145ba52c96..93d9d60980fb 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2765,15 +2765,20 @@ static inline struct request > *blk_mq_get_cached_request(struct request_queue *q, > return NULL; > } > > - rq_qos_throttle(q, *bio); > - > if (blk_mq_get_hctx_type((*bio)->bi_opf) != rq->mq_hctx- > >type) > return NULL; > if (op_is_flush(rq->cmd_flags) != op_is_flush((*bio)- > >bi_opf)) > return NULL; > > - rq->cmd_flags = (*bio)->bi_opf; > + /* > + * If any qos ->throttle() end up blocking, we will have > flushed the > + * plug and hence killed the cached_rq list as well. Pop this > entry > + * before we throttle. > + */ > plug->cached_rq = rq_list_next(rq); > + rq_qos_throttle(q, *bio); > + > + rq->cmd_flags = (*bio)->bi_opf; > INIT_LIST_HEAD(&rq->queuelist); > return rq; > } This fixes the problems I was seeing Tested-by: Dylan Yudaken <dylany@fb.com>
On Jun 21, 2022 / 10:07, Jens Axboe wrote: > If rq_qos_throttle() ends up blocking, then we will have invalidated and > flushed our current plug. Since blk_mq_get_cached_request() hasn't > popped the cached request off the plug list just yet, we end holding a > pointer to a request that is no longer valid. This insta-crashes with > rq->mq_hctx being NULL in the validity checks just after. > > Pop the request off the cached list before doing rq_qos_throttle() to > avoid using a potentially stale request. > > Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection") > Reported-by: Dylan Yudaken <dylany@fb.com> > Signed-off-by: Jens Axboe <axboe@kernel.dk> Thank you for the fix and sorry for the trouble. The patch passed my test set: Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> BTW, may I know the workload or script which triggers the failure? I'm interested in if we can add a blktests test case to exercises qos throttle and prevent similar bugs in the future.
On 6/23/22 2:56 AM, Shinichiro Kawasaki wrote: > On Jun 21, 2022 / 10:07, Jens Axboe wrote: >> If rq_qos_throttle() ends up blocking, then we will have invalidated and >> flushed our current plug. Since blk_mq_get_cached_request() hasn't >> popped the cached request off the plug list just yet, we end holding a >> pointer to a request that is no longer valid. This insta-crashes with >> rq->mq_hctx being NULL in the validity checks just after. >> >> Pop the request off the cached list before doing rq_qos_throttle() to >> avoid using a potentially stale request. >> >> Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection") >> Reported-by: Dylan Yudaken <dylany@fb.com> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> > > Thank you for the fix and sorry for the trouble. The patch passed my test set: > > Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> > > > BTW, may I know the workload or script which triggers the failure? I'm > interested in if we can add a blktests test case to exercises qos throttle > and prevent similar bugs in the future. Not sure if there are others, but specifically blk-iocost will do an explicit schedule() for some conditions. See the bottom of ioc_rqos_throttle().
On Jun 23, 2022 / 06:05, Jens Axboe wrote: > On 6/23/22 2:56 AM, Shinichiro Kawasaki wrote: > > On Jun 21, 2022 / 10:07, Jens Axboe wrote: > >> If rq_qos_throttle() ends up blocking, then we will have invalidated and > >> flushed our current plug. Since blk_mq_get_cached_request() hasn't > >> popped the cached request off the plug list just yet, we end holding a > >> pointer to a request that is no longer valid. This insta-crashes with > >> rq->mq_hctx being NULL in the validity checks just after. > >> > >> Pop the request off the cached list before doing rq_qos_throttle() to > >> avoid using a potentially stale request. > >> > >> Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection") > >> Reported-by: Dylan Yudaken <dylany@fb.com> > >> Signed-off-by: Jens Axboe <axboe@kernel.dk> > > > > Thank you for the fix and sorry for the trouble. The patch passed my test set: > > > > Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> > > > > > > BTW, may I know the workload or script which triggers the failure? I'm > > interested in if we can add a blktests test case to exercises qos throttle > > and prevent similar bugs in the future. > > Not sure if there are others, but specifically blk-iocost will do an > explicit schedule() for some conditions. See the bottom of > ioc_rqos_throttle(). Thanks. Will try to create a script which triggers the schedule() in ioc_rqos_throttle().
diff --git a/block/blk-mq.c b/block/blk-mq.c index 33145ba52c96..93d9d60980fb 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2765,15 +2765,20 @@ static inline struct request *blk_mq_get_cached_request(struct request_queue *q, return NULL; } - rq_qos_throttle(q, *bio); - if (blk_mq_get_hctx_type((*bio)->bi_opf) != rq->mq_hctx->type) return NULL; if (op_is_flush(rq->cmd_flags) != op_is_flush((*bio)->bi_opf)) return NULL; - rq->cmd_flags = (*bio)->bi_opf; + /* + * If any qos ->throttle() end up blocking, we will have flushed the + * plug and hence killed the cached_rq list as well. Pop this entry + * before we throttle. + */ plug->cached_rq = rq_list_next(rq); + rq_qos_throttle(q, *bio); + + rq->cmd_flags = (*bio)->bi_opf; INIT_LIST_HEAD(&rq->queuelist); return rq; }
If rq_qos_throttle() ends up blocking, then we will have invalidated and flushed our current plug. Since blk_mq_get_cached_request() hasn't popped the cached request off the plug list just yet, we end holding a pointer to a request that is no longer valid. This insta-crashes with rq->mq_hctx being NULL in the validity checks just after. Pop the request off the cached list before doing rq_qos_throttle() to avoid using a potentially stale request. Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection") Reported-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> ---