diff mbox series

block: pop cached rq before potentially blocking rq_qos_throttle()

Message ID 65187802-413a-7425-234e-68cf0b281a91@kernel.dk (mailing list archive)
State New, archived
Headers show
Series block: pop cached rq before potentially blocking rq_qos_throttle() | expand

Commit Message

Jens Axboe June 21, 2022, 4:07 p.m. UTC
If rq_qos_throttle() ends up blocking, then we will have invalidated and
flushed our current plug. Since blk_mq_get_cached_request() hasn't
popped the cached request off the plug list just yet, we end holding a
pointer to a request that is no longer valid. This insta-crashes with
rq->mq_hctx being NULL in the validity checks just after.

Pop the request off the cached list before doing rq_qos_throttle() to
avoid using a potentially stale request.

Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection")
Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

Comments

Dylan Yudaken June 21, 2022, 4:58 p.m. UTC | #1
On Tue, 2022-06-21 at 10:07 -0600, Jens Axboe wrote:
> If rq_qos_throttle() ends up blocking, then we will have invalidated
> and
> flushed our current plug. Since blk_mq_get_cached_request() hasn't
> popped the cached request off the plug list just yet, we end holding
> a
> pointer to a request that is no longer valid. This insta-crashes with
> rq->mq_hctx being NULL in the validity checks just after.
> 
> Pop the request off the cached list before doing rq_qos_throttle() to
> avoid using a potentially stale request.
> 
> Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and
> rq_qos_throttle protection")
> Reported-by: Dylan Yudaken <dylany@fb.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 33145ba52c96..93d9d60980fb 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2765,15 +2765,20 @@ static inline struct request
> *blk_mq_get_cached_request(struct request_queue *q,
>                 return NULL;
>         }
>  
> -       rq_qos_throttle(q, *bio);
> -
>         if (blk_mq_get_hctx_type((*bio)->bi_opf) != rq->mq_hctx-
> >type)
>                 return NULL;
>         if (op_is_flush(rq->cmd_flags) != op_is_flush((*bio)-
> >bi_opf))
>                 return NULL;
>  
> -       rq->cmd_flags = (*bio)->bi_opf;
> +       /*
> +        * If any qos ->throttle() end up blocking, we will have
> flushed the
> +        * plug and hence killed the cached_rq list as well. Pop this
> entry
> +        * before we throttle.
> +        */
>         plug->cached_rq = rq_list_next(rq);
> +       rq_qos_throttle(q, *bio);
> +
> +       rq->cmd_flags = (*bio)->bi_opf;
>         INIT_LIST_HEAD(&rq->queuelist);
>         return rq;
>  }

This fixes the problems I was seeing

Tested-by: Dylan Yudaken <dylany@fb.com>
Shin'ichiro Kawasaki June 23, 2022, 8:56 a.m. UTC | #2
On Jun 21, 2022 / 10:07, Jens Axboe wrote:
> If rq_qos_throttle() ends up blocking, then we will have invalidated and
> flushed our current plug. Since blk_mq_get_cached_request() hasn't
> popped the cached request off the plug list just yet, we end holding a
> pointer to a request that is no longer valid. This insta-crashes with
> rq->mq_hctx being NULL in the validity checks just after.
> 
> Pop the request off the cached list before doing rq_qos_throttle() to
> avoid using a potentially stale request.
> 
> Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection")
> Reported-by: Dylan Yudaken <dylany@fb.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Thank you for the fix and sorry for the trouble. The patch passed my test set:

Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>


BTW, may I know the workload or script which triggers the failure? I'm
interested in if we can add a blktests test case to exercises qos throttle
and prevent similar bugs in the future.
Jens Axboe June 23, 2022, 12:05 p.m. UTC | #3
On 6/23/22 2:56 AM, Shinichiro Kawasaki wrote:
> On Jun 21, 2022 / 10:07, Jens Axboe wrote:
>> If rq_qos_throttle() ends up blocking, then we will have invalidated and
>> flushed our current plug. Since blk_mq_get_cached_request() hasn't
>> popped the cached request off the plug list just yet, we end holding a
>> pointer to a request that is no longer valid. This insta-crashes with
>> rq->mq_hctx being NULL in the validity checks just after.
>>
>> Pop the request off the cached list before doing rq_qos_throttle() to
>> avoid using a potentially stale request.
>>
>> Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection")
>> Reported-by: Dylan Yudaken <dylany@fb.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> Thank you for the fix and sorry for the trouble. The patch passed my test set:
> 
> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> 
> 
> BTW, may I know the workload or script which triggers the failure? I'm
> interested in if we can add a blktests test case to exercises qos throttle
> and prevent similar bugs in the future.

Not sure if there are others, but specifically blk-iocost will do an
explicit schedule() for some conditions. See the bottom of
ioc_rqos_throttle().
Shin'ichiro Kawasaki June 23, 2022, 12:33 p.m. UTC | #4
On Jun 23, 2022 / 06:05, Jens Axboe wrote:
> On 6/23/22 2:56 AM, Shinichiro Kawasaki wrote:
> > On Jun 21, 2022 / 10:07, Jens Axboe wrote:
> >> If rq_qos_throttle() ends up blocking, then we will have invalidated and
> >> flushed our current plug. Since blk_mq_get_cached_request() hasn't
> >> popped the cached request off the plug list just yet, we end holding a
> >> pointer to a request that is no longer valid. This insta-crashes with
> >> rq->mq_hctx being NULL in the validity checks just after.
> >>
> >> Pop the request off the cached list before doing rq_qos_throttle() to
> >> avoid using a potentially stale request.
> >>
> >> Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection")
> >> Reported-by: Dylan Yudaken <dylany@fb.com>
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > 
> > Thank you for the fix and sorry for the trouble. The patch passed my test set:
> > 
> > Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > 
> > 
> > BTW, may I know the workload or script which triggers the failure? I'm
> > interested in if we can add a blktests test case to exercises qos throttle
> > and prevent similar bugs in the future.
> 
> Not sure if there are others, but specifically blk-iocost will do an
> explicit schedule() for some conditions. See the bottom of
> ioc_rqos_throttle().

Thanks. Will try to create a script which triggers the schedule() in
ioc_rqos_throttle().
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 33145ba52c96..93d9d60980fb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2765,15 +2765,20 @@  static inline struct request *blk_mq_get_cached_request(struct request_queue *q,
 		return NULL;
 	}
 
-	rq_qos_throttle(q, *bio);
-
 	if (blk_mq_get_hctx_type((*bio)->bi_opf) != rq->mq_hctx->type)
 		return NULL;
 	if (op_is_flush(rq->cmd_flags) != op_is_flush((*bio)->bi_opf))
 		return NULL;
 
-	rq->cmd_flags = (*bio)->bi_opf;
+	/*
+	 * If any qos ->throttle() end up blocking, we will have flushed the
+	 * plug and hence killed the cached_rq list as well. Pop this entry
+	 * before we throttle.
+	 */
 	plug->cached_rq = rq_list_next(rq);
+	rq_qos_throttle(q, *bio);
+
+	rq->cmd_flags = (*bio)->bi_opf;
 	INIT_LIST_HEAD(&rq->queuelist);
 	return rq;
 }