block: don't dereference request after flush insertion

Message ID	f2f17f46-ff3a-01c4-bfd4-8dec836ec343@kernel.dk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> To: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org> From: Jens Axboe <axboe@kernel.dk> Subject: [PATCH] block: don't dereference request after flush insertion Message-ID: <f2f17f46-ff3a-01c4-bfd4-8dec836ec343@kernel.dk> Date: Sat, 16 Oct 2021 19:35:39 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk
Series	block: don't dereference request after flush insertion \| expand block: don't dereference request after flush insertion

Jens Axboe Oct. 17, 2021, 1:35 a.m. UTC

We could have a race here, where the request gets freed before we call
into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
of the request.

Grab the hardware context before inserting the flush.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

Ming Lei Oct. 18, 2021, 1:49 a.m. UTC | #1

On Sat, Oct 16, 2021 at 07:35:39PM -0600, Jens Axboe wrote:
> We could have a race here, where the request gets freed before we call
> into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
> of the request.
> 
> Grab the hardware context before inserting the flush.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2197cfbf081f..22b30a89bf3a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2468,9 +2468,10 @@ void blk_mq_submit_bio(struct bio *bio)
>  	}
>  
>  	if (unlikely(is_flush_fua)) {
> +		struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  		/* Bypass scheduler for flush requests */
>  		blk_insert_flush(rq);
> -		blk_mq_run_hw_queue(rq->mq_hctx, true);
> +		blk_mq_run_hw_queue(hctx, true);

If the request is freed before running queue, the request queue could
be released and the hctx may be freed.

Thanks,
Ming

Jens Axboe Oct. 18, 2021, 1:50 a.m. UTC | #2

On 10/17/21 7:49 PM, Ming Lei wrote:
> On Sat, Oct 16, 2021 at 07:35:39PM -0600, Jens Axboe wrote:
>> We could have a race here, where the request gets freed before we call
>> into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
>> of the request.
>>
>> Grab the hardware context before inserting the flush.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 2197cfbf081f..22b30a89bf3a 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2468,9 +2468,10 @@ void blk_mq_submit_bio(struct bio *bio)
>>  	}
>>  
>>  	if (unlikely(is_flush_fua)) {
>> +		struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>>  		/* Bypass scheduler for flush requests */
>>  		blk_insert_flush(rq);
>> -		blk_mq_run_hw_queue(rq->mq_hctx, true);
>> +		blk_mq_run_hw_queue(hctx, true);
> 
> If the request is freed before running queue, the request queue could
> be released and the hctx may be freed.

No, we still hold a queue enter ref.

Ming Lei Oct. 18, 2021, 2:02 a.m. UTC | #3

On Sun, Oct 17, 2021 at 07:50:24PM -0600, Jens Axboe wrote:
> On 10/17/21 7:49 PM, Ming Lei wrote:
> > On Sat, Oct 16, 2021 at 07:35:39PM -0600, Jens Axboe wrote:
> >> We could have a race here, where the request gets freed before we call
> >> into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
> >> of the request.
> >>
> >> Grab the hardware context before inserting the flush.
> >>
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>
> >> ---
> >>
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 2197cfbf081f..22b30a89bf3a 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -2468,9 +2468,10 @@ void blk_mq_submit_bio(struct bio *bio)
> >>  	}
> >>  
> >>  	if (unlikely(is_flush_fua)) {
> >> +		struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> >>  		/* Bypass scheduler for flush requests */
> >>  		blk_insert_flush(rq);
> >> -		blk_mq_run_hw_queue(rq->mq_hctx, true);
> >> +		blk_mq_run_hw_queue(hctx, true);
> > 
> > If the request is freed before running queue, the request queue could
> > be released and the hctx may be freed.
> 
> No, we still hold a queue enter ref.

But that one is released when rq is freed since ac7c5675fa45 ("blk-mq: allow
blk_mq_make_request to consume the q_usage_counter reference"), isn't
it?

Thanks,
Ming

Jens Axboe Oct. 18, 2021, 2:10 a.m. UTC | #4

On 10/17/21 8:02 PM, Ming Lei wrote:
> On Sun, Oct 17, 2021 at 07:50:24PM -0600, Jens Axboe wrote:
>> On 10/17/21 7:49 PM, Ming Lei wrote:
>>> On Sat, Oct 16, 2021 at 07:35:39PM -0600, Jens Axboe wrote:
>>>> We could have a race here, where the request gets freed before we call
>>>> into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
>>>> of the request.
>>>>
>>>> Grab the hardware context before inserting the flush.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>
>>>> ---
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 2197cfbf081f..22b30a89bf3a 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -2468,9 +2468,10 @@ void blk_mq_submit_bio(struct bio *bio)
>>>>  	}
>>>>  
>>>>  	if (unlikely(is_flush_fua)) {
>>>> +		struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>>>>  		/* Bypass scheduler for flush requests */
>>>>  		blk_insert_flush(rq);
>>>> -		blk_mq_run_hw_queue(rq->mq_hctx, true);
>>>> +		blk_mq_run_hw_queue(hctx, true);
>>>
>>> If the request is freed before running queue, the request queue could
>>> be released and the hctx may be freed.
>>
>> No, we still hold a queue enter ref.
> 
> But that one is released when rq is freed since ac7c5675fa45 ("blk-mq: allow
> blk_mq_make_request to consume the q_usage_counter reference"), isn't
> it?

Yes I think you're right, we need to grab an extra ref in the flush case as
we're using it after it may potentially have completed. We could probably
make it smarter, but little point for just handling a flush.

commit ea0f672e7cc66e7ec12468ff907de6064656b6e7
Author: Jens Axboe <axboe@kernel.dk>
Date:   Sat Oct 16 07:34:49 2021 -0600

    block: grab extra reference for flush insertion
    
    We could have a race here, where the request gets freed before we call
    into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
    of the request, nor can we rely on the queue still being alive.
    
    Grab an extra queue reference before inserting the flush and then running
    the queue, to ensure that it is still valid.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 87dc2debedfb..d28423ccfe2b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2284,9 +2284,18 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio)
 	}
 
 	if (unlikely(is_flush_fua)) {
+		struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+
+		/*
+		 * Our queue ref may disappears as soon as the flush is
+		 * inserted, grab an extra one.
+		 */
+		percpu_ref_tryget_live(&q->q_usage_counter);
+
 		/* Bypass scheduler for flush requests */
 		blk_insert_flush(rq);
-		blk_mq_run_hw_queue(rq->mq_hctx, true);
+		blk_mq_run_hw_queue(hctx, true);
+		blk_queue_exit(q);
 	} else if (plug && (q->nr_hw_queues == 1 ||
 		   blk_mq_is_shared_tags(rq->mq_hctx->flags) ||
 		   q->mq_ops->commit_rqs || !blk_queue_nonrot(q))) {

Ming Lei Oct. 18, 2021, 2:11 a.m. UTC | #5

On Mon, Oct 18, 2021 at 10:02:32AM +0800, Ming Lei wrote:
> On Sun, Oct 17, 2021 at 07:50:24PM -0600, Jens Axboe wrote:
> > On 10/17/21 7:49 PM, Ming Lei wrote:
> > > On Sat, Oct 16, 2021 at 07:35:39PM -0600, Jens Axboe wrote:
> > >> We could have a race here, where the request gets freed before we call
> > >> into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
> > >> of the request.
> > >>
> > >> Grab the hardware context before inserting the flush.
> > >>
> > >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > >>
> > >> ---
> > >>
> > >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > >> index 2197cfbf081f..22b30a89bf3a 100644
> > >> --- a/block/blk-mq.c
> > >> +++ b/block/blk-mq.c
> > >> @@ -2468,9 +2468,10 @@ void blk_mq_submit_bio(struct bio *bio)
> > >>  	}
> > >>  
> > >>  	if (unlikely(is_flush_fua)) {
> > >> +		struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > >>  		/* Bypass scheduler for flush requests */
> > >>  		blk_insert_flush(rq);
> > >> -		blk_mq_run_hw_queue(rq->mq_hctx, true);
> > >> +		blk_mq_run_hw_queue(hctx, true);
> > > 
> > > If the request is freed before running queue, the request queue could
> > > be released and the hctx may be freed.
> > 
> > No, we still hold a queue enter ref.
> 
> But that one is released when rq is freed since ac7c5675fa45 ("blk-mq: allow
> blk_mq_make_request to consume the q_usage_counter reference"), isn't
> it?

With commit ac7c5675fa45, any reference to hctx after queuing request could
lead to UAF in the code path of blk_mq_submit_bio(). Maybe we need to grab
two ref in queue enter, and release one after the bio is submitted.


Thanks,
Ming

Jens Axboe Oct. 18, 2021, 2:16 a.m. UTC | #6

On 10/17/21 8:11 PM, Ming Lei wrote:
> On Mon, Oct 18, 2021 at 10:02:32AM +0800, Ming Lei wrote:
>> On Sun, Oct 17, 2021 at 07:50:24PM -0600, Jens Axboe wrote:
>>> On 10/17/21 7:49 PM, Ming Lei wrote:
>>>> On Sat, Oct 16, 2021 at 07:35:39PM -0600, Jens Axboe wrote:
>>>>> We could have a race here, where the request gets freed before we call
>>>>> into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
>>>>> of the request.
>>>>>
>>>>> Grab the hardware context before inserting the flush.
>>>>>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>
>>>>> ---
>>>>>
>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>> index 2197cfbf081f..22b30a89bf3a 100644
>>>>> --- a/block/blk-mq.c
>>>>> +++ b/block/blk-mq.c
>>>>> @@ -2468,9 +2468,10 @@ void blk_mq_submit_bio(struct bio *bio)
>>>>>  	}
>>>>>  
>>>>>  	if (unlikely(is_flush_fua)) {
>>>>> +		struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>>>>>  		/* Bypass scheduler for flush requests */
>>>>>  		blk_insert_flush(rq);
>>>>> -		blk_mq_run_hw_queue(rq->mq_hctx, true);
>>>>> +		blk_mq_run_hw_queue(hctx, true);
>>>>
>>>> If the request is freed before running queue, the request queue could
>>>> be released and the hctx may be freed.
>>>
>>> No, we still hold a queue enter ref.
>>
>> But that one is released when rq is freed since ac7c5675fa45 ("blk-mq: allow
>> blk_mq_make_request to consume the q_usage_counter reference"), isn't
>> it?
> 
> With commit ac7c5675fa45, any reference to hctx after queuing request could
> lead to UAF in the code path of blk_mq_submit_bio(). Maybe we need to grab
> two ref in queue enter, and release one after the bio is submitted.

I'd rather audit and see if there are any, because extra get+put isn't
exactly free.

Ming Lei Oct. 18, 2021, 2:30 a.m. UTC | #7

On Sun, Oct 17, 2021 at 08:16:25PM -0600, Jens Axboe wrote:
> On 10/17/21 8:11 PM, Ming Lei wrote:
> > On Mon, Oct 18, 2021 at 10:02:32AM +0800, Ming Lei wrote:
> >> On Sun, Oct 17, 2021 at 07:50:24PM -0600, Jens Axboe wrote:
> >>> On 10/17/21 7:49 PM, Ming Lei wrote:
> >>>> On Sat, Oct 16, 2021 at 07:35:39PM -0600, Jens Axboe wrote:
> >>>>> We could have a race here, where the request gets freed before we call
> >>>>> into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
> >>>>> of the request.
> >>>>>
> >>>>> Grab the hardware context before inserting the flush.
> >>>>>
> >>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>
> >>>>> ---
> >>>>>
> >>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>>>> index 2197cfbf081f..22b30a89bf3a 100644
> >>>>> --- a/block/blk-mq.c
> >>>>> +++ b/block/blk-mq.c
> >>>>> @@ -2468,9 +2468,10 @@ void blk_mq_submit_bio(struct bio *bio)
> >>>>>  	}
> >>>>>  
> >>>>>  	if (unlikely(is_flush_fua)) {
> >>>>> +		struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> >>>>>  		/* Bypass scheduler for flush requests */
> >>>>>  		blk_insert_flush(rq);
> >>>>> -		blk_mq_run_hw_queue(rq->mq_hctx, true);
> >>>>> +		blk_mq_run_hw_queue(hctx, true);
> >>>>
> >>>> If the request is freed before running queue, the request queue could
> >>>> be released and the hctx may be freed.
> >>>
> >>> No, we still hold a queue enter ref.
> >>
> >> But that one is released when rq is freed since ac7c5675fa45 ("blk-mq: allow
> >> blk_mq_make_request to consume the q_usage_counter reference"), isn't
> >> it?
> > 
> > With commit ac7c5675fa45, any reference to hctx after queuing request could
> > lead to UAF in the code path of blk_mq_submit_bio(). Maybe we need to grab
> > two ref in queue enter, and release one after the bio is submitted.
> 
> I'd rather audit and see if there are any, because extra get+put isn't
> exactly free.

Only direct issue needn't that, looks others do need grab one extra
ref if the request has to be queued somewhere before dispatch.


Thanks,
Ming

Ming Lei Oct. 18, 2021, 2:42 a.m. UTC | #8

On Sat, Oct 16, 2021 at 07:35:39PM -0600, Jens Axboe wrote:
> We could have a race here, where the request gets freed before we call
> into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
> of the request.
> 
> Grab the hardware context before inserting the flush.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2197cfbf081f..22b30a89bf3a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2468,9 +2468,10 @@ void blk_mq_submit_bio(struct bio *bio)
>  	}
>  
>  	if (unlikely(is_flush_fua)) {
> +		struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  		/* Bypass scheduler for flush requests */
>  		blk_insert_flush(rq);
> -		blk_mq_run_hw_queue(rq->mq_hctx, true);
> +		blk_mq_run_hw_queue(hctx, true);
>  	} else if (plug && (q->nr_hw_queues == 1 ||
>  		   blk_mq_is_shared_tags(rq->mq_hctx->flags) ||
>  		   q->mq_ops->commit_rqs || !blk_queue_nonrot(q))) {

From report in [1], no device close & queue release is involved, and
request freeing could be much easier to trigger than queue release,
so looks fine:

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Fixes: f328476e373a ("blk-mq: cleanup blk_mq_submit_bio")



[1] https://lore.kernel.org/linux-block/23531d29-9d96-6744-bab9-797e65379037@kernel.dk/T/#t


thanks,
Ming

Jens Axboe Oct. 18, 2021, 2:44 a.m. UTC | #9

On 10/17/21 8:42 PM, Ming Lei wrote:
> On Sat, Oct 16, 2021 at 07:35:39PM -0600, Jens Axboe wrote:
>> We could have a race here, where the request gets freed before we call
>> into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
>> of the request.
>>
>> Grab the hardware context before inserting the flush.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 2197cfbf081f..22b30a89bf3a 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2468,9 +2468,10 @@ void blk_mq_submit_bio(struct bio *bio)
>>  	}
>>  
>>  	if (unlikely(is_flush_fua)) {
>> +		struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>>  		/* Bypass scheduler for flush requests */
>>  		blk_insert_flush(rq);
>> -		blk_mq_run_hw_queue(rq->mq_hctx, true);
>> +		blk_mq_run_hw_queue(hctx, true);
>>  	} else if (plug && (q->nr_hw_queues == 1 ||
>>  		   blk_mq_is_shared_tags(rq->mq_hctx->flags) ||
>>  		   q->mq_ops->commit_rqs || !blk_queue_nonrot(q))) {
> 
> From report in [1], no device close & queue release is involved, and
> request freeing could be much easier to trigger than queue release,
> so looks fine:
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> 
> Fixes: f328476e373a ("blk-mq: cleanup blk_mq_submit_bio")

Wasn't in the one I sent out, but I do have the fixes as well. Thanks,
I'll add your reviewed-by.

Christoph Hellwig Oct. 18, 2021, 8:34 a.m. UTC | #10

I think this can be done much simpler. The only place in
blk_insert_flush that actually needs to run the queue is the case
where no flushes are needed, as all the others are handled via the flush
state machine and the requeue list.  So something like this should work:

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 4201728bf3a5a..1fce6d16e6d3a 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -421,7 +421,7 @@ void blk_insert_flush(struct request *rq)
 	 */
 	if ((policy & REQ_FSEQ_DATA) &&
 	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
-		blk_mq_request_bypass_insert(rq, false, false);
+		blk_mq_request_bypass_insert(rq, false, true);
 		return;
 	}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f296edff47246..89a142b61f456 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2314,7 +2314,6 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (unlikely(is_flush_fua)) {
 		/* Bypass scheduler for flush requests */
 		blk_insert_flush(rq);
-		blk_mq_run_hw_queue(rq->mq_hctx, true);
 	} else if (plug && (q->nr_hw_queues == 1 ||
 		   blk_mq_is_shared_tags(rq->mq_hctx->flags) ||
 		   q->mq_ops->commit_rqs || !blk_queue_nonrot(q))) {

block: don't dereference request after flush insertion

Commit Message

Comments

Patch