diff mbox series

[RFC] blk_mq: clear rq mapping in driver tags before freeing rqs in sched tags

Message ID 20210817022306.1622027-1-yukuai3@huawei.com (mailing list archive)
State New, archived
Headers show
Series [RFC] blk_mq: clear rq mapping in driver tags before freeing rqs in sched tags | expand

Commit Message

Yu Kuai Aug. 17, 2021, 2:23 a.m. UTC
If ioscheduler is not none, hctx->tags->rq[tag] will point to
hctx->sched_tags->static_rq[internel_tag] in blk_mq_get_driver_tag().
However, static_rq of sched_tags might be freed through switching
elevator or increasing nr_requests. Thus leave a window for some drivers
to get the freed request through blk_mq_tag_to_rq(tags, tag).

It's difficult to fix this uaf from driver side, I'm thinking about
following solution:

a. clear rq mapping in driver tags before freeing rqs in sched tags
b. provide a new interface to replace blk_mq_tag_to_rq(), the new
interface will make sure it won't return freed rq.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-sched.c | 10 +++++++++-
 block/blk-mq.c       | 13 +++++++++++--
 block/blk-mq.h       |  2 ++
 3 files changed, 22 insertions(+), 3 deletions(-)

Comments

Ming Lei Aug. 18, 2021, 12:52 a.m. UTC | #1
On Tue, Aug 17, 2021 at 10:23:06AM +0800, Yu Kuai wrote:
> If ioscheduler is not none, hctx->tags->rq[tag] will point to
> hctx->sched_tags->static_rq[internel_tag] in blk_mq_get_driver_tag().
> However, static_rq of sched_tags might be freed through switching
> elevator or increasing nr_requests. Thus leave a window for some drivers
> to get the freed request through blk_mq_tag_to_rq(tags, tag).

I believe I have explained that it is bug of driver which has knowledge
if the passed tag is valid or not. We are clear that driver need to cover
race between normal completion and timeout/error handling.

> 
> It's difficult to fix this uaf from driver side, I'm thinking about

So far not see any analysis on why the uaf is triggered, care to
investigate the reason?

> following solution:
> 
> a. clear rq mapping in driver tags before freeing rqs in sched tags

We have done that already, see blk_mq_free_rqs().

> b. provide a new interface to replace blk_mq_tag_to_rq(), the new
> interface will make sure it won't return freed rq.

b) in your previous patch can't avoid uaf:

https://lore.kernel.org/linux-block/YRHa%2FkeJ4pHP3hnL@T590/

> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-mq-sched.c | 10 +++++++++-
>  block/blk-mq.c       | 13 +++++++++++--
>  block/blk-mq.h       |  2 ++
>  3 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 0f006cabfd91..9f11f17b8380 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -662,8 +662,16 @@ void blk_mq_sched_free_requests(struct request_queue *q)
>  	int i;
>  
>  	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (hctx->sched_tags)
> +		if (hctx->sched_tags) {
> +			/*
> +			 * We are about to free requests in 'sched_tags[]',
> +			 * however, 'tags[]' may still point to these requests.
> +			 * Thus we need to clear rq mapping in 'tags[]' before
> +			 * freeing requests in sched_tags[].
> +			 */
> +			blk_mq_clear_rq_mapping(q->tag_set, hctx->tags, i);
>  			blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i);

blk_mq_clear_rq_mapping() has been called in blk_mq_free_rqs()
for clearing the request reference.

In theory, we only need to clear it in case of real io sched, but
so far we do it for both io sched and none.

> +		}
>  	}
>  }
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d185be64c85f..b1e30464f87f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2314,8 +2314,8 @@ static size_t order_to_size(unsigned int order)
>  }
>  
>  /* called before freeing request pool in @tags */
> -static void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set,
> -		struct blk_mq_tags *tags, unsigned int hctx_idx)
> +void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set,
> +			     struct blk_mq_tags *tags, unsigned int hctx_idx)
>  {
>  	struct blk_mq_tags *drv_tags = set->tags[hctx_idx];
>  	struct page *page;
> @@ -3632,6 +3632,15 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>  			if (!ret && blk_mq_is_sbitmap_shared(set->flags))
>  				blk_mq_tag_resize_shared_sbitmap(set, nr);
>  		} else {
> +			/*
> +			 * We are about to free requests in 'sched_tags[]',
> +			 * however, 'tags[]' may still point to these requests.
> +			 * Thus we need to clear rq mapping in 'tags[]' before
> +			 * freeing requests in sched_tags[].
> +			 */
> +			if (nr > hctx->sched_tags->nr_tags)
> +				blk_mq_clear_rq_mapping(set, hctx->tags, i);
> +
>  			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
>  							nr, true);

The request reference has been cleared too in blk_mq_tag_update_depth():

	blk_mq_tag_update_depth
		blk_mq_free_rqs
			blk_mq_clear_rq_mapping


Thanks, 
Ming
Yu Kuai Aug. 18, 2021, 2:02 a.m. UTC | #2
On 2021/08/18 8:52, Ming Lei wrote:
> On Tue, Aug 17, 2021 at 10:23:06AM +0800, Yu Kuai wrote:
>> If ioscheduler is not none, hctx->tags->rq[tag] will point to
>> hctx->sched_tags->static_rq[internel_tag] in blk_mq_get_driver_tag().
>> However, static_rq of sched_tags might be freed through switching
>> elevator or increasing nr_requests. Thus leave a window for some drivers
>> to get the freed request through blk_mq_tag_to_rq(tags, tag).
> 
> I believe I have explained that it is bug of driver which has knowledge
> if the passed tag is valid or not. We are clear that driver need to cover
> race between normal completion and timeout/error handling.
> 
>>
>> It's difficult to fix this uaf from driver side, I'm thinking about
> 
> So far not see any analysis on why the uaf is triggered, care to
> investigate the reason?

Hi, Ming

I'm sorry if I didn't explian the uaf clearly.

1) At first, a normal io is submitted and completed with scheduler:

internel_tag = blk_mq_get_tag -> get tag from sched_tags
  blk_mq_rq_ctx_init
   sched_tags->rq[internel_tag] = sched_tag->static_rq[internel_tag]
...
blk_mq_get_driver_tag
  __blk_mq_get_driver_tag -> get tag from tags
  tags->rq[tag] = sched_tag->static_rq[internel_tag]

So, both tags->rq[tag] and sched_tags->rq[internel_tag] are pointing
to the request: sched_tags->static_rq[internal_tag].

2) Then, if the sched_tags->static_rq is freed:

blk_mq_sched_free_requests
  blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i)
   blk_mq_clear_rq_mapping(set, tags, hctx_idx);
    -> sched_tags->rq[internel_tag] is set to null here

After switching elevator, tags->rq[tag] still point to the request
that is just freed.

3) nbd server send a reply with random tag directly:

recv_work
  nbd_read_stat
   blk_mq_tag_to_rq(tags, tag)
    rq = tags->rq[tag] -> rq is freed

Usually, nbd will get tag and send a request to server first, and then
handle the reply. However, if the request is skipped, such uaf problem
can be triggered.

> 
> The request reference has been cleared too in blk_mq_tag_update_depth():
> 
> 	blk_mq_tag_update_depth
> 		blk_mq_free_rqs
> 			blk_mq_clear_rq_mapping
> 

What I'm trying to do is to clear rq mapping in both
hctx->sched_tags->rq and hctx->tags->rq when sched_tags->static_rq
is freed. However, I forgot about the case when tags is shared in
multiple device. Thus what this patch does is clearly wrong...

So, what do you think about adding a new interface to iterate the
request in tags->rq[], find what is pointing to the
sched_tags->static_rq[], and use cmpxchg() to clear them?

Thanks
Kuai
Ming Lei Aug. 18, 2021, 2:45 a.m. UTC | #3
On Wed, Aug 18, 2021 at 10:02:09AM +0800, yukuai (C) wrote:
> On 2021/08/18 8:52, Ming Lei wrote:
> > On Tue, Aug 17, 2021 at 10:23:06AM +0800, Yu Kuai wrote:
> > > If ioscheduler is not none, hctx->tags->rq[tag] will point to
> > > hctx->sched_tags->static_rq[internel_tag] in blk_mq_get_driver_tag().
> > > However, static_rq of sched_tags might be freed through switching
> > > elevator or increasing nr_requests. Thus leave a window for some drivers
> > > to get the freed request through blk_mq_tag_to_rq(tags, tag).
> > 
> > I believe I have explained that it is bug of driver which has knowledge
> > if the passed tag is valid or not. We are clear that driver need to cover
> > race between normal completion and timeout/error handling.
> > 
> > > 
> > > It's difficult to fix this uaf from driver side, I'm thinking about
> > 
> > So far not see any analysis on why the uaf is triggered, care to
> > investigate the reason?
> 
> Hi, Ming
> 
> I'm sorry if I didn't explian the uaf clearly.
> 
> 1) At first, a normal io is submitted and completed with scheduler:
> 
> internel_tag = blk_mq_get_tag -> get tag from sched_tags
>  blk_mq_rq_ctx_init
>   sched_tags->rq[internel_tag] = sched_tag->static_rq[internel_tag]
> ...
> blk_mq_get_driver_tag
>  __blk_mq_get_driver_tag -> get tag from tags
>  tags->rq[tag] = sched_tag->static_rq[internel_tag]
> 
> So, both tags->rq[tag] and sched_tags->rq[internel_tag] are pointing
> to the request: sched_tags->static_rq[internal_tag].
> 
> 2) Then, if the sched_tags->static_rq is freed:
> 
> blk_mq_sched_free_requests
>  blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i)
>   blk_mq_clear_rq_mapping(set, tags, hctx_idx);
>    -> sched_tags->rq[internel_tag] is set to null here

Please take a look at blk_mq_clear_rq_mapping():

	//drv_tags points to set->tags[] which is shared in host wide
	struct blk_mq_tags *drv_tags = set->tags[hctx_idx];
	...

	//tags points to sched_tags
	list_for_each_entry(page, &tags->page_list, lru) {
		unsigned long start = (unsigned long)page_address(page);
		unsigned long end = start + order_to_size(page->private);
		int i;

		/* clear drv_tags->rq[i] in case it is from this sched tags*/
		for (i = 0; i < set->queue_depth; i++) {
			struct request *rq = drv_tags->rqs[i];
			unsigned long rq_addr = (unsigned long)rq;

			if (rq_addr >= start && rq_addr < end) {
				WARN_ON_ONCE(refcount_read(&rq->ref) != 0);
				cmpxchg(&drv_tags->rqs[i], rq, NULL);
			}
		}
	}

So we do clear tags->rq[] instead of sched_tag->rq[].

> 
> After switching elevator, tags->rq[tag] still point to the request
> that is just freed.

No.

> 
> 3) nbd server send a reply with random tag directly:
> 
> recv_work
>  nbd_read_stat
>   blk_mq_tag_to_rq(tags, tag)
>    rq = tags->rq[tag] -> rq is freed
> 
> Usually, nbd will get tag and send a request to server first, and then
> handle the reply. However, if the request is skipped, such uaf problem
> can be triggered.

When or how is such reply with random tag replied to nbd client? Is it
possible for nbd client to detect such un-expected/bad situation?
What if blk_mq_tag_to_rq() is just called before/when we clear tags->rq[]?

> 
> > 
> > The request reference has been cleared too in blk_mq_tag_update_depth():
> > 
> > 	blk_mq_tag_update_depth
> > 		blk_mq_free_rqs
> > 			blk_mq_clear_rq_mapping
> > 
> 
> What I'm trying to do is to clear rq mapping in both
> hctx->sched_tags->rq and hctx->tags->rq when sched_tags->static_rq
> is freed. However, I forgot about the case when tags is shared in
> multiple device. Thus what this patch does is clearly wrong...
> 
> So, what do you think about adding a new interface to iterate the
> request in tags->rq[], find what is pointing to the
> sched_tags->static_rq[], and use cmpxchg() to clear them?

See above, we already clear tags->rq[] if the rq is from to-be-freed
sched_tags.


Thanks,
Ming
Yu Kuai Aug. 18, 2021, 3:13 a.m. UTC | #4
On 2021/08/18 10:45, Ming Lei wrote:

> Please take a look at blk_mq_clear_rq_mapping():
> 
> 	//drv_tags points to set->tags[] which is shared in host wide
> 	struct blk_mq_tags *drv_tags = set->tags[hctx_idx];
> 	...
> 
> 	//tags points to sched_tags
> 	list_for_each_entry(page, &tags->page_list, lru) {
> 		unsigned long start = (unsigned long)page_address(page);
> 		unsigned long end = start + order_to_size(page->private);
> 		int i;
> 
> 		/* clear drv_tags->rq[i] in case it is from this sched tags*/
> 		for (i = 0; i < set->queue_depth; i++) {
> 			struct request *rq = drv_tags->rqs[i];
> 			unsigned long rq_addr = (unsigned long)rq;
> 
> 			if (rq_addr >= start && rq_addr < end) {
> 				WARN_ON_ONCE(refcount_read(&rq->ref) != 0);
> 				cmpxchg(&drv_tags->rqs[i], rq, NULL);
> 			}
> 		}
> 	}
> 
> So we do clear tags->rq[] instead of sched_tag->rq[].

Thanks for the correction, I misunderstand the code, my bad...

> 
>>
>> After switching elevator, tags->rq[tag] still point to the request
>> that is just freed.
> 
> No.
> 
>>
>> 3) nbd server send a reply with random tag directly:
>>
>> recv_work
>>   nbd_read_stat
>>    blk_mq_tag_to_rq(tags, tag)
>>     rq = tags->rq[tag] -> rq is freed
>>
>> Usually, nbd will get tag and send a request to server first, and then
>> handle the reply. However, if the request is skipped, such uaf problem
>> can be triggered.
> 
> When or how is such reply with random tag replied to nbd client? Is it
> possible for nbd client to detect such un-expected/bad situation?

We see that the random tag replied to nbd client in our syzkaller test,
I guess it will not happen in the common case.

nbd_read_stat() is trying to get a valid request from the tag and
complete the request since the server send such reply to client.

There is a way that I can think of to detect the situation that server
reply to client without client's request:

1) get tag from the reply message

2) check that the tag is really set in bitmap of nbd->tag_set.tags[]
If the client didn't send the request message, the driver_tag is not
accquired yet, thus this check can detect this situation.

3) call blk_mq_tag_to_dile to get the request

> What if blk_mq_tag_to_rq() is just called before/when we clear tags->rq[]?

The concurrent scenario is still possible currently.

Thanks
Kuai
diff mbox series

Patch

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 0f006cabfd91..9f11f17b8380 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -662,8 +662,16 @@  void blk_mq_sched_free_requests(struct request_queue *q)
 	int i;
 
 	queue_for_each_hw_ctx(q, hctx, i) {
-		if (hctx->sched_tags)
+		if (hctx->sched_tags) {
+			/*
+			 * We are about to free requests in 'sched_tags[]',
+			 * however, 'tags[]' may still point to these requests.
+			 * Thus we need to clear rq mapping in 'tags[]' before
+			 * freeing requests in sched_tags[].
+			 */
+			blk_mq_clear_rq_mapping(q->tag_set, hctx->tags, i);
 			blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i);
+		}
 	}
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d185be64c85f..b1e30464f87f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2314,8 +2314,8 @@  static size_t order_to_size(unsigned int order)
 }
 
 /* called before freeing request pool in @tags */
-static void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set,
-		struct blk_mq_tags *tags, unsigned int hctx_idx)
+void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set,
+			     struct blk_mq_tags *tags, unsigned int hctx_idx)
 {
 	struct blk_mq_tags *drv_tags = set->tags[hctx_idx];
 	struct page *page;
@@ -3632,6 +3632,15 @@  int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 			if (!ret && blk_mq_is_sbitmap_shared(set->flags))
 				blk_mq_tag_resize_shared_sbitmap(set, nr);
 		} else {
+			/*
+			 * We are about to free requests in 'sched_tags[]',
+			 * however, 'tags[]' may still point to these requests.
+			 * Thus we need to clear rq mapping in 'tags[]' before
+			 * freeing requests in sched_tags[].
+			 */
+			if (nr > hctx->sched_tags->nr_tags)
+				blk_mq_clear_rq_mapping(set, hctx->tags, i);
+
 			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
 							nr, true);
 			if (blk_mq_is_sbitmap_shared(set->flags)) {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9e646ade81a8..d31f96eca71e 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -77,6 +77,8 @@  void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last);
 void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
 				    struct list_head *list);
+void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set,
+			     struct blk_mq_tags *tags, unsigned int hctx_idx);
 
 /*
  * CPU -> queue mappings