Message ID | 20210817022306.1622027-1-yukuai3@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] blk_mq: clear rq mapping in driver tags before freeing rqs in sched tags | expand |
On Tue, Aug 17, 2021 at 10:23:06AM +0800, Yu Kuai wrote: > If ioscheduler is not none, hctx->tags->rq[tag] will point to > hctx->sched_tags->static_rq[internel_tag] in blk_mq_get_driver_tag(). > However, static_rq of sched_tags might be freed through switching > elevator or increasing nr_requests. Thus leave a window for some drivers > to get the freed request through blk_mq_tag_to_rq(tags, tag). I believe I have explained that it is bug of driver which has knowledge if the passed tag is valid or not. We are clear that driver need to cover race between normal completion and timeout/error handling. > > It's difficult to fix this uaf from driver side, I'm thinking about So far not see any analysis on why the uaf is triggered, care to investigate the reason? > following solution: > > a. clear rq mapping in driver tags before freeing rqs in sched tags We have done that already, see blk_mq_free_rqs(). > b. provide a new interface to replace blk_mq_tag_to_rq(), the new > interface will make sure it won't return freed rq. b) in your previous patch can't avoid uaf: https://lore.kernel.org/linux-block/YRHa%2FkeJ4pHP3hnL@T590/ > > Signed-off-by: Yu Kuai <yukuai3@huawei.com> > --- > block/blk-mq-sched.c | 10 +++++++++- > block/blk-mq.c | 13 +++++++++++-- > block/blk-mq.h | 2 ++ > 3 files changed, 22 insertions(+), 3 deletions(-) > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index 0f006cabfd91..9f11f17b8380 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -662,8 +662,16 @@ void blk_mq_sched_free_requests(struct request_queue *q) > int i; > > queue_for_each_hw_ctx(q, hctx, i) { > - if (hctx->sched_tags) > + if (hctx->sched_tags) { > + /* > + * We are about to free requests in 'sched_tags[]', > + * however, 'tags[]' may still point to these requests. > + * Thus we need to clear rq mapping in 'tags[]' before > + * freeing requests in sched_tags[]. > + */ > + blk_mq_clear_rq_mapping(q->tag_set, hctx->tags, i); > blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i); blk_mq_clear_rq_mapping() has been called in blk_mq_free_rqs() for clearing the request reference. In theory, we only need to clear it in case of real io sched, but so far we do it for both io sched and none. > + } > } > } > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index d185be64c85f..b1e30464f87f 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2314,8 +2314,8 @@ static size_t order_to_size(unsigned int order) > } > > /* called before freeing request pool in @tags */ > -static void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set, > - struct blk_mq_tags *tags, unsigned int hctx_idx) > +void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set, > + struct blk_mq_tags *tags, unsigned int hctx_idx) > { > struct blk_mq_tags *drv_tags = set->tags[hctx_idx]; > struct page *page; > @@ -3632,6 +3632,15 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr) > if (!ret && blk_mq_is_sbitmap_shared(set->flags)) > blk_mq_tag_resize_shared_sbitmap(set, nr); > } else { > + /* > + * We are about to free requests in 'sched_tags[]', > + * however, 'tags[]' may still point to these requests. > + * Thus we need to clear rq mapping in 'tags[]' before > + * freeing requests in sched_tags[]. > + */ > + if (nr > hctx->sched_tags->nr_tags) > + blk_mq_clear_rq_mapping(set, hctx->tags, i); > + > ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags, > nr, true); The request reference has been cleared too in blk_mq_tag_update_depth(): blk_mq_tag_update_depth blk_mq_free_rqs blk_mq_clear_rq_mapping Thanks, Ming
On 2021/08/18 8:52, Ming Lei wrote: > On Tue, Aug 17, 2021 at 10:23:06AM +0800, Yu Kuai wrote: >> If ioscheduler is not none, hctx->tags->rq[tag] will point to >> hctx->sched_tags->static_rq[internel_tag] in blk_mq_get_driver_tag(). >> However, static_rq of sched_tags might be freed through switching >> elevator or increasing nr_requests. Thus leave a window for some drivers >> to get the freed request through blk_mq_tag_to_rq(tags, tag). > > I believe I have explained that it is bug of driver which has knowledge > if the passed tag is valid or not. We are clear that driver need to cover > race between normal completion and timeout/error handling. > >> >> It's difficult to fix this uaf from driver side, I'm thinking about > > So far not see any analysis on why the uaf is triggered, care to > investigate the reason? Hi, Ming I'm sorry if I didn't explian the uaf clearly. 1) At first, a normal io is submitted and completed with scheduler: internel_tag = blk_mq_get_tag -> get tag from sched_tags blk_mq_rq_ctx_init sched_tags->rq[internel_tag] = sched_tag->static_rq[internel_tag] ... blk_mq_get_driver_tag __blk_mq_get_driver_tag -> get tag from tags tags->rq[tag] = sched_tag->static_rq[internel_tag] So, both tags->rq[tag] and sched_tags->rq[internel_tag] are pointing to the request: sched_tags->static_rq[internal_tag]. 2) Then, if the sched_tags->static_rq is freed: blk_mq_sched_free_requests blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i) blk_mq_clear_rq_mapping(set, tags, hctx_idx); -> sched_tags->rq[internel_tag] is set to null here After switching elevator, tags->rq[tag] still point to the request that is just freed. 3) nbd server send a reply with random tag directly: recv_work nbd_read_stat blk_mq_tag_to_rq(tags, tag) rq = tags->rq[tag] -> rq is freed Usually, nbd will get tag and send a request to server first, and then handle the reply. However, if the request is skipped, such uaf problem can be triggered. > > The request reference has been cleared too in blk_mq_tag_update_depth(): > > blk_mq_tag_update_depth > blk_mq_free_rqs > blk_mq_clear_rq_mapping > What I'm trying to do is to clear rq mapping in both hctx->sched_tags->rq and hctx->tags->rq when sched_tags->static_rq is freed. However, I forgot about the case when tags is shared in multiple device. Thus what this patch does is clearly wrong... So, what do you think about adding a new interface to iterate the request in tags->rq[], find what is pointing to the sched_tags->static_rq[], and use cmpxchg() to clear them? Thanks Kuai
On Wed, Aug 18, 2021 at 10:02:09AM +0800, yukuai (C) wrote: > On 2021/08/18 8:52, Ming Lei wrote: > > On Tue, Aug 17, 2021 at 10:23:06AM +0800, Yu Kuai wrote: > > > If ioscheduler is not none, hctx->tags->rq[tag] will point to > > > hctx->sched_tags->static_rq[internel_tag] in blk_mq_get_driver_tag(). > > > However, static_rq of sched_tags might be freed through switching > > > elevator or increasing nr_requests. Thus leave a window for some drivers > > > to get the freed request through blk_mq_tag_to_rq(tags, tag). > > > > I believe I have explained that it is bug of driver which has knowledge > > if the passed tag is valid or not. We are clear that driver need to cover > > race between normal completion and timeout/error handling. > > > > > > > > It's difficult to fix this uaf from driver side, I'm thinking about > > > > So far not see any analysis on why the uaf is triggered, care to > > investigate the reason? > > Hi, Ming > > I'm sorry if I didn't explian the uaf clearly. > > 1) At first, a normal io is submitted and completed with scheduler: > > internel_tag = blk_mq_get_tag -> get tag from sched_tags > blk_mq_rq_ctx_init > sched_tags->rq[internel_tag] = sched_tag->static_rq[internel_tag] > ... > blk_mq_get_driver_tag > __blk_mq_get_driver_tag -> get tag from tags > tags->rq[tag] = sched_tag->static_rq[internel_tag] > > So, both tags->rq[tag] and sched_tags->rq[internel_tag] are pointing > to the request: sched_tags->static_rq[internal_tag]. > > 2) Then, if the sched_tags->static_rq is freed: > > blk_mq_sched_free_requests > blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i) > blk_mq_clear_rq_mapping(set, tags, hctx_idx); > -> sched_tags->rq[internel_tag] is set to null here Please take a look at blk_mq_clear_rq_mapping(): //drv_tags points to set->tags[] which is shared in host wide struct blk_mq_tags *drv_tags = set->tags[hctx_idx]; ... //tags points to sched_tags list_for_each_entry(page, &tags->page_list, lru) { unsigned long start = (unsigned long)page_address(page); unsigned long end = start + order_to_size(page->private); int i; /* clear drv_tags->rq[i] in case it is from this sched tags*/ for (i = 0; i < set->queue_depth; i++) { struct request *rq = drv_tags->rqs[i]; unsigned long rq_addr = (unsigned long)rq; if (rq_addr >= start && rq_addr < end) { WARN_ON_ONCE(refcount_read(&rq->ref) != 0); cmpxchg(&drv_tags->rqs[i], rq, NULL); } } } So we do clear tags->rq[] instead of sched_tag->rq[]. > > After switching elevator, tags->rq[tag] still point to the request > that is just freed. No. > > 3) nbd server send a reply with random tag directly: > > recv_work > nbd_read_stat > blk_mq_tag_to_rq(tags, tag) > rq = tags->rq[tag] -> rq is freed > > Usually, nbd will get tag and send a request to server first, and then > handle the reply. However, if the request is skipped, such uaf problem > can be triggered. When or how is such reply with random tag replied to nbd client? Is it possible for nbd client to detect such un-expected/bad situation? What if blk_mq_tag_to_rq() is just called before/when we clear tags->rq[]? > > > > > The request reference has been cleared too in blk_mq_tag_update_depth(): > > > > blk_mq_tag_update_depth > > blk_mq_free_rqs > > blk_mq_clear_rq_mapping > > > > What I'm trying to do is to clear rq mapping in both > hctx->sched_tags->rq and hctx->tags->rq when sched_tags->static_rq > is freed. However, I forgot about the case when tags is shared in > multiple device. Thus what this patch does is clearly wrong... > > So, what do you think about adding a new interface to iterate the > request in tags->rq[], find what is pointing to the > sched_tags->static_rq[], and use cmpxchg() to clear them? See above, we already clear tags->rq[] if the rq is from to-be-freed sched_tags. Thanks, Ming
On 2021/08/18 10:45, Ming Lei wrote: > Please take a look at blk_mq_clear_rq_mapping(): > > //drv_tags points to set->tags[] which is shared in host wide > struct blk_mq_tags *drv_tags = set->tags[hctx_idx]; > ... > > //tags points to sched_tags > list_for_each_entry(page, &tags->page_list, lru) { > unsigned long start = (unsigned long)page_address(page); > unsigned long end = start + order_to_size(page->private); > int i; > > /* clear drv_tags->rq[i] in case it is from this sched tags*/ > for (i = 0; i < set->queue_depth; i++) { > struct request *rq = drv_tags->rqs[i]; > unsigned long rq_addr = (unsigned long)rq; > > if (rq_addr >= start && rq_addr < end) { > WARN_ON_ONCE(refcount_read(&rq->ref) != 0); > cmpxchg(&drv_tags->rqs[i], rq, NULL); > } > } > } > > So we do clear tags->rq[] instead of sched_tag->rq[]. Thanks for the correction, I misunderstand the code, my bad... > >> >> After switching elevator, tags->rq[tag] still point to the request >> that is just freed. > > No. > >> >> 3) nbd server send a reply with random tag directly: >> >> recv_work >> nbd_read_stat >> blk_mq_tag_to_rq(tags, tag) >> rq = tags->rq[tag] -> rq is freed >> >> Usually, nbd will get tag and send a request to server first, and then >> handle the reply. However, if the request is skipped, such uaf problem >> can be triggered. > > When or how is such reply with random tag replied to nbd client? Is it > possible for nbd client to detect such un-expected/bad situation? We see that the random tag replied to nbd client in our syzkaller test, I guess it will not happen in the common case. nbd_read_stat() is trying to get a valid request from the tag and complete the request since the server send such reply to client. There is a way that I can think of to detect the situation that server reply to client without client's request: 1) get tag from the reply message 2) check that the tag is really set in bitmap of nbd->tag_set.tags[] If the client didn't send the request message, the driver_tag is not accquired yet, thus this check can detect this situation. 3) call blk_mq_tag_to_dile to get the request > What if blk_mq_tag_to_rq() is just called before/when we clear tags->rq[]? The concurrent scenario is still possible currently. Thanks Kuai
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 0f006cabfd91..9f11f17b8380 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -662,8 +662,16 @@ void blk_mq_sched_free_requests(struct request_queue *q) int i; queue_for_each_hw_ctx(q, hctx, i) { - if (hctx->sched_tags) + if (hctx->sched_tags) { + /* + * We are about to free requests in 'sched_tags[]', + * however, 'tags[]' may still point to these requests. + * Thus we need to clear rq mapping in 'tags[]' before + * freeing requests in sched_tags[]. + */ + blk_mq_clear_rq_mapping(q->tag_set, hctx->tags, i); blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i); + } } } diff --git a/block/blk-mq.c b/block/blk-mq.c index d185be64c85f..b1e30464f87f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2314,8 +2314,8 @@ static size_t order_to_size(unsigned int order) } /* called before freeing request pool in @tags */ -static void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set, - struct blk_mq_tags *tags, unsigned int hctx_idx) +void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set, + struct blk_mq_tags *tags, unsigned int hctx_idx) { struct blk_mq_tags *drv_tags = set->tags[hctx_idx]; struct page *page; @@ -3632,6 +3632,15 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr) if (!ret && blk_mq_is_sbitmap_shared(set->flags)) blk_mq_tag_resize_shared_sbitmap(set, nr); } else { + /* + * We are about to free requests in 'sched_tags[]', + * however, 'tags[]' may still point to these requests. + * Thus we need to clear rq mapping in 'tags[]' before + * freeing requests in sched_tags[]. + */ + if (nr > hctx->sched_tags->nr_tags) + blk_mq_clear_rq_mapping(set, hctx->tags, i); + ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags, nr, true); if (blk_mq_is_sbitmap_shared(set->flags)) { diff --git a/block/blk-mq.h b/block/blk-mq.h index 9e646ade81a8..d31f96eca71e 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -77,6 +77,8 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last); void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, struct list_head *list); +void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set, + struct blk_mq_tags *tags, unsigned int hctx_idx); /* * CPU -> queue mappings
If ioscheduler is not none, hctx->tags->rq[tag] will point to hctx->sched_tags->static_rq[internel_tag] in blk_mq_get_driver_tag(). However, static_rq of sched_tags might be freed through switching elevator or increasing nr_requests. Thus leave a window for some drivers to get the freed request through blk_mq_tag_to_rq(tags, tag). It's difficult to fix this uaf from driver side, I'm thinking about following solution: a. clear rq mapping in driver tags before freeing rqs in sched tags b. provide a new interface to replace blk_mq_tag_to_rq(), the new interface will make sure it won't return freed rq. Signed-off-by: Yu Kuai <yukuai3@huawei.com> --- block/blk-mq-sched.c | 10 +++++++++- block/blk-mq.c | 13 +++++++++++-- block/blk-mq.h | 2 ++ 3 files changed, 22 insertions(+), 3 deletions(-)