Message ID | 1614957294-188540-4-git-send-email-john.garry@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-mq: Avoid use-after-free for accessing old requests | expand |
On 3/5/21 7:14 AM, John Garry wrote: > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index 7ff1b20d58e7..5950fee490e8 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, > { > int i; > > + if (!atomic_inc_not_zero(&tagset->iter_usage_counter)) > + return; > + > for (i = 0; i < tagset->nr_hw_queues; i++) { > if (tagset->tags && tagset->tags[i]) > __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, > BT_TAG_ITER_STARTED); > } > + > + atomic_dec(&tagset->iter_usage_counter); > } > EXPORT_SYMBOL(blk_mq_tagset_busy_iter); This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g. happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter() call and if that causes all mtip_abort_cmd() calls to be skipped? > + while (atomic_cmpxchg(&set->iter_usage_counter, 1, 0) != 1); Isn't it recommended to call cpu_relax() inside busy-waiting loops? > blk_mq_sched_free_requests(q); > __elevator_exit(q, e); > > + atomic_set(&set->iter_usage_counter, 1); Can it happen that the above atomic_set() call happens while a blk_mq_tagset_busy_iter() call is in progress? Should that atomic_set() call perhaps be changed into an atomic_inc() call? Thanks, Bart.
On 06/03/2021 04:43, Bart Van Assche wrote: > On 3/5/21 7:14 AM, John Garry wrote: >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c >> index 7ff1b20d58e7..5950fee490e8 100644 >> --- a/block/blk-mq-tag.c >> +++ b/block/blk-mq-tag.c >> @@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, >> { >> int i; >> >> + if (!atomic_inc_not_zero(&tagset->iter_usage_counter)) >> + return; >> + >> for (i = 0; i < tagset->nr_hw_queues; i++) { >> if (tagset->tags && tagset->tags[i]) >> __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, >> BT_TAG_ITER_STARTED); >> } >> + >> + atomic_dec(&tagset->iter_usage_counter); >> } >> EXPORT_SYMBOL(blk_mq_tagset_busy_iter); Hi Bart, > This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g. > happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags, > mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter() > call and if that causes all mtip_abort_cmd() calls to be skipped? I'm not sure that I understand this problem you describe. So if blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called, either can happen: a. normal operation, iter_usage_counter initially holds >= 1, and then iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() will also increase iter_usage_counter. b. we're switching IO scheduler. In this scenario, first we quiesce all queues. After that, there should be no active requests. At that point, we ensure any calls to blk_mq_tagset_busy_iter() are finished and block (or discard may be a better term) any more calls. Blocking any more calls should be safe as there are no requests to iter. atomic_cmpxchg() is used to set iter_usage_counter to 0, blocking any more calls. > >> + while (atomic_cmpxchg(&set->iter_usage_counter, 1, 0) != 1); > Isn't it recommended to call cpu_relax() inside busy-waiting loops? Maybe, but I am considering changing this patch to use percpu_refcnt() - I need to check it further. > >> blk_mq_sched_free_requests(q); >> __elevator_exit(q, e); >> >> + atomic_set(&set->iter_usage_counter, 1); > Can it happen that the above atomic_set() call happens while a > blk_mq_tagset_busy_iter() call is in progress? No, as at this point it should be ensured that iter_usage_counter holds 0 from atomic_cmpxchg(), so there should be no active processes in blk_mq_tagset_busy_iter() sensitive region. Calls to blk_mq_tagset_busy_iter() are blocked when iter_usage_counter holds 0. > Should that atomic_set() > call perhaps be changed into an atomic_inc() call? They have the same affect in practice, but we use atomic_set() in blk_mq_alloc_tag_set(), so at least consistent. Thanks, John
On 3/8/21 3:17 AM, John Garry wrote: > On 06/03/2021 04:43, Bart Van Assche wrote: >> On 3/5/21 7:14 AM, John Garry wrote: >>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c >>> index 7ff1b20d58e7..5950fee490e8 100644 >>> --- a/block/blk-mq-tag.c >>> +++ b/block/blk-mq-tag.c >>> @@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct >>> blk_mq_tag_set *tagset, >>> { >>> int i; >>> + if (!atomic_inc_not_zero(&tagset->iter_usage_counter)) >>> + return; >>> + >>> for (i = 0; i < tagset->nr_hw_queues; i++) { >>> if (tagset->tags && tagset->tags[i]) >>> __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, >>> BT_TAG_ITER_STARTED); >>> } >>> + >>> + atomic_dec(&tagset->iter_usage_counter); >>> } >>> EXPORT_SYMBOL(blk_mq_tagset_busy_iter); > > Hi Bart, > >> This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g. >> happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags, >> mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter() >> call and if that causes all mtip_abort_cmd() calls to be skipped? > > I'm not sure that I understand this problem you describe. So if > blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called, either > can happen: > a. normal operation, iter_usage_counter initially holds >= 1, and then > iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we > iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() will > also increase iter_usage_counter. > b. we're switching IO scheduler. In this scenario, first we quiesce all > queues. After that, there should be no active requests. At that point, > we ensure any calls to blk_mq_tagset_busy_iter() are finished and block > (or discard may be a better term) any more calls. Blocking any more > calls should be safe as there are no requests to iter. atomic_cmpxchg() > is used to set iter_usage_counter to 0, blocking any more calls. Hi John, My concern is about the insertion of the early return statement in blk_mq_tagset_busy_iter(). Although most blk_mq_tagset_busy_iter() callers can handle skipping certain blk_mq_tagset_busy_iter() calls (e.g. when gathering statistics), I'm not sure this is safe for all blk_mq_tagset_busy_iter() callers. The example I cited is an example of a blk_mq_tagset_busy_iter() call with side effects. The mtip driver allocates one tag set per request queue so quiescing queues should be sufficient to address my concern for the mtip driver. The NVMe core and SCSI core however share a single tag set across multiple namespaces / LUNs. In the error path of nvme_rdma_setup_ctrl() I found a call to nvme_cancel_tagset(). nvme_cancel_tagset() calls blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl). I'm not sure it is safe to skip the nvme_cancel_request() calls if the I/O scheduler for another NVMe namespace is being modified. Thanks, Bart.
On 08/03/2021 19:59, Bart Van Assche wrote: >>> This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g. >>> happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags, >>> mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter() >>> call and if that causes all mtip_abort_cmd() calls to be skipped? >> >> I'm not sure that I understand this problem you describe. So if >> blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called, >> either can happen: >> a. normal operation, iter_usage_counter initially holds >= 1, and then >> iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we >> iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() >> will also increase iter_usage_counter. >> b. we're switching IO scheduler. In this scenario, first we quiesce >> all queues. After that, there should be no active requests. At that >> point, we ensure any calls to blk_mq_tagset_busy_iter() are finished >> and block (or discard may be a better term) any more calls. Blocking >> any more calls should be safe as there are no requests to iter. >> atomic_cmpxchg() is used to set iter_usage_counter to 0, blocking any >> more calls. > Hi Bart, > My concern is about the insertion of the early return statement in > blk_mq_tagset_busy_iter(). So I take this approach as I don't see any way to use a mutual exclusion waiting mechanism to block calls to blk_mq_tagset_busy_iter() while the IO scheduler is being switched. The reason is that blk_mq_tagset_busy_iter() can be called from any context, including hardirq. > Although most blk_mq_tagset_busy_iter() > callers can handle skipping certain blk_mq_tagset_busy_iter() calls > (e.g. when gathering statistics), I'm not sure this is safe for all > blk_mq_tagset_busy_iter() callers. The example I cited is an example of > a blk_mq_tagset_busy_iter() call with side effects. I don't like to think that we're skipping it, which may imply that there are some active requests to iter and we're just ignoring them. It's more like: we know that there are no requests active, so don't bother trying to iterate. > > The mtip driver allocates one tag set per request queue so quiescing > queues should be sufficient to address my concern for the mtip driver. > > The NVMe core and SCSI core however share a single tag set across > multiple namespaces / LUNs. In the error path of nvme_rdma_setup_ctrl() > I found a call to nvme_cancel_tagset(). nvme_cancel_tagset() calls > blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl). I'm > not sure it is safe to skip the nvme_cancel_request() calls if the I/O > scheduler for another NVMe namespace is being modified. Again, I would be relying on all request_queues associated with that tagset to be queisced when switching IO scheduler at the point blk_mq_tagset_busy_iter() is called and returns early. Now if there were active requests, I am relying on the request queue quiescing to flush them. So blk_mq_tagset_busy_iter() could be called during that quiescing period, and would continue to iter the requests. This does fall over if some tags are allocated without associated request queue, which I do not know exists. Thanks, John
On 3/9/21 9:47 AM, John Garry wrote: > This does fall over if some tags are allocated without associated > request queue, which I do not know exists. The only tag allocation mechanism I know of is blk_mq_get_tag(). The only blk_mq_get_tag() callers I know of are __blk_mq_alloc_request() and blk_mq_alloc_request_hctx(). So I think all allocated tags are associated with a request queue. Regarding this patch series, I have shared the feedback I wanted to share so I would appreciate it if someone else could also take a look. Thanks, Bart.
On 09/03/2021 19:21, Bart Van Assche wrote: > On 3/9/21 9:47 AM, John Garry wrote: >> This does fall over if some tags are allocated without associated >> request queue, which I do not know exists. > Hi Bart, > The only tag allocation mechanism I know of is blk_mq_get_tag(). The > only blk_mq_get_tag() callers I know of are __blk_mq_alloc_request() and > blk_mq_alloc_request_hctx(). So I think all allocated tags are > associated with a request queue. > ok, good. > Regarding this patch series, I have shared the feedback I wanted to > share so I would appreciate it if someone else could also take a look. > So I can incorporate any changes and suggestions so far and send a non-RFC version - that may get more attention if none extra comes. As mentioned on the cover letter, if patch 2+3/3 are accepted, then patch 1/3 could be simplified. But I plan to leave as is. BTW, any issue with putting your suggested-by on patch 2/3? Thanks, John
On 3/10/21 12:52 AM, John Garry wrote: > On 09/03/2021 19:21, Bart Van Assche wrote: >> Regarding this patch series, I have shared the feedback I wanted to >> share so I would appreciate it if someone else could also take a look. > > So I can incorporate any changes and suggestions so far and send a > non-RFC version - that may get more attention if none extra comes. > > As mentioned on the cover letter, if patch 2+3/3 are accepted, then > patch 1/3 could be simplified. But I plan to leave as is. > > BTW, any issue with putting your suggested-by on patch 2/3? Hi John, I have added my Reviewed-by to patch 2/3. Regarding the other two patches in this series: I do not agree with patch 3/3. As I have explained, I am concerned that that patch breaks existing block drivers. Are patches 1/3 and 3/3 necessary? Or in other words, is patch 2/3 sufficient to fix the use-after-free? Thanks, Bart.
On 10/03/2021 16:00, Bart Van Assche wrote: >> So I can incorporate any changes and suggestions so far and send a >> non-RFC version - that may get more attention if none extra comes. >> >> As mentioned on the cover letter, if patch 2+3/3 are accepted, then >> patch 1/3 could be simplified. But I plan to leave as is. >> >> BTW, any issue with putting your suggested-by on patch 2/3? > Hi Bart, > > I have added my Reviewed-by to patch 2/3. > OK, thanks. Please note that I still want to check further whether some of Ming's series "blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING" can be used. > Regarding the other two patches in this series: I do not agree with > patch 3/3. As I have explained, I am concerned that that patch breaks > existing block drivers. Understood. I need to check your concern further to allay any fears. So I could probably change that patch to drop the early return. Instead we just need to ensure that we complete any existing calls to blk_mq_tagset_busy_iter() prior to freeing the IO scheduler requests. Then we don't need to return early and can iter as before - but, as I said previously, there should be no active tags to iter. > > Are patches 1/3 and 3/3 necessary? Or in other words, is patch 2/3 > sufficient to fix the use-after-free? No, we need them all in some form. So far, reports are that 1/3 solves the most common seen UAF. It is pretty easy to trigger. But the scenarios associated with 2/3 and 3/3 are much harder to trigger, and I needed to add delays in the code just to trigger them. Thanks, John
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 7ff1b20d58e7..5950fee490e8 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, { int i; + if (!atomic_inc_not_zero(&tagset->iter_usage_counter)) + return; + for (i = 0; i < tagset->nr_hw_queues; i++) { if (tagset->tags && tagset->tags[i]) __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, BT_TAG_ITER_STARTED); } + + atomic_dec(&tagset->iter_usage_counter); } EXPORT_SYMBOL(blk_mq_tagset_busy_iter); diff --git a/block/blk-mq.c b/block/blk-mq.c index 9cb60bf7ac24..326e1b0e5b83 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3493,6 +3493,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) goto out_free_mq_rq_maps; } } + atomic_set(&set->iter_usage_counter, 1); mutex_init(&set->tag_list_lock); INIT_LIST_HEAD(&set->tag_list); diff --git a/block/blk.h b/block/blk.h index 1a948bfd91e4..461e5b54eb5f 100644 --- a/block/blk.h +++ b/block/blk.h @@ -214,9 +214,13 @@ static inline void elevator_exit(struct request_queue *q, blk_mq_quiesce_queue(tmp); } + while (atomic_cmpxchg(&set->iter_usage_counter, 1, 0) != 1); + blk_mq_sched_free_requests(q); __elevator_exit(q, e); + atomic_set(&set->iter_usage_counter, 1); + list_for_each_entry(tmp, &set->tag_list, tag_set_list) { if (tmp == q) continue; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 2c473c9b8990..30a21335767b 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -263,6 +263,7 @@ struct blk_mq_tag_set { struct mutex tag_list_lock; struct list_head tag_list; + atomic_t iter_usage_counter; }; /**
All queues associated with a tagset are frozen when one queue is exiting an elevator. This is to ensure that one queue running blk_mq_queue_tag_busy_iter() cannot hold a stale request reference for the queue who is exiting the elevator. However, there is nothing to stop blk_mq_all_tag_iter() being run for the tagset, and, again, getting hold of a stale request reference. A kasan UAF can be triggered for this scenario: BUG: KASAN: use-after-free in bt_tags_iter+0xe0/0x128 Read of size 4 at addr ffff001085330fcc by task more/3038 CPU: 1 PID: 3038 Comm: more Not tainted 5.12.0-rc1-11926-g7359e4a1604d-dirty #750 Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018 Call trace: dump_backtrace+0x0/0x2d0 show_stack+0x18/0x68 dump_stack+0x100/0x16c print_address_description.constprop.13+0x68/0x30c kasan_report+0x1d8/0x240 __asan_load4+0x9c/0xd8 bt_tags_iter+0xe0/0x128 __blk_mq_all_tag_iter+0x320/0x3a8 blk_mq_tagset_busy_iter+0x84/0xb8 scsi_host_busy+0x88/0xb8 show_host_busy+0x1c/0x48 dev_attr_show+0x44/0x90 sysfs_kf_seq_show+0x128/0x1c8 kernfs_seq_show+0xa0/0xb8 seq_read_iter+0x210/0x660 kernfs_fop_read_iter+0x208/0x2b0 new_sync_read+0x1ec/0x2d0 vfs_read+0x188/0x248 ksys_read+0xc8/0x178 __arm64_sys_read+0x44/0x58 el0_svc_common.constprop.1+0xc4/0x190 do_el0_svc+0x90/0xa0 el0_svc+0x24/0x38 el0_sync_handler+0x90/0xb8 el0_sync+0x154/0x180 To avoid this, reject the tagset iterators when the queue is exiting the elevator. This should not break any semantics in blk_mq_all_tag_iter(), as, since all queues are frozen, there should be no active tags to iterate. Signed-off-by: John Garry <john.garry@huawei.com> --- block/blk-mq-tag.c | 5 +++++ block/blk-mq.c | 1 + block/blk.h | 4 ++++ include/linux/blk-mq.h | 1 + 4 files changed, 11 insertions(+)