Message ID | 1614957294-188540-3-git-send-email-john.garry@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-mq: Avoid use-after-free for accessing old requests | expand |
On 3/5/21 7:14 AM, John Garry wrote: > diff --git a/block/blk.h b/block/blk.h > index 3b53e44b967e..1a948bfd91e4 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q); > static inline void elevator_exit(struct request_queue *q, > struct elevator_queue *e) > { > + struct blk_mq_tag_set *set = q->tag_set; > + struct request_queue *tmp; > + > lockdep_assert_held(&q->sysfs_lock); > > + mutex_lock(&set->tag_list_lock); > + list_for_each_entry(tmp, &set->tag_list, tag_set_list) { > + if (tmp == q) > + continue; > + blk_mq_freeze_queue(tmp); > + blk_mq_quiesce_queue(tmp); > + } > + > blk_mq_sched_free_requests(q); > __elevator_exit(q, e); > + > + list_for_each_entry(tmp, &set->tag_list, tag_set_list) { > + if (tmp == q) > + continue; > + blk_mq_unquiesce_queue(tmp); > + blk_mq_unfreeze_queue(tmp); > + } > + mutex_unlock(&set->tag_list_lock); > } This patch introduces nesting of tag_list_lock inside sysfs_lock. The latter is per request queue while the former can be shared across multiple request queues. Has it been analyzed whether this is safe? Thanks, Bart.
On 06/03/2021 04:32, Bart Van Assche wrote: > On 3/5/21 7:14 AM, John Garry wrote: >> diff --git a/block/blk.h b/block/blk.h >> index 3b53e44b967e..1a948bfd91e4 100644 >> --- a/block/blk.h >> +++ b/block/blk.h >> @@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q); >> static inline void elevator_exit(struct request_queue *q, >> struct elevator_queue *e) >> { >> + struct blk_mq_tag_set *set = q->tag_set; >> + struct request_queue *tmp; >> + >> lockdep_assert_held(&q->sysfs_lock); >> >> + mutex_lock(&set->tag_list_lock); >> + list_for_each_entry(tmp, &set->tag_list, tag_set_list) { >> + if (tmp == q) >> + continue; >> + blk_mq_freeze_queue(tmp); >> + blk_mq_quiesce_queue(tmp); >> + } >> + >> blk_mq_sched_free_requests(q); >> __elevator_exit(q, e); >> + >> + list_for_each_entry(tmp, &set->tag_list, tag_set_list) { >> + if (tmp == q) >> + continue; >> + blk_mq_unquiesce_queue(tmp); >> + blk_mq_unfreeze_queue(tmp); >> + } >> + mutex_unlock(&set->tag_list_lock); >> } Hi Bart, > This patch introduces nesting of tag_list_lock inside sysfs_lock. The > latter is per request queue while the former can be shared across > multiple request queues. Has it been analyzed whether this is safe? Firstly - ignoring implementation details for a moment - this patch is to ensure that the concept is consistent with your suggestion and whether it is sound. As for nested locking, I can analyze more, but I did assume that we don't care about locking-out sysfs intervention during this time. And it seems pretty difficult to avoid nesting the locks. And further to this, I see that https://lore.kernel.org/linux-block/3aa5407c-0800-2482-597b-4264781a7eac@grimberg.me/T/#mc3e3175642660578c0ae2a6c32185b1e34ec4b8a has a new interface for tagset quiesce, which could make this process more efficient. Please let me know further thoughts. Thanks, John
On 3/8/21 2:50 AM, John Garry wrote:
> Please let me know further thoughts.
Hi John,
My guess is that it is safe to nest these two locks. I was asking
because I had not found any information about the nesting in the patch
description.
Bart.
On 3/5/21 7:14 AM, John Garry wrote: > diff --git a/block/blk.h b/block/blk.h > index 3b53e44b967e..1a948bfd91e4 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q); > static inline void elevator_exit(struct request_queue *q, > struct elevator_queue *e) > { > + struct blk_mq_tag_set *set = q->tag_set; > + struct request_queue *tmp; > + > lockdep_assert_held(&q->sysfs_lock); > > + mutex_lock(&set->tag_list_lock); > + list_for_each_entry(tmp, &set->tag_list, tag_set_list) { > + if (tmp == q) > + continue; > + blk_mq_freeze_queue(tmp); > + blk_mq_quiesce_queue(tmp); > + } > + > blk_mq_sched_free_requests(q); > __elevator_exit(q, e); > + > + list_for_each_entry(tmp, &set->tag_list, tag_set_list) { > + if (tmp == q) > + continue; > + blk_mq_unquiesce_queue(tmp); > + blk_mq_unfreeze_queue(tmp); > + } > + mutex_unlock(&set->tag_list_lock); > } Reviewed-by: Bart Van Assche <bvanassche@acm.org>
On Fri, Mar 05, 2021 at 11:14:53PM +0800, John Garry wrote: > A use-after-free may occur if blk_mq_queue_tag_busy_iter() is run on a > queue when another queue associated with the same tagset is switching IO > scheduler: > > BUG: KASAN: use-after-free in bt_iter+0xa0/0x120 > Read of size 8 at addr ffff0410285e7e00 by task fio/2302 > > CPU: 24 PID: 2302 Comm: fio Not tainted 5.12.0-rc1-11925-g29a317e228d9 #747 > Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018 > Call trace: > dump_backtrace+0x0/0x2d8 > show_stack+0x18/0x68 > dump_stack+0x124/0x1a0 > print_address_description.constprop.13+0x68/0x30c > kasan_report+0x1e8/0x258 > __asan_load8+0x9c/0xd8 > bt_iter+0xa0/0x120 > blk_mq_queue_tag_busy_iter+0x348/0x5d8 > blk_mq_in_flight+0x80/0xb8 > part_stat_show+0xcc/0x210 > dev_attr_show+0x44/0x90 > sysfs_kf_seq_show+0x120/0x1c0 > kernfs_seq_show+0x9c/0xb8 > seq_read_iter+0x214/0x668 > kernfs_fop_read_iter+0x204/0x2c0 > new_sync_read+0x1ec/0x2d0 > vfs_read+0x18c/0x248 > ksys_read+0xc8/0x178 > __arm64_sys_read+0x44/0x58 > el0_svc_common.constprop.1+0xc8/0x1a8 > do_el0_svc+0x90/0xa0 > el0_svc+0x24/0x38 > el0_sync_handler+0x90/0xb8 > el0_sync+0x154/0x180 > > Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its > queue usage counter when called, and the queue cannot be frozen to switch > IO scheduler until all refs are dropped. This ensures no stale references > to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter(). > > However, there is nothing to stop blk_mq_queue_tag_busy_iter() being > run for another queue associated with the same tagset, and it seeing > a stale IO scheduler request from the other queue after they are freed. > > To stop this happening, freeze and quiesce all queues associated with the > tagset as the elevator is exited. I think this way can't be accepted since switching one queue's scheduler is nothing to do with other request queues attached to same HBA. This patch will cause performance regression because userspace may switch scheduler according to medium or workloads, at that time other LUNs will be affected by this patch.
On 11/03/2021 00:58, Ming Lei wrote: >> Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its >> queue usage counter when called, and the queue cannot be frozen to switch >> IO scheduler until all refs are dropped. This ensures no stale references >> to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter(). >> >> However, there is nothing to stop blk_mq_queue_tag_busy_iter() being >> run for another queue associated with the same tagset, and it seeing >> a stale IO scheduler request from the other queue after they are freed. >> >> To stop this happening, freeze and quiesce all queues associated with the >> tagset as the elevator is exited. > I think this way can't be accepted since switching one queue's scheduler > is nothing to do with other request queues attached to same HBA. > > This patch will cause performance regression because userspace may > switch scheduler according to medium or workloads, at that time other > LUNs will be affected by this patch. Hmmm..that was my concern also. Do you think that it may cause a big impact? Depends totally on the workload, I suppose. FWIW, it is useful though for solving both iterator problems. Thanks, John
On 3/11/21 12:21 AM, John Garry wrote: > On 11/03/2021 00:58, Ming Lei wrote: >> I think this way can't be accepted since switching one queue's scheduler >> is nothing to do with other request queues attached to same HBA. >> >> This patch will cause performance regression because userspace may >> switch scheduler according to medium or workloads, at that time other >> LUNs will be affected by this patch. > > Hmmm..that was my concern also. Do you think that it may cause a big > impact? Depends totally on the workload, I suppose. > > FWIW, it is useful though for solving both iterator problems. Hi John, How about replacing the entire patch series with the patch below? The patch below has the following advantages: - Instead of making the race window smaller, the race is fixed completely. - No new atomic instructions are added to the block layer code. - No early return is inserted in blk_mq_tagset_busy_iter(). Thanks, Bart. From a0e534012a766bd6e53cdd466eec0a811164c12a Mon Sep 17 00:00:00 2001 From: Bart Van Assche <bvanassche@acm.org> Date: Wed, 10 Mar 2021 19:11:47 -0800 Subject: [PATCH] blk-mq: Fix races between iterating over requests and freeing requests Multiple users have reported use-after-free complaints similar to the following (see also https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/): BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 Read of size 8 at addr ffff88803b335240 by task fio/21412 CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x86/0xca print_address_description+0x71/0x239 kasan_report.cold.5+0x242/0x301 __asan_load8+0x54/0x90 bt_iter+0x86/0xf0 blk_mq_queue_tag_busy_iter+0x373/0x5e0 blk_mq_in_flight+0x96/0xb0 part_in_flight+0x40/0x140 part_round_stats+0x18e/0x370 blk_account_io_start+0x3d7/0x670 blk_mq_bio_to_request+0x19c/0x3a0 blk_mq_make_request+0x7a9/0xcb0 generic_make_request+0x41d/0x960 submit_bio+0x9b/0x250 do_blockdev_direct_IO+0x435c/0x4c70 __blockdev_direct_IO+0x79/0x88 ext4_direct_IO+0x46c/0xc00 generic_file_direct_write+0x119/0x210 __generic_file_write_iter+0x11c/0x280 ext4_file_write_iter+0x1b8/0x6f0 aio_write+0x204/0x310 io_submit_one+0x9d3/0xe80 __x64_sys_io_submit+0x115/0x340 do_syscall_64+0x71/0x210 When multiple request queues share a tag set and when switching the I/O scheduler for one of the request queues that uses this tag set, the following race can happen: * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns a pointer to a scheduler request to the local variable 'rq'. * blk_mq_sched_free_requests() is called to free hctx->sched_tags. * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free. Fix this race as follows: * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[]. * Protect hctx->tags->rqs[] reads with an RCU read-side lock. * No new rcu_barrier() call has been added because clearing the request pointer in hctx->tags->rqs[] happens before blk_queue_exit() and the blk_freeze_queue() call in blk_cleanup_queue() triggers an RCU barrier after all scheduler request pointers assiociated with a request queue have been removed from hctx->tags->rqs[] and before these scheduler requests are freed. Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- block/blk-mq-tag.c | 27 +++++++++++++++++---------- block/blk-mq-tag.h | 2 +- block/blk-mq.c | 10 ++++++---- block/blk-mq.h | 1 + 4 files changed, 25 insertions(+), 15 deletions(-) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 9c92053e704d..8351c3f2fe2d 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -206,18 +206,23 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) struct blk_mq_tags *tags = hctx->tags; bool reserved = iter_data->reserved; struct request *rq; + bool res = true; if (!reserved) bitnr += tags->nr_reserved_tags; - rq = tags->rqs[bitnr]; + + rcu_read_lock(); + rq = rcu_dereference(tags->rqs[bitnr]); /* * We can hit rq == NULL here, because the tagging functions * test and set the bit before assigning ->rqs[]. */ if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) - return iter_data->fn(hctx, rq, iter_data->data, reserved); - return true; + res = iter_data->fn(hctx, rq, iter_data->data, reserved); + rcu_read_unlock(); + + return res; } /** @@ -264,10 +269,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) struct blk_mq_tags *tags = iter_data->tags; bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED; struct request *rq; + bool res = true; if (!reserved) bitnr += tags->nr_reserved_tags; + rcu_read_lock(); /* * We can hit rq == NULL here, because the tagging functions * test and set the bit before assigning ->rqs[]. @@ -275,13 +282,13 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) if (iter_data->flags & BT_TAG_ITER_STATIC_RQS) rq = tags->static_rqs[bitnr]; else - rq = tags->rqs[bitnr]; - if (!rq) - return true; - if ((iter_data->flags & BT_TAG_ITER_STARTED) && - !blk_mq_request_started(rq)) - return true; - return iter_data->fn(rq, iter_data->data, reserved); + rq = rcu_dereference(tags->rqs[bitnr]); + if (rq && (!(iter_data->flags & BT_TAG_ITER_STARTED) || + blk_mq_request_started(rq))) + res = iter_data->fn(rq, iter_data->data, reserved); + rcu_read_unlock(); + + return res; } /** diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 7d3e6b333a4a..7a6d04733261 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -17,7 +17,7 @@ struct blk_mq_tags { struct sbitmap_queue __bitmap_tags; struct sbitmap_queue __breserved_tags; - struct request **rqs; + struct request __rcu **rqs; struct request **static_rqs; struct list_head page_list; }; diff --git a/block/blk-mq.c b/block/blk-mq.c index d4d7c1caa439..594bf7f4ed9a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -495,8 +495,10 @@ static void __blk_mq_free_request(struct request *rq) blk_crypto_free_request(rq); blk_pm_mark_last_busy(rq); rq->mq_hctx = NULL; - if (rq->tag != BLK_MQ_NO_TAG) + if (rq->tag != BLK_MQ_NO_TAG) { blk_mq_put_tag(hctx->tags, ctx, rq->tag); + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL); + } if (sched_tag != BLK_MQ_NO_TAG) blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag); blk_mq_sched_restart(hctx); @@ -839,8 +841,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) { if (tag < tags->nr_tags) { - prefetch(tags->rqs[tag]); - return tags->rqs[tag]; + prefetch((__force void *)tags->rqs[tag]); + return rcu_dereference_check(tags->rqs[tag], true); } return NULL; @@ -1111,7 +1113,7 @@ static bool blk_mq_get_driver_tag(struct request *rq) rq->rq_flags |= RQF_MQ_INFLIGHT; __blk_mq_inc_active_requests(hctx); } - hctx->tags->rqs[rq->tag] = rq; + rcu_assign_pointer(hctx->tags->rqs[rq->tag], rq); return true; } diff --git a/block/blk-mq.h b/block/blk-mq.h index 3616453ca28c..9ccb1818303b 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq) { blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag); + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL); rq->tag = BLK_MQ_NO_TAG; if (rq->rq_flags & RQF_MQ_INFLIGHT) {
Hi Bart, I'll have a look at this ASAP - a bit busy. But a quick scan and I notice this: > @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, > struct request *rq) > { > blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag); > + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL); Wasn't a requirement to not touch the fastpath at all, including even if only NULLifying a pointer? IIRC, Kashyap some time ago had a patch like above (but without RCU usage), but the request from Jens was to not touch the fastpath. Maybe I'm mistaken - I will try to dig up the thread. Thanks, John > How about replacing the entire patch series with the patch below? The > patch below has the following advantages: > - Instead of making the race window smaller, the race is fixed > completely. > - No new atomic instructions are added to the block layer code. > - No early return is inserted in blk_mq_tagset_busy_iter(). > > Thanks, > > Bart. > > From a0e534012a766bd6e53cdd466eec0a811164c12a Mon Sep 17 00:00:00 2001 > From: Bart Van Assche<bvanassche@acm.org> > Date: Wed, 10 Mar 2021 19:11:47 -0800 > Subject: [PATCH] blk-mq: Fix races between iterating over requests and freeing > requests > > Multiple users have reported use-after-free complaints similar to the > following (see alsohttps://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/): > > BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 > Read of size 8 at addr ffff88803b335240 by task fio/21412 > > CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > Call Trace: > dump_stack+0x86/0xca > print_address_description+0x71/0x239 > kasan_report.cold.5+0x242/0x301 > __asan_load8+0x54/0x90 > bt_iter+0x86/0xf0 > blk_mq_queue_tag_busy_iter+0x373/0x5e0 > blk_mq_in_flight+0x96/0xb0 > part_in_flight+0x40/0x140 > part_round_stats+0x18e/0x370 > blk_account_io_start+0x3d7/0x670 > blk_mq_bio_to_request+0x19c/0x3a0 > blk_mq_make_request+0x7a9/0xcb0 > generic_make_request+0x41d/0x960 > submit_bio+0x9b/0x250 > do_blockdev_direct_IO+0x435c/0x4c70 > __blockdev_direct_IO+0x79/0x88 > ext4_direct_IO+0x46c/0xc00 > generic_file_direct_write+0x119/0x210 > __generic_file_write_iter+0x11c/0x280 > ext4_file_write_iter+0x1b8/0x6f0 > aio_write+0x204/0x310 > io_submit_one+0x9d3/0xe80 > __x64_sys_io_submit+0x115/0x340 > do_syscall_64+0x71/0x210 > > When multiple request queues share a tag set and when switching the I/O > scheduler for one of the request queues that uses this tag set, the > following race can happen: > * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns > a pointer to a scheduler request to the local variable 'rq'. > * blk_mq_sched_free_requests() is called to free hctx->sched_tags. > * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free. > > Fix this race as follows: > * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[]. > * Protect hctx->tags->rqs[] reads with an RCU read-side lock. > * No new rcu_barrier() call has been added because clearing the request > pointer in hctx->tags->rqs[] happens before blk_queue_exit() and the > blk_freeze_queue() call in blk_cleanup_queue() triggers an RCU barrier > after all scheduler request pointers assiociated with a request queue > have been removed from hctx->tags->rqs[] and before these scheduler > requests are freed. > > Signed-off-by: Bart Van Assche<bvanassche@acm.org> > --- > block/blk-mq-tag.c | 27 +++++++++++++++++---------- > block/blk-mq-tag.h | 2 +- > block/blk-mq.c | 10 ++++++---- > block/blk-mq.h | 1 + > 4 files changed, 25 insertions(+), 15 deletions(-) > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index 9c92053e704d..8351c3f2fe2d 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -206,18 +206,23 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > struct blk_mq_tags *tags = hctx->tags; > bool reserved = iter_data->reserved; > struct request *rq; > + bool res = true; > > if (!reserved) > bitnr += tags->nr_reserved_tags; > - rq = tags->rqs[bitnr]; > + > + rcu_read_lock(); > + rq = rcu_dereference(tags->rqs[bitnr]); > > /* > * We can hit rq == NULL here, because the tagging functions > * test and set the bit before assigning ->rqs[]. > */ > if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) > - return iter_data->fn(hctx, rq, iter_data->data, reserved); > - return true; > + res = iter_data->fn(hctx, rq, iter_data->data, reserved); > + rcu_read_unlock(); > + > + return res; > } > > /** > @@ -264,10 +269,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > struct blk_mq_tags *tags = iter_data->tags; > bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED; > struct request *rq; > + bool res = true; > > if (!reserved) > bitnr += tags->nr_reserved_tags; > > + rcu_read_lock(); > /* > * We can hit rq == NULL here, because the tagging functions > * test and set the bit before assigning ->rqs[]. > @@ -275,13 +282,13 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > if (iter_data->flags & BT_TAG_ITER_STATIC_RQS) > rq = tags->static_rqs[bitnr]; > else > - rq = tags->rqs[bitnr]; > - if (!rq) > - return true; > - if ((iter_data->flags & BT_TAG_ITER_STARTED) && > - !blk_mq_request_started(rq)) > - return true; > - return iter_data->fn(rq, iter_data->data, reserved); > + rq = rcu_dereference(tags->rqs[bitnr]); > + if (rq && (!(iter_data->flags & BT_TAG_ITER_STARTED) || > + blk_mq_request_started(rq))) > + res = iter_data->fn(rq, iter_data->data, reserved); > + rcu_read_unlock(); > + > + return res; > } > > /** > diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h > index 7d3e6b333a4a..7a6d04733261 100644 > --- a/block/blk-mq-tag.h > +++ b/block/blk-mq-tag.h > @@ -17,7 +17,7 @@ struct blk_mq_tags { > struct sbitmap_queue __bitmap_tags; > struct sbitmap_queue __breserved_tags; > > - struct request **rqs; > + struct request __rcu **rqs; > struct request **static_rqs; > struct list_head page_list; > }; > diff --git a/block/blk-mq.c b/block/blk-mq.c > index d4d7c1caa439..594bf7f4ed9a 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -495,8 +495,10 @@ static void __blk_mq_free_request(struct request *rq) > blk_crypto_free_request(rq); > blk_pm_mark_last_busy(rq); > rq->mq_hctx = NULL; > - if (rq->tag != BLK_MQ_NO_TAG) > + if (rq->tag != BLK_MQ_NO_TAG) { > blk_mq_put_tag(hctx->tags, ctx, rq->tag); > + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL); > + } > if (sched_tag != BLK_MQ_NO_TAG) > blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag); > blk_mq_sched_restart(hctx); > @@ -839,8 +841,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); > struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) > { > if (tag < tags->nr_tags) { > - prefetch(tags->rqs[tag]); > - return tags->rqs[tag]; > + prefetch((__force void *)tags->rqs[tag]); > + return rcu_dereference_check(tags->rqs[tag], true); > } > > return NULL; > @@ -1111,7 +1113,7 @@ static bool blk_mq_get_driver_tag(struct request *rq) > rq->rq_flags |= RQF_MQ_INFLIGHT; > __blk_mq_inc_active_requests(hctx); > } > - hctx->tags->rqs[rq->tag] = rq; > + rcu_assign_pointer(hctx->tags->rqs[rq->tag], rq); > return true; > } > > diff --git a/block/blk-mq.h b/block/blk-mq.h > index 3616453ca28c..9ccb1818303b 100644 > --- a/block/blk-mq.h > +++ b/block/blk-mq.h > @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, > struct request *rq) > { > blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag); > + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL); > rq->tag = BLK_MQ_NO_TAG; > > if (rq->rq_flags & RQF_MQ_INFLIGHT) {
On 3/16/21 9:15 AM, John Garry wrote: > I'll have a look at this ASAP - a bit busy. > > But a quick scan and I notice this: > > > @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct > blk_mq_hw_ctx *hctx, > > struct request *rq) > > { > > blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag); > > + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL); > > Wasn't a requirement to not touch the fastpath at all, including even if > only NULLifying a pointer? > > IIRC, Kashyap some time ago had a patch like above (but without RCU > usage), but the request from Jens was to not touch the fastpath. > > Maybe I'm mistaken - I will try to dig up the thread. Hi John, I agree that Jens asked at the end of 2018 not to touch the fast path to fix this use-after-free (maybe that request has been repeated more recently). If Jens or anyone else feels strongly about not clearing hctx->tags->rqs[rq->tag] from the fast path then I will make that change. My motivation for clearing these pointers from the fast path is as follows: - This results in code that is easier to read and easier to maintain. - Every modern CPU pipelines store instructions so the performance impact of adding an additional store should be small. - Since the block layer has a tendency to reuse tags that have been freed recently, it is likely that hctx->tags->rqs[rq->tag] will be used for a next request and hence that it will have to be loaded into the CPU cache anyway. Bart.
On 16/03/2021 17:00, Bart Van Assche wrote: > On 3/16/21 9:15 AM, John Garry wrote: >> I'll have a look at this ASAP - a bit busy. >> >> But a quick scan and I notice this: >> >> > @@ -226,6 +226,7 @@ static inline void >> __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, >> > struct request *rq) >> > { >> > blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag); >> > + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL); >> >> Wasn't a requirement to not touch the fastpath at all, including even >> if only NULLifying a pointer? >> >> IIRC, Kashyap some time ago had a patch like above (but without RCU >> usage), but the request from Jens was to not touch the fastpath. >> >> Maybe I'm mistaken - I will try to dig up the thread. > Hi Bart, > > I agree that Jens asked at the end of 2018 not to touch the fast path to > fix this use-after-free (maybe that request has been repeated more > recently). If Jens or anyone else feels strongly about not clearing > hctx->tags->rqs[rq->tag] from the fast path then I will make that > change. Is that possible for this same approach? I need to check the code more.. And don't we still have the problem that some iter callbacks may sleep/block, which is not allowed in an RCU read-side critical section? > My motivation for clearing these pointers from the fast path is > as follows: > - This results in code that is easier to read and easier to maintain. > - Every modern CPU pipelines store instructions so the performance > impact of adding an additional store should be small. > - Since the block layer has a tendency to reuse tags that have been > freed recently, it is likely that hctx->tags->rqs[rq->tag] will be used > for a next request and hence that it will have to be loaded into the CPU > cache anyway. > Those points make sense to me, but obviously it's the maintainers call. Thanks, john
On 3/16/21 10:43 AM, John Garry wrote: > On 16/03/2021 17:00, Bart Van Assche wrote: >> I agree that Jens asked at the end of 2018 not to touch the fast path >> to fix this use-after-free (maybe that request has been repeated more >> recently). If Jens or anyone else feels strongly about not clearing >> hctx->tags->rqs[rq->tag] from the fast path then I will make that change. > > Is that possible for this same approach? I need to check the code more.. If the fast path should not be modified, I'm considering to borrow patch 1/3 from your patch series and to add an rcu_barrier() between the code that clears the request pointers and that frees the scheduler requests. > And don't we still have the problem that some iter callbacks may > sleep/block, which is not allowed in an RCU read-side critical section? Thanks for having brought this up. Since none of the functions that iterate over requests should be called from the hot path of a block driver, I think that we can use srcu_read_(un|)lock() inside bt_iter() and bt_tags_iter() instead of rcu_read_(un|)lock(). Bart.
On 16/03/2021 19:59, Bart Van Assche wrote: > On 3/16/21 10:43 AM, John Garry wrote: >> On 16/03/2021 17:00, Bart Van Assche wrote: >>> I agree that Jens asked at the end of 2018 not to touch the fast path >>> to fix this use-after-free (maybe that request has been repeated more >>> recently). If Jens or anyone else feels strongly about not clearing >>> hctx->tags->rqs[rq->tag] from the fast path then I will make that change. Hi Bart, >> Is that possible for this same approach? I need to check the code more.. > If the fast path should not be modified, I'm considering to borrow patch > 1/3 from your patch series Fine > and to add an rcu_barrier() between the code > that clears the request pointers and that frees the scheduler requests. > >> And don't we still have the problem that some iter callbacks may >> sleep/block, which is not allowed in an RCU read-side critical section? > Thanks for having brought this up. Since none of the functions that > iterate over requests should be called from the hot path of a block > driver, I think that we can use srcu_read_(un|)lock() inside bt_iter() > and bt_tags_iter() instead of rcu_read_(un|)lock(). OK, but TBH, I am not so familiar with srcu - where you going to try this? Thanks, John
On 3/19/21 11:19 AM, John Garry wrote:
> OK, but TBH, I am not so familiar with srcu - where you going to try this?
Hi John,
Have you received the following patch: "[PATCH] blk-mq: Fix races
between iterating over requests and freeing requests"
(https://lore.kernel.org/linux-block/20210319010009.10041-1-bvanassche@acm.org/)?
Thanks,
Bart.
diff --git a/block/blk.h b/block/blk.h index 3b53e44b967e..1a948bfd91e4 100644 --- a/block/blk.h +++ b/block/blk.h @@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q); static inline void elevator_exit(struct request_queue *q, struct elevator_queue *e) { + struct blk_mq_tag_set *set = q->tag_set; + struct request_queue *tmp; + lockdep_assert_held(&q->sysfs_lock); + mutex_lock(&set->tag_list_lock); + list_for_each_entry(tmp, &set->tag_list, tag_set_list) { + if (tmp == q) + continue; + blk_mq_freeze_queue(tmp); + blk_mq_quiesce_queue(tmp); + } + blk_mq_sched_free_requests(q); __elevator_exit(q, e); + + list_for_each_entry(tmp, &set->tag_list, tag_set_list) { + if (tmp == q) + continue; + blk_mq_unquiesce_queue(tmp); + blk_mq_unfreeze_queue(tmp); + } + mutex_unlock(&set->tag_list_lock); } ssize_t part_size_show(struct device *dev, struct device_attribute *attr,
A use-after-free may occur if blk_mq_queue_tag_busy_iter() is run on a queue when another queue associated with the same tagset is switching IO scheduler: BUG: KASAN: use-after-free in bt_iter+0xa0/0x120 Read of size 8 at addr ffff0410285e7e00 by task fio/2302 CPU: 24 PID: 2302 Comm: fio Not tainted 5.12.0-rc1-11925-g29a317e228d9 #747 Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018 Call trace: dump_backtrace+0x0/0x2d8 show_stack+0x18/0x68 dump_stack+0x124/0x1a0 print_address_description.constprop.13+0x68/0x30c kasan_report+0x1e8/0x258 __asan_load8+0x9c/0xd8 bt_iter+0xa0/0x120 blk_mq_queue_tag_busy_iter+0x348/0x5d8 blk_mq_in_flight+0x80/0xb8 part_stat_show+0xcc/0x210 dev_attr_show+0x44/0x90 sysfs_kf_seq_show+0x120/0x1c0 kernfs_seq_show+0x9c/0xb8 seq_read_iter+0x214/0x668 kernfs_fop_read_iter+0x204/0x2c0 new_sync_read+0x1ec/0x2d0 vfs_read+0x18c/0x248 ksys_read+0xc8/0x178 __arm64_sys_read+0x44/0x58 el0_svc_common.constprop.1+0xc8/0x1a8 do_el0_svc+0x90/0xa0 el0_svc+0x24/0x38 el0_sync_handler+0x90/0xb8 el0_sync+0x154/0x180 Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its queue usage counter when called, and the queue cannot be frozen to switch IO scheduler until all refs are dropped. This ensures no stale references to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter(). However, there is nothing to stop blk_mq_queue_tag_busy_iter() being run for another queue associated with the same tagset, and it seeing a stale IO scheduler request from the other queue after they are freed. To stop this happening, freeze and quiesce all queues associated with the tagset as the elevator is exited. Signed-off-by: John Garry <john.garry@huawei.com> --- I think that this patch is what Bart suggested: https://lore.kernel.org/linux-block/c0d127a9-9320-6e1c-4e8d-412aa9ea9ca6@acm.org/ block/blk.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)