diff mbox series

[RFC,v3,2/3] blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()

Message ID 1614957294-188540-3-git-send-email-john.garry@huawei.com (mailing list archive)
State New, archived
Headers show
Series blk-mq: Avoid use-after-free for accessing old requests | expand

Commit Message

John Garry March 5, 2021, 3:14 p.m. UTC
A use-after-free may occur if blk_mq_queue_tag_busy_iter() is run on a
queue when another queue associated with the same tagset is switching IO
scheduler:

BUG: KASAN: use-after-free in bt_iter+0xa0/0x120
Read of size 8 at addr ffff0410285e7e00 by task fio/2302

CPU: 24 PID: 2302 Comm: fio Not tainted 5.12.0-rc1-11925-g29a317e228d9 #747
Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018 
 Call trace:
dump_backtrace+0x0/0x2d8 
show_stack+0x18/0x68
dump_stack+0x124/0x1a0
print_address_description.constprop.13+0x68/0x30c
kasan_report+0x1e8/0x258 
__asan_load8+0x9c/0xd8
bt_iter+0xa0/0x120 
blk_mq_queue_tag_busy_iter+0x348/0x5d8
blk_mq_in_flight+0x80/0xb8
part_stat_show+0xcc/0x210
dev_attr_show+0x44/0x90
sysfs_kf_seq_show+0x120/0x1c0
kernfs_seq_show+0x9c/0xb8
seq_read_iter+0x214/0x668
kernfs_fop_read_iter+0x204/0x2c0
new_sync_read+0x1ec/0x2d0
vfs_read+0x18c/0x248
ksys_read+0xc8/0x178
__arm64_sys_read+0x44/0x58
el0_svc_common.constprop.1+0xc8/0x1a8
do_el0_svc+0x90/0xa0
el0_svc+0x24/0x38
el0_sync_handler+0x90/0xb8
el0_sync+0x154/0x180

Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its
queue usage counter when called, and the queue cannot be frozen to switch
IO scheduler until all refs are dropped. This ensures no stale references
to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter().

However, there is nothing to stop blk_mq_queue_tag_busy_iter() being
run for another queue associated with the same tagset, and it seeing
a stale IO scheduler request from the other queue after they are freed.

To stop this happening, freeze and quiesce all queues associated with the
tagset as the elevator is exited.

Signed-off-by: John Garry <john.garry@huawei.com>
---

I think that this patch is what Bart suggested:
https://lore.kernel.org/linux-block/c0d127a9-9320-6e1c-4e8d-412aa9ea9ca6@acm.org/

 block/blk.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Bart Van Assche March 6, 2021, 4:32 a.m. UTC | #1
On 3/5/21 7:14 AM, John Garry wrote:
> diff --git a/block/blk.h b/block/blk.h
> index 3b53e44b967e..1a948bfd91e4 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q);
>  static inline void elevator_exit(struct request_queue *q,
>  		struct elevator_queue *e)
>  {
> +	struct blk_mq_tag_set *set = q->tag_set;
> +	struct request_queue *tmp;
> +
>  	lockdep_assert_held(&q->sysfs_lock);
>  
> +	mutex_lock(&set->tag_list_lock);
> +	list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
> +		if (tmp == q)
> +			continue;
> +		blk_mq_freeze_queue(tmp);
> +		blk_mq_quiesce_queue(tmp);
> +	}
> +
>  	blk_mq_sched_free_requests(q);
>  	__elevator_exit(q, e);
> +
> +	list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
> +		if (tmp == q)
> +			continue;
> +		blk_mq_unquiesce_queue(tmp);
> +		blk_mq_unfreeze_queue(tmp);
> +	}
> +	mutex_unlock(&set->tag_list_lock);
>  }

This patch introduces nesting of tag_list_lock inside sysfs_lock. The
latter is per request queue while the former can be shared across
multiple request queues. Has it been analyzed whether this is safe?

Thanks,

Bart.
John Garry March 8, 2021, 10:50 a.m. UTC | #2
On 06/03/2021 04:32, Bart Van Assche wrote:
> On 3/5/21 7:14 AM, John Garry wrote:
>> diff --git a/block/blk.h b/block/blk.h
>> index 3b53e44b967e..1a948bfd91e4 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q);
>>   static inline void elevator_exit(struct request_queue *q,
>>   		struct elevator_queue *e)
>>   {
>> +	struct blk_mq_tag_set *set = q->tag_set;
>> +	struct request_queue *tmp;
>> +
>>   	lockdep_assert_held(&q->sysfs_lock);
>>   
>> +	mutex_lock(&set->tag_list_lock);
>> +	list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
>> +		if (tmp == q)
>> +			continue;
>> +		blk_mq_freeze_queue(tmp);
>> +		blk_mq_quiesce_queue(tmp);
>> +	}
>> +
>>   	blk_mq_sched_free_requests(q);
>>   	__elevator_exit(q, e);
>> +
>> +	list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
>> +		if (tmp == q)
>> +			continue;
>> +		blk_mq_unquiesce_queue(tmp);
>> +		blk_mq_unfreeze_queue(tmp);
>> +	}
>> +	mutex_unlock(&set->tag_list_lock);
>>   }

Hi Bart,

> This patch introduces nesting of tag_list_lock inside sysfs_lock. The
> latter is per request queue while the former can be shared across
> multiple request queues. Has it been analyzed whether this is safe?

Firstly - ignoring implementation details for a moment - this patch is 
to ensure that the concept is consistent with your suggestion and 
whether it is sound.

As for nested locking, I can analyze more, but I did assume that we 
don't care about locking-out sysfs intervention during this time. And it 
seems pretty difficult to avoid nesting the locks.

And further to this, I see that 
https://lore.kernel.org/linux-block/3aa5407c-0800-2482-597b-4264781a7eac@grimberg.me/T/#mc3e3175642660578c0ae2a6c32185b1e34ec4b8a 
has a new interface for tagset quiesce, which could make this process 
more efficient.

Please let me know further thoughts.

Thanks,
John
Bart Van Assche March 8, 2021, 7:35 p.m. UTC | #3
On 3/8/21 2:50 AM, John Garry wrote:
> Please let me know further thoughts.

Hi John,

My guess is that it is safe to nest these two locks. I was asking 
because I had not found any information about the nesting in the patch 
description.

Bart.
Bart Van Assche March 10, 2021, 3:57 p.m. UTC | #4
On 3/5/21 7:14 AM, John Garry wrote:
> diff --git a/block/blk.h b/block/blk.h
> index 3b53e44b967e..1a948bfd91e4 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q);
>   static inline void elevator_exit(struct request_queue *q,
>   		struct elevator_queue *e)
>   {
> +	struct blk_mq_tag_set *set = q->tag_set;
> +	struct request_queue *tmp;
> +
>   	lockdep_assert_held(&q->sysfs_lock);
>   
> +	mutex_lock(&set->tag_list_lock);
> +	list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
> +		if (tmp == q)
> +			continue;
> +		blk_mq_freeze_queue(tmp);
> +		blk_mq_quiesce_queue(tmp);
> +	}
> +
>   	blk_mq_sched_free_requests(q);
>   	__elevator_exit(q, e);
> +
> +	list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
> +		if (tmp == q)
> +			continue;
> +		blk_mq_unquiesce_queue(tmp);
> +		blk_mq_unfreeze_queue(tmp);
> +	}
> +	mutex_unlock(&set->tag_list_lock);
>   }

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Ming Lei March 11, 2021, 12:58 a.m. UTC | #5
On Fri, Mar 05, 2021 at 11:14:53PM +0800, John Garry wrote:
> A use-after-free may occur if blk_mq_queue_tag_busy_iter() is run on a
> queue when another queue associated with the same tagset is switching IO
> scheduler:
> 
> BUG: KASAN: use-after-free in bt_iter+0xa0/0x120
> Read of size 8 at addr ffff0410285e7e00 by task fio/2302
> 
> CPU: 24 PID: 2302 Comm: fio Not tainted 5.12.0-rc1-11925-g29a317e228d9 #747
> Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018 
>  Call trace:
> dump_backtrace+0x0/0x2d8 
> show_stack+0x18/0x68
> dump_stack+0x124/0x1a0
> print_address_description.constprop.13+0x68/0x30c
> kasan_report+0x1e8/0x258 
> __asan_load8+0x9c/0xd8
> bt_iter+0xa0/0x120 
> blk_mq_queue_tag_busy_iter+0x348/0x5d8
> blk_mq_in_flight+0x80/0xb8
> part_stat_show+0xcc/0x210
> dev_attr_show+0x44/0x90
> sysfs_kf_seq_show+0x120/0x1c0
> kernfs_seq_show+0x9c/0xb8
> seq_read_iter+0x214/0x668
> kernfs_fop_read_iter+0x204/0x2c0
> new_sync_read+0x1ec/0x2d0
> vfs_read+0x18c/0x248
> ksys_read+0xc8/0x178
> __arm64_sys_read+0x44/0x58
> el0_svc_common.constprop.1+0xc8/0x1a8
> do_el0_svc+0x90/0xa0
> el0_svc+0x24/0x38
> el0_sync_handler+0x90/0xb8
> el0_sync+0x154/0x180
> 
> Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its
> queue usage counter when called, and the queue cannot be frozen to switch
> IO scheduler until all refs are dropped. This ensures no stale references
> to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter().
> 
> However, there is nothing to stop blk_mq_queue_tag_busy_iter() being
> run for another queue associated with the same tagset, and it seeing
> a stale IO scheduler request from the other queue after they are freed.
> 
> To stop this happening, freeze and quiesce all queues associated with the
> tagset as the elevator is exited.

I think this way can't be accepted since switching one queue's scheduler
is nothing to do with other request queues attached to same HBA.

This patch will cause performance regression because userspace may
switch scheduler according to medium or workloads, at that time other
LUNs will be affected by this patch.
John Garry March 11, 2021, 8:21 a.m. UTC | #6
On 11/03/2021 00:58, Ming Lei wrote:
>> Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its
>> queue usage counter when called, and the queue cannot be frozen to switch
>> IO scheduler until all refs are dropped. This ensures no stale references
>> to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter().
>>
>> However, there is nothing to stop blk_mq_queue_tag_busy_iter() being
>> run for another queue associated with the same tagset, and it seeing
>> a stale IO scheduler request from the other queue after they are freed.
>>
>> To stop this happening, freeze and quiesce all queues associated with the
>> tagset as the elevator is exited.
> I think this way can't be accepted since switching one queue's scheduler
> is nothing to do with other request queues attached to same HBA.
> 
> This patch will cause performance regression because userspace may
> switch scheduler according to medium or workloads, at that time other
> LUNs will be affected by this patch.

Hmmm..that was my concern also. Do you think that it may cause a big 
impact? Depends totally on the workload, I suppose.

FWIW, it is useful though for solving both iterator problems.

Thanks,
John
Bart Van Assche March 12, 2021, 11:05 p.m. UTC | #7
On 3/11/21 12:21 AM, John Garry wrote:
> On 11/03/2021 00:58, Ming Lei wrote:
>> I think this way can't be accepted since switching one queue's scheduler
>> is nothing to do with other request queues attached to same HBA.
>>
>> This patch will cause performance regression because userspace may
>> switch scheduler according to medium or workloads, at that time other
>> LUNs will be affected by this patch.
> 
> Hmmm..that was my concern also. Do you think that it may cause a big
> impact? Depends totally on the workload, I suppose.
> 
> FWIW, it is useful though for solving both iterator problems.

Hi John,

How about replacing the entire patch series with the patch below? The
patch below has the following advantages:
- Instead of making the race window smaller, the race is fixed
  completely.
- No new atomic instructions are added to the block layer code.
- No early return is inserted in blk_mq_tagset_busy_iter().

Thanks,

Bart.

From a0e534012a766bd6e53cdd466eec0a811164c12a Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Wed, 10 Mar 2021 19:11:47 -0800
Subject: [PATCH] blk-mq: Fix races between iterating over requests and freeing
 requests

Multiple users have reported use-after-free complaints similar to the
following (see also https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/):

BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
Read of size 8 at addr ffff88803b335240 by task fio/21412

CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
 dump_stack+0x86/0xca
 print_address_description+0x71/0x239
 kasan_report.cold.5+0x242/0x301
 __asan_load8+0x54/0x90
 bt_iter+0x86/0xf0
 blk_mq_queue_tag_busy_iter+0x373/0x5e0
 blk_mq_in_flight+0x96/0xb0
 part_in_flight+0x40/0x140
 part_round_stats+0x18e/0x370
 blk_account_io_start+0x3d7/0x670
 blk_mq_bio_to_request+0x19c/0x3a0
 blk_mq_make_request+0x7a9/0xcb0
 generic_make_request+0x41d/0x960
 submit_bio+0x9b/0x250
 do_blockdev_direct_IO+0x435c/0x4c70
 __blockdev_direct_IO+0x79/0x88
 ext4_direct_IO+0x46c/0xc00
 generic_file_direct_write+0x119/0x210
 __generic_file_write_iter+0x11c/0x280
 ext4_file_write_iter+0x1b8/0x6f0
 aio_write+0x204/0x310
 io_submit_one+0x9d3/0xe80
 __x64_sys_io_submit+0x115/0x340
 do_syscall_64+0x71/0x210

When multiple request queues share a tag set and when switching the I/O
scheduler for one of the request queues that uses this tag set, the
following race can happen:
* blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
  a pointer to a scheduler request to the local variable 'rq'.
* blk_mq_sched_free_requests() is called to free hctx->sched_tags.
* blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.

Fix this race as follows:
* Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
* Protect hctx->tags->rqs[] reads with an RCU read-side lock.
* No new rcu_barrier() call has been added because clearing the request
  pointer in hctx->tags->rqs[] happens before blk_queue_exit() and the
  blk_freeze_queue() call in blk_cleanup_queue() triggers an RCU barrier
  after all scheduler request pointers assiociated with a request queue
  have been removed from hctx->tags->rqs[] and before these scheduler
  requests are freed.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq-tag.c | 27 +++++++++++++++++----------
 block/blk-mq-tag.h |  2 +-
 block/blk-mq.c     | 10 ++++++----
 block/blk-mq.h     |  1 +
 4 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9c92053e704d..8351c3f2fe2d 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -206,18 +206,23 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	struct blk_mq_tags *tags = hctx->tags;
 	bool reserved = iter_data->reserved;
 	struct request *rq;
+	bool res = true;

 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	rq = tags->rqs[bitnr];
+
+	rcu_read_lock();
+	rq = rcu_dereference(tags->rqs[bitnr]);

 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
 	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
-		return iter_data->fn(hctx, rq, iter_data->data, reserved);
-	return true;
+		res = iter_data->fn(hctx, rq, iter_data->data, reserved);
+	rcu_read_unlock();
+
+	return res;
 }

 /**
@@ -264,10 +269,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	struct blk_mq_tags *tags = iter_data->tags;
 	bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED;
 	struct request *rq;
+	bool res = true;

 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;

+	rcu_read_lock();
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
@@ -275,13 +282,13 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
 		rq = tags->static_rqs[bitnr];
 	else
-		rq = tags->rqs[bitnr];
-	if (!rq)
-		return true;
-	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
-	    !blk_mq_request_started(rq))
-		return true;
-	return iter_data->fn(rq, iter_data->data, reserved);
+		rq = rcu_dereference(tags->rqs[bitnr]);
+	if (rq && (!(iter_data->flags & BT_TAG_ITER_STARTED) ||
+		   blk_mq_request_started(rq)))
+		res = iter_data->fn(rq, iter_data->data, reserved);
+	rcu_read_unlock();
+
+	return res;
 }

 /**
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 7d3e6b333a4a..7a6d04733261 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -17,7 +17,7 @@ struct blk_mq_tags {
 	struct sbitmap_queue __bitmap_tags;
 	struct sbitmap_queue __breserved_tags;

-	struct request **rqs;
+	struct request __rcu **rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
 };
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..594bf7f4ed9a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -495,8 +495,10 @@ static void __blk_mq_free_request(struct request *rq)
 	blk_crypto_free_request(rq);
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
-	if (rq->tag != BLK_MQ_NO_TAG)
+	if (rq->tag != BLK_MQ_NO_TAG) {
 		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
+		rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
+	}
 	if (sched_tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
@@ -839,8 +841,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
 	if (tag < tags->nr_tags) {
-		prefetch(tags->rqs[tag]);
-		return tags->rqs[tag];
+		prefetch((__force void *)tags->rqs[tag]);
+		return rcu_dereference_check(tags->rqs[tag], true);
 	}

 	return NULL;
@@ -1111,7 +1113,7 @@ static bool blk_mq_get_driver_tag(struct request *rq)
 		rq->rq_flags |= RQF_MQ_INFLIGHT;
 		__blk_mq_inc_active_requests(hctx);
 	}
-	hctx->tags->rqs[rq->tag] = rq;
+	rcu_assign_pointer(hctx->tags->rqs[rq->tag], rq);
 	return true;
 }

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3616453ca28c..9ccb1818303b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
 {
 	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
+	rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
 	rq->tag = BLK_MQ_NO_TAG;

 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
John Garry March 16, 2021, 4:15 p.m. UTC | #8
Hi Bart,

I'll have a look at this ASAP -  a bit busy.

But a quick scan and I notice this:

 > @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct 
blk_mq_hw_ctx *hctx,
 >   					   struct request *rq)
 >   {
 >   	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
 > +	rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);

Wasn't a requirement to not touch the fastpath at all, including even if 
only NULLifying a pointer?

IIRC, Kashyap some time ago had a patch like above (but without RCU 
usage), but the request from Jens was to not touch the fastpath.

Maybe I'm mistaken - I will try to dig up the thread.

Thanks,
John

> How about replacing the entire patch series with the patch below? The
> patch below has the following advantages:
> - Instead of making the race window smaller, the race is fixed
>    completely.
> - No new atomic instructions are added to the block layer code.
> - No early return is inserted in blk_mq_tagset_busy_iter().
> 
> Thanks,
> 
> Bart.
> 
>  From a0e534012a766bd6e53cdd466eec0a811164c12a Mon Sep 17 00:00:00 2001
> From: Bart Van Assche<bvanassche@acm.org>
> Date: Wed, 10 Mar 2021 19:11:47 -0800
> Subject: [PATCH] blk-mq: Fix races between iterating over requests and freeing
>   requests
> 
> Multiple users have reported use-after-free complaints similar to the
> following (see alsohttps://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/):
> 
> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
> Read of size 8 at addr ffff88803b335240 by task fio/21412
> 
> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> Call Trace:
>   dump_stack+0x86/0xca
>   print_address_description+0x71/0x239
>   kasan_report.cold.5+0x242/0x301
>   __asan_load8+0x54/0x90
>   bt_iter+0x86/0xf0
>   blk_mq_queue_tag_busy_iter+0x373/0x5e0
>   blk_mq_in_flight+0x96/0xb0
>   part_in_flight+0x40/0x140
>   part_round_stats+0x18e/0x370
>   blk_account_io_start+0x3d7/0x670
>   blk_mq_bio_to_request+0x19c/0x3a0
>   blk_mq_make_request+0x7a9/0xcb0
>   generic_make_request+0x41d/0x960
>   submit_bio+0x9b/0x250
>   do_blockdev_direct_IO+0x435c/0x4c70
>   __blockdev_direct_IO+0x79/0x88
>   ext4_direct_IO+0x46c/0xc00
>   generic_file_direct_write+0x119/0x210
>   __generic_file_write_iter+0x11c/0x280
>   ext4_file_write_iter+0x1b8/0x6f0
>   aio_write+0x204/0x310
>   io_submit_one+0x9d3/0xe80
>   __x64_sys_io_submit+0x115/0x340
>   do_syscall_64+0x71/0x210
> 
> When multiple request queues share a tag set and when switching the I/O
> scheduler for one of the request queues that uses this tag set, the
> following race can happen:
> * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
>    a pointer to a scheduler request to the local variable 'rq'.
> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
> 
> Fix this race as follows:
> * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
> * Protect hctx->tags->rqs[] reads with an RCU read-side lock.
> * No new rcu_barrier() call has been added because clearing the request
>    pointer in hctx->tags->rqs[] happens before blk_queue_exit() and the
>    blk_freeze_queue() call in blk_cleanup_queue() triggers an RCU barrier
>    after all scheduler request pointers assiociated with a request queue
>    have been removed from hctx->tags->rqs[] and before these scheduler
>    requests are freed.
> 
> Signed-off-by: Bart Van Assche<bvanassche@acm.org>
> ---
>   block/blk-mq-tag.c | 27 +++++++++++++++++----------
>   block/blk-mq-tag.h |  2 +-
>   block/blk-mq.c     | 10 ++++++----
>   block/blk-mq.h     |  1 +
>   4 files changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 9c92053e704d..8351c3f2fe2d 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -206,18 +206,23 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>   	struct blk_mq_tags *tags = hctx->tags;
>   	bool reserved = iter_data->reserved;
>   	struct request *rq;
> +	bool res = true;
> 
>   	if (!reserved)
>   		bitnr += tags->nr_reserved_tags;
> -	rq = tags->rqs[bitnr];
> +
> +	rcu_read_lock();
> +	rq = rcu_dereference(tags->rqs[bitnr]);
> 
>   	/*
>   	 * We can hit rq == NULL here, because the tagging functions
>   	 * test and set the bit before assigning ->rqs[].
>   	 */
>   	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
> -		return iter_data->fn(hctx, rq, iter_data->data, reserved);
> -	return true;
> +		res = iter_data->fn(hctx, rq, iter_data->data, reserved);
> +	rcu_read_unlock();
> +
> +	return res;
>   }
> 
>   /**
> @@ -264,10 +269,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>   	struct blk_mq_tags *tags = iter_data->tags;
>   	bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED;
>   	struct request *rq;
> +	bool res = true;
> 
>   	if (!reserved)
>   		bitnr += tags->nr_reserved_tags;
> 
> +	rcu_read_lock();
>   	/*
>   	 * We can hit rq == NULL here, because the tagging functions
>   	 * test and set the bit before assigning ->rqs[].
> @@ -275,13 +282,13 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>   	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
>   		rq = tags->static_rqs[bitnr];
>   	else
> -		rq = tags->rqs[bitnr];
> -	if (!rq)
> -		return true;
> -	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
> -	    !blk_mq_request_started(rq))
> -		return true;
> -	return iter_data->fn(rq, iter_data->data, reserved);
> +		rq = rcu_dereference(tags->rqs[bitnr]);
> +	if (rq && (!(iter_data->flags & BT_TAG_ITER_STARTED) ||
> +		   blk_mq_request_started(rq)))
> +		res = iter_data->fn(rq, iter_data->data, reserved);
> +	rcu_read_unlock();
> +
> +	return res;
>   }
> 
>   /**
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 7d3e6b333a4a..7a6d04733261 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -17,7 +17,7 @@ struct blk_mq_tags {
>   	struct sbitmap_queue __bitmap_tags;
>   	struct sbitmap_queue __breserved_tags;
> 
> -	struct request **rqs;
> +	struct request __rcu **rqs;
>   	struct request **static_rqs;
>   	struct list_head page_list;
>   };
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa439..594bf7f4ed9a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -495,8 +495,10 @@ static void __blk_mq_free_request(struct request *rq)
>   	blk_crypto_free_request(rq);
>   	blk_pm_mark_last_busy(rq);
>   	rq->mq_hctx = NULL;
> -	if (rq->tag != BLK_MQ_NO_TAG)
> +	if (rq->tag != BLK_MQ_NO_TAG) {
>   		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> +		rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
> +	}
>   	if (sched_tag != BLK_MQ_NO_TAG)
>   		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
>   	blk_mq_sched_restart(hctx);
> @@ -839,8 +841,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
>   struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
>   {
>   	if (tag < tags->nr_tags) {
> -		prefetch(tags->rqs[tag]);
> -		return tags->rqs[tag];
> +		prefetch((__force void *)tags->rqs[tag]);
> +		return rcu_dereference_check(tags->rqs[tag], true);
>   	}
> 
>   	return NULL;
> @@ -1111,7 +1113,7 @@ static bool blk_mq_get_driver_tag(struct request *rq)
>   		rq->rq_flags |= RQF_MQ_INFLIGHT;
>   		__blk_mq_inc_active_requests(hctx);
>   	}
> -	hctx->tags->rqs[rq->tag] = rq;
> +	rcu_assign_pointer(hctx->tags->rqs[rq->tag], rq);
>   	return true;
>   }
> 
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 3616453ca28c..9ccb1818303b 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
>   					   struct request *rq)
>   {
>   	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
> +	rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
>   	rq->tag = BLK_MQ_NO_TAG;
> 
>   	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
Bart Van Assche March 16, 2021, 5 p.m. UTC | #9
On 3/16/21 9:15 AM, John Garry wrote:
> I'll have a look at this ASAP -  a bit busy.
> 
> But a quick scan and I notice this:
> 
>  > @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct 
> blk_mq_hw_ctx *hctx,
>  >                          struct request *rq)
>  >   {
>  >       blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
>  > +    rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
> 
> Wasn't a requirement to not touch the fastpath at all, including even if 
> only NULLifying a pointer?
> 
> IIRC, Kashyap some time ago had a patch like above (but without RCU 
> usage), but the request from Jens was to not touch the fastpath.
> 
> Maybe I'm mistaken - I will try to dig up the thread.

Hi John,

I agree that Jens asked at the end of 2018 not to touch the fast path to 
fix this use-after-free (maybe that request has been repeated more 
recently). If Jens or anyone else feels strongly about not clearing 
hctx->tags->rqs[rq->tag] from the fast path then I will make that 
change. My motivation for clearing these pointers from the fast path is 
as follows:
- This results in code that is easier to read and easier to maintain.
- Every modern CPU pipelines store instructions so the performance 
impact of adding an additional store should be small.
- Since the block layer has a tendency to reuse tags that have been 
freed recently, it is likely that hctx->tags->rqs[rq->tag] will be used 
for a next request and hence that it will have to be loaded into the CPU 
cache anyway.

Bart.
John Garry March 16, 2021, 5:43 p.m. UTC | #10
On 16/03/2021 17:00, Bart Van Assche wrote:
> On 3/16/21 9:15 AM, John Garry wrote:
>> I'll have a look at this ASAP -  a bit busy.
>>
>> But a quick scan and I notice this:
>>
>>  > @@ -226,6 +226,7 @@ static inline void 
>> __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
>>  >                          struct request *rq)
>>  >   {
>>  >       blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
>>  > +    rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
>>
>> Wasn't a requirement to not touch the fastpath at all, including even 
>> if only NULLifying a pointer?
>>
>> IIRC, Kashyap some time ago had a patch like above (but without RCU 
>> usage), but the request from Jens was to not touch the fastpath.
>>
>> Maybe I'm mistaken - I will try to dig up the thread.
> 

Hi Bart,

> 
> I agree that Jens asked at the end of 2018 not to touch the fast path to 
> fix this use-after-free (maybe that request has been repeated more 
> recently). If Jens or anyone else feels strongly about not clearing 
> hctx->tags->rqs[rq->tag] from the fast path then I will make that 
> change. 

Is that possible for this same approach? I need to check the code more..

And don't we still have the problem that some iter callbacks may 
sleep/block, which is not allowed in an RCU read-side critical section?

> My motivation for clearing these pointers from the fast path is 
> as follows:
> - This results in code that is easier to read and easier to maintain.
> - Every modern CPU pipelines store instructions so the performance 
> impact of adding an additional store should be small.
> - Since the block layer has a tendency to reuse tags that have been 
> freed recently, it is likely that hctx->tags->rqs[rq->tag] will be used 
> for a next request and hence that it will have to be loaded into the CPU 
> cache anyway.
> 

Those points make sense to me, but obviously it's the maintainers call.

Thanks,
john
Bart Van Assche March 16, 2021, 7:59 p.m. UTC | #11
On 3/16/21 10:43 AM, John Garry wrote:
> On 16/03/2021 17:00, Bart Van Assche wrote:
>> I agree that Jens asked at the end of 2018 not to touch the fast path
>> to fix this use-after-free (maybe that request has been repeated more
>> recently). If Jens or anyone else feels strongly about not clearing
>> hctx->tags->rqs[rq->tag] from the fast path then I will make that change. 
> 
> Is that possible for this same approach? I need to check the code more..

If the fast path should not be modified, I'm considering to borrow patch
1/3 from your patch series and to add an rcu_barrier() between the code
that clears the request pointers and that frees the scheduler requests.

> And don't we still have the problem that some iter callbacks may
> sleep/block, which is not allowed in an RCU read-side critical section?

Thanks for having brought this up. Since none of the functions that
iterate over requests should be called from the hot path of a block
driver, I think that we can use srcu_read_(un|)lock() inside bt_iter()
and bt_tags_iter() instead of rcu_read_(un|)lock().

Bart.
John Garry March 19, 2021, 6:19 p.m. UTC | #12
On 16/03/2021 19:59, Bart Van Assche wrote:
> On 3/16/21 10:43 AM, John Garry wrote:
>> On 16/03/2021 17:00, Bart Van Assche wrote:
>>> I agree that Jens asked at the end of 2018 not to touch the fast path
>>> to fix this use-after-free (maybe that request has been repeated more
>>> recently). If Jens or anyone else feels strongly about not clearing
>>> hctx->tags->rqs[rq->tag] from the fast path then I will make that change.

Hi Bart,

>> Is that possible for this same approach? I need to check the code more..
> If the fast path should not be modified, I'm considering to borrow patch
> 1/3 from your patch series

Fine

> and to add an rcu_barrier() between the code
> that clears the request pointers and that frees the scheduler requests.
> 
>> And don't we still have the problem that some iter callbacks may
>> sleep/block, which is not allowed in an RCU read-side critical section?
> Thanks for having brought this up. Since none of the functions that
> iterate over requests should be called from the hot path of a block
> driver, I think that we can use srcu_read_(un|)lock() inside bt_iter()
> and bt_tags_iter() instead of rcu_read_(un|)lock().

OK, but TBH, I am not so familiar with srcu - where you going to try this?

Thanks,
John
Bart Van Assche March 19, 2021, 6:32 p.m. UTC | #13
On 3/19/21 11:19 AM, John Garry wrote:
> OK, but TBH, I am not so familiar with srcu - where you going to try this?

Hi John,

Have you received the following patch: "[PATCH] blk-mq: Fix races 
between iterating over requests and freeing requests" 
(https://lore.kernel.org/linux-block/20210319010009.10041-1-bvanassche@acm.org/)?

Thanks,

Bart.
diff mbox series

Patch

diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e..1a948bfd91e4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -201,10 +201,29 @@  void elv_unregister_queue(struct request_queue *q);
 static inline void elevator_exit(struct request_queue *q,
 		struct elevator_queue *e)
 {
+	struct blk_mq_tag_set *set = q->tag_set;
+	struct request_queue *tmp;
+
 	lockdep_assert_held(&q->sysfs_lock);
 
+	mutex_lock(&set->tag_list_lock);
+	list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
+		if (tmp == q)
+			continue;
+		blk_mq_freeze_queue(tmp);
+		blk_mq_quiesce_queue(tmp);
+	}
+
 	blk_mq_sched_free_requests(q);
 	__elevator_exit(q, e);
+
+	list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
+		if (tmp == q)
+			continue;
+		blk_mq_unquiesce_queue(tmp);
+		blk_mq_unfreeze_queue(tmp);
+	}
+	mutex_unlock(&set->tag_list_lock);
 }
 
 ssize_t part_size_show(struct device *dev, struct device_attribute *attr,