diff mbox series

block: don't grab elevator lock during queue initialization

Message ID 20250403105402.1334206-1-ming.lei@redhat.com (mailing list archive)
State New
Headers show
Series block: don't grab elevator lock during queue initialization | expand

Commit Message

Ming Lei April 3, 2025, 10:54 a.m. UTC
->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.

queue freeze lock depends on fs_reclaim.

So don't grab elevator lock during queue initialization which needs to
call kmalloc(GFP_KERNEL), and we can cut the dependency between
->elevator_lock and fs_reclaim, then the lockdep warning can be killed.

This way is safe because elevator setting isn't ready to run during
queue initialization.

There isn't such issue in __blk_mq_update_nr_hw_queues() because
memalloc_noio_save() is called before acquiring elevator lock.

Fixes the following lockdep warning:

https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/

Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
Cc: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

Comments

Nilay Shroff April 3, 2025, 1:19 p.m. UTC | #1
On 4/3/25 4:24 PM, Ming Lei wrote:
> ->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.
> 
> queue freeze lock depends on fs_reclaim.
> 
> So don't grab elevator lock during queue initialization which needs to
> call kmalloc(GFP_KERNEL), and we can cut the dependency between
> ->elevator_lock and fs_reclaim, then the lockdep warning can be killed.
> 
> This way is safe because elevator setting isn't ready to run during
> queue initialization.
> 
> There isn't such issue in __blk_mq_update_nr_hw_queues() because
> memalloc_noio_save() is called before acquiring elevator lock.
> 
> Fixes the following lockdep warning:
> 
> https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/
> 
> Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
> Cc: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
I think you earlier posted this same patch here:
https://lore.kernel.org/linux-block/Z-dUCLvf06SfTOHy@fedora/

I tested that earlier patch and got into another lockdep splat as reported here: 
https://lore.kernel.org/linux-block/462d4e8a-dd95-48fe-b9fe-a558057f9595@linux.ibm.com/

I don't know why we think your earlier fix which cut dependency between 
->elevator_lock and ->freeze_lock doesn't work. But anyways, my view
is that we've got into these lock chains from two different code paths:

path1: elevator_lock 
         -> fs_reclaim (GFP_KERNEL)
           -> freeze_lock

path2: freeze_lock(memalloc_noio) 
         -> elevator_lock 
           -> fs_reclaim (this becomes NOP in this case due to memalloc_noio)

As you could see above, we've got into circular dependency between 
->elevator_lock and ->freeze_lock. I thought with your earlier patch
we were able to cut that dependency using blk_mq_enter_no_io and 
blk_mq_exit_no_io. But I'm not sure why we think that won't work?

Thanks,
--Nilay
Ming Lei April 3, 2025, 2:24 p.m. UTC | #2
On Thu, Apr 03, 2025 at 06:49:05PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/3/25 4:24 PM, Ming Lei wrote:
> > ->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.
> > 
> > queue freeze lock depends on fs_reclaim.
> > 
> > So don't grab elevator lock during queue initialization which needs to
> > call kmalloc(GFP_KERNEL), and we can cut the dependency between
> > ->elevator_lock and fs_reclaim, then the lockdep warning can be killed.
> > 
> > This way is safe because elevator setting isn't ready to run during
> > queue initialization.
> > 
> > There isn't such issue in __blk_mq_update_nr_hw_queues() because
> > memalloc_noio_save() is called before acquiring elevator lock.
> > 
> > Fixes the following lockdep warning:
> > 
> > https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/
> > 
> > Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
> > Cc: Nilay Shroff <nilay@linux.ibm.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq.c | 24 +++++++++++++++++-------
> >  1 file changed, 17 insertions(+), 7 deletions(-)
> > 
> I think you earlier posted this same patch here:
> https://lore.kernel.org/linux-block/Z-dUCLvf06SfTOHy@fedora/
> 
> I tested that earlier patch and got into another lockdep splat as reported here: 
> https://lore.kernel.org/linux-block/462d4e8a-dd95-48fe-b9fe-a558057f9595@linux.ibm.com/

That is another different one, let's fix this one first.

The ->elevator_lock in blk_register_queue() should be for avoiding race
with updating nr_hw_queues, right?

That is why I think the dependency between elevator lock and freeze lock
is one trouble.

> 
> I don't know why we think your earlier fix which cut dependency between 
> ->elevator_lock and ->freeze_lock doesn't work. But anyways, my view
> is that we've got into these lock chains from two different code paths:

As I explained, blk_mq_enter_no_io() is same with freeze queue, just the
lock in __bio_queue_enter() isn't modeled. If it is done, every lockdep
warning will be re-triggered too.

> 
> path1: elevator_lock 
>          -> fs_reclaim (GFP_KERNEL)
>            -> freeze_lock
> 
> path2: freeze_lock(memalloc_noio) 
>          -> elevator_lock 
>            -> fs_reclaim (this becomes NOP in this case due to memalloc_noio)

No, there isn't fs_reclaim in path2, and memalloc_noio() will avoid it.


Thanks,
Ming
Jens Axboe April 3, 2025, 2:32 p.m. UTC | #3
On Thu, 03 Apr 2025 18:54:02 +0800, Ming Lei wrote:
> ->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.
> 
> queue freeze lock depends on fs_reclaim.
> 
> So don't grab elevator lock during queue initialization which needs to
> call kmalloc(GFP_KERNEL), and we can cut the dependency between
> ->elevator_lock and fs_reclaim, then the lockdep warning can be killed.
> 
> [...]

Applied, thanks!

[1/1] block: don't grab elevator lock during queue initialization
      commit: 01b91bf14f6d4893e03e357006e7af3a20c03fee

Best regards,
Christoph Hellwig April 4, 2025, 9:10 a.m. UTC | #4
On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
> Fixes the following lockdep warning:

Please spell the actual dependency out here, links are not permanent
and also not readable for any offline reading of the commit logs.

> +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> +				   struct request_queue *q, bool lock)
> +{
> +	if (lock) {

bool lock(ed) arguments are an anti-pattern, and regularly get Linus
screaming at you (in this case even for the right reason :))

> +		/* protect against switching io scheduler  */
> +		mutex_lock(&q->elevator_lock);
> +		__blk_mq_realloc_hw_ctxs(set, q);
> +		mutex_unlock(&q->elevator_lock);
> +	} else {
> +		__blk_mq_realloc_hw_ctxs(set, q);
> +	}

I think the problem here is again that because of all the other
dependencies elevator_lock really needs to be per-set instead of
per-queue which will allows us to have much saner locking hierarchies.
Ming Lei April 4, 2025, 12:09 p.m. UTC | #5
On Fri, Apr 04, 2025 at 11:10:37AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 03, 2025 at 06:54:02PM +0800, Ming Lei wrote:
> > Fixes the following lockdep warning:
> 
> Please spell the actual dependency out here, links are not permanent
> and also not readable for any offline reading of the commit logs.
> 
> > +static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> > +				   struct request_queue *q, bool lock)
> > +{
> > +	if (lock) {
> 
> bool lock(ed) arguments are an anti-pattern, and regularly get Linus
> screaming at you (in this case even for the right reason :))
> 
> > +		/* protect against switching io scheduler  */
> > +		mutex_lock(&q->elevator_lock);
> > +		__blk_mq_realloc_hw_ctxs(set, q);
> > +		mutex_unlock(&q->elevator_lock);
> > +	} else {
> > +		__blk_mq_realloc_hw_ctxs(set, q);
> > +	}
> 
> I think the problem here is again that because of all the other
> dependencies elevator_lock really needs to be per-set instead of
> per-queue which will allows us to have much saner locking hierarchies.

per-set lock is required in error handling code path, anywhere it is used
with freezing together can be one deadlock risk if hardware error happens.

Or can you explain the idea in details?

Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ae8494d88897..d7a103dc258b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4465,14 +4465,12 @@  static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 	return NULL;
 }
 
-static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-						struct request_queue *q)
+static void __blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
+				     struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i, j;
 
-	/* protect against switching io scheduler  */
-	mutex_lock(&q->elevator_lock);
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		int old_node;
 		int node = blk_mq_get_hctx_node(set, i);
@@ -4505,7 +4503,19 @@  static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 	xa_for_each_start(&q->hctx_table, j, hctx, j)
 		blk_mq_exit_hctx(q, set, hctx, j);
-	mutex_unlock(&q->elevator_lock);
+}
+
+static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
+				   struct request_queue *q, bool lock)
+{
+	if (lock) {
+		/* protect against switching io scheduler  */
+		mutex_lock(&q->elevator_lock);
+		__blk_mq_realloc_hw_ctxs(set, q);
+		mutex_unlock(&q->elevator_lock);
+	} else {
+		__blk_mq_realloc_hw_ctxs(set, q);
+	}
 
 	/* unregister cpuhp callbacks for exited hctxs */
 	blk_mq_remove_hw_queues_cpuhp(q);
@@ -4537,7 +4547,7 @@  int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	xa_init(&q->hctx_table);
 
-	blk_mq_realloc_hw_ctxs(set, q);
+	blk_mq_realloc_hw_ctxs(set, q, false);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
 
@@ -5033,7 +5043,7 @@  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 fallback:
 	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_realloc_hw_ctxs(set, q);
+		blk_mq_realloc_hw_ctxs(set, q, true);
 
 		if (q->nr_hw_queues != set->nr_hw_queues) {
 			int i = prev_nr_hw_queues;