diff mbox series

[1/2] block: avoid to hold q->limits_lock across APIs for atomic update queue limits

Message ID 20241216080206.2850773-2-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series block: fix deadlock caused by atomic limits update | expand

Commit Message

Ming Lei Dec. 16, 2024, 8:02 a.m. UTC
Commit d690cb8ae14b ("block: add an API to atomically update queue limits")
adds APIs for updating queue limits atomically. And q->limits_lock is
grabbed in queue_limits_start_update(), and released in queue_limits_commit_update().

This way is very fragile and easy to introduce deadlock[1][2].

More importantly, queue_limits_start_update() returns one local copy of
q->limits, then the API user overwrites the local copy, and commit the
copy by queue_limits_commit_update() finally.

So holding q->limits_lock protects nothing for the local copy, and not see
any real help by preventing new update & commit from happening, cause
what matters is that we do validation & commit atomically.

Changes the API to not hold q->limits_lock across atomic queue limits update
APIs for fixing deadlock & making it easy to use.

[1] https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/
[2] https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/

Fixes: d690cb8ae14b ("block: add an API to atomically update queue limits")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-settings.c   | 2 +-
 include/linux/blkdev.h | 8 ++++++--
 2 files changed, 7 insertions(+), 3 deletions(-)

Comments

Christoph Hellwig Dec. 16, 2024, 3:49 p.m. UTC | #1
On Mon, Dec 16, 2024 at 04:02:03PM +0800, Ming Lei wrote:
> More importantly, queue_limits_start_update() returns one local copy of
> q->limits, then the API user overwrites the local copy, and commit the
> copy by queue_limits_commit_update() finally.
> 
> So holding q->limits_lock protects nothing for the local copy, and not see
> any real help by preventing new update & commit from happening, cause
> what matters is that we do validation & commit atomically.

It protects against someone else changing the copy in the queue while
the current thread is updating the local copy, and thus incoherent
updates.  So no, we can't just remove it.
Ming Lei Dec. 17, 2024, 1:52 a.m. UTC | #2
On Mon, Dec 16, 2024 at 04:49:01PM +0100, Christoph Hellwig wrote:
> On Mon, Dec 16, 2024 at 04:02:03PM +0800, Ming Lei wrote:
> > More importantly, queue_limits_start_update() returns one local copy of
> > q->limits, then the API user overwrites the local copy, and commit the
> > copy by queue_limits_commit_update() finally.
> > 
> > So holding q->limits_lock protects nothing for the local copy, and not see
> > any real help by preventing new update & commit from happening, cause
> > what matters is that we do validation & commit atomically.
> 
> It protects against someone else changing the copy in the queue while
> the current thread is updating the local copy, and thus incoherent
> updates.  So no, we can't just remove it.

The local copy can be updated in any way with any data, so does another
concurrent update on q->limits really matter?


thanks,
Ming
Christoph Hellwig Dec. 17, 2024, 4:40 a.m. UTC | #3
On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote:
> The local copy can be updated in any way with any data, so does another
> concurrent update on q->limits really matter?

Yes, because that means one of the updates get lost even if it is
for entirely separate fields.
Ming Lei Dec. 17, 2024, 7:05 a.m. UTC | #4
On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote:
> > The local copy can be updated in any way with any data, so does another
> > concurrent update on q->limits really matter?
> 
> Yes, because that means one of the updates get lost even if it is
> for entirely separate fields.

Right, but the limits are still valid anytime.

Any suggestion for fixing this deadlock?

One idea is to add queue_limits_start_try_update() and apply it in
sysfs ->store(), in which update failure could be tolerable.


Thanks,
Ming
Christoph Hellwig Dec. 17, 2024, 7:19 a.m. UTC | #5
On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote:
> On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote:
> > On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote:
> > > The local copy can be updated in any way with any data, so does another
> > > concurrent update on q->limits really matter?
> > 
> > Yes, because that means one of the updates get lost even if it is
> > for entirely separate fields.
> 
> Right, but the limits are still valid anytime.
> 
> Any suggestion for fixing this deadlock?

What is "this deadlock"?
Ming Lei Dec. 17, 2024, 7:30 a.m. UTC | #6
On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote:
> > On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote:
> > > On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote:
> > > > The local copy can be updated in any way with any data, so does another
> > > > concurrent update on q->limits really matter?
> > > 
> > > Yes, because that means one of the updates get lost even if it is
> > > for entirely separate fields.
> > 
> > Right, but the limits are still valid anytime.
> > 
> > Any suggestion for fixing this deadlock?
> 
> What is "this deadlock"?

The commit log provides two reports:

- lockdep warning

https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/

- real deadlock report

https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/

It is actually one simple ABBA lock:

1) queue_attr_store()

      blk_mq_freeze_queue(q);					//queue freeze lock
      res = entry->store(disk, page, length);
	  			queue_limits_start_update		//->limits_lock
				...
				queue_limits_commit_update
      blk_mq_unfreeze_queue(q);

2) sd_revalidate_disk()

queue_limits_start_update					//->limits_lock
	sd_read_capacity()
		scsi_execute_cmd
			scsi_alloc_request
				blk_queue_enter					//queue freeze lock


Thanks,
Ming
Damien Le Moal Dec. 17, 2024, 4:07 p.m. UTC | #7
On 2024/12/16 23:30, Ming Lei wrote:
> On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote:
>> On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote:
>>> On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote:
>>>> On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote:
>>>>> The local copy can be updated in any way with any data, so does another
>>>>> concurrent update on q->limits really matter?
>>>>
>>>> Yes, because that means one of the updates get lost even if it is
>>>> for entirely separate fields.
>>>
>>> Right, but the limits are still valid anytime.
>>>
>>> Any suggestion for fixing this deadlock?
>>
>> What is "this deadlock"?
> 
> The commit log provides two reports:
> 
> - lockdep warning
> 
> https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/
> 
> - real deadlock report
> 
> https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/
> 
> It is actually one simple ABBA lock:
> 
> 1) queue_attr_store()
> 
>       blk_mq_freeze_queue(q);					//queue freeze lock
>       res = entry->store(disk, page, length);
> 	  			queue_limits_start_update		//->limits_lock
> 				...
> 				queue_limits_commit_update
>       blk_mq_unfreeze_queue(q);

The locking + freeze pattern should be:

	lim = queue_limits_start_update(q);
	...
	blk_mq_freeze_queue(q);
	ret = queue_limits_commit_update(q, &lim);
	blk_mq_unfreeze_queue(q);

This pattern is used in most places and anything that does not use it is likely
susceptible to a similar ABBA deadlock. We should probably look into trying to
integrate the freeze/unfreeze calls directly into queue_limits_commit_update().

Fixing queue_attr_store() to use this pattern seems simpler than trying to fix
sd_revalidate_disk().

> 
> 2) sd_revalidate_disk()
> 
> queue_limits_start_update					//->limits_lock
> 	sd_read_capacity()
> 		scsi_execute_cmd
> 			scsi_alloc_request
> 				blk_queue_enter					//queue freeze lock
> 
> 
> Thanks,
> Ming
> 
>
Ming Lei Dec. 18, 2024, 2:09 a.m. UTC | #8
On Tue, Dec 17, 2024 at 08:07:06AM -0800, Damien Le Moal wrote:
> On 2024/12/16 23:30, Ming Lei wrote:
> > On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote:
> >> On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote:
> >>> On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote:
> >>>> On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote:
> >>>>> The local copy can be updated in any way with any data, so does another
> >>>>> concurrent update on q->limits really matter?
> >>>>
> >>>> Yes, because that means one of the updates get lost even if it is
> >>>> for entirely separate fields.
> >>>
> >>> Right, but the limits are still valid anytime.
> >>>
> >>> Any suggestion for fixing this deadlock?
> >>
> >> What is "this deadlock"?
> > 
> > The commit log provides two reports:
> > 
> > - lockdep warning
> > 
> > https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/
> > 
> > - real deadlock report
> > 
> > https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/
> > 
> > It is actually one simple ABBA lock:
> > 
> > 1) queue_attr_store()
> > 
> >       blk_mq_freeze_queue(q);					//queue freeze lock
> >       res = entry->store(disk, page, length);
> > 	  			queue_limits_start_update		//->limits_lock
> > 				...
> > 				queue_limits_commit_update
> >       blk_mq_unfreeze_queue(q);
> 
> The locking + freeze pattern should be:
> 
> 	lim = queue_limits_start_update(q);
> 	...
> 	blk_mq_freeze_queue(q);
> 	ret = queue_limits_commit_update(q, &lim);
> 	blk_mq_unfreeze_queue(q);
> 
> This pattern is used in most places and anything that does not use it is likely
> susceptible to a similar ABBA deadlock. We should probably look into trying to
> integrate the freeze/unfreeze calls directly into queue_limits_commit_update().
> 
> Fixing queue_attr_store() to use this pattern seems simpler than trying to fix
> sd_revalidate_disk().

This way looks good, just commit af2814149883 ("block: freeze the queue in
queue_attr_store") needs to be reverted, and freeze/unfreeze has to be
added to each queue attribute .store() handler.



Thanks,
Ming
Nilay Shroff Dec. 18, 2024, 11:33 a.m. UTC | #9
On 12/18/24 07:39, Ming Lei wrote:
> On Tue, Dec 17, 2024 at 08:07:06AM -0800, Damien Le Moal wrote:
>> On 2024/12/16 23:30, Ming Lei wrote:
>>> On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote:
>>>> On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote:
>>>>> On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote:
>>>>>> On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote:
>>>>>>> The local copy can be updated in any way with any data, so does another
>>>>>>> concurrent update on q->limits really matter?
>>>>>>
>>>>>> Yes, because that means one of the updates get lost even if it is
>>>>>> for entirely separate fields.
>>>>>
>>>>> Right, but the limits are still valid anytime.
>>>>>
>>>>> Any suggestion for fixing this deadlock?
>>>>
>>>> What is "this deadlock"?
>>>
>>> The commit log provides two reports:
>>>
>>> - lockdep warning
>>>
>>> https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/
>>>
>>> - real deadlock report
>>>
>>> https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/
>>>
>>> It is actually one simple ABBA lock:
>>>
>>> 1) queue_attr_store()
>>>
>>>       blk_mq_freeze_queue(q);					//queue freeze lock
>>>       res = entry->store(disk, page, length);
>>> 	  			queue_limits_start_update		//->limits_lock
>>> 				...
>>> 				queue_limits_commit_update
>>>       blk_mq_unfreeze_queue(q);
>>
>> The locking + freeze pattern should be:
>>
>> 	lim = queue_limits_start_update(q);
>> 	...
>> 	blk_mq_freeze_queue(q);
>> 	ret = queue_limits_commit_update(q, &lim);
>> 	blk_mq_unfreeze_queue(q);
>>
>> This pattern is used in most places and anything that does not use it is likely
>> susceptible to a similar ABBA deadlock. We should probably look into trying to
>> integrate the freeze/unfreeze calls directly into queue_limits_commit_update().
>>
>> Fixing queue_attr_store() to use this pattern seems simpler than trying to fix
>> sd_revalidate_disk().
> 
> This way looks good, just commit af2814149883 ("block: freeze the queue in
> queue_attr_store") needs to be reverted, and freeze/unfreeze has to be
> added to each queue attribute .store() handler.
> 
Wouldn't it be feasible to add blk-mq freeze in queue_limits_start_update()
and blk-mq unfreeze in queue_limits_commit_update()? If we do this then 
the pattern would be, 

queue_limits_start_update(): limit-lock + freeze
queue_limits_commit_update() : unfreeze + limit-unlock  

Then in queue_attr_store() we shall just remove freeze/unfreeze.

We also need to fix few call sites where we've code block,

{
    blk_mq_freeze_queue()
    ...
    queue_limits_start_update()
    ...    
    queue_limits_commit_update()
    ...
    blk_mq_unfreeze_queue()
    
}

In the above code block, we may then replace blk_mq_freeze_queue() with
queue_limits_commit_update() and similarly replace blk_mq_unfreeze_queue() 
with queue_limits_commit_update().

{
    queue_limits_start_update()
    ...
    ...
    ...
    queue_limits_commit_update()

}

Thanks,
--Nilay
Ming Lei Dec. 18, 2024, 1:40 p.m. UTC | #10
On Wed, Dec 18, 2024 at 05:03:00PM +0530, Nilay Shroff wrote:
> 
> 
> On 12/18/24 07:39, Ming Lei wrote:
> > On Tue, Dec 17, 2024 at 08:07:06AM -0800, Damien Le Moal wrote:
> >> On 2024/12/16 23:30, Ming Lei wrote:
> >>> On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote:
> >>>> On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote:
> >>>>> On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote:
> >>>>>> On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote:
> >>>>>>> The local copy can be updated in any way with any data, so does another
> >>>>>>> concurrent update on q->limits really matter?
> >>>>>>
> >>>>>> Yes, because that means one of the updates get lost even if it is
> >>>>>> for entirely separate fields.
> >>>>>
> >>>>> Right, but the limits are still valid anytime.
> >>>>>
> >>>>> Any suggestion for fixing this deadlock?
> >>>>
> >>>> What is "this deadlock"?
> >>>
> >>> The commit log provides two reports:
> >>>
> >>> - lockdep warning
> >>>
> >>> https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/
> >>>
> >>> - real deadlock report
> >>>
> >>> https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/
> >>>
> >>> It is actually one simple ABBA lock:
> >>>
> >>> 1) queue_attr_store()
> >>>
> >>>       blk_mq_freeze_queue(q);					//queue freeze lock
> >>>       res = entry->store(disk, page, length);
> >>> 	  			queue_limits_start_update		//->limits_lock
> >>> 				...
> >>> 				queue_limits_commit_update
> >>>       blk_mq_unfreeze_queue(q);
> >>
> >> The locking + freeze pattern should be:
> >>
> >> 	lim = queue_limits_start_update(q);
> >> 	...
> >> 	blk_mq_freeze_queue(q);
> >> 	ret = queue_limits_commit_update(q, &lim);
> >> 	blk_mq_unfreeze_queue(q);
> >>
> >> This pattern is used in most places and anything that does not use it is likely
> >> susceptible to a similar ABBA deadlock. We should probably look into trying to
> >> integrate the freeze/unfreeze calls directly into queue_limits_commit_update().
> >>
> >> Fixing queue_attr_store() to use this pattern seems simpler than trying to fix
> >> sd_revalidate_disk().
> > 
> > This way looks good, just commit af2814149883 ("block: freeze the queue in
> > queue_attr_store") needs to be reverted, and freeze/unfreeze has to be
> > added to each queue attribute .store() handler.
> > 
> Wouldn't it be feasible to add blk-mq freeze in queue_limits_start_update()
> and blk-mq unfreeze in queue_limits_commit_update()? If we do this then 
> the pattern would be, 
> 
> queue_limits_start_update(): limit-lock + freeze
> queue_limits_commit_update() : unfreeze + limit-unlock  
> 
> Then in queue_attr_store() we shall just remove freeze/unfreeze.
> 
> We also need to fix few call sites where we've code block,
> 
> {
>     blk_mq_freeze_queue()
>     ...
>     queue_limits_start_update()
>     ...    
>     queue_limits_commit_update()
>     ...
>     blk_mq_unfreeze_queue()
>     
> }
> 
> In the above code block, we may then replace blk_mq_freeze_queue() with
> queue_limits_commit_update() and similarly replace blk_mq_unfreeze_queue() 
> with queue_limits_commit_update().
> 
> {
>     queue_limits_start_update()
>     ...
>     ...
>     ...
>     queue_limits_commit_update()

In sd_revalidate_disk(), blk-mq request is allocated under queue_limits_start_update(),
then ABBA deadlock is triggered since blk_queue_enter() implies same lock(freeze lock)
from blk_mq_freeze_queue().


Thanks,
Ming
Nilay Shroff Dec. 18, 2024, 2:05 p.m. UTC | #11
On 12/18/24 19:10, Ming Lei wrote:
> On Wed, Dec 18, 2024 at 05:03:00PM +0530, Nilay Shroff wrote:
>>
>>
>> On 12/18/24 07:39, Ming Lei wrote:
>>> On Tue, Dec 17, 2024 at 08:07:06AM -0800, Damien Le Moal wrote:
>>>> On 2024/12/16 23:30, Ming Lei wrote:
>>>>> On Tue, Dec 17, 2024 at 08:19:28AM +0100, Christoph Hellwig wrote:
>>>>>> On Tue, Dec 17, 2024 at 03:05:48PM +0800, Ming Lei wrote:
>>>>>>> On Tue, Dec 17, 2024 at 05:40:56AM +0100, Christoph Hellwig wrote:
>>>>>>>> On Tue, Dec 17, 2024 at 09:52:51AM +0800, Ming Lei wrote:
>>>>>>>>> The local copy can be updated in any way with any data, so does another
>>>>>>>>> concurrent update on q->limits really matter?
>>>>>>>>
>>>>>>>> Yes, because that means one of the updates get lost even if it is
>>>>>>>> for entirely separate fields.
>>>>>>>
>>>>>>> Right, but the limits are still valid anytime.
>>>>>>>
>>>>>>> Any suggestion for fixing this deadlock?
>>>>>>
>>>>>> What is "this deadlock"?
>>>>>
>>>>> The commit log provides two reports:
>>>>>
>>>>> - lockdep warning
>>>>>
>>>>> https://lore.kernel.org/linux-block/Z1A8fai9_fQFhs1s@hovoldconsulting.com/
>>>>>
>>>>> - real deadlock report
>>>>>
>>>>> https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/
>>>>>
>>>>> It is actually one simple ABBA lock:
>>>>>
>>>>> 1) queue_attr_store()
>>>>>
>>>>>       blk_mq_freeze_queue(q);					//queue freeze lock
>>>>>       res = entry->store(disk, page, length);
>>>>> 	  			queue_limits_start_update		//->limits_lock
>>>>> 				...
>>>>> 				queue_limits_commit_update
>>>>>       blk_mq_unfreeze_queue(q);
>>>>
>>>> The locking + freeze pattern should be:
>>>>
>>>> 	lim = queue_limits_start_update(q);
>>>> 	...
>>>> 	blk_mq_freeze_queue(q);
>>>> 	ret = queue_limits_commit_update(q, &lim);
>>>> 	blk_mq_unfreeze_queue(q);
>>>>
>>>> This pattern is used in most places and anything that does not use it is likely
>>>> susceptible to a similar ABBA deadlock. We should probably look into trying to
>>>> integrate the freeze/unfreeze calls directly into queue_limits_commit_update().
>>>>
>>>> Fixing queue_attr_store() to use this pattern seems simpler than trying to fix
>>>> sd_revalidate_disk().
>>>
>>> This way looks good, just commit af2814149883 ("block: freeze the queue in
>>> queue_attr_store") needs to be reverted, and freeze/unfreeze has to be
>>> added to each queue attribute .store() handler.
>>>
>> Wouldn't it be feasible to add blk-mq freeze in queue_limits_start_update()
>> and blk-mq unfreeze in queue_limits_commit_update()? If we do this then 
>> the pattern would be, 
>>
>> queue_limits_start_update(): limit-lock + freeze
>> queue_limits_commit_update() : unfreeze + limit-unlock  
>>
>> Then in queue_attr_store() we shall just remove freeze/unfreeze.
>>
>> We also need to fix few call sites where we've code block,
>>
>> {
>>     blk_mq_freeze_queue()
>>     ...
>>     queue_limits_start_update()
>>     ...    
>>     queue_limits_commit_update()
>>     ...
>>     blk_mq_unfreeze_queue()
>>     
>> }
>>
>> In the above code block, we may then replace blk_mq_freeze_queue() with
>> queue_limits_commit_update() and similarly replace blk_mq_unfreeze_queue() 
>> with queue_limits_commit_update().
>>
>> {
>>     queue_limits_start_update()
>>     ...
>>     ...
>>     ...
>>     queue_limits_commit_update()
> 
> In sd_revalidate_disk(), blk-mq request is allocated under queue_limits_start_update(),
> then ABBA deadlock is triggered since blk_queue_enter() implies same lock(freeze lock)
> from blk_mq_freeze_queue().
> 
Yeah agreed but I see sd_revalidate_disk() is probably the only exception 
which allocates the blk-mq request. Can't we fix it? 

One simple way I could think of would be updating queue_limit_xxx() APIs and
then use it,

queue_limits_start_update(struct request_queue *q, bool freeze) 
{
    ...
    mutex_lock(&q->limits_lock);
    if(freeze)
        blk_mq_freeze_queue(q);
    ...
}

queue_limits_commit_update(struct request_queue *q,
		struct queue_limits *lim, bool unfreeze)
{
    ...
    ...
    if(unfreeze)
        blk_mq_unfreeze_queue(q);
    mutex_unlock(&q->limits_lock);
}

sd_revalidate_disk()
{
    ...
    ...
    queue_limits_start_update(q, false);

    ...
    ...

    blk_mq_freeze_queue(q);
    queue_limits_commit_update(q, lim, true);        
}

Thanks,
--Nilay
Damien Le Moal Dec. 18, 2024, 2:57 p.m. UTC | #12
On 2024/12/18 6:05, Nilay Shroff wrote:
>>> In the above code block, we may then replace blk_mq_freeze_queue() with
>>> queue_limits_commit_update() and similarly replace blk_mq_unfreeze_queue() 
>>> with queue_limits_commit_update().
>>>
>>> {
>>>     queue_limits_start_update()
>>>     ...
>>>     ...
>>>     ...
>>>     queue_limits_commit_update()
>>
>> In sd_revalidate_disk(), blk-mq request is allocated under queue_limits_start_update(),
>> then ABBA deadlock is triggered since blk_queue_enter() implies same lock(freeze lock)
>> from blk_mq_freeze_queue().
>>
> Yeah agreed but I see sd_revalidate_disk() is probably the only exception 
> which allocates the blk-mq request. Can't we fix it? 

If we change where limits_lock is taken now, we will again introduce races
between user config and discovery/revalidation, which is what
queue_limits_start_update() and queue_limits_commit_update() intended to fix in
the first place.

So changing sd_revalidate_disk() is not the right approach.

> One simple way I could think of would be updating queue_limit_xxx() APIs and
> then use it,
> 
> queue_limits_start_update(struct request_queue *q, bool freeze) 
> {
>     ...
>     mutex_lock(&q->limits_lock);
>     if(freeze)
>         blk_mq_freeze_queue(q);
>     ...
> }
> 
> queue_limits_commit_update(struct request_queue *q,
> 		struct queue_limits *lim, bool unfreeze)
> {
>     ...
>     ...
>     if(unfreeze)
>         blk_mq_unfreeze_queue(q);
>     mutex_unlock(&q->limits_lock);
> }
> 
> sd_revalidate_disk()
> {
>     ...
>     ...
>     queue_limits_start_update(q, false);
> 
>     ...
>     ...
> 
>     blk_mq_freeze_queue(q);
>     queue_limits_commit_update(q, lim, true);    

This is overly complicated ... As I suggested, I think that a simpler approach
is to call blk_mq_freeze_queue() and blk_mq_unfreeze_queue() inside
queue_limits_commit_update(). Doing so, no driver should need to directly call
freeze/unfreeze. But that would be a cleanup. Let's first fix the few instances
that have the update/freeze order wrong. As mentioned, the pattern simply needs
to be:

	queue_limits_start_update();
	...
	blk_mq_freeze_queue();
	queue_limits_commit_update();
	blk_mq_unfreeze_queue();

I fixed blk-zoned code last week like this. But I had not noticed that other
places had a similar issue as well.
Christoph Hellwig Dec. 19, 2024, 6:15 a.m. UTC | #13
On Wed, Dec 18, 2024 at 10:09:24AM +0800, Ming Lei wrote:
> This way looks good, just commit af2814149883 ("block: freeze the queue in
> queue_attr_store") needs to be reverted, and freeze/unfreeze has to be
> added to each queue attribute .store() handler.

Actually the sysfs handlers need a lot more work, including sorting
out the sysfs_lock mess.  Gimme a little time to work on it.
Christoph Hellwig Dec. 19, 2024, 6:17 a.m. UTC | #14
On Wed, Dec 18, 2024 at 07:35:36PM +0530, Nilay Shroff wrote:
> One simple way I could think of would be updating queue_limit_xxx() APIs and
> then use it,
> 
> queue_limits_start_update(struct request_queue *q, bool freeze) 
> {
>     ...
>     mutex_lock(&q->limits_lock);
>     if(freeze)
>         blk_mq_freeze_queue(q);
>     ...

Conditional locking (and the freeze sort of is a lock) is always a bad
idea.  But we can do a lower level no-free API or this use case.
Christoph Hellwig Dec. 19, 2024, 6:20 a.m. UTC | #15
On Wed, Dec 18, 2024 at 06:57:45AM -0800, Damien Le Moal wrote:
> > Yeah agreed but I see sd_revalidate_disk() is probably the only exception 
> > which allocates the blk-mq request. Can't we fix it? 
> 
> If we change where limits_lock is taken now, we will again introduce races
> between user config and discovery/revalidation, which is what
> queue_limits_start_update() and queue_limits_commit_update() intended to fix in
> the first place.
> 
> So changing sd_revalidate_disk() is not the right approach.

Well, sd_revalidate_disk is a bit special in that it needs a command
on the same queue to query the information.  So it needs to be able
to issue commands without the queue frozen.  Freezing the queue inside
the limits lock support that, sd just can't use the convenience helpers
that lock and freeze.

> This is overly complicated ... As I suggested, I think that a simpler approach
> is to call blk_mq_freeze_queue() and blk_mq_unfreeze_queue() inside
> queue_limits_commit_update(). Doing so, no driver should need to directly call
> freeze/unfreeze. But that would be a cleanup. Let's first fix the few instances
> that have the update/freeze order wrong. As mentioned, the pattern simply needs

Yes, the queue only needs to be frozen for the actual update,
which would remove the need for the locking.  The big question for both
variants is if we can get rid of all the callers that have the queue
already frozen and then start an update.
Nilay Shroff Dec. 19, 2024, 7:16 a.m. UTC | #16
On 12/19/24 11:50, Christoph Hellwig wrote:
> On Wed, Dec 18, 2024 at 06:57:45AM -0800, Damien Le Moal wrote:
>>> Yeah agreed but I see sd_revalidate_disk() is probably the only exception 
>>> which allocates the blk-mq request. Can't we fix it? 
>>
>> If we change where limits_lock is taken now, we will again introduce races
>> between user config and discovery/revalidation, which is what
>> queue_limits_start_update() and queue_limits_commit_update() intended to fix in
>> the first place.
>>
>> So changing sd_revalidate_disk() is not the right approach.
> 
> Well, sd_revalidate_disk is a bit special in that it needs a command
> on the same queue to query the information.  So it needs to be able
> to issue commands without the queue frozen.  Freezing the queue inside
> the limits lock support that, sd just can't use the convenience helpers
> that lock and freeze.
> 
>> This is overly complicated ... As I suggested, I think that a simpler approach
>> is to call blk_mq_freeze_queue() and blk_mq_unfreeze_queue() inside
>> queue_limits_commit_update(). Doing so, no driver should need to directly call
>> freeze/unfreeze. But that would be a cleanup. Let's first fix the few instances
>> that have the update/freeze order wrong. As mentioned, the pattern simply needs
> 
> Yes, the queue only needs to be frozen for the actual update,
> which would remove the need for the locking.  The big question for both
> variants is if we can get rid of all the callers that have the queue
> already frozen and then start an update.
> 
Yes agreed that in most cases we only needs the queue to be frozen while 
committing the update, however we do have few call sites (in nvme driver)
where I see we freeze queue before actually starting update. And looking 
at those call sites it seems that we probably do require freezing the 
queue. One example from NVMe driver,

nvme_update_ns_info_block()
{
    ...
    ...

    blk_mq_freeze_queue(ns->disk->queue);
    ns->head->lba_shift = id->lbaf[lbaf].ds;
    ns->head->nuse = le64_to_cpu(id->nuse);
    capacity = nvme_lba_to_sect(ns->head, le64_to_cpu(id->nsze));

    lim = queue_limits_start_update(ns->disk->queue);
    ...
    ...
    queue_limits_commit_update();
    ...
    set_capacity_and_notify(ns->disk, capacity);
    ...
    set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info));
    set_bit(NVME_NS_READY, &ns->flags);
    blk_mq_unfreeze_queue(ns->disk->queue);
    ...
}

So looking at the above example, I earlier proposed  freezing the queue 
in queue_limits_start_update() and then unfreezing the queue in 
queue_limits_commit_update(). In the above code then we could replace 
blk_mq_freeze_queue() with queue_limits_start_update() and 
blk_mq_unfreeze_queue() with queue_limits_commit_update() and get rid 
of the original call sites of start/commit update APIs. Having said 
that, I am open for any other better suggestions and one of the suggestion 
is from Damien about calling blk_mq_freeze_queue() and blk_mq_unfreeze_queue() 
inside queue_limits_commit_update(). But then I wonder how would we fix the 
call sites as shown above with this approach.

Thanks,
--Nilay
Nilay Shroff Dec. 21, 2024, 1:03 p.m. UTC | #17
On 12/19/24 11:50, Christoph Hellwig wrote:
> On Wed, Dec 18, 2024 at 06:57:45AM -0800, Damien Le Moal wrote:
>>> Yeah agreed but I see sd_revalidate_disk() is probably the only exception 
>>> which allocates the blk-mq request. Can't we fix it? 
>>
>> If we change where limits_lock is taken now, we will again introduce races
>> between user config and discovery/revalidation, which is what
>> queue_limits_start_update() and queue_limits_commit_update() intended to fix in
>> the first place.
>>
>> So changing sd_revalidate_disk() is not the right approach.
> 
> Well, sd_revalidate_disk is a bit special in that it needs a command
> on the same queue to query the information.  So it needs to be able
> to issue commands without the queue frozen.  Freezing the queue inside
> the limits lock support that, sd just can't use the convenience helpers
> that lock and freeze.
> 
>> This is overly complicated ... As I suggested, I think that a simpler approach
>> is to call blk_mq_freeze_queue() and blk_mq_unfreeze_queue() inside
>> queue_limits_commit_update(). Doing so, no driver should need to directly call
>> freeze/unfreeze. But that would be a cleanup. Let's first fix the few instances
>> that have the update/freeze order wrong. As mentioned, the pattern simply needs
> 
> Yes, the queue only needs to be frozen for the actual update,
> which would remove the need for the locking.  The big question for both
> variants is if we can get rid of all the callers that have the queue
> already frozen and then start an update.
> 
After thinking for a while I found that broadly we've four categories of users
which need this pattern of limits-lock and/or queue-freeze:

1. Callers which need acquiring limits-lock while starting the update; and freezing 
   queue only when committing the update:
   - sd_revalidate_disk
   - nvme_init_identify
   - loop_clear_limits
   - few more...

2. Callers which need both freezing the queue and acquiring limits-lock while starting
   the update:
   - nvme_update_ns_info_block
   - nvme_update_ns_info_generic
   - few more... 

3. Callers which neither need acquiring limits-lock nor require freezing queue as for 
   these set of callers in the call stack limits-lock is already acquired and queue is 
   already frozen:
   - __blk_mq_update_nr_hw_queues
   - queue_xxx_store and helpers

4. Callers which only need acquiring limits-lock; freezing queue may not be needed
   for such callers even while committing update:
   - scsi_add_lun
   - nvme_init_identify
   - few more...

IMO, we may covert category #4 users into category #1, as it may not harm even if 
we momentarily freeze the queue while committing the update. 

Then, for each of the above category we may define helpers show below:

// For category-3:

static inline struct queue_limits
get_queue_limits(struct request_queue *q)
{
	return q->limits;
}
int set_queue_limits(struct request_queue *q,
		struct queue_limits *lim)
{
	int error;

	error = blk_validate_limits(lim);
        ...
        ...
	q->limits = *lim;
	if (q->disk)
		blk_apply_bdi_limits(q->disk->bdi, lim);

	return error;
}

// For category-1:

static inline struct queue_limits
__queue_limits_start_update(struct request_queue *q)
{
	mutex_lock(&q->limits_lock);
	return q->limits;
}
int __queue_limits_commit_update(struct request_queue *q,
		struct queue_limits *lim)
{
	int error;

	blk_mq_freeze_queue(q);
	error = set_queue_limits(q, lim);
	blk_mq_unfreeze_queue(q);
	mutex_unlock(&q->limits_lock);

	return error;
}

// For category-2 :
static inline struct queue_limits
queue_limits_start_update(struct request_queue *q)
{
	mutex_lock(&q->limits_lock);
	blk_mq_freeze_queue(q);
	return q->limits;
}
int queue_limits_commit_update(struct request_queue *q,
		struct queue_limits *lim)
{
	int error;

	error = set_queue_limits(q, lim);
	blk_mq_unfreeze_queue(q);
	mutex_unlock(&q->limits_lock);

	return error;
}

With above helpers, I updated each caller based on in which category it fits in. 
For reference, attached the full diff. With this change, I ran blktests to ensure
that we don't see any lockdep splat or failures. 

Thanks,
--Nilay
Ming Lei Dec. 30, 2024, 9:02 a.m. UTC | #18
On Sat, Dec 21, 2024 at 06:33:13PM +0530, Nilay Shroff wrote:
> 
> 
> On 12/19/24 11:50, Christoph Hellwig wrote:
> > On Wed, Dec 18, 2024 at 06:57:45AM -0800, Damien Le Moal wrote:
> >>> Yeah agreed but I see sd_revalidate_disk() is probably the only exception 
> >>> which allocates the blk-mq request. Can't we fix it? 
> >>
> >> If we change where limits_lock is taken now, we will again introduce races
> >> between user config and discovery/revalidation, which is what
> >> queue_limits_start_update() and queue_limits_commit_update() intended to fix in
> >> the first place.
> >>
> >> So changing sd_revalidate_disk() is not the right approach.
> > 
> > Well, sd_revalidate_disk is a bit special in that it needs a command
> > on the same queue to query the information.  So it needs to be able
> > to issue commands without the queue frozen.  Freezing the queue inside
> > the limits lock support that, sd just can't use the convenience helpers
> > that lock and freeze.
> > 
> >> This is overly complicated ... As I suggested, I think that a simpler approach
> >> is to call blk_mq_freeze_queue() and blk_mq_unfreeze_queue() inside
> >> queue_limits_commit_update(). Doing so, no driver should need to directly call
> >> freeze/unfreeze. But that would be a cleanup. Let's first fix the few instances
> >> that have the update/freeze order wrong. As mentioned, the pattern simply needs
> > 
> > Yes, the queue only needs to be frozen for the actual update,
> > which would remove the need for the locking.  The big question for both
> > variants is if we can get rid of all the callers that have the queue
> > already frozen and then start an update.
> > 
> After thinking for a while I found that broadly we've four categories of users
> which need this pattern of limits-lock and/or queue-freeze:
> 
> 1. Callers which need acquiring limits-lock while starting the update; and freezing 
>    queue only when committing the update:
>    - sd_revalidate_disk

sd_revalidate_disk() should be the most strange one, in which
passthrough io command is required, so dependency on queue freeze lock
can't be added, such as, q->limits_lock

Actually the current queue limits structure aren't well-organized, otherwise
limit lock isn't needed for reading queue limits from hardware, since
sd_revalidate_disk() just overwrites partial limits. Or it can be
done by refactoring sd_revalidate_disk(). However, the change might
be a little big, and I guess that is the reason why Damien don't like
it.

>    - nvme_init_identify
>    - loop_clear_limits
>    - few more...
> 
> 2. Callers which need both freezing the queue and acquiring limits-lock while starting
>    the update:
>    - nvme_update_ns_info_block
>    - nvme_update_ns_info_generic
>    - few more... 
> 
> 3. Callers which neither need acquiring limits-lock nor require freezing queue as for 
>    these set of callers in the call stack limits-lock is already acquired and queue is 
>    already frozen:
>    - __blk_mq_update_nr_hw_queues
>    - queue_xxx_store and helpers

I think it isn't correct.

The queue limits are applied on fast IO path, in theory anywhere
updating q->limits need to drain IOs in submission path at least
after gendisk is added.

Also Christoph adds limits-lock for avoiding to lose other concurrent
update, which makes the problem more hard to solve.



Thanks, 
Ming
Damien Le Moal Dec. 30, 2024, 11:29 p.m. UTC | #19
On 2024/12/30 18:02, Ming Lei wrote:
>> 1. Callers which need acquiring limits-lock while starting the update; and freezing 
>>    queue only when committing the update:
>>    - sd_revalidate_disk
> 
> sd_revalidate_disk() should be the most strange one, in which
> passthrough io command is required, so dependency on queue freeze lock
> can't be added, such as, q->limits_lock
> 
> Actually the current queue limits structure aren't well-organized, otherwise
> limit lock isn't needed for reading queue limits from hardware, since
> sd_revalidate_disk() just overwrites partial limits. Or it can be
> done by refactoring sd_revalidate_disk(). However, the change might
> be a little big, and I guess that is the reason why Damien don't like
> it.

That was not the reason, but yes, modifying sd_revalidate_disk() is not without
risks of introducing regressions. The reason I proposed to simply move the queue
freeze around or inside queue_limits_commit_update() is that:

1) It is the right thing to do as that is the only place where it is actually
needed to avoid losing concurrent limits changes.

2) It clarifies the locking order between queue freeze and the limits lock.

3) The current issues should mostly all be solved with some refactoring of the
->store() calls in blk-sysfs.c, resolving the current ABBA deadlocks between
queue freeze and limits lock.

With that, we should be able to fix the issue for all block drivers with changes
to the block layer sysfs code only. But... I have not looked into the details of
all limits commit calls in all block drivers. So there may be some bad apples in
there that will also need some tweaking.

>>    - nvme_init_identify
>>    - loop_clear_limits
>>    - few more...
>>
>> 2. Callers which need both freezing the queue and acquiring limits-lock while starting
>>    the update:
>>    - nvme_update_ns_info_block
>>    - nvme_update_ns_info_generic
>>    - few more... 

The queue freeze should not be necessary anywhere when starting the update. The
queue freeze is only needed when applying the limits so that IOs that are in
flight are not affected by the limits change while still being processed.

>>
>> 3. Callers which neither need acquiring limits-lock nor require freezing queue as for 
>>    these set of callers in the call stack limits-lock is already acquired and queue is 
>>    already frozen:
>>    - __blk_mq_update_nr_hw_queues
>>    - queue_xxx_store and helpers
> 
> I think it isn't correct.
> 
> The queue limits are applied on fast IO path, in theory anywhere
> updating q->limits need to drain IOs in submission path at least
> after gendisk is added.

Yes !
Nilay Shroff Jan. 1, 2025, 11:17 a.m. UTC | #20
On 12/31/24 04:59, Damien Le Moal wrote:
> On 2024/12/30 18:02, Ming Lei wrote:
>>> 1. Callers which need acquiring limits-lock while starting the update; and freezing 
>>>    queue only when committing the update:
>>>    - sd_revalidate_disk
>>
>> sd_revalidate_disk() should be the most strange one, in which
>> passthrough io command is required, so dependency on queue freeze lock
>> can't be added, such as, q->limits_lock
>>
>> Actually the current queue limits structure aren't well-organized, otherwise
>> limit lock isn't needed for reading queue limits from hardware, since
>> sd_revalidate_disk() just overwrites partial limits. Or it can be
>> done by refactoring sd_revalidate_disk(). However, the change might
>> be a little big, and I guess that is the reason why Damien don't like
>> it.
> 
> That was not the reason, but yes, modifying sd_revalidate_disk() is not without
> risks of introducing regressions. The reason I proposed to simply move the queue
> freeze around or inside queue_limits_commit_update() is that:
> 
> 1) It is the right thing to do as that is the only place where it is actually
> needed to avoid losing concurrent limits changes.
> 
> 2) It clarifies the locking order between queue freeze and the limits lock.
> 
> 3) The current issues should mostly all be solved with some refactoring of the
> ->store() calls in blk-sysfs.c, resolving the current ABBA deadlocks between
> queue freeze and limits lock.
> 
> With that, we should be able to fix the issue for all block drivers with changes
> to the block layer sysfs code only. But... I have not looked into the details of
> all limits commit calls in all block drivers. So there may be some bad apples in
> there that will also need some tweaking.
Yes, I think, we have places other than blk-sysfs, like nvme, where we need fixing.
> 
>>>    - nvme_init_identify
>>>    - loop_clear_limits
>>>    - few more...
>>>
>>> 2. Callers which need both freezing the queue and acquiring limits-lock while starting
>>>    the update:
>>>    - nvme_update_ns_info_block
>>>    - nvme_update_ns_info_generic
>>>    - few more... 
> 
> The queue freeze should not be necessary anywhere when starting the update. The
> queue freeze is only needed when applying the limits so that IOs that are in
> flight are not affected by the limits change while still being processed.
Hmm, but as mentioned for nvme limits update this is not always true. So this need to be tweaked.
>>>
>>> 3. Callers which neither need acquiring limits-lock nor require freezing queue as for 
>>>    these set of callers in the call stack limits-lock is already acquired and queue is 
>>>    already frozen:
>>>    - __blk_mq_update_nr_hw_queues
>>>    - queue_xxx_store and helpers
>>
>> I think it isn't correct.
>>
>> The queue limits are applied on fast IO path, in theory anywhere
>> updating q->limits need to drain IOs in submission path at least
>> after gendisk is added.
> 
> Yes !
> 
OK, I think it'd be tricky to fix __blk_mq_update_nr_hw_queues and queue_xxx_store
with the current proposal of acquiring limits lock followed by queue freeze 
while committing limit update. 

Thanks,
--Nilay
diff mbox series

Patch

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 8f09e33f41f6..b737428c6084 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -422,6 +422,7 @@  int queue_limits_commit_update(struct request_queue *q,
 {
 	int error;
 
+	mutex_lock(&q->limits_lock);
 	error = blk_validate_limits(lim);
 	if (error)
 		goto out_unlock;
@@ -456,7 +457,6 @@  EXPORT_SYMBOL_GPL(queue_limits_commit_update);
  */
 int queue_limits_set(struct request_queue *q, struct queue_limits *lim)
 {
-	mutex_lock(&q->limits_lock);
 	return queue_limits_commit_update(q, lim);
 }
 EXPORT_SYMBOL_GPL(queue_limits_set);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 378d3a1a22fc..6cc20ca79adc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -944,8 +944,13 @@  static inline unsigned int blk_boundary_sectors_left(sector_t offset,
 static inline struct queue_limits
 queue_limits_start_update(struct request_queue *q)
 {
+	struct queue_limits lim;
+
 	mutex_lock(&q->limits_lock);
-	return q->limits;
+	lim = q->limits;
+	mutex_unlock(&q->limits_lock);
+
+	return lim;
 }
 int queue_limits_commit_update(struct request_queue *q,
 		struct queue_limits *lim);
@@ -962,7 +967,6 @@  int blk_validate_limits(struct queue_limits *lim);
  */
 static inline void queue_limits_cancel_update(struct request_queue *q)
 {
-	mutex_unlock(&q->limits_lock);
 }
 
 /*